From 01f4943ab93475ffa49581e8e4d1d3603151e465 Mon Sep 17 00:00:00 2001 From: master Date: Sun, 14 Dec 2025 16:23:44 +0200 Subject: [PATCH] up --- ...5 - Define a north star metric for TTFS.md | 866 +++++++++++ ...ning the Call‑Stack Reachability Engine.md | 1258 ++++++++++++++++ ...‑Diff - Defining Meaningful Risk Change.md | 892 ++++++++++++ ... - Add a dedicated “first_signal” event.md | 1295 +++++++++++++++++ ...25 - Create a small ground‑truth corpus.md | 787 ++++++++++ ...- Dissect triage and evidence workflows.md | 551 +++++++ ...ate PostgreSQL vs MongoDB for StellaOps.md | 544 +++++++ .../archived/AR-REVIVE-PLAN.md | 12 - 8 files changed, 6193 insertions(+), 12 deletions(-) create mode 100644 docs/product-advisories/13-Dec-2025 - Define a north star metric for TTFS.md create mode 100644 docs/product-advisories/13-Dec-2025 - Designing the Call‑Stack Reachability Engine.md create mode 100644 docs/product-advisories/13-Dec-2025 - Smart‑Diff - Defining Meaningful Risk Change.md create mode 100644 docs/product-advisories/14-Dec-2025 - Add a dedicated “first_signal” event.md create mode 100644 docs/product-advisories/14-Dec-2025 - Create a small ground‑truth corpus.md create mode 100644 docs/product-advisories/14-Dec-2025 - Dissect triage and evidence workflows.md create mode 100644 docs/product-advisories/14-Dec-2025 - Evaluate PostgreSQL vs MongoDB for StellaOps.md delete mode 100644 docs/product-advisories/archived/AR-REVIVE-PLAN.md diff --git a/docs/product-advisories/13-Dec-2025 - Define a north star metric for TTFS.md b/docs/product-advisories/13-Dec-2025 - Define a north star metric for TTFS.md new file mode 100644 index 000000000..df3a177a1 --- /dev/null +++ b/docs/product-advisories/13-Dec-2025 - Define a north star metric for TTFS.md @@ -0,0 +1,866 @@ +Here’s a simple, battle‑tested way to make your UX feel fast under pressure: treat **Time‑to‑First‑Signal (TTFS)** as a product SLO and design everything backwards from it. + +--- + +# TTFS SLO: the idea in one line + +Guarantee **p50 < 2s, p95 < 5s** from user action (or CI event) to the **first meaningful signal** (status, cause, or next step)—fast enough to calm triage, short enough to be felt. + +--- + +## What counts as “First Signal”? + +* A clear, human message like: “Scan started; last error matched: `NU1605` (likely transitive). Retry advice →” +* Or a progress token with context: “Queued (ETA ~18s). Cached reachability graph loaded.” + +Not a spinner. Not 0% progress. A real, decision‑shaping hint. + +--- + +## Budget the pipeline backwards (guardrails) + +* **Frontend (≤150 ms):** render instant skeleton + last known state; optimistic UI; no blocking on fresh data. +* **Edge/API (≤250 ms):** return a “signal frame” fast path (status + last error signature + cached ETA) from cache. +* **Core services (≤500–1500 ms):** pre‑index failures, warm reachability summaries, enqueue heavy work, emit stream token. +* **Slow work (async):** full scan, lattice policy merge, provenance trails—arrive later via push updates. + +--- + +## Minimal implementation (1–2 sprints) + +1. **Define the signal contract** + + * `FirstSignal { kind, verb, scope, lastKnownOutcome?, ETA?, nextAction? }` + * Version it; keep it <1 KB; always return within the SLO window. + +2. **Cache last error signature** + + * Key: `(repo, branch|imageDigest, toolchain-hash)` + * Value: `{errorCode, excerpt, fixLink, firstSeenAt, hitCount}` + * Evict by LRU + TTL (e.g., 7–14 days). Use Valkey in default profile; Postgres JSONB in air‑gap. + +3. **Pre‑index the failing step** + + * When a job fails, extract and store: + + * normalized step id (e.g., `scanner:deps-restore`) + * top 1–3 error tokens (codes, regex’d phrases) + * minimal context (package id, version range) + * Write a tiny **“failure indexer”** that runs in‑band on failure and out‑of‑band on success. + +4. **Lazy‑load everything else** + + * UI shows FirstSignal + “Details loading…” + * Fetch heavy panes (full CVE list, call‑graph, SBOM diff) after paint. + +5. **Fast path endpoint** + + * `GET /signal/{jobId}` returns from cache or snapshot table. + * If cache miss: fall back to “cold signal” (`queued`, basic ETA) and **immediately** enqueue warmup tasks. + +6. **Streaming updates** + + * Emit compact deltas: `status:started → status:analyzing → triage:blocked(POLICY_X)` etc. + * UI subscribes; CI annotates with the same tokens. + +--- + +## TTFS SLO monitor (keep it honest) + +* Emit for every user‑visible action: `ttfs_ms`, `path` (UI|CLI|CI), `signal_kind`, `cache_hit` (T/F). +* Track **p50/p95** by surface and by repo size. +* Page on **p95 > 5s** for 5 consecutive minutes (or >2% of traffic). +* Store exemplars (trace ids) to replay slow paths. + +--- + +## Stella Ops–specific hooks (drop‑in) + +* **Scanner.WebService:** on job accept, write `FirstSignal{kind:"queued", ETA}`; if failure index has a hit, attach `lastKnownOutcome`. +* **Feedser/Vexer:** publish “known criticals changed since last run” as a hint in FirstSignal. +* **Policy Engine:** pre‑evaluate “obvious blocks” (e.g., banned license) and surface as `nextAction:"toggle waiver or update license map"`. +* **Air‑gapped profile:** skip Valkey; keep a `first_signal_snapshots` Postgres table + NOTIFY/LISTEN for streaming. + +--- + +## UX micro‑rules + +* **Never show a spinner alone**; always pair with a sentence or chip (“Warm cache found; verifying”). +* **3 taps max** to reach evidence: Button → FirstSignal → Evidence card. +* **Always include a next step** (“Retry with `--ignore NU1605` is unsafe; use `PackageReference` pin → link”). + +--- + +## Quick success criteria + +* New incident claims: “I knew what was happening within 2 seconds.” +* CI annotates within 5s on p95. +* Support tickets referencing “stuck scans” drop ≥40%. + +--- + +If you want, I can turn this into a ready‑to‑paste **TASKS.md** (owners, DOD, metrics, endpoints, DB schemas) for your Stella Ops repos. +````md +# TASKS.md — TTFS (Time‑to‑First‑Signal) Fast Signal + Progressive Updates + +> Paste this file into the repo root (or `/docs/TTFS/TASKS.md`). +> This plan is structured as two sprints (A + B) with clear owners, dependencies, and DoD. + +--- + +## 0) Product SLO and non‑negotiables + +### SLO +- **TTFS p50 < 2s, p95 < 5s** +- Applies to: **Web UI**, **CLI**, **CI annotations** +- TTFS = time from **user action / CI start** → **first meaningful signal rendered/logged** + +### What counts as “First Signal” +A First Signal must include at least one of: +- Status + context (“Queued, ETA ~18s”; “Started, phase: restore”; “Blocked by policy XYZ”) +- Known cause hint (error token/code/category) +- Next action (open logs, docs link, retry command) + +A spinner alone does **not** count. + +### Hard constraints +- `/jobs/{id}/signal` must **never block** on full scan work +- FirstSignal payload in normal cases **< 1KB** +- **No secrets** in snapshots, excerpts, telemetry + +--- + +## 1) Scope and module owners + +### Modules (assumed) +- **Scanner.WebService** (job API + signal provider) +- **Scanner.Worker** (phase changes + event publishing) +- **Policy Engine** (block reasons + quick pre-eval hooks) +- **Feedser/Vexer** (optional: “critical changed” hint) +- **Web UI** (progressive rendering + streaming) +- **CLI** (first signal + streaming) +- **CI Integration** (checks/annotations) +- **Platform/Observability** (metrics, dashboards, alerts) +- **Security/Compliance** (redaction + tenant isolation) + +### Owners (replace with actual people/teams) +- **Backend Lead:** @be-owner +- **Frontend Lead:** @fe-owner +- **DevEx/CLI Lead:** @dx-owner +- **CI Integrations Lead:** @ci-owner +- **SRE/Obs Lead:** @sre-owner +- **Security Lead:** @sec-owner +- **PM:** @pm-owner + +--- + +## 2) Canonical contract: FirstSignal v1.0 + +### FirstSignal shape (canonical) +All surfaces (UI/CLI/CI) must be representable via this contract. + +```json +{ + "version": "1.0", + "signalId": "sig_...", + "jobId": "job_...", + + "timestamp": "2025-12-14T18:22:31.014Z", + "kind": "queued|started|phase|blocked|failed|succeeded|canceled|unavailable", + "phase": "resolve|fetch|restore|analyze|policy|report|unknown", + + "scope": { "type": "repo|image|artifact", "id": "org/repo@branch-or-digest" }, + + "summary": "Queued (ETA ~18s). Last failure matched: NU1605 (dependency downgrade).", + "etaSeconds": 18, + + "lastKnownOutcome": { + "signatureId": "sigerr_...", + "errorCode": "NU1605", + "token": "dependency-downgrade", + "excerpt": "Detected package downgrade: ...", + "confidence": "low|medium|high", + "firstSeenAt": "2025-12-01T00:00:00Z", + "hitCount": 14 + }, + + "nextActions": [ + { "type": "open_logs|open_job|docs|retry|cli_command", "label": "Open logs", "target": "/jobs/job_.../logs" } + ], + + "diagnostics": { + "cacheHit": true, + "source": "snapshot|failure_index|cold_start", + "correlationId": "corr_..." + } +} +```` + +### Contract rules + +* Must always include: `version`, `jobId`, `timestamp`, `kind`, `summary` +* Keep normal payload < 1KB (enforce excerpt max length; avoid lists) +* Never include secrets; excerpts must be redacted + +--- + +## 3) Milestones + +### Sprint A — “TTFS Baseline” + +Goal: Always show **some** meaningful First Signal quickly. + +Deliverables: + +* Snapshot persistence (DB) + optional cache +* `/jobs/{id}/signal` fast path +* UI skeleton + immediate FirstSignal rendering (poll fallback OK) +* Base telemetry: `ttfs_ms`, endpoint latency, cache hit + +### Sprint B — “Smart Hints + Streaming” + +Goal: First Signal is helpful and updates live. + +Deliverables: + +* Failure signature indexer + lookup +* SSE events (or WebSocket) for incremental updates +* CLI streaming + CI annotations +* Dashboards + alerts + exemplars/traces +* Redaction hardening and tenant isolation validation + +--- + +## 4) Sprint A tasks — TTFS baseline + +### A1 — Implement FirstSignal types and helpers (shared package) + +**Owner:** @be-owner +**Depends on:** none +**Est:** 2–4 pts + +**Tasks** + +* [ ] Define FirstSignal v1.0 schema in a shared package (`/common/contracts/firstsignal`) +* [ ] Add validators: + + * [ ] required fields present + * [ ] size limits (excerpt length; total serialized bytes threshold warning) + * [ ] allowed enums for kind/phase +* [ ] Add builders: + + * [ ] `buildQueuedSignal(job, eta?)` + * [ ] `buildColdSignal(job)` + * [ ] `mergeHint(signal, lastKnownOutcome)` + * [ ] `addNextActions(signal, actions[])` + +**DoD** + +* Contract is versioned, unit-tested, and used by backend endpoint +* Validation rejects/flags invalid signals in tests + +--- + +### A2 — Snapshot storage: `first_signal_snapshots` table + migrations + +**Owner:** @be-owner +**Depends on:** A1 +**Est:** 3–5 pts + +**Tasks** + +* [ ] Add Postgres migration for `first_signal_snapshots` +* [ ] Implement CRUD: + + * [ ] `createSnapshot(jobId, signal)` + * [ ] `updateSnapshot(jobId, partialSignal)` (phase transitions) + * [ ] `getSnapshot(jobId)` +* [ ] Enforce: + + * [ ] `payload_json` size guard (soft warn + hard cap via excerpt limit) + * [ ] `updated_at` maintained automatically + +**Suggested schema** + +```sql +CREATE TABLE first_signal_snapshots ( + job_id TEXT PRIMARY KEY, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + kind TEXT NOT NULL, + phase TEXT NOT NULL, + summary TEXT NOT NULL, + eta_seconds INT NULL, + payload_json JSONB NOT NULL +); +CREATE INDEX ON first_signal_snapshots (updated_at DESC); +``` + +**DoD** + +* Migration included +* Integration test: create job → snapshot exists within request lifecycle (or best-effort async write + immediate cold response) + +--- + +### A3 — Cache layer (default profile) with Postgres fallback + +**Owner:** @be-owner +**Depends on:** A2 +**Est:** 3–6 pts + +**Tasks** + +* [ ] Add optional Valkey/Redis support: + + * [ ] key: `signal:job:{jobId}` TTL: 24h + * [ ] read-through cache on `/signal` + * [ ] write-through on snapshot updates +* [ ] Air-gapped mode behavior: + + * [ ] cache disabled → read/write snapshots in Postgres only +* [ ] Add config toggles: + + * [ ] `TTFS_CACHE_BACKEND=valkey|postgres|none` + * [ ] `TTFS_CACHE_TTL_SECONDS=86400` + +**DoD** + +* With cache enabled: `/signal` p95 latency meets budget in load test +* With cache disabled: correctness remains; p95 within acceptable baseline + +--- + +### A4 — `/jobs/{jobId}/signal` fast-path endpoint + +**Owner:** @be-owner +**Depends on:** A2, A3 +**Est:** 4–8 pts + +**Tasks** + +* [ ] Implement `GET /jobs/{jobId}/signal` + + * [ ] Try cache snapshot + * [ ] Else DB snapshot + * [ ] Else cold signal (`kind=queued`, `phase=unknown`, summary “Queued. Preparing scan…”) + * [ ] Best-effort snapshot write if missing (non-blocking) +* [ ] Response headers: + + * [ ] `X-Correlation-Id` + * [ ] `Cache-Status: hit|miss|bypass` +* [ ] Add server-side timing logs (debug-level) for: + + * [ ] cache read time + * [ ] db read time + * [ ] cold path time + +**Performance budget** + +* Cache-hit response: **p95 ≤ 250ms** +* Cold response: **p95 ≤ 500ms** + +**DoD** + +* Endpoint never blocks on scan work +* Returns a valid FirstSignal every time job exists +* Load test demonstrates budgets + +--- + +### A5 — Create snapshot at job creation and update on phase changes + +**Owner:** @be-owner + @worker-owner +**Depends on:** A2 +**Est:** 5–8 pts + +**Tasks** + +* [ ] In `POST /jobs`: + + * [ ] Immediately write initial snapshot: + + * `kind=queued` + * `phase=unknown` + * summary includes “Queued” and optional ETA +* [ ] In worker: + + * [ ] When job starts: update snapshot to `kind=started`, `phase=resolve|fetch|restore…` + * [ ] On phase transitions: update snapshot + * [ ] On terminal: `kind=succeeded|failed|canceled` +* [ ] Ensure updates are idempotent and safe (replays) + +**DoD** + +* For any started job, snapshot shows phase changes within a few seconds +* Terminal kind always correct + +--- + +### A6 — UI: Immediate “First Signal” rendering with polling fallback + +**Owner:** @fe-owner +**Depends on:** A4 +**Est:** 6–10 pts + +**Tasks** + +* [ ] On scan trigger: + + * [ ] Render skeleton + “Preparing scan…” message (no spinner-only) + * [ ] Call `POST /jobs` (get jobId) + * [ ] Immediately call `GET /jobs/{jobId}/signal` + * [ ] Render summary + at least one next action button (Open job/logs) +* [ ] Poll fallback: + + * [ ] If streaming not available yet (Sprint A), poll `/signal` every 2–5s until terminal +* [ ] Lazy-load heavy panels (must not block First Signal): + + * [ ] vulnerability list + * [ ] dependency graph + * [ ] SBOM diff + +**DoD** + +* Real user monitoring shows UI TTFS p50 < 2s, p95 < 5s for the baseline path +* No spinner-only states + +--- + +### A7 — Telemetry: baseline metrics and tracing + +**Owner:** @sre-owner + @be-owner + @fe-owner +**Depends on:** A4, A6 +**Est:** 5–10 pts + +**Metrics** + +* [ ] `ttfs_ms` (emitted client-side for UI; server-side for CLI/CI if needed) + + * tags: `surface=ui|cli|ci`, `cache_hit=true|false`, `signal_source=snapshot|cold_start`, `kind`, `repo_size_bucket` +* [ ] `signal_endpoint_latency_ms` +* [ ] `signal_payload_bytes` +* [ ] `signal_error_rate` + +**Tracing** + +* [ ] Correlation id propagated: + + * [ ] API response header + * [ ] worker logs + * [ ] events (Sprint B) + +**Dashboards** + +* [ ] TTFS p50/p95 by surface +* [ ] cache hit rate +* [ ] endpoint latency percentiles + +**DoD** + +* Metrics visible in dashboard +* Correlation ids make it possible to trace slow examples end-to-end + +--- + +## 5) Sprint B tasks — smart hints + streaming + +### B1 — Failure signature extraction + redaction library + +**Owner:** @be-owner + @sec-owner +**Depends on:** A1 +**Est:** 6–12 pts + +**Tasks** + +* [ ] Implement redaction utility (unit-tested): + + * [ ] strip bearer tokens, API keys, access tokens, private URLs + * [ ] cap excerpt length (e.g., 240 chars) + * [ ] normalize whitespace +* [ ] Implement signature extraction from: + + * [ ] structured step errors (preferred) + * [ ] raw logs (fallback) via regex ruleset +* [ ] Map to: + + * `errorCode` (if present) + * `token` (normalized category) + * `confidence` (high/med/low) + +**DoD** + +* Redaction unit tests include “known secret-like patterns” +* Extraction produces stable tokens for top failure families + +--- + +### B2 — Failure signature storage: `failure_signatures` table + upsert on failures + +**Owner:** @be-owner +**Depends on:** B1 +**Est:** 5–10 pts + +**Tasks** + +* [ ] Add Postgres migration for `failure_signatures` +* [ ] Implement lookup key: + + * `(scope_type, scope_id, toolchain_hash)` +* [ ] On job failure: + + * [ ] extract signature → redaction → upsert + * [ ] increment hit_count; update last_seen_at +* [ ] Retention: + + * [ ] TTL job: delete signatures older than 14 days (configurable) + * [ ] or retain last N signatures per scope + +**Suggested schema** + +```sql +CREATE TABLE failure_signatures ( + signature_id TEXT PRIMARY KEY, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + scope_type TEXT NOT NULL, + scope_id TEXT NOT NULL, + toolchain_hash TEXT NOT NULL, + error_code TEXT NULL, + token TEXT NOT NULL, + excerpt TEXT NULL, + confidence TEXT NOT NULL, + first_seen_at TIMESTAMPTZ NOT NULL, + last_seen_at TIMESTAMPTZ NOT NULL, + hit_count INT NOT NULL DEFAULT 1 +); +CREATE INDEX ON failure_signatures (scope_type, scope_id, toolchain_hash); +CREATE INDEX ON failure_signatures (token); +``` + +**DoD** + +* Failure runs populate signatures +* Excerpts are redacted and capped +* Retention job verified + +--- + +### B3 — Enrich FirstSignal with “lastKnownOutcome” hint + +**Owner:** @be-owner +**Depends on:** B2 +**Est:** 3–6 pts + +**Tasks** + +* [ ] On `/signal` (fast path): + + * [ ] if snapshot exists but has no hint, attempt signature lookup by scope+toolchain hash + * [ ] merge hint into signal + * [ ] include `diagnostics.source=failure_index` when used +* [ ] Add “next actions” for common tokens: + + * [ ] docs link for known error codes/tokens + * [ ] “open logs” always present + +**DoD** + +* For scopes with prior failures, FirstSignal includes hint within SLO budgets + +--- + +### B4 — Streaming updates via SSE (recommended) + +**Owner:** @be-owner + @worker-owner + @fe-owner +**Depends on:** A5 +**Est:** 8–16 pts + +**Backend tasks** + +* [ ] Add `GET /jobs/{jobId}/events` SSE endpoint +* [ ] Define event payloads: + + * `status` (kind+phase+message) + * `hint` (token+errorCode+confidence) + * `policy` (blocked + policyId) + * `complete` (terminal) +* [ ] Worker publishes events at: + + * start + * phase transitions + * policy decision + * terminal +* [ ] Ensure reconnect safety: + + * [ ] event id monotonic or timestamp + * [ ] optional replay window (last N events in memory or DB) + +**Frontend tasks** + +* [ ] Subscribe after jobId known +* [ ] Update FirstSignal UI in-place on deltas +* [ ] Fallback to polling when SSE fails + +**DoD** + +* UI updates without refresh +* Event stream doesn’t spam (3–8 meaningful events per job typical) +* SSE failure degrades gracefully + +--- + +### B5 — Policy Engine: “obvious block” pre-eval for early signal + +**Owner:** @be-owner + @policy-owner +**Depends on:** B4 (optional), or can enrich snapshot directly +**Est:** 5–10 pts + +**Tasks** + +* [ ] Add a quick pre-evaluation hook for high-signal blocks: + + * banned license + * disallowed package + * org-level denylist +* [ ] Emit early policy event or update snapshot: + + * `kind=blocked`, `phase=policy`, summary names the policy + * next action points to waiver/docs (if supported) + +**DoD** + +* When an obvious block is present, users see it in FirstSignal without waiting for full analysis + +--- + +### B6 — CLI: First Signal + streaming + +**Owner:** @dx-owner +**Depends on:** A4, B4 +**Est:** 5–10 pts + +**Tasks** + +* [ ] Ensure CLI prints FirstSignal within TTFS budget +* [ ] Add `--follow` default behavior: + + * connect to SSE and stream deltas +* [ ] Provide minimal, non-spammy output: + + * only on meaningful transitions +* [ ] Print correlation id for support triage + +**DoD** + +* CLI TTFS p50 < 2s, p95 < 5s +* Streaming works and degrades to polling + +--- + +### B7 — CI annotations/checks: initial First Signal within 5s p95 + +**Owner:** @ci-owner +**Depends on:** A4, B4 (optional) +**Est:** 6–12 pts + +**Tasks** + +* [ ] On CI job start: + + * [ ] call `/signal` and publish check/annotation with summary + job link +* [ ] Update annotations only on state changes: + + * queued → started + * started → blocked/failed/succeeded +* [ ] Avoid annotation spam (max 3–5 updates) + +**DoD** + +* CI shows actionable first message within 5s p95 +* Updates are minimal and meaningful + +--- + +### B8 — Observability: SLO alerts + exemplars + +**Owner:** @sre-owner +**Depends on:** A7 +**Est:** 5–10 pts + +**Tasks** + +* [ ] Alerts: + + * [ ] page when `p95(ttfs_ms) > 5000` for 5 mins + * [ ] page when `signal_endpoint_error_rate > 1%` +* [ ] Add exemplars / trace links on slow TTFS samples +* [ ] Add breakdown dashboard: + + * surface (ui/cli/ci) + * cacheHit + * repo size bucket + * kind/phase + +**DoD** + +* On-call can diagnose slow TTFS with one click to traces/logs + +--- + +## 6) Cross-cutting: security, privacy, and tenancy + +### S1 — Tenant-safe caching and lookups + +**Owner:** @sec-owner + @be-owner +**Depends on:** A3, B2 +**Est:** 3–6 pts + +**Tasks** + +* [ ] Ensure cache keys include tenant/org boundary where applicable: + + * `tenant:{tenantId}:signal:job:{jobId}` +* [ ] Ensure failure signatures are only looked up within same tenant +* [ ] Add tests for cross-tenant leakage + +**DoD** + +* No cross-tenant access possible via cache or signature index + +--- + +### S2 — No secrets policy enforcement + +**Owner:** @sec-owner +**Depends on:** B1 +**Est:** 2–5 pts + +**Tasks** + +* [ ] Add “secret scanning” unit tests for redaction +* [ ] Add runtime guardrails: + + * if excerpt contains forbidden patterns → replace with “[redacted]” +* [ ] Ensure telemetry attributes never include excerpts + +**DoD** + +* Security review sign-off for snapshot + signature + telemetry + +--- + +## 7) Global Definition of Done + +A feature is “done” only when: + +* [ ] Meets TTFS SLO in staging load test and in production RUM (within agreed rollout window) +* [ ] Has: + + * [ ] unit tests + * [ ] integration tests + * [ ] basic load test coverage for `/signal` +* [ ] Has: + + * [ ] dashboards + * [ ] alerts (or explicitly deferred with signed waiver) +* [ ] Has: + + * [ ] secure redaction + * [ ] tenant isolation +* [ ] Has a rollback plan via feature flag + +--- + +## 8) Test plan + +### Unit tests + +* FirstSignal contract validation (required fields, enums) +* Redaction patterns (bearer tokens, API keys, URLs, long strings) +* Signature extraction rule correctness + +### Integration tests + +* Create job → snapshot exists → `/signal` returns it +* Worker phase transitions update snapshot +* Job fail → signature stored → next job → `/signal` includes lastKnownOutcome +* SSE connect → receive events in order → terminal event once + +### Load tests (must-have) + +* `/jobs/{id}/signal`: + + * cache-hit p95 ≤ 250ms + * cold path p95 ≤ 500ms + * error rate < 0.1% under expected concurrency + +### Chaos/degraded tests + +* Cache down → Postgres fallback works +* SSE blocked → UI polls and still updates + +--- + +## 9) Feature flags and rollout + +### Flags + +* `ttfs.first_signal_enabled` (default ON in staging) +* `ttfs.cache_enabled` +* `ttfs.failure_index_enabled` +* `ttfs.sse_enabled` +* `ttfs.policy_preeval_enabled` + +### Rollout steps + +1. Enable baseline FirstSignal + snapshots for internal/staging +2. Enable cache in default profile +3. Enable failure index (read-only first; then write) +4. Enable SSE for 10% traffic → 50% → 100% +5. Enable CI annotations (start with non-blocking informational checks) + +--- + +## 10) PR review checklist (paste into PR template) + +* [ ] No blocking heavy work added to `/signal` path +* [ ] Signal payload size remains < 1KB in normal cases +* [ ] Excerpts are redacted + length-capped +* [ ] Tenant boundary included in cache keys and DB queries +* [ ] Metrics emitted (`ttfs_ms`, endpoint latency, cacheHit) +* [ ] UI has no spinner-only state; always shows message + next action +* [ ] Streaming has polling fallback +* [ ] Tests added/updated (unit + integration) + +--- + +## 11) “Ready for QA” scenarios + +QA should validate: + +* UI: + + * click scan → first message within 2s typical + * see queued/started/blocked states clearly + * open logs works +* CLI: + + * first output within 2s typical + * follow stream updates +* CI: + + * first annotation/check appears quickly and links to job +* Security: + + * inject fake token into logs → stored excerpt is redacted +* Multi-tenant: + + * run jobs across tenants → no leakage in signals or hints + +--- + +``` + +If you want this split into **multiple repo-local files** (e.g., `/docs/TTFS/ARCH.md`, `/docs/TTFS/SCHEMAS.sql`, `/docs/TTFS/RUNBOOK.md`, plus a PR template snippet), say the folder structure you prefer and I’ll output them in the same paste-ready format. +``` diff --git a/docs/product-advisories/13-Dec-2025 - Designing the Call‑Stack Reachability Engine.md b/docs/product-advisories/13-Dec-2025 - Designing the Call‑Stack Reachability Engine.md new file mode 100644 index 000000000..8e29fea7a --- /dev/null +++ b/docs/product-advisories/13-Dec-2025 - Designing the Call‑Stack Reachability Engine.md @@ -0,0 +1,1258 @@ +Here’s a practical blueprint for building a **reachability‑first code+binary scanner** that fuses static call‑graphs with runtime evidence, and scales to large monorepos/microservices. + +--- + +# 1) Static analyzers (per language) + +* **.NET (Roslyn / IL)** + + * Parse solutions with `Microsoft.CodeAnalysis.MSBuild`, collect symbols, build call graph from `ISymbol` → `IInvocationOperation`. + * Handle reflection edges by heuristics (string literals, `Type.GetType`, DI registrations). + * IL pass: read assemblies with `System.Reflection.Metadata` to connect external/library calls. + * Minimal sample: + + ```csharp + using Microsoft.CodeAnalysis; + using Microsoft.CodeAnalysis.CSharp; + using Microsoft.CodeAnalysis.MSBuild; + + var ws = MSBuildWorkspace.Create(); + var sln = await ws.OpenSolutionAsync(@"path\to.sln"); + foreach (var proj in sln.Projects) + foreach (var doc in proj.Documents) + { + var model = await doc.GetSemanticModelAsync(); + var root = await doc.GetSyntaxRootAsync(); + foreach (var node in root.DescendantNodes().OfType()) + { + var sym = model.GetSymbolInfo(node).Symbol as IMethodSymbol; + if (sym != null) + { + // record edge: caller -> sym.ContainingType.Name + "." + sym.Name + } + } + } + ``` +* **Java (Soot or WALA)** + + * Build bytecode call graph (CHA/RTA/points‑to) and export edges. + * Seed entrypoints from `public static void main`, Spring Boot controllers, servlet mappings. +* **Node/Python** + + * Build AST + import graph; resolve exports (`module.exports`, `export default`, Python `__all__`). + * Track dynamic requires (best‑effort string eval); record web/router handlers as entrypoints. +* **Go/Rust** + + * Use build graph (Go modules, Cargo metadata) + AST to map `main` and handler functions. + * Include linker‑time features/conditions to avoid dead edges. +* **Binary‑only (containers, closed libs)** + + * Recover function boundaries (Ghidra/rizin), mine strings/imports, detect candidates for entrypoints from container `ENTRYPOINT/CMD`, service files, and exposed ports. + * Heuristics: exported symbols, syscall usage, and common framework stubs. + +--- + +# 2) Runtime confirmation (evidence) + +* **Windows/.NET:** ETW sampling to “mint” runtime edges (method IDs, stack samples) without heavy overhead. +* **Linux/containers:** eBPF/usdt or perf sampling to confirm hot paths; record PID→image→build info to link evidence back to SBOM components. +* **Rule:** static edge exists → mark **probable**; static+runtime match → mark **proven** (confidence ↑, prioritize). + +--- + +# 3) Entrypoint discovery + +* **Web services:** framework routers (ASP.NET Core endpoints, Spring mappings, Express routes, FastAPI decorators). +* **Jobs/CLIs:** scheduler configs (Cron, systemd timers, k8s CronJobs). +* **Events:** message consumers (RabbitMQ/Kafka topics), gRPC service maps. + +Entrypoints seed reachability: start from entry, traverse call graph, intersect with SBOM → “reachable components + reachable vulns”. + +--- + +# 4) Scale & storage + +* **Shard** by repo/service; compute graphs independently. +* **Compress** with SCCs (strongly connected components) to shrink graph size. +* **Cap cardinality** using hot‑path sampling (keep top‑N edges by observed frequency). +* **Cache**: content‑addressed graphs keyed by `(SBOM hash, compiler flags, env)`; invalidate on source/SBOM/CFG changes or new VEX/policy. +* **Store** edges as `(caller, callee, kind: static|runtime, weight, build-id)` in Postgres; keep Valkey for ephemeral reachability queries. + +--- + +# 5) SBOM/VEX linkage + +* Normalize package coordinates (purl), map symbols/binaries → SBOM components. +* For each CVE: + + * **Reachable?** (entrypoint‑anchored traversal hits affected symbol/library) + * **Proven at runtime?** (evidence present) + * **Gated by config?** (feature flags, platform checks) +* Emit VEX with machine‑explainable reasons (e.g., *not reachable*, *reachable but not loaded*, *reachable+proven*). + +--- + +# 6) APIs and outputs (developer‑friendly) + +* **CLI** + + * `scan graph --lang dotnet --sln path.sln --out graph.scc.json` + * `scan runtime --target pod/myservice --duration 30s --out stacks.json` + * `reachability join --graph graph.scc.json --runtime stacks.json --sbom bom.cdx.json --out reach.cdxr.json` +* **HTTP** + + * `POST /graph` (upload call graph) + * `POST /runtime` (upload evidence) + * `POST /reachability` → returns ranked, evidence‑linked findings +* **Artifacts** + + * `graph.scc.json` (SCC‑compressed call graph) + * `reach.cdxr.json` (CycloneDX extension with evidence) + * `vex.json` (OpenVEX/CSAF w/ “justifications”) + +--- + +# 7) Quality gates & tests + +* **Golden images**: tiny test services where reachable/unreachable CVEs are known. +* **Mutation tests**: toggle entrypoints, flags, and ensure reachability shifts correctly. +* **Drift checks**: if runtime sees edges not in static graph → open “coverage debt” issue. + +--- + +# 8) Security & perf knobs + +* Sampling rate caps (CPU bound), PID/image allowlists, PII‑safe symbol hashing option. +* Offline mode: bundle symbols + evidence into a replayable archive (deterministic re‑evaluation). + +--- + +If you want, I can generate a **starter repo layout** (Roslyn worker, Java WALA worker, eBPF sampler, joiner, and a Postgres schema) tailored to your .NET 10 + microservices stack. +Below is a developer-ready **product + BA implementation specification** for the **Reachability-First Scanner** described earlier, tailored to **StellaOps (.NET 10)** and your standing architecture rules (**lattice algorithms run in `scanner.webservice`; Concelier/Excititor preserve prune source; Postgres is SoR; Valkey is ephemeral only**). + +--- + +# StellaOps Reachability-First Scanner + +## Developer Implementation Specification (v1) + +## 0) Objective and boundaries + +### Objective + +Reduce vulnerability noise by classifying findings as **Unreachable / Possibly Reachable / Reachable (Static) / Proven Reachable (Runtime)** using: + +1. **Static call graph** (best-effort; language-aware) +2. **Runtime evidence** (sampling, low overhead) +3. **Entrypoint seeding** (framework-aware) +4. **Join** against SBOM component mapping + vulnerability data (from Concelier) + VEX (from Excititor) + +### Non-goals (v1) + +* Perfect points-to analysis for all languages. +* Full decompilation for every binary (support is “best-effort” with confidence). +* Executing or fuzzing workloads. + +--- + +# 1) Product behavior: what the user sees + +## 1.1 Reachability statuses (canonical) + +These labels must be stable across UI/CLI/API: + +* **UNREACHABLE**: no path from any discovered entrypoint to affected component/symbol. +* **POSSIBLY_REACHABLE**: graph incomplete / dynamic behavior; heuristics indicate risk. +* **REACHABLE_STATIC**: a static path exists from at least one entrypoint. +* **REACHABLE_PROVEN**: runtime evidence confirms code path or library load (stronger than static). + +### Required explanation fields (always returned) + +Every reachability classification must include: + +* `why[]`: list of structured reasons (machine-readable codes + human text) +* `evidence[]`: references to graph paths and/or runtime samples +* `confidence`: 0.0–1.0 +* `scope`: component-only or symbol-level (if symbol mapping exists) + +## 1.2 Key UX outputs (pipeline-first) + +* CLI output for CI gates: `stella scan reachability --format sarif|json` +* UI detail panel must show: + + * Entry point(s) → path summary (k shortest paths, default k=3) + * Whether runtime proved it (samples, timestamps, container/build IDs) + * Which assumptions/heuristics were used (reflection, DI, dynamic import, etc.) + +--- + +# 2) System architecture (StellaOps modules) + +## 2.1 Services and responsibilities + +### `StellaOps.Scanner.WebService` (authoritative) + +**Owns the reachability pipeline and the lattice computation for reachability decisions.** +Responsibilities: + +* Ingest static graphs from language workers +* Ingest runtime evidence (from collectors) +* Normalize symbols → components (SBOM join) +* Compute reachability results, confidence, and explanation artifacts +* Expose query APIs and CI export formats +* Persist everything to Postgres (SoR) +* Use Valkey only as ephemeral accelerator + +### Language workers (stateless compute) + +Examples: + +* `StellaOps.Scanner.Worker.DotNet` +* `StellaOps.Scanner.Worker.Java` +* `StellaOps.Scanner.Worker.Node` +* `StellaOps.Scanner.Worker.Python` +* `StellaOps.Scanner.Worker.Go` +* `StellaOps.Scanner.Worker.Rust` +* `StellaOps.Scanner.Worker.Binary` + +Responsibilities: + +* Produce `CallGraph.v1.json` (+ optional `Entrypoints.v1.json`) +* Provide symbol IDs stable within a scan (see hashing rules) + +### Runtime collectors (agent/sidecar; optional) + +* Windows: ETW/EventPipe sampling for .NET +* Linux: eBPF/perf sampling for native; plus runtime-specific exporters where feasible + +Collectors only emit **evidence events**; they never compute reachability. + +### Concelier / Excititor integration + +* Concelier provides vulnerability facts (CVE ↔ component versions). +* Excititor provides VEX statements. + **Neither computes reachability or lattice merges**; they provide **pruned sources** only. + +--- + +# 3) Data contracts (hard requirements) + +## 3.1 Stable identifiers + +All graph nodes must have: + +* `nodeId`: stable across replays when code is unchanged. +* `symbolKey`: canonical string (language-specific) +* `artifactKey`: assembly/jar/module/binary identity (prefer build ID + path + hash) +* Optional: `purlCandidates[]` (library mapping hints) + +**DotNet nodeId rule (v1):** +`nodeId = SHA256(assemblyMvid + ":" + metadataToken + ":" + genericArity + ":" + signatureShape)` + +* If token unavailable (source-only), fallback: SHA256(projectPath + ":" + file + ":" + span + ":" + symbolDisplayString) + +## 3.2 CallGraph.v1.json + +Minimum required schema: + +```json +{ + "schema": "stella.callgraph.v1", + "scanKey": "uuid", + "language": "dotnet|java|node|python|go|rust|binary", + "artifacts": [{ "artifactKey": "…", "kind": "assembly|jar|module|binary", "sha256": "…" }], + "nodes": [{ + "nodeId": "…", + "artifactKey": "…", + "symbolKey": "Namespace.Type::Method(…)", + "visibility": "public|internal|private|unknown", + "isEntrypointCandidate": false + }], + "edges": [{ + "from": "nodeId", + "to": "nodeId", + "kind": "static|heuristic", + "reason": "direct_call|virtual_call|reflection_string|di_binding|dynamic_import|unknown", + "weight": 1.0 + }], + "entrypoints": [{ + "nodeId": "…", + "kind": "http|grpc|cli|job|event|unknown", + "route": "/api/orders/{id}", + "framework": "aspnetcore|minimalapi|spring|express|unknown" + }] +} +``` + +## 3.3 RuntimeEvidence.v1.json + +```json +{ + "schema": "stella.runtimeevidence.v1", + "scanKey": "uuid", + "collectedAt": "2025-12-14T10:00:00Z", + "environment": { + "os": "linux|windows", + "k8s": { "namespace": "…", "pod": "…", "container": "…" }, + "imageDigest": "sha256:…", + "buildId": "…" + }, + "samples": [{ + "timestamp": "…", + "pid": 1234, + "threadId": 77, + "frames": ["nodeId","nodeId","nodeId"], + "sampleWeight": 1.0 + }], + "loadedArtifacts": [{ + "artifactKey": "…", + "evidence": "loaded_module|mapped_file|jar_loaded" + }] +} +``` + +--- + +# 4) Postgres schema (system of record) + +## 4.1 Core tables + +You can implement with migrations in `StellaOps.Scanner.Persistence` (EF Core 9). + +### `scan` + +* `scan_id uuid pk` +* `created_at timestamptz` +* `repo_uri text null` +* `commit_sha text null` +* `sbom_digest text` (hash of SBOM input) +* `policy_digest text` (hash of reachability policy inputs) +* `status text` (NEW/RUNNING/DONE/FAILED) + +Indexes: + +* `(commit_sha, sbom_digest)` for caching + +### `artifact` + +* `artifact_id uuid pk` +* `scan_id uuid fk` +* `artifact_key text` unique per scan +* `kind text` +* `sha256 text` +* `build_id text null` +* `purl text null` + +Index: + +* `(scan_id, artifact_key)` unique + +### `cg_node` + +* `scan_id uuid fk` +* `node_id text` (hash string) +* `artifact_key text` +* `symbol_key text` +* `visibility text` +* `flags int` (bitset: entrypointCandidate, external, generated, etc.) + PK: `(scan_id, node_id)` + +GIN index: + +* `symbol_key` trigram for search (optional) + +### `cg_edge` + +* `scan_id uuid fk` +* `from_node_id text` +* `to_node_id text` +* `kind smallint` (0 static, 1 heuristic, 2 runtime_minted) +* `reason smallint` +* `weight real` + PK: `(scan_id, from_node_id, to_node_id, kind, reason)` + +Indexes: + +* `(scan_id, from_node_id)` +* `(scan_id, to_node_id)` + +### `entrypoint` + +* `scan_id uuid` +* `node_id text` +* `kind text` +* `framework text` +* `route text null` + PK: `(scan_id, node_id, kind, framework, route)` + +### `runtime_sample` + +* `scan_id uuid` +* `collected_at timestamptz` +* `env_hash text` (hash of environment identity) +* `sample_id bigserial pk` +* `timestamp timestamptz` +* `pid int` +* `thread_id int` +* `frames text[]` (nodeIds) +* `weight real` + +Partition suggestion: + +* Partition by `scan_id` or by month depending on retention. + +### `symbol_component_map` + +* `scan_id uuid` +* `node_id text` +* `purl text` +* `mapping_kind text` (exact|heuristic|external) +* `confidence real` + PK: `(scan_id, node_id, purl)` + +### `reachability_component` + +* `scan_id uuid` +* `purl text` +* `status smallint` (0 unreachable, 1 possible, 2 reachable_static, 3 reachable_proven) +* `confidence real` +* `why jsonb` +* `evidence jsonb` + PK: `(scan_id, purl)` + +### `reachability_finding` + +* `scan_id uuid` +* `cve_id text` +* `purl text` +* `status smallint` +* `confidence real` +* `why jsonb` +* `evidence jsonb` + PK: `(scan_id, cve_id, purl)` + +## 4.2 Valkey usage (ephemeral only) + +Allowed: + +* Dedup keys for evidence ingest (short TTL) +* Hot query cache: `(scan_id, purl)` → reachability result +* Rate limits / nonces + +Not allowed: + +* Authoritative queueing for scan state +* Any “only copy” of results + +--- + +# 5) Reachability computation (the actual algorithm) + +## 5.1 Inputs + +* Call graph nodes/edges + entrypoints +* Runtime evidence (optional) +* SBOM (CycloneDX/SPDX) with purls +* Concelier vulnerability facts (CVE ↔ purl/version ranges) +* Excititor VEX statements (not affected / affected / under investigation) + +## 5.2 Normalize to a graph suitable for traversal + +In `scanner.webservice`: + +1. Build adjacency list for `cg_edge.kind in (static, heuristic)` +2. Optionally compress SCCs: + + * Compute SCCs (Tarjan/Kosaraju) + * Store SCC mapping for explanation paths (must remain explainable) + +## 5.3 Entrypoint seeding rules + +Entrypoints come from: + +* Worker-reported entrypoints (preferred) +* Framework discovery in worker (ASP.NET maps, Spring mappings, etc.) +* Fallback: `Main`, exported symbols, container CMD/ENTRYPOINT + +**If entrypoints are empty**, mark all results as `POSSIBLY_REACHABLE` with reason `NO_ENTRYPOINTS_DISCOVERED`, unless runtime evidence exists. + +## 5.4 Traversal + +For each scan: + +* Start from all entrypoints; traverse reachable nodes. +* Track: + + * `firstSeenFromEntrypoint[node]` (for k-shortest path reconstruction) + * `pathWitness[node]` (parent pointers or compressed witness) + +Produce: + +* `reachableNodesStatic` set + +## 5.5 Join to components (SBOM) + +Map reachable nodes to purls using `symbol_component_map`. + +Mapping sources (priority order): + +1. Exact binary symbol → package metadata (where available) +2. Assembly/jar/module to SBOM component (by hash/purl) +3. Heuristics: namespace prefixes, import paths, jar manifest, npm package.json, go module path + +If a vulnerable purl is in SBOM but has **no symbol mapping**, component reachability defaults: + +* If artifact is **loaded at runtime** → at least `REACHABLE_PROVEN` (component level) +* Else if referenced by static dependency graph → `POSSIBLY_REACHABLE` +* Else → `UNREACHABLE` (with `NO_SYMBOL_MAPPING` reason) + +## 5.6 Runtime evidence upgrade (“minting”) + +If runtime evidence is present: + +* For each sample stack: + + * Mark each frame node as “executed” + * Mint runtime edges: consecutive frames become `cg_edge.kind=runtime_minted` (optional table or derived view) +* If any executed node maps to purl affected by CVE: + + * Upgrade status to `REACHABLE_PROVEN` +* If only loaded artifact exists: + + * Upgrade component status to `REACHABLE_PROVEN` (component-only), but keep symbol-level as unknown. + +## 5.7 Confidence scoring (deterministic) + +A simple deterministic scoring function (v1) used everywhere: + +* Base: + + * `UNREACHABLE` → 0.05 + * `POSSIBLY_REACHABLE` → 0.35 + * `REACHABLE_STATIC` → 0.70 + * `REACHABLE_PROVEN` → 0.95 +* Modifiers: + + * +0.10 if path uses only `static` edges (no heuristic) + * −0.15 if path includes `reflection_string|dynamic_import` + * +0.10 if runtime evidence hits a node in affected component + * −0.10 if entrypoints incomplete (`NO_ENTRYPOINTS_DISCOVERED`) + Clamp to `[0, 1]`. + +All modifiers must be recorded in `why[]`. + +--- + +# 6) Language worker specs (what each worker must do) + +## 6.1 .NET worker (Roslyn + optional IL) + +**Goal (v1):** produce good-enough call graph + entrypoints for ASP.NET Core and workers. + +### Required features + +* Direct invocation edges: `InvocationExpressionSyntax` +* Object creation edges: constructors +* Delegate invocation: best-effort; record heuristic edge when target unresolved +* Virtual/interface dispatch: + + * record `virtual_call` edge to declared method + * optionally add edges to known overrides within solution (static, conservative) +* Async/await: treat state machine calls as implementation detail; connect logical caller → awaited method + +### Entrypoint discovery (.NET) + +Implement these detectors: + +* `Program.Main` (classic) +* ASP.NET Core: + + * Controllers: `[ApiController]`, route attributes, action methods + * Minimal APIs: `MapGet/MapPost/MapMethods` patterns (syntactic + semantic) + * gRPC: `MapGrpcService()` and service methods + * Hosted services: `IHostedService`, `BackgroundService.ExecuteAsync` as job entrypoints +* Message consumers (if present): known libs patterns (e.g., MassTransit consumers) + +### Reflection and DI heuristics + +Produce **heuristic edges** when you see: + +* `Type.GetType("…")`, `Assembly.GetType`, `GetMethod("…")`, `Invoke` +* `services.AddTransient()` / `AddScoped` / `AddSingleton` + + * Add edge `IFoo` → `Foo` constructor as `di_binding` heuristic +* `Activator.CreateInstance`, `ServiceProvider.GetService` patterns + +### Output guarantees + +* Must not crash on partial compilation (missing refs); produce partial graph with `why=COMPILATION_PARTIAL` +* Provide `artifact_key` per assembly/project output + +## 6.2 Java / Node / Python / Go / Rust workers + +v1 expectations: + +* Provide import graph + framework entrypoints + best-effort call edges. +* Always label uncertain resolution as `heuristic` with a reason code. + +## 6.3 Binary worker + +v1 expectations: + +* Identify artifacts, exported symbols, imported libs, and candidate entrypoints from container metadata. +* Provide component-level mapping primarily; symbol-level mapping only when confident. + +--- + +# 7) APIs (scanner.webservice) + +## 7.1 Ingestion endpoints + +* `POST /api/scans` → creates scan record (returns `scanId`) +* `POST /api/scans/{scanId}/callgraphs` → accepts `CallGraph.v1.json` +* `POST /api/scans/{scanId}/runtimeevidence` → accepts `RuntimeEvidence.v1.json` +* `POST /api/scans/{scanId}/sbom` → accepts CycloneDX/SPDX +* `POST /api/scans/{scanId}/compute-reachability` → triggers computation (idempotent) + +Rules: + +* All ingests must be **idempotent** via `contentDigest` header (store seen digests in Postgres; Valkey may accelerate dedupe). +* Reject mismatched `scanKey/scanId`. + +## 7.2 Query endpoints + +* `GET /api/scans/{scanId}/reachability/components?purl=...` +* `GET /api/scans/{scanId}/reachability/findings?cve=...` +* `GET /api/scans/{scanId}/reachability/explain?cve=...&purl=...` + + * returns `why[]` + path witness + sample refs + +## 7.3 Export endpoints + +* `GET /api/scans/{scanId}/exports/sarif` +* `GET /api/scans/{scanId}/exports/cdxr` (CycloneDX reachability extension) +* `GET /api/scans/{scanId}/exports/openvex` (reachability justifications as VEX annotations) + +--- + +# 8) Deterministic replay requirements (must-have) + +Every reachability result must be reproducible from: + +* SBOM digest +* CallGraph digests (per worker) +* RuntimeEvidence digests (optional) +* Concelier feed snapshot digest +* Excititor VEX snapshot digest +* Policy digest (confidence scoring + gating rules) + +Implement `ReplayManifest.json`: + +```json +{ + "schema": "stella.replaymanifest.v1", + "scanId": "uuid", + "inputs": { + "sbomDigest": "sha256:…", + "callGraphs": [{"language":"dotnet","digest":"sha256:…"}], + "runtimeEvidence": [{"digest":"sha256:…"}], + "concelierSnapshot": "sha256:…", + "excititorSnapshot": "sha256:…", + "policyDigest": "sha256:…" + } +} +``` + +--- + +# 9) Quality gates and acceptance criteria + +## 9.1 Golden corpus (mandatory) + +Create `/tests/Reachability.Golden/` with: + +* Minimal ASP.NET controller app with known reachable endpoint → vulnerable lib call +* Minimal app with vulnerable lib present but never called → unreachable +* Reflection-based activation case → “possible” unless runtime proves +* BackgroundService job case + +**Acceptance**: + +* Each golden test asserts: + + * Reachability status + * At least one `why[]` reason + * Deterministic `confidence` within ±0.01 + +## 9.2 Drift detection (mandatory) + +If runtime minted edges not present in static graph above a threshold: + +* Emit `COVERAGE_DRIFT` warning with top missing edges +* Store drift report in Postgres (`reachability_drift` table or JSONB field) + +## 9.3 Performance SLOs (v1 targets) + +* 1 medium service (100k LOC .NET) static graph: < 2 minutes on CI runner class machine +* Reachability compute: < 30 seconds +* Query `GET finding`: < 200ms p95 (use Postgres indexes + optional Valkey cache) + +--- + +# 10) Implementation plan (developer checklist) + +## Milestone A — Data plumbing (1) + +* Implement Postgres tables + migrations +* Implement ingestion endpoints +* Persist call graphs and runtime evidence +* Implement digesting + idempotency + +**Done when:** you can ingest a call graph + SBOM and query stored nodes/edges. + +## Milestone B — DotNet worker MVP (2) + +* Roslyn graph extraction (direct calls + constructors) +* ASP.NET Core entrypoints (controllers + minimal APIs) +* Output `CallGraph.v1.json` + +**Done when:** golden corpus “reachable/unreachable” works on static only. + +## Milestone C — Reachability compute (3) + +* Entry point seeding +* SCC compression optional +* Traversal + path witness generation +* Join to SBOM purls +* Generate `reachability_component` + `reachability_finding` + +**Done when:** component-level reachability returns stable answers. + +## Milestone D — Runtime evidence upgrade (4) + +* Ingest runtime evidence +* Upgrade statuses to `REACHABLE_PROVEN` +* Drift report + +**Done when:** runtime samples can flip a “possible” to “proven”. + +## Milestone E — Exports (5) + +* SARIF export +* CycloneDX reachability extension export +* OpenVEX annotations export + +**Done when:** CI pipeline can gate on proven/static reachability. + +--- + +If you want this converted into **repo-facing artifacts** (e.g., `docs/modules/scanner/reachability-spec.md`, EF Core entity classes skeletons, and OpenAPI snippets for the endpoints), tell me the exact module paths you use under `/src/` and I’ll emit drop-in files matching your StellaOps naming conventions. +Below is a concrete, implementable **architecture + technology stack** for **`StellaOps.Scanner.Worker.DotNet`** (the .NET static graph worker), written as if you’re handing it directly to an engineer to build. + +--- + +# 1) Worker’s purpose in the StellaOps scanner system + +**`StellaOps.Scanner.Worker.DotNet` is a stateless “graph producer.”** It does **not** compute reachability, confidence, or any lattice logic (that lives in `scanner.webservice`). The worker: + +1. **Discovers entrypoints** (ASP.NET Core controllers, minimal APIs, gRPC, hosted services, etc.) +2. **Extracts a static call graph** (method → method edges) +3. **Adds heuristic edges** for DI/reflection/dynamic patterns +4. **Emits `CallGraph.v1.json`** and optionally uploads it to `scanner.webservice` + +Key constraint: node IDs must be compatible with runtime evidence (EventPipe/ETW) mapping. That’s why we build node IDs from **(Module MVID + metadata token)** whenever possible. + +--- + +# 2) Deployment model + +## 2.1 Container image choice + +You have two legitimate modes; implement both: + +### Mode A — “Artifacts-first” (preferred for security) + +* Input: already-built assemblies from CI (`bin/Release/.../*.dll` + associated files) +* Worker does **no `dotnet build`** +* Worker performs **IL/metadata scanning** + optional Roslyn source parsing for entrypoints/heuristics + +### Mode B — “Build-and-scan” (convenience; higher risk) + +* Input: repo checkout with `.sln` +* Worker runs `dotnet restore`/`dotnet build` inside a sandboxed container, then scans outputs + +Because .NET build can execute **MSBuild tasks, analyzers, and source generators** (code execution risk), the product-default should be Mode A in any untrusted scenario. + +## 2.2 Runtime requirements + +* Base runtime: **.NET 10 (LTS)**. Microsoft’s support policy lists .NET 10 as LTS with original release **Nov 11, 2025** and latest patch **10.0.1 (Dec 9, 2025)**. ([Microsoft][1]) +* If you use Mode B, the image must include **.NET 10 SDK** (not just runtime). ([Microsoft][2]) + +## 2.3 Sandbox controls (Mode B) + +If you allow building: + +* Run with **no outbound network** (or allowlist only internal NuGet proxy). +* Read-only root FS; writable temp only. +* Drop Linux capabilities; use seccomp/apparmor defaults. +* Mount repo read-only; write outputs to a dedicated volume. +* Disable telemetry: `DOTNET_CLI_TELEMETRY_OPTOUT=1`. + +--- + +# 3) Core architecture (pipeline) + +Implement the worker as a single executable (CLI) with internal pipeline stages: + +``` +┌───────────────────────────────────────────────────────────────┐ +│ Worker.DotNet CLI │ +│ Inputs: --sln / --assemblies / --repo, --scanKey, --out │ +└───────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Stage 0: Discovery │ +│ - Find solutions/projects or assemblies │ +│ - Determine configuration/TFM │ +└───────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Stage 1: Build (optional) │ +│ - dotnet restore/build OR skip │ +│ - Collect output assembly paths │ +└───────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Stage 2: Reference Indexer │ +│ - Build mapping: (AssemblyName, Version) -> artifactKey │ +│ - Compute sha256 per referenced dll │ +└───────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Stage 3: IL Call Graph Extractor │ +│ - Parse each project assembly │ +│ - Create method nodes (nodeId = hash(MVID:token)) │ +│ - Parse IL & add static edges (call/callvirt/newobj/ldftn...) │ +│ - Emit external nodes for member refs │ +└───────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Stage 4: Roslyn Entrypoints + Heuristics │ +│ - Controllers/minimal APIs/gRPC/HostedService entrypoints │ +│ - DI binding edges (AddTransient/AddScoped/AddSingleton etc.) │ +│ - Reflection edges (Type.GetType/GetMethod/Invoke etc.) │ +│ - Resolve Roslyn symbols -> nodeIds via symbolKey dictionary │ +└───────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Stage 5: Merge + Emit │ +│ - Merge nodes/edges/entrypoints │ +│ - Output CallGraph.v1.json │ +│ - Optional POST to scanner.webservice │ +└───────────────────────────────────────────────────────────────┘ +``` + +**Why IL-first?** +Because you want **metadata token + MVID** node IDs that correlate naturally with runtime stacks. Deterministic builds make MVID stable for identical compilation inputs. ([Microsoft Learn][3]) + +--- + +# 4) Technology stack (NuGet + platform APIs) + +## 4.1 Roslyn / MSBuild loading + +Use Roslyn MSBuild workspace packages: + +* `Microsoft.CodeAnalysis.Workspaces.MSBuild` (MSBuildWorkspace support) ([NuGet][4]) +* `Microsoft.CodeAnalysis.CSharp.Workspaces` (C# semantic model / operations API) +* Optional: `Microsoft.CodeAnalysis` meta-package (superset) ([NuGet][5]) +* `Microsoft.Build.Locator` (register MSBuild instances for workspace loading) + +Roslyn packages are actively published by RoslynTeam (latest shown as **5.0.0** as of Nov 2025). ([NuGet][6]) + +## 4.2 IL + metadata scanning + +Prefer BCL APIs (no extra dependencies): + +* `System.Reflection.Metadata` +* `System.Reflection.PortableExecutable` +* `System.Reflection.Emit.OpCodes` for IL decoding (operand sizes) + (This lets you implement a compact IL parser without Cecil.) + +Optional alternative (faster development, more deps): + +* `Mono.Cecil` (makes IL traversal trivial) ([NuGet][7]) + +## 4.3 CLI + logging + JSON + +* `System.CommandLine` (recommended) +* `Microsoft.Extensions.Logging` (+ Console logger) +* `System.Text.Json` (source-generated serializers strongly recommended) + +## 4.4 Runtime alignment note + +Runtime collectors commonly rely on EventPipe/ETW; the .NET diagnostics client library (`Microsoft.Diagnostics.NETCore.Client`) is the standard managed API for EventPipe sessions. ([Microsoft Learn][8]) +The worker itself doesn’t collect runtime evidence, but the **nodeId algorithm must match what runtime collectors can compute** (hence MVID+token). + +--- + +# 5) Internal module decomposition + +Implement these internal components as classes/services. Keep them testable (pure functions where possible). + +## 5.1 `WorkerOptions` + +Holds CLI options: + +* `ScanKey` (uuid) +* `RepoRoot`, `SolutionPath` OR `AssembliesPath[]` +* `Configuration` (default Release) +* `TargetFramework` (optional) +* `BuildMode` = `Artifacts | Build` +* `OutFile` +* `UploadUrl` + `ApiKey` (optional) +* `MaxEdgesPerNode` (optional throttle) +* `IncludeExternalNodes` (bool) +* `Concurrency` (int) + +## 5.2 `BuildOrchestrator` (Mode B only) + +Responsibilities: + +* Run `dotnet restore` and `dotnet build` +* Capture output logs and surface them as structured diagnostics +* Return discovered output assemblies (dll paths) + +Hard requirements: + +* Support `--no-restore` and `--no-build` toggles (or equivalent) +* Support `ContinuousIntegrationBuild=true` to improve determinism when available +* If build fails, still attempt to scan any assemblies that exist, but mark output with `why=BUILD_FAILED_PARTIAL`. + +## 5.3 `MsbuildWorkspaceLoader` (Roslyn) + +Responsibilities: + +* Register MSBuild with `MSBuildLocator` +* Load `.sln` via `MSBuildWorkspace` +* Provide: + + * `Solution` object + * `Project` list (C# only for v1) + * Compilation(s) when needed (for semantic analysis) + +MSBuildWorkspace is the canonical Roslyn path for analyzing MSBuild solutions. ([NuGet][4]) + +## 5.4 `ReferenceIndexer` + +Responsibilities: + +* Build a map from referenced assemblies to `artifactKey` +* For each `PortableExecutableReference` with a file path: + + * compute sha256 + * read assembly identity (name, version) + * create `artifactKey` + * add to: + + * `AssemblyIdentity -> artifactKey` + * `artifactKey -> sha256/path/version` + +This index is used by IL extractor to attribute **external nodes** to correct artifacts. + +## 5.5 `IlCallGraphExtractor` + +Responsibilities: + +* For each “root” assembly (project output): + + * open PE + * get module MVID + * enumerate `MethodDefinition` rows + * create nodes for all methods + * parse IL bodies and emit edges + +### IL parsing scope (v1) + +You only need to recognize these opcodes as “calls”: + +* `call` +* `callvirt` +* `newobj` +* `jmp` +* `ldftn` +* `ldvirtftn` + +### Node identity + +* Internal method nodeId: + + * `nodeId = SHA256( MVID + ":" + metadataToken + ":" + arity + ":" + signatureShape )` + * Minimal acceptable: `SHA256(MVID + ":" + metadataToken)` + +This is intentionally compatible with how runtime stacks identify methods (module + token). + +### External method nodes + +If a call operand is a `MemberRef`/`MethodSpec` that targets another assembly: + +* Create an “external node” with: + + * `symbolKey` computed from metadata signature + * `artifactKey` resolved via `ReferenceIndexer` (assembly identity match) + * `nodeId = SHA256("ext:" + artifactKey + ":" + symbolKey)` (runtime-proof not required) + +Set `flags |= External`. + +## 5.6 `RoslynEntrypointExtractor` + +Responsibilities: + +* Produce `entrypoints[]` records pointing to nodeIds. + +### Must support (v1) + +**ASP.NET Core MVC controllers** + +* Type has `[ApiController]` or derives from `ControllerBase` +* Action methods: public instance methods with routing attributes `[HttpGet]`, `[HttpPost]`, `[Route]`, etc. +* Route template: + + * combine controller + action route attributes (best effort) +* `entrypoint.kind = http`, `framework=aspnetcore` + +**Minimal APIs** + +* Detect invocation of `MapGet`, `MapPost`, `MapPut`, `MapDelete`, `MapMethods` +* Extract route string literal when available +* Handler target: + + * lambda => map to generated method? (best effort) + * method group => resolve to method symbolKey => nodeId + +**gRPC** + +* Detect `MapGrpcService()` (endpoint registration) +* Entry points: service methods on generated base types (best effort) + +**Background jobs** + +* Types implementing `IHostedService` +* `BackgroundService.ExecuteAsync` override +* `entrypoint.kind = job` + +### Mapping Roslyn → nodeId + +Do **not** attempt to compute metadata tokens from Roslyn symbols directly. + +Instead: + +* Generate the same canonical `symbolKey` for Roslyn symbols +* Resolve `symbolKey -> nodeId` using a dictionary built from IL nodes + +If not resolvable, emit an entrypoint with a synthetic “unresolved” node: + +* `nodeId = SHA256("unresolved:" + symbolKey)` +* `flags |= Unresolved` +* `why += ENTRYPOINT_SYMBOL_UNRESOLVED` + +## 5.7 `RoslynHeuristicEdgeExtractor` + +Responsibilities: + +* Add **heuristic edges** that IL won’t reliably capture. + +### DI bindings (must-have) + +Detect common DI registration patterns: + +* `services.AddTransient()` +* `AddScoped`, `AddSingleton` + Emit heuristic edge: +* from: interface method set? (v1 simplify to type-level constructor edge) +* to: `Foo..ctor(...)` node +* `reason = di_binding` + +Practical v1 implementation: + +* Create edge from a synthetic “DI container” node per assembly to implementation constructors. +* Or create edges from the registration site method to the constructor. + (Choose one and keep consistent.) + +### Reflection (must-have) + +Emit heuristic edges with lower confidence: + +* `Type.GetType("Namespace.Type, Assembly")` +* `Assembly.Load(...)`, `GetMethod("X")`, `Invoke` +* `Activator.CreateInstance(...)` + +If string literal resolves to a type/method in the solution, create edge: + +* from: caller method +* to: target method/ctor +* `reason = reflection_string` + +If not resolvable, record a `why=REFLECTION_UNRESOLVED_STRING` diagnostic; do not crash. + +## 5.8 `GraphMerger` + +Responsibilities: + +* Merge nodes/edges/entrypoints from IL and Roslyn stages +* De-duplicate edges by `(from,to,kind,reason)` +* Apply optional throttles: + + * cap edges per node + * drop low-weight heuristics if too many + +## 5.9 `CallGraphWriter` + +Responsibilities: + +* Serialize `CallGraph.v1.json` exactly to spec +* Include: + + * `artifacts[]` (project outputs + references) + * `nodes[]`, `edges[]` + * `entrypoints[]` + * `language = "dotnet"` + * `scanKey` + +--- + +# 6) Canonical symbolKey format (critical for merges) + +Pick one canonical form and use it everywhere. + +Recommended v1 `symbolKey` shape: + +``` +{Namespace}.{TypeName}[`Arity][+Nested]::{MethodName}[`Arity]({ParamType1},{ParamType2},...) +``` + +Rules: + +* Use `System.*` full names for BCL types +* Use `+` for nested types (metadata style) +* Use backtick arity for generic type/method definitions +* For arrays: `System.String[]` +* For byref: `System.String&` + +**Implementation detail:** + +* IL extractor can build this from metadata signatures. +* Roslyn extractor can build this using a controlled `SymbolDisplayFormat`. + +If you get this right, Roslyn → IL mapping becomes reliable. + +--- + +# 7) CLI surface (what developers will actually run) + +Minimum viable commands: + +### Artifacts-first scan + +```bash +stella-worker-dotnet scan \ + --scanKey 00000000-0000-0000-0000-000000000000 \ + --assemblies ./artifacts/bin/Release \ + --out ./callgraph.json +``` + +### Build-and-scan (internal trusted only) + +```bash +stella-worker-dotnet scan \ + --scanKey ... \ + --sln ./src/MySolution.sln \ + --configuration Release \ + --tfm net10.0 \ + --buildMode build \ + --out ./callgraph.json +``` + +### Upload to scanner.webservice + +```bash +stella-worker-dotnet scan \ + --scanKey ... \ + --assemblies ./artifacts/bin/Release \ + --upload https://scanner/api/scans/{scanId}/callgraphs \ + --apiKey $STELLA_API_KEY +``` + +--- + +# 8) Observability and failure behavior + +## 8.1 Structured diagnostics + +Always emit: + +* counts: nodes/edges/entrypoints +* build outcome: success/failed/partial +* list of projects scanned/skipped +* unresolved symbol counts (entrypoints + heuristic edges) + +## 8.2 Hard failure vs partial output + +* If at least one assembly scanned, output a graph even if others fail. +* Mark diagnostics in output: + + * add `why`/`notes` (if you extend schema) OR log to stderr and let webservice record the warning on ingest. + +--- + +# 9) Why this architecture works for reachability + +* **IL extraction** provides the most faithful call edges and stable node IDs. +* **Roslyn adds what IL can’t:** framework entrypoints, DI and reflection heuristics. +* Node IDs based on **MVID+token** align with deterministic compilation behavior and runtime tooling patterns. Deterministic compilation replaces timestamp/MVID with values derived from compilation inputs, enabling stable identity under stable inputs. ([Microsoft Learn][3]) +* Roslyn MSBuildWorkspace is the canonical way to load solutions/projects with correct references and compilation options. ([NuGet][4]) + +--- + +# 10) Implementation “must-do” checklist for the developer + +1. **Define canonical symbolKey** and implement it in: + + * IL extractor + * Roslyn symbol formatter +2. Implement `ReferenceIndexer` to map assembly identity → artifactKey. +3. Implement IL extractor: + + * nodes for method defs + * edges for call opcodes +4. Implement entrypoint detectors (controllers + minimal APIs + hosted service). +5. Implement DI + reflection heuristic edges. +6. Merge and output `CallGraph.v1.json` matching schema. +7. Add golden tests (small ASP.NET apps) verifying: + + * entrypoint detection + * at least one static path exists + * DI/reflection edges appear with correct reason codes + +--- + +If you want, I can also provide: + +* a **concrete folder layout** (`/src/StellaOps.Scanner.Worker.DotNet/...`) with project files, +* the **exact `SymbolDisplayFormat`** you should use for symbolKey stability, +* and an IL opcode decoder snippet that only records call-like opcodes (fast, minimal). + +[1]: https://dotnet.microsoft.com/en-us/platform/support/policy/dotnet-core ".NET and .NET Core official support policy | .NET" +[2]: https://dotnet.microsoft.com/en-US/download/dotnet/10.0 "Download .NET 10.0 (Linux, macOS, and Windows) | .NET" +[3]: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-options/code-generation?utm_source=chatgpt.com "C# Compiler Options that control code generation" +[4]: https://www.nuget.org/packages/Microsoft.CodeAnalysis.Workspaces.MSBuild/?utm_source=chatgpt.com "Microsoft.CodeAnalysis.Workspaces.MSBuild 5.0.0" +[5]: https://www.nuget.org/packages/microsoft.codeanalysis?utm_source=chatgpt.com "Microsoft.CodeAnalysis 5.0.0" +[6]: https://www.nuget.org/profiles/RoslynTeam?utm_source=chatgpt.com "NuGet Gallery | RoslynTeam" +[7]: https://www.nuget.org/packages/mono.cecil/?utm_source=chatgpt.com "Mono.Cecil 0.11.6" +[8]: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/diagnostics-client-library?utm_source=chatgpt.com "Diagnostics client library - .NET" diff --git a/docs/product-advisories/13-Dec-2025 - Smart‑Diff - Defining Meaningful Risk Change.md b/docs/product-advisories/13-Dec-2025 - Smart‑Diff - Defining Meaningful Risk Change.md new file mode 100644 index 000000000..253b7615e --- /dev/null +++ b/docs/product-advisories/13-Dec-2025 - Smart‑Diff - Defining Meaningful Risk Change.md @@ -0,0 +1,892 @@ +Here’s a crisp, first‑time‑friendly blueprint for **Smart‑Diff**—a minimal‑noise way to highlight only changes that actually shift security risk, not every tiny SBOM/VEX delta. + +--- + +# What “Smart‑Diff” means (in plain terms) + +Smart‑Diff is the **smallest set of changes** between two builds/releases that **materially change risk**. We only surface a change when it affects exploitability or policy—not when a dev-only transitive bumped a patch with no runtime path. + +**Count it as a Smart‑Diff only if at least one of these flips:** + +* **Reachability:** new reachable vulnerable code appears, or previously reachable code becomes unreachable. +* **VEX status:** a CVE’s status changes (e.g., to `not_affected`). +* **Version vs affected ranges:** a dependency crosses into/out of a known vulnerable range. +* **KEV/EPSS/Policy:** CISA KEV listing, EPSS spike, or your org policy gates change. + +Ignore: + +* CVEs that are both **unreachable** and **VEX = not_affected**. +* Pure patch‑level churn that doesn’t cross an affected range and isn’t KEV‑listed. +* Dev/test‑only deps with **no runtime path**. + +--- + +# Minimal data model (practical) + +* **DiffSet { added, removed, changed }** for packages, symbols, CVEs, and policy gates. +* **AffectedGraph { package → symbol → call‑site }**: reachability edges from entrypoints to vulnerable sinks. +* **EvidenceLink { attestation | VEX | KEV | scanner trace }** per item, so every claim is traceable. + +--- + +# Core algorithms (what makes it “smart”) + +* **Reachability‑aware set ops:** run set diffs only on **reachable** vuln findings. +* **SemVer gates:** treat “crossing an affected range” as a boolean boundary; patch bumps inside a safe range don’t alert. +* **VEX merge logic:** vendor or internal VEX that says `not_affected` suppresses noise unless KEV contradicts. +* **EPSS‑weighted priority:** rank surfaced diffs by latest EPSS; KEV always escalates to top. +* **Policy overlays:** org rules (e.g., “block any KEV,” “warn if EPSS > 0.7”) applied last. + +--- + +# Example (why it’s quieter, but safer) + +* **OpenSSL 3.0.10 → 3.0.11** with VEX `not_affected` for a CVE: Smart‑Diff marks **risk down** and **closes** the prior alert. +* A **transitive dev dependency** changes with **no runtime path**: Smart‑Diff **logs only**, no red flag. + +--- + +# Implementation plan (Stella Ops‑ready) + +**1) Inputs** + +* SBOM (CycloneDX/SPDX) old vs new +* VEX (OpenVEX/CycloneDX VEX) +* Vuln feeds (NVD, vendor), **CISA KEV**, **EPSS** +* Reachability traces (per language analyzers) + +**2) Normalize** + +* Map all deps to **purl**, normalize versions, index CVEs → affected ranges. +* Ingest VEX and attach to CVE ↔ component with clear status precedence. + +**3) Build graphs** + +* Generate/refresh **AffectedGraph** per build: entrypoints → call stacks → vulnerable symbols. +* Tag each finding with `{reachable?, vex_status, kev?, epss, policy_flags}`. + +**4) Diff** + +* Compute **DiffSet** between builds for: + + * Reachable findings + * VEX statuses + * Version/range crossings + * Policy/KEV/EPSS gates + +**5) Prioritize & suppress** + +* Drop items that are **unreachable AND not_affected**. +* Collapse patch‑level churn unless **KEV‑listed**. +* Sort remaining by **KEV first**, then **EPSS**, then **runtime blast‑radius** (fan‑in/fan‑out). + +**6) Evidence** + +* Attach **EvidenceLink** to each surfaced change: + + * VEX doc (line/ID) + * KEV entry + * EPSS score + timestamp + * Reachability call stack (top 1‑3 paths) + +**7) UX** + +* Pipeline‑first: output a **Smart‑Diff report JSON** + concise CLI table: + + * `risk ↑/↓`, reason (reachability/VEX/KEV/EPSS), component@version, CVE, **one** example call‑stack. +* UI is an explainer: expand to full stack, VEX note, KEV link, and “minimum safe change” suggestion. + +--- + +# Module sketch (your stack) + +* **Services:** `Sbomer.Diff`, `Vexer.Merge`, `Scanner.Reachability`, `Feedser.KEV/EPSS`, `Policy.Engine`, `SmartDiff.Service` +* **Store:** PostgreSQL (SoR), Valkey cache (ephemeral). Tables: `components`, `cves`, `vex_entries`, `reachability_edges`, `smartdiff_events`, `evidence_links`. +* **APIs:** + + * `POST /smartdiff/compare` → returns filtered diff + priorities + * `GET /smartdiff/:id/evidence` → links to VEX/KEV/EPSS + trace +* **CI usage:** `stella smart-diff --old sbomA.json --new sbomB.json --vex vex.json --out smartdiff.json` + +--- + +# Guardrails (to keep it deterministic) + +* Freeze feed snapshots per run (hash KEV/EPSS CSVs + VEX docs). +* Version the merge rules (VEX precedence + policy) and emit in the report header. +* Log the **exact** semver comparisons that triggered/exempted an alert. + +If you want, I can draft the **Postgres schema**, the **.NET 10 DTOs** for `DiffSet` and `AffectedGraph`, and a **CLI prototype** (`stella smart-diff`) you can drop into your pipeline. +Noted: the services are **Concelier** (feeds: KEV/EPSS/NVD/vendor snapshots) and **Excititor** (VEX merge + status resolution). I’ll use those names going forward. + +Below is a **product + business analysis implementation spec** that a developer can follow to build the Smart‑Diff capability you described. + +--- + +# 1) Product objective + +## Problem + +Classic SBOM/VEX diffs are noisy: they surface *all* dependency/CVE churn, even when nothing changes in **actual exploitable risk**. + +## Goal + +Produce a **Smart‑Diff report** between two builds/releases that highlights only changes that **materially impact security risk**, with evidence attached. + +## Success criteria + +* **Noise reduction:** >80% fewer diff items vs raw SBOM diff for typical builds (measured by count). +* **No missed “high-risk flips”:** any change that creates or removes a **reachable vulnerable path** must appear. +* **Traceability:** every surfaced Smart‑Diff item has at least **one evidence link** (VEX entry, reachability trace, KEV reference, feed snapshot hash, scanner output). + +--- + +# 2) Scope + +## In scope (MVP) + +* Compare two “build snapshots”: `{SBOM, VEX, reachability traces, vuln feed snapshot, policy snapshot}` +* Detect & report these change types: + + 1. **Reachability flips** (reachable ↔ unreachable) + 2. **VEX status changes** (e.g., `affected` → `not_affected`) + 3. **Version crosses vuln boundary** (safe ↔ affected range) + 4. **KEV/EPSS/policy gate flips** (e.g., becomes KEV-listed) +* Suppress noise using explicit rules (see section 6) +* Output: + + * JSON report for CI + * concise CLI output (table) + * optional UI list view (later) + +## Out of scope (for now) + +* Full remediation planning / patch PR automation +* Cross-repo portfolio aggregation (doable later) +* Advanced exploit intelligence beyond KEV/EPSS + +--- + +# 3) Key definitions (developers must implement these exactly) + +## 3.1 Finding + +A “finding” is a tuple: + +`FindingKey = (component_purl, component_version, cve_id)` + +…and includes computed fields: + +* `reachable: bool | unknown` +* `vex_status: enum` (see 3.3) +* `in_affected_range: bool | unknown` +* `kev: bool` +* `epss_score: float | null` +* `policy_flags: set` +* `evidence_links: list` + +## 3.2 Material risk change (Smart‑Diff item) + +A change is “material” if it changes the computed **RiskState** for any `FindingKey` or creates/removes a `FindingKey` that is in-scope after suppression rules. + +## 3.3 VEX status vocabulary + +Normalize all incoming VEX statuses into a fixed internal enum: + +* `AFFECTED` +* `NOT_AFFECTED` +* `FIXED` +* `UNDER_INVESTIGATION` +* `UNKNOWN` (no statement or unparseable) + +> Note: Use OpenVEX/CycloneDX VEX mappings, but internal logic must operate on the above set. + +--- + +# 4) System context and responsibilities + +You already have a modular setup. Developers should implement Smart‑Diff as a pipeline over these components: + +## Components (names aligned to your system) + +* **Sbomer** + + * Ingest SBOM(s), normalize to purl/version graph +* **Scanner.Reachability** + + * Produce reachability traces: entrypoints → call paths → vulnerable symbol/sink +* **Concelier** + + * Fetch + snapshot vulnerability intelligence (NVD/vendor/OSV as applicable), **CISA KEV**, **EPSS** + * Provide *feed snapshot identifiers* (hashes) per run +* **Excititor** + + * Ingest and merge VEX sources + * Resolve a final `vex_status` per (component, cve) + * Provide precedence + explanation +* **Policy.Engine** + + * Evaluate org rules against a computed finding (e.g., “block if KEV”) +* **SmartDiff.Service** + + * Compute risk states for “old” and “new” + * Diff them + * Suppress noise + * Rank + output report with evidence + +--- + +# 5) Developer deliverables + +## Deliverable A: Smart‑Diff computation library + +A deterministic library that takes: + +* `OldSnapshot` and `NewSnapshot` (see section 7) +* returns a `SmartDiffReport` + +## Deliverable B: Service endpoint + +`POST /smartdiff/compare` returns report JSON. + +## Deliverable C: CLI command + +`stella smart-diff --old --new [--policy policy.json] --out smartdiff.json` + +--- + +# 6) Smart‑Diff rules + +Developers must implement these as **explicit, testable rule functions**. + +## 6.1 Suppression rules (noise filters) + +A finding is **suppressed** if ALL apply: + +1. `reachable == false` (or `unknown` treated as false only if you explicitly decide; recommended: unknown is *not* suppressible) +2. `vex_status == NOT_AFFECTED` +3. `kev == false` +4. no policy requires it (e.g., “report all vuln findings” override) + +**Patch churn suppression** + +* If a component version changes but: + + * `in_affected_range` remains false in both versions, AND + * no KEV/policy flag flips, + * then suppress (don’t surface). + +**Dev/test dependency suppression (optional if you already tag scopes)** + +* If SBOM scope indicates `dev/test` AND `reachable == false`, suppress. +* If reachability is unknown, do **not** suppress by scope alone (avoid false negatives). + +## 6.2 Material change detection rules + +Surface a Smart‑Diff item when any of the following changes between old and new: + +### Rule R1: Reachability flip + +* `reachable` changes: `false → true` (risk ↑) or `true → false` (risk ↓) +* Include at least one call path as evidence if reachable is true. + +### Rule R2: VEX status flip + +* `vex_status` changes meaningfully: + + * `AFFECTED ↔ NOT_AFFECTED` + * `UNDER_INVESTIGATION → NOT_AFFECTED` etc. +* Changes involving `UNKNOWN` should be shown but ranked lower unless KEV. + +### Rule R3: Affected range boundary + +* `in_affected_range` flips: + + * `false → true` (risk ↑) + * `true → false` (risk ↓) +* This is the main guard against patch churn noise. + +### Rule R4: Intelligence / policy flip + +* `kev` changes `false → true` or `epss_score` crosses a configured threshold +* any `policy_flag` changes severity (warn → block) + +--- + +# 7) Snapshot contract (what Smart‑Diff compares) + +Define a stable internal format: + +```json +{ + "snapshot_id": "build-2025.12.14+sha.abc123", + "created_at": "2025-12-14T12:34:56Z", + "sbom": { "...": "CycloneDX or SPDX raw" }, + "vex_documents": [ { "...": "OpenVEX/CycloneDX VEX raw" } ], + "reachability": { + "analyzer": "java-callgraph@1.2.0", + "entrypoints": ["com.app.Main#main"], + "paths": [ + { + "component_purl": "pkg:maven/org.example/foo@1.2.3", + "cve": "CVE-2024-1234", + "sink": "org.example.foo.VulnClass#vulnMethod", + "callstack": ["...", "..."] + } + ] + }, + "concelier_feed_snapshot": { + "kev_hash": "sha256:...", + "epss_hash": "sha256:...", + "vuln_db_hash": "sha256:..." + }, + "policy_snapshot": { "policy_hash": "sha256:...", "rules": [ ... ] } +} +``` + +**Implementation note** + +* SBOM/VEX can remain “raw”, but you must also build normalized indexes (in-memory or stored) for diffing. + +--- + +# 8) Data normalization requirements + +## 8.1 Component identity + +* Use **purl** as canonical component ID. +* Normalize casing, qualifiers, and version string normalization per ecosystem. + +## 8.2 Vulnerability identity + +* Use `CVE-*` as primary key where available. +* If you ingest OSV IDs too, map them to CVE when possible but keep OSV ID in evidence. + +## 8.3 Affected range evaluation + +Implement: +`bool? IsVersionInAffectedRange(version, affectedRanges)` + +Return `null` (unknown) if version cannot be parsed or range semantics are unknown. + +--- + +# 9) Excititor: VEX merge requirements + +Developers should implement Excititor as a deterministic resolver: + +## 9.1 Inputs + +* List of VEX documents, each with metadata: + + * `source` (vendor/internal/scanner) + * `issued_at` + * `signature/attestation` info (if present) + +## 9.2 Output + +For each `(component_purl, cve_id)`: + +* `final_status` +* `winning_statement_id` +* `precedence_reason` +* `all_statements[]` (for audit) + +## 9.3 Precedence rules (recommendation) + +Implement as ordered priority (highest wins), unless overridden by your org: + +1. **Internal signed VEX** (security team attested) +2. **Vendor signed VEX** +3. **Internal unsigned VEX** +4. **Scanner/VEX-like annotations** +5. None → `UNKNOWN` + +Conflict handling: + +* If two same-priority statements disagree, pick newest by `issued_at`, but **record conflict** and surface it as a low-priority Smart‑Diff meta-item (optional). + +--- + +# 10) Concelier: feed snapshot requirements + +Concelier must provide deterministic inputs to Smart‑Diff. + +## 10.1 What Concelier stores + +* KEV list snapshot +* EPSS snapshot +* Vulnerability database snapshot (your choice: NVD mirror, OSV, vendor advisories) + +## 10.2 Required APIs (internal) + +* `GET /concelier/snapshots/latest` +* `GET /concelier/snapshots/{hash}` +* `GET /concelier/kev/{snapshotHash}/is_listed?cve=CVE-...` +* `GET /concelier/epss/{snapshotHash}/score?cve=CVE-...` + +## 10.3 Determinism + +Smart‑Diff report must include the snapshot hashes used, so the result can be reproduced. + +--- + +# 11) RiskState computation (core dev logic) + +Implement a pure function: + +`RiskState ComputeRiskState(FindingKey key, Snapshot snapshot)` + +### Inputs used + +* SBOM: to confirm component exists, scope, runtime path +* Concelier feeds: KEV, EPSS, affected ranges +* Excititor: VEX status +* Reachability analyzer output +* Policy engine: flags based on org rules + +### Output + +```json +{ + "finding_key": { "purl": "...", "version": "...", "cve": "..." }, + "reachable": true, + "vex_status": "AFFECTED", + "in_affected_range": true, + "kev": false, + "epss": 0.42, + "policy": { + "decision": "WARN|BLOCK|ALLOW", + "flags": ["epss_over_0_4"] + }, + "evidence": [ + { "type": "reachability_trace", "ref": "trace:abc", "detail": "short call stack..." }, + { "type": "vex", "ref": "openvex:doc123#stmt7" }, + { "type": "concelier_snapshot", "ref": "sha256:..." } + ] +} +``` + +--- + +# 12) Diff engine specification + +## 12.1 Inputs + +* `OldRiskStates: map` +* `NewRiskStates: map` + +You build these maps by: + +1. Enumerating candidate findings in each snapshot: + + * from vulnerability matching against SBOM components (affected ranges) + * plus any VEX statements referencing components +2. Joining with reachability traces +3. Resolving status via Excititor +4. Applying Concelier intelligence + policy + +## 12.2 Diff output types + +Return `SmartDiffItem` with: + +* `change_type`: `ADDED|REMOVED|CHANGED` +* `risk_direction`: `UP|DOWN|NEUTRAL` +* `reason_codes`: `[REACHABILITY_FLIP, VEX_FLIP, RANGE_FLIP, KEV_FLIP, POLICY_FLIP, EPSS_THRESHOLD]` +* `old_state` / `new_state` +* `priority_score` +* `evidence_links[]` + +## 12.3 Suppress AFTER diff, not before + +Important: compute diff on full sets, then suppress items by rules, because: + +* suppression itself can flip (e.g., VEX becomes `not_affected` → item disappears, which is meaningful as “risk down”). + +--- + +# 13) Priority scoring & ranking + +Implement a deterministic score: + +### Hard ordering + +1. `kev == true` in new state → top tier +2. Reachable in new state (`reachable == true`) → next tier + +### Numeric scoring (example) + +``` +score = + + 1000 if new.kev + + 500 if new.reachable + + 200 if reason includes RANGE_FLIP to affected + + 150 if VEX_FLIP to AFFECTED + + 0..100 based on EPSS (epss * 100) + + policy weight: +300 if decision BLOCK, +100 if WARN +``` + +Always include `score_breakdown` in report for explainability. + +--- + +# 14) Evidence requirements (must implement) + +Every Smart‑Diff item must include **at least one** evidence link, and ideally 2–4: + +EvidenceLink schema: + +```json +{ + "type": "vex|reachability|kev|epss|scanner|sbom|policy", + "ref": "stable identifier", + "summary": "one-line human readable", + "blob_hash": "sha256 of raw evidence payload (optional)" +} +``` + +Examples: + +* `type=kev`: ref is `concelier:kev@{snapshotHash}#CVE-2024-1234` +* `type=reachability`: ref is `reach:{snapshotId}:{traceId}` +* `type=vex`: ref is `openvex:{docHash}#statement:{id}` + +--- + +# 15) API specification + +## 15.1 Compare endpoint + +`POST /smartdiff/compare` + +Request: + +```json +{ + "old_snapshot_id": "buildA", + "new_snapshot_id": "buildB", + "options": { + "include_suppressed": false, + "max_items": 200, + "epss_threshold": 0.7 + } +} +``` + +Response: + +```json +{ + "report_id": "smartdiff:2025-12-14:xyz", + "old": { "snapshot_id": "buildA", "feed_hashes": { ... } }, + "new": { "snapshot_id": "buildB", "feed_hashes": { ... } }, + "summary": { + "risk_up": 3, + "risk_down": 8, + "reachable_new": 2, + "kev_new": 1, + "suppressed": 143 + }, + "items": [ + { + "change_type": "CHANGED", + "risk_direction": "UP", + "priority_score": 1680, + "reason_codes": ["REACHABILITY_FLIP","RANGE_FLIP"], + "finding_key": { + "purl": "pkg:maven/org.example/foo", + "version_old": "1.2.3", + "version_new": "1.2.4", + "cve": "CVE-2024-1234" + }, + "old_state": { "...": "RiskState" }, + "new_state": { "...": "RiskState" }, + "evidence": [ ... ] + } + ] +} +``` + +## 15.2 Evidence endpoint + +`GET /smartdiff/{report_id}/evidence/{evidence_ref}` + +Returns raw stored evidence (or a signed URL if you store blobs elsewhere). + +--- + +# 16) CLI behavior + +Command: + +``` +stella smart-diff \ + --old ./snapshots/buildA \ + --new ./snapshots/buildB \ + --policy ./policy.json \ + --out ./smartdiff.json +``` + +CLI output (human): + +* Summary line: `risk ↑ 3 | risk ↓ 8 | new reachable 2 | new KEV 1` +* Then top N items sorted by priority, each one line: + + * `↑ REACHABILITY_FLIP foo@1.2.4 CVE-2024-1234 (EPSS 0.42) path: Main→...→vulnMethod` + +Exit code: + +* `0` if policy decision overall is ALLOW/WARN +* `2` if any item triggers policy BLOCK in new snapshot (configurable) + +--- + +# 17) Storage schema (Postgres) — implementation-ready + +You can implement in a single schema to start; split later. + +## Core tables + +### `snapshots` + +* `snapshot_id (pk)` +* `created_at` +* `sbom_hash` +* `policy_hash` +* `kev_hash` +* `epss_hash` +* `vuln_db_hash` +* `metadata jsonb` + +### `components` + +* `component_id (pk)` (internal UUID) +* `snapshot_id (fk)` +* `purl` +* `version` +* `scope` (runtime/dev/test/unknown) +* `direct bool` +* indexes on `(snapshot_id, purl)` and `(purl, version)` + +### `findings` + +* `finding_id (pk)` +* `snapshot_id (fk)` +* `purl` +* `version` +* `cve` +* `reachable bool null` +* `vex_status text` +* `in_affected_range bool null` +* `kev bool` +* `epss real null` +* `policy_decision text` +* `policy_flags text[]` +* index `(snapshot_id, purl, cve)` + +### `reachability_traces` + +* `trace_id (pk)` +* `snapshot_id (fk)` +* `purl` +* `cve` +* `sink` +* `callstack jsonb` +* index `(snapshot_id, purl, cve)` + +### `vex_statements` + +* `stmt_id (pk)` +* `snapshot_id (fk)` +* `purl` +* `cve` +* `source` +* `issued_at` +* `status` +* `doc_hash` +* `raw jsonb` +* index `(snapshot_id, purl, cve)` + +### `smartdiff_reports` + +* `report_id (pk)` +* `created_at` +* `old_snapshot_id` +* `new_snapshot_id` +* `options jsonb` +* `summary jsonb` + +### `smartdiff_items` + +* `item_id (pk)` +* `report_id (fk)` +* `change_type` +* `risk_direction` +* `priority_score` +* `reason_codes text[]` +* `purl` +* `cve` +* `old_version` +* `new_version` +* `old_state jsonb` +* `new_state jsonb` + +### `evidence_links` + +* `evidence_id (pk)` +* `report_id (fk)` +* `item_id (fk)` +* `type` +* `ref` +* `summary` +* `blob_hash` + +--- + +# 18) Implementation plan (developer-focused) + +## Phase 1 — MVP (end-to-end working) + +1. **Normalize SBOM** + + * Parse CycloneDX/SPDX + * Build `components` list with purl + version + scope +2. **Concelier integration** + + * Load KEV + EPSS snapshots (even from local files initially) + * Expose snapshot hashes +3. **Excititor integration** + + * Parse OpenVEX/CycloneDX VEX + * Implement precedence rules and output `final_status` +4. **Affected range matching** + + * For each component, query vulnerability DB snapshot for affected ranges + * Produce candidate findings `(purl, version, cve)` +5. **Reachability ingestion** + + * Accept reachability JSON traces (even if generated elsewhere initially) + * Mark `reachable=true` when trace exists for (purl,cve) +6. **Compute RiskState** + + * For each finding compute `kev`, `epss`, `policy_decision` +7. **Diff + suppression + ranking** + + * Generate `SmartDiffReport` +8. **Outputs** + + * JSON report + CLI table + * Store report + items in Postgres + +Acceptance tests for Phase 1: + +* Given a known pair of snapshots, Smart‑Diff only includes: + + * reachable vulnerable changes + * VEX flips + * affected range boundary flips + * KEV flips +* Patch churn not crossing ranges is absent. + +## Phase 2 — Determinism & evidence hardening + +* Store raw evidence blobs (VEX doc hash, trace payload hash) +* Ensure feed snapshots are immutable and referenced by hash +* Add `score_breakdown` +* Add conflict surfacing for VEX merge + +## Phase 3 — Performance & scale + +* Incremental computation (only recompute affected components changed) +* Cache Concelier lookups by `(snapshotHash, cve)` +* Batch range matching queries +* Add pagination and `max_items` enforcement + +--- + +# 19) Edge cases developers must handle + +1. **Reachability unknown** + + * If no analyzer output exists, set `reachable = null` + * Do not suppress solely based on `reachable=null` +2. **Version parse failures** + + * `in_affected_range = null` + * Surface range-related changes only when one side is determinable +3. **Component renamed / purl drift** + + * Consider purl normalization rules (namespace casing, qualifiers) + * If purl changes but is same artifact, treat as new component (unless you implement alias mapping later) +4. **Multiple CVE sources / duplicates** + + * Deduplicate by CVE ID per component+version +5. **Conflicting VEX statements** + + * Pick winner deterministically, but log conflict evidence +6. **KEV listed but VEX says not affected** + + * Still suppress? Recommended: + + * Do **not** suppress; surface as “KEV listed but VEX not_affected” and rank high (KEV tier) +7. **Policy config changes** + + * Treat policy hash difference as a diff dimension; surface “policy flip” items even if underlying vuln unchanged + +--- + +# 20) Testing strategy (must implement) + +## Unit tests + +* SemVer compare + affected range evaluation +* Excititor precedence resolution +* Suppression rules (table-driven tests) +* Priority scoring determinism + +## Integration tests + +* Build synthetic snapshots: + + * A: vuln present, unreachable, VEX not_affected + * B: same vuln reachable + * Assert Smart‑Diff surfaces exactly one item with `REACHABILITY_FLIP` +* KEV flip test: + + * Same findings, KEV list changes between Concelier snapshots + * Assert item surfaces with `KEV_FLIP` + +## Regression suite + +Keep a folder of snapshot pairs and expected outputs: + +* `fixtures/snapA`, `fixtures/snapB`, `expected.smartdiff.json` + +--- + +# 21) What the developer should code first (practical order) + +1. DTOs: + + * `Snapshot`, `Component`, `VexStatement`, `ReachTrace`, `FindingKey`, `RiskState`, `SmartDiffItem`, `SmartDiffReport` +2. Pure functions: + + * `NormalizePurl` + * `IsVersionInAffectedRange` + * `ResolveVexStatus` (Excititor) + * `ComputeRiskState` + * `DiffRiskStates` + * `ApplySuppression` + * `ScoreAndRank` +3. Persistence: + + * store snapshots and computed findings +4. API + CLI wrappers + +--- + +If you want, I can also provide: + +* a **concrete JSON Schema** for `SmartDiffReport` +* **C# (.NET 10) interfaces + class skeletons** for `ConcelierClient`, `ExcititorResolver`, and `SmartDiffService` +* a **fixture set** (sample SBOM/VEX/reach traces) to bootstrap the test suite diff --git a/docs/product-advisories/14-Dec-2025 - Add a dedicated “first_signal” event.md b/docs/product-advisories/14-Dec-2025 - Add a dedicated “first_signal” event.md new file mode 100644 index 000000000..69fb49ac4 --- /dev/null +++ b/docs/product-advisories/14-Dec-2025 - Add a dedicated “first_signal” event.md @@ -0,0 +1,1295 @@ +Here’s a lightweight pattern to make failures show up instantly while keeping backends decoupled: **emit a tiny, versioned event the moment you know something failed**, and attach pointers to heavier evidence that can arrive later. + +--- + +# Why this helps + +* **UI reacts in real time**: show “Failed at Step X (E123)” immediately—no waiting for logs, SBOMs, or artifacts to upload/process. +* **Backends evolve safely**: logs, traces, SBOM/VEX, heap dumps, etc., can change format or arrive out of order without breaking the UI contract. +* **Deterministic UX**: a small, stable schema prevents flaky pipelines from blocking visibility. +* **Great for air‑gapped/offline**: the tiny event rides your internal bus/storage; bulky payloads sync or materialize when available. + +--- + +# The event itself (keep it tiny) + +**Fields (stable, versioned):** + +* `v` — schema version (e.g., `1`). +* `ts` — event timestamp (UTC, ISO 8601). +* `run_id` — pipeline/execution correlation ID. +* `stage` — coarse phase (e.g., `fetch`, `build`, `scan`, `policy`, `deploy`). +* `step` — fine-grained step (e.g., `trivy-scan`, `dotnet-restore`). +* `status` — `fail|warn|pass|info` (for this pattern, you’ll use `fail`). +* `error_class` — stable classifier (e.g., `NETWORK_DNS`, `AUTH_EXPIRED`, `POLICY_BLOCK`, `VULN_REACHABLE`). +* `summary` — short human string (“Reachable vuln blocks release”). +* `pointers` — array of *opaque, resolvable references* (log offsets, artifact URIs, attestation IDs). +* `kv` — optional tiny key/values for quick filtering (e.g., `severity=A`, `package=openssl`). +* `sig` (optional) — detached/inline signature (DSSE) for integrity. + +**Example** + +```json +{ + "v": 1, + "ts": "2025-12-13T12:10:03Z", + "run_id": "run_7f3c6a8", + "stage": "policy", + "step": "vex-gate", + "status": "fail", + "error_class": "VULN_REACHABLE", + "summary": "Reachable CVE blocks release", + "pointers": [ + {"type":"log", "ref":"logs://scanner/7f3c6a8#L1423-L1480"}, + {"type":"attestation", "ref":"rekor://sha256:…"}, + {"type":"sbom", "ref":"artifact://sbom/cyclonedx@run_7f3c6a8.json"} + ], + "kv": {"cve":"CVE-2025-12345", "component":"openssl", "severity":"A"} +} +``` + +--- + +# UI behavior (instant, then enrich) + +1. **Instant render** (sub-200 ms): show a red card with stage/step, `error_class`, and `summary`. +2. **Progressive hydration**: as pointers resolve, add: + + * “View log excerpt” (jump to `#L1423-L1480`) + * “Open attestation” (verify DSSE/Rekor) + * “Inspect SBOM diff” (component → version → call‑graph) +3. **Stable affordances**: UI never breaks if a pointer is slow/missing; it just shows a spinner or “awaiting evidence”. + +--- + +# Backend contract + +* **Publish early**: emit on first knowledge of failure (e.g., non‑zero exit, policy deny, TLS error). +* **Don’t embed heavy data**: only pointers or tiny facts for filters. +* **Pointer resolution is pluggable**: files, object storage, Postgres row, Valkey cache key, Rekor entry—whatever suits the deployment. +* **Version discipline**: bump `v` only for breaking schema changes; additive fields are fine. + +--- + +# Minimal topic map (so teams agree on names) + +* `stage`: `fetch|build|scan|policy|sign|package|deploy` +* `error_class` suggestions: + + * Infra: `NETWORK_DNS`, `NETWORK_TIMEOUT`, `REGISTRY_403`, `DISK_FULL` + * AuthN/Z: `AUTH_EXPIRED`, `TOKEN_SCOPE_MISS` + * Supply chain: `ATTESTATION_MISSING`, `SIGNATURE_INVALID`, `SBOM_STALE` + * Secure build: `POLICY_BLOCK`, `VULN_REACHABLE`, `MALWARE_FLAG` + * Runtime: `IMAGE_DRIFT`, `PROVENANCE_MISMATCH` + +Keep each to a 1–2 line definition in a shared doc. + +--- + +# Drop‑in for Stella Ops (tailored) + +* **Emitter**: `StellaOps.Events` (tiny .NET lib) used by Scanner/Policy/Scheduler to publish `TinyFailureEvent`. +* **Transport**: Postgres notify (default) + Valkey pub/sub accelerator. (Matches your Postgres+Valkey architecture choice.) +* **Resolver service**: `EvidenceGateway` that turns `pointers` into viewable slices (log excerpts, SBOM component focus, Rekor proof). +* **UI**: “Failure Feed” panel shows cards from the event stream; detail drawer resolves pointers on demand. +* **Signing**: optional DSSE for events; Rekor (or mirror) for attestations—your “Proof‑Linked” moat. +* **Air‑gap**: pointers use `artifact://` and `row://` schemes resolvable entirely on‑prem. + +--- + +# Quick implementation checklist + +* Define `TinyFailureEvent` schema v1 and `error_class` registry. +* Add emit helpers for each module (`FailNow(summary, error_class, pointers, kv)`). +* Build `EvidenceGateway.Resolve(pointer)` handlers. +* UI: render card instantly; hydrate sections as resolvers return. +* Telemetry: metrics on TTF**E** (Time‑To‑Failure‑Event) and pointer hydration latencies. +* Docs: 1‑page contract; examples for each error_class. + +If you want, I can draft the .NET 10 interfaces (`ITinyEventEmitter`, resolvers, and a small Razor/Angular card) and a Postgres schema you can paste into your repo. +Below is a **PM-grade implementation spec** for “Real-time Failure Signaling” using **Tiny Failure Events** + **Evidence Pointers**, written so engineers can build it without guessing. + +--- + +# Product: Real-time Failure Signaling (Tiny Failure Events) + +## Goal + +When any pipeline run fails, users must see **what failed and where** (stage/step + error class + short summary) **immediately**, even if logs/SBOM/attestations are delayed, huge, or unavailable. + +The UI must render a failure card from a tiny event and then progressively enrich with evidence as it becomes resolvable. + +## Outcomes we must deliver + +1. **Instant visibility:** “Failed at Step X” appears within seconds of failure. +2. **Decoupling:** UI depends only on a stable tiny schema, not on log formats/artifact structures. +3. **Evidence linking:** Users can open logs/SBOM/attestations when available, via pointers. +4. **Reliability:** Duplicate/out-of-order events don’t break the UI; state remains consistent. +5. **Security:** Evidence access is authorized; pointers do not leak sensitive info. + +--- + +# Scope + +## In scope (MVP) + +* Emit **TinyFailureEvent v1** on first detected failure for a step. +* Transport events in near real-time to UI. +* Store events durably and allow UI to fetch a run’s event timeline. +* Support evidence pointers for: + + * logs (excerptable) + * artifacts (SBOM, reports) + * attestations (provenance/signature) +* UI: + + * show run timeline + * show failure card instantly + * hydrate evidence sections on demand (or automatically where feasible) + +## Out of scope (MVP) + +* Full trace viewer / distributed tracing UI (we can link to external trace systems via pointer). +* Automated remediation (“fix it”) actions. +* Full-blown case management. + +--- + +# Key terms and definitions + +* **Run:** A single execution of a pipeline. Identified by `run_id`. +* **Stage:** Coarse lifecycle phase (`fetch`, `build`, `scan`, `policy`, `sign`, `package`, `deploy`). +* **Step:** A concrete activity within a stage (`dotnet-restore`, `trivy-scan`, `vex-gate`). +* **Tiny Failure Event:** A small message representing “this step failed”, including stable classification and references to evidence. +* **Pointer:** An opaque reference that can be resolved into evidence content or a link later. + +--- + +# User stories and acceptance criteria + +## Story 1: I see failure instantly + +**As a** developer +**I want** to see which step failed immediately +**So that** I don’t wait on logs/artifacts + +**Acceptance criteria** + +* When a step fails, the UI updates within **≤ 2 seconds p95** from the time the orchestrator/runner detects failure. +* The failure card includes: + + * stage, step + * error class + * human summary + * timestamp + * (optional) primary key/value details (e.g., CVE, severity) + +## Story 2: I can open evidence when available + +**As a** release engineer +**I want** to click evidence links (logs/SBOM/attestation) +**So that** I can diagnose/root-cause + +**Acceptance criteria** + +* Failure card shows evidence sections as: + + * **Available** (clickable) + * **Pending** (spinner / “awaiting evidence”) + * **Unavailable** (“not produced” or “access denied”) +* Clicking log evidence opens an excerpt view, not a 500MB file download. +* Evidence access enforces authorization (same as run access). + +## Story 3: Events are robust to duplicates/out-of-order + +**As a** user +**I want** the timeline to remain correct +**Even if** event delivery is at-least-once + +**Acceptance criteria** + +* UI displays exactly one current “failed” state per step attempt. +* Duplicate events do not create duplicate cards. +* Out-of-order arrival does not revert a step from fail → pass. + +--- + +# Functional requirements (what developers must build) + +## FR1: TinyFailureEvent schema v1 + +### Required fields + +All producers MUST emit events that validate against this schema. + +```json +{ + "v": 1, + "event_id": "evt_01J…", + "ts": "2025-12-13T12:10:03.123Z", + "run_id": "run_7f3c6a8", + "stage": "policy", + "step": "vex-gate", + "attempt": 1, + "status": "fail", + "error_class": "VULN_REACHABLE", + "summary": "Reachable CVE blocks release", + "pointers": [], + "kv": {} +} +``` + +### Field definitions & constraints + +* `v` (int, required): must be `1` for this spec. +* `event_id` (string, required): globally unique. + + * Format: `evt_` (ULID recommended for time-sortable IDs). +* `ts` (RFC3339 UTC, required): creation timestamp. +* `run_id` (string, required): stable correlation id for run. +* `stage` (enum string, required): one of: + + * `fetch|build|scan|policy|sign|package|deploy|runtime` +* `step` (string, required): lowercase kebab-case recommended; max 80 chars. +* `attempt` (int, required): starts at 1; increments for retries. +* `status` (enum string, required for this feature): `fail` (MVP supports fail only; schema allows later expansion) +* `error_class` (string, required): stable classifier from a shared registry (see FR2). + + * max 64 chars; uppercase snake-case. +* `summary` (string, required): human readable, max 140 chars. +* `pointers` (array, optional): max 20 items; each item is a `Pointer` object (see FR3). +* `kv` (object, optional): small metadata map for filtering. + + * max 20 keys + * key max 32 chars; value max 120 chars + * no nested objects/arrays + +### Size limits + +* Entire event payload MUST be ≤ **8 KB** serialized JSON. +* If producers exceed limits, they MUST truncate `summary` and drop low-priority `kv` keys before failing emission. + +--- + +## FR2: Error class registry (stable contract) + +We maintain a canonical list of `error_class` values in a shared repo/module. + +### Requirements + +* Each `error_class` MUST have: + + * name (e.g., `NETWORK_DNS`) + * short description + * severity mapping (optional) + * recommended remediation hints (optional, can be UI-side) +* Producers MUST use a registry value if applicable. +* Producers MAY emit `error_class="UNKNOWN"` if no mapping exists, but must log a warning and increment a metric. + +### Initial registry (minimum) + +Infra/Network: + +* `NETWORK_DNS` +* `NETWORK_TIMEOUT` +* `DISK_FULL` + +Auth: + +* `AUTH_EXPIRED` +* `REGISTRY_403` + +Supply chain: + +* `SIGNATURE_INVALID` +* `ATTESTATION_MISSING` +* `SBOM_MISSING` + +Policy/Security: + +* `POLICY_BLOCK` +* `VULN_REACHABLE` +* `MALWARE_FLAG` + +Runner/Orchestrator: + +* `STEP_TIMEOUT` +* `RUN_ABORTED` +* `WORKER_LOST` + +--- + +## FR3: Evidence pointer format and rules + +### Pointer object schema + +```json +{ + "type": "log|artifact|attestation|url|trace", + "ref": "logs://scanner/run_7f3c6a8#L1423-L1480", + "mime": "text/plain", + "label": "Scanner log excerpt", + "expires_at": "2025-12-20T00:00:00Z", + "sha256": "optional hex" +} +``` + +### Rules + +* `type` and `ref` are required. +* `ref` is opaque to UI; UI passes it to the resolver service. +* `label` is optional, but strongly recommended for UI friendliness. +* `expires_at` is optional; if present UI should show “may expire”. +* `sha256` optional for immutability verification (artifacts/attestations especially). + +### Allowed schemes (MVP) + +* `logs:///#Lx-Ly` +* `artifact:///@` +* `attestation:///` +* `url://` (only internal allowed; resolver enforces) +* `trace:///` + +### Security constraints + +* Pointers MUST NOT embed secrets (tokens, passwords). +* Any pointer that could expose sensitive data MUST be resolvable only through the Evidence Gateway (FR6), never directly client-side. +* The resolver MUST enforce authorization for the requesting user. + +--- + +## FR4: Emission rules (when and how events are produced) + +### When to emit + +Producers MUST emit a TinyFailureEvent when: + +1. A step exits non-zero. +2. A policy decision is “deny/block”. +3. A required artifact/attestation is missing at gate time. +4. A step times out. +5. The worker is lost (emitted by orchestrator watchdog). + +### Exactly-once vs at-least-once + +* Transport can be **at-least-once**. +* Consumers MUST be idempotent using `(run_id, stage, step, attempt, status)` + `event_id`. + +### One failure event per step attempt + +* For a given `(run_id, stage, step, attempt)`: + + * First emitted `status=fail` is canonical. + * Later fail events for the same tuple are treated as “updates” only if they add pointers/kv (see FR5). + +### Updates / enrichment + +We support enrichment without breaking “tiny”: + +* Producers MAY emit a second event **with the same tuple** (run_id/stage/step/attempt/status) that adds pointers or kv after the initial fail. +* Consumers MUST merge pointers (dedupe identical `type+ref`) and merge kv (new keys overwrite old keys). +* Producers MUST NOT spam; max 3 enrichment events per tuple. + +--- + +## FR5: Event storage and aggregation + +### Required services/components + +1. **Event Ingest** (API or internal library endpoint) +2. **Event Store** (durable DB table) +3. **Realtime Fanout** (pub/sub channel) +4. **Run Timeline API** (query per run) + +### Behavior + +* On ingest: + + * Validate schema (reject invalid with 400/validation error). + * Persist to event store. + * Publish to realtime channel. + +### Suggested DB model (Postgres) + +Table: `run_events` + +* `event_id` PK +* `run_id` indexed +* `ts` indexed +* `stage`, `step`, `attempt`, `status` indexed composite +* `payload` jsonb +* `ingested_at` + +Uniqueness constraints: + +* `event_id` unique +* Optional: unique on `(run_id, stage, step, attempt, status, hash(summary))` if you want stronger dedupe + +### Query API + +* `GET /runs/{run_id}/events` returns events sorted by `ts` ascending. +* UI should also subscribe realtime to avoid polling. + +--- + +## FR6: Evidence Gateway (pointer resolver) + +### Purpose + +A single service that resolves pointers into either: + +* log excerpts +* signed download URLs +* attestation display + verification data +* external trace links (sanitized) + +### Endpoints (MVP) + +1. **Resolve metadata** + + * `POST /evidence/resolve` + * body: `{ "run_id": "...", "pointers": [ { "type": "...", "ref": "..." } ] }` + * returns per pointer: + + * `status`: `available|pending|missing|denied|expired|error` + * `kind`: `inline|link` + * `title` + * `mime` + * `size_bytes` (if known) + * `link` (if kind=link) – must be short-lived, server-generated + * `inline_preview` (optional, small excerpt) + +2. **Fetch log excerpt** + + * `GET /evidence/log-excerpt?ref=...` + * returns: + + * `text` (max 64 KB) + * `start_line`, `end_line` + * `source` (provider info) + +3. **Fetch artifact** + + * `GET /evidence/artifact?ref=...` + * returns either: + + * short-lived download link + * or 404/403/410 + +### AuthZ requirements + +* Evidence Gateway MUST verify the caller has access to the `run_id`. +* Gateway MUST validate that the pointer belongs to that run (or is explicitly declared “global shared”). +* Gateway MUST audit-log every evidence resolution. + +### Resilience + +* If evidence is not ready, resolver returns `pending`, not 500. +* If pointer is unknown format, return `error` with a safe message. + +--- + +# UI requirements (what the product must do) + +## UI1: Run timeline renders from events + +* The run detail page MUST show: + + * stages/steps list + * current state per step (pass/warn/fail/running) + * failure details if fail exists +* The failure state MUST be derived from TinyFailureEvent without requiring any log fetch. + +## UI2: Failure card content (minimum) + +When a fail event arrives: + +* Show a red failure card with: + + * `stage` + `step` + * `summary` + * `error_class` badge + * `ts` (relative + absolute on hover) + * key kv fields (up to 4 shown; remainder behind “Show more”) + +## UI3: Progressive hydration + +* The card MUST include an “Evidence” section. +* For each pointer: + + * show a row with label and availability status + * if available, show “Open” + * if pending, show spinner + “Awaiting evidence” + * if denied, show lock icon + “No access” + * if missing, show “Not produced” +* Clicking “Open”: + + * logs open excerpt viewer (modal/drawer) + * artifacts open in viewer or download (type-dependent) + * attestations open verification view + +## UI4: Realtime behavior + +* UI MUST subscribe to realtime events for the run. +* UI MUST apply idempotent merge logic: + + * dedupe by `event_id` + * merge enrichment events by tuple (run_id/stage/step/attempt/status) + +## UI5: Ordering and out-of-order handling + +* UI MUST sort by `ts` for display. +* UI MUST NOT regress a step state if a late “pass/info” arrives after fail. + + * Rule: `fail` is terminal for a step attempt. + +--- + +# Non-functional requirements + +## Latency + +* From failure detection to UI update: **≤ 2s p95**, **≤ 5s p99** (within the same network). +* Evidence resolution: + + * `resolve` call should return in **≤ 300ms p95** for cached/known pointers. + +## Reliability + +* Event ingestion must be durable (stored) before fanout. +* System must tolerate: + + * duplicates + * retries + * out-of-order delivery + * partial evidence availability + +## Payload limits + +* Event size ≤ 8KB +* Evidence inline previews ≤ 4KB per pointer + +## Retention + +* Tiny events retained ≥ 30 days (configurable). +* Evidence retention depends on provider, but resolver must surface expiry. + +--- + +# Metrics and instrumentation (definition of success) + +Producers + ingestion MUST emit: + +* `ttfe_ms`: time to failure event (from step start or from failure detection) +* `event_ingest_latency_ms` +* `event_validation_fail_count` +* `unknown_error_class_count` +* `pointer_resolution_status_count{available|pending|missing|denied|expired|error}` +* `pointer_hydration_latency_ms` + +UI MUST log: + +* time from run page open → first event rendered +* evidence open clickthrough rate +* evidence resolution failure rate + +--- + +# Edge cases we explicitly handle + +1. **Runner killed before it can emit** + + * Orchestrator watchdog emits `WORKER_LOST` with stage/step best-effort. + +2. **Logs produced after failure** + + * Initial fail event has no log pointer. + * Later enrichment event adds log pointer (same tuple). + +3. **Evidence exists but user lacks access** + + * Resolver returns `denied`; UI shows locked state. + +4. **Evidence link expired** + + * Resolver returns `expired` and provides a “Refresh” action that re-resolves. + +5. **Multiple retries** + + * `attempt` increments; UI shows attempt number and keeps prior attempt history. + +--- + +# Definition of Done (engineers can ship when…) + +## Backend DoD + +* Schema validation implemented. +* Ingest → store → fanout pipeline working. +* Enrichment merge logic implemented. +* Evidence Gateway resolves at least: + + * log excerpt pointers + * artifact pointers + * attestation pointers +* AuthZ enforced. + +## Frontend DoD + +* Run page shows failure card from TinyFailureEvent alone. +* Evidence hydration works and handles all resolver statuses. +* Realtime updates work; duplicates/out-of-order are safe. + +## QA DoD (minimum test cases) + +1. Step fails; event arrives; UI shows failure without logs. +2. Log pointer arrives later; UI updates evidence section. +3. Duplicate event delivery; UI shows one failure. +4. Evidence denied; UI shows lock state. +5. Out-of-order: enrichment arrives before initial fail; UI still resolves correctly. + +--- + +# Implementation checklist (suggested division of work) + +### Team A: Producer SDK + +* `StellaOps.Events` library: + + * `EmitFailure(run_id, stage, step, attempt, error_class, summary, pointers, kv)` + * ULID generation + * payload size enforcement + * retry w/ backoff + +### Team B: Ingest + Store + Fanout + +* API endpoint or internal gRPC for ingest +* Postgres `run_events` table +* Valkey pub/sub channel: `run:{run_id}:events` + +### Team C: Evidence Gateway + +* pointer parsing + resolvers +* log excerpt adapter(s) +* artifact download adapter(s) +* attestation viewer adapter(s) + +### Team D: UI + +* realtime subscription +* timeline state reducer w/ merge rules +* evidence drawer + resolver calls + +--- + +If you want, I can also provide: + +* a **formal JSON Schema** file for v1 (drop-in for validation), +* a **state reducer spec** for the UI (pure function inputs/outputs), +* and a **Postgres migration** + example queries. +Below is a **UI State Reducer Spec (Pure Function Inputs/Outputs)** for the Run Detail page that renders the **timeline + step statuses + failure cards + evidence hydration** using TinyFailureEvents (and future-compatible with pass/warn/info). + +This is written so devs can implement it as a deterministic reducer (Redux, Zustand w/ reducer, Elm-style update, etc.). + +--- + +# UI State Reducer Spec v1: Run Timeline + Failure Cards + +## Reducer contract + +### Pure function + +```ts +reduceRunView(state: RunViewState, action: Action): RunViewState +``` + +### Guarantees + +* **Pure & deterministic**: no IO, no timers, no random IDs, no Date.now() inside reducer. +* **Idempotent**: applying the same `RUN_EVENT_RECEIVED` twice yields the same state after the first time. +* **Order-safe**: out-of-order events never “downgrade” a step attempt from `fail` → `pass`. + +--- + +# 1) Data types + +## 1.1 Event type used by reducer + +```ts +type StageName = + | 'fetch' | 'build' | 'scan' | 'policy' + | 'sign' | 'package' | 'deploy' | 'runtime'; + +type StepStatus = + // present now (MVP) + | 'fail' + // future-compatible + | 'warn' | 'pass' | 'running' | 'queued' | 'info' | 'unknown'; + +type PointerType = 'log' | 'artifact' | 'attestation' | 'url' | 'trace'; + +type Pointer = { + type: PointerType; + ref: string; + mime?: string; + label?: string; + expires_at?: string; // RFC3339 + sha256?: string; +}; + +type TinyEventV1 = { + v: 1; + event_id: string; + ts: string; // RFC3339 UTC + run_id: string; + stage: StageName; + step: string; + attempt: number; + status: StepStatus; // MVP sends 'fail' only + error_class: string; + summary: string; + pointers?: Pointer[]; + kv?: Record; +}; + +// Normalized for sorting and comparisons (created outside or inside reducer deterministically) +type NormalizedEvent = TinyEventV1 & { + tsMs: number; // parse(ts) -> number, invalid => 0 +}; +``` + +--- + +## 1.2 Keys and comparisons + +```ts +type TupleKey = string; // `${stage}|${step}|${attempt}|${status}` +type StepAttemptKey = string; // `${stage}|${step}|${attempt}` +type StepIdentityKey = string;// `${stage}|${step}` (no attempt) +type PointerKey = string; // `${type}|${ref}` + +function tupleKey(e: TinyEventV1): TupleKey { + return `${e.stage}|${e.step}|${e.attempt}|${e.status}`; +} +function stepAttemptKey(e: TinyEventV1): StepAttemptKey { + return `${e.stage}|${e.step}|${e.attempt}`; +} +function stepIdentityKey(e: TinyEventV1): StepIdentityKey { + return `${e.stage}|${e.step}`; +} +function pointerKey(p: Pointer): PointerKey { + return `${p.type}|${p.ref}`; +} + +// Sort: ts ascending, then event_id lexicographically (stable deterministic tiebreak) +function compareEvent(a: NormalizedEvent, b: NormalizedEvent): number { + if (a.tsMs !== b.tsMs) return a.tsMs - b.tsMs; + return a.event_id < b.event_id ? -1 : (a.event_id > b.event_id ? 1 : 0); +} +``` + +--- + +## 1.3 Status ranking rule (terminal safety) + +We need a single numeric ranking so we can: + +* prevent regressions (`fail` must remain terminal), and +* compute rollups. + +```ts +const STATUS_RANK: Record = { + unknown: 0, + queued: 1, + running: 2, + info: 3, + pass: 4, + warn: 5, + fail: 6, +}; + +function isTerminal(status: StepStatus): boolean { + return status === 'fail' || status === 'warn' || status === 'pass'; +} +``` + +**Invariant:** A step attempt’s displayed status must never decrease in rank. + +--- + +# 2) State shape + +This state is for a single Run Detail page (one `runId` at a time). If you store multiple runs in a global store, wrap this in a `Record`. + +```ts +type RealtimeStatus = 'idle' | 'connecting' | 'connected' | 'disconnected' | 'error'; +type LoadStatus = 'idle' | 'loading' | 'loaded' | 'error'; + +type EvidenceResolveStatus = + | 'unresolved' // pointer exists but no resolver call made yet + | 'loading' // resolver call in-flight + | 'available' | 'pending' | 'missing' | 'denied' | 'expired' | 'error'; + +type EvidenceResolution = { + status: EvidenceResolveStatus; + kind?: 'inline' | 'link'; + title?: string; + mime?: string; + size_bytes?: number; + inline_preview?: string; // small preview + link?: string; // short-lived link + error_message?: string; +}; + +type EvidenceState = { + pointer: Pointer; // latest metadata merged from events + status: EvidenceResolveStatus; + lastResolvedAtMs?: number; // from action payload (not Date.now) + // for stale response protection + seq: number; // increments each request + inFlightSeq?: number; // seq currently in-flight + resolution?: EvidenceResolution; +}; + +type PointerAggregate = { + pointerKey: PointerKey; + pointer: Pointer; // merged metadata +}; + +type TupleAggregate = { + tupleKey: TupleKey; + + // all events contributing to this tuple (same stage/step/attempt/status) + eventIdsSorted: string[]; // sorted by (tsMs, event_id) + canonicalEventId: string; // min by (tsMs, event_id) + + // merged view computed deterministically from eventIdsSorted + merged: { + summary: string; // from canonical event + error_class: string; // from canonical event + kv: Record; // merged by sorted order (later overwrites) + pointers: PointerAggregate[];// dedup by pointerKey, merged by sorted order + updatedAtMs: number; // max tsMs among contributing events + }; +}; + +type StepAttemptState = { + key: StepAttemptKey; + stage: StageName; + step: string; + attempt: number; + + // all tuple aggregates for this attempt (one per status) + tuplesByStatus: Partial>; + + // derived “best” status for this attempt + bestStatus: StepStatus; + bestStatusRank: number; + updatedAtMs: number; // max of all tupleAgg.updatedAtMs for this attempt +}; + +type StageRollup = { + stage: StageName; + // worst status among latest attempts of steps in this stage + rollupStatus: StepStatus; + rollupRank: number; +}; + +type RunViewState = { + runId: string | null; + + loading: { initialEvents: LoadStatus; error?: string }; + realtime: { status: RealtimeStatus; error?: string }; + + // storage + eventsById: Record; + timelineEventIds: string[]; // global timeline sorted by (tsMs, event_id) + + tupleAggByKey: Record; + stepAttemptByKey: Record; + latestAttemptByStep: Record; // max attempt observed + + stageRollups: Record; + + evidenceByPointer: Record; +}; +``` + +--- + +# 3) Actions (inputs to reducer) + +```ts +type Action = + | { type: 'RUN_VIEW_OPENED'; runId: string } + | { type: 'RUN_EVENTS_LOAD_STARTED'; runId: string } + | { type: 'RUN_EVENTS_LOADED'; runId: string; events: TinyEventV1[] } + | { type: 'RUN_EVENTS_LOAD_FAILED'; runId: string; error: string } + + | { type: 'REALTIME_STATUS_CHANGED'; runId: string; status: RealtimeStatus; error?: string } + | { type: 'RUN_EVENT_RECEIVED'; event: TinyEventV1 } + + // Evidence hydration lifecycle (pure reducer; side-effects happen elsewhere) + | { type: 'EVIDENCE_RESOLVE_REQUESTED'; runId: string; pointerKey: PointerKey } + | { type: 'EVIDENCE_RESOLVE_RESULT'; runId: string; pointerKey: PointerKey; seq: number; resolvedAtMs: number; resolution: EvidenceResolution } + | { type: 'EVIDENCE_RESOLVE_CLEARED'; runId: string; pointerKey: PointerKey }; +``` + +**Reducer must ignore** any action where `action.runId !== state.runId` (except `RUN_VIEW_OPENED` which sets it). + +--- + +# 4) Reducer semantics (outputs) + +## 4.1 RUN_VIEW_OPENED + +**Input:** `{ runId }` +**Output:** resets all run-specific state. + +Rules: + +* Set `state.runId = runId` +* Clear events, aggregates, evidence, timeline. +* Set `loading.initialEvents = 'loading'` +* Set `realtime.status = 'connecting'` (optional) + +--- + +## 4.2 RUN_EVENTS_LOAD_STARTED / LOADED / FAILED + +### RUN_EVENTS_LOAD_STARTED + +* If runId matches, set `loading.initialEvents = 'loading'`. + +### RUN_EVENTS_LOADED + +* If runId matches: + + * For each event in `events`: apply the exact same logic as `RUN_EVENT_RECEIVED`. + * Then set `loading.initialEvents = 'loaded'`. + +### RUN_EVENTS_LOAD_FAILED + +* If runId matches: `loading.initialEvents = 'error'`, store error string. + +--- + +## 4.3 REALTIME_STATUS_CHANGED + +* Update `realtime.status` and `realtime.error` if runId matches. + +--- + +## 4.4 RUN_EVENT_RECEIVED (core ingestion) + +### Preconditions + +If `state.runId` is null, ignore (or treat as no-op). +If `event.run_id !== state.runId`, ignore. + +### Step A — normalize + dedupe + +* Convert to `NormalizedEvent`: + + * `tsMs = parseRFC3339ToMs(event.ts)`; if parse fails, `tsMs = 0`. + * Default `pointers = []`, `kv = {}` if missing. +* If `eventsById[event_id]` exists: **no-op**. + +### Step B — insert into global stores + +* Add to `eventsById[event_id]`. +* Insert `event_id` into `timelineEventIds` keeping sorted order by `(tsMs, event_id)`. + +### Step C — ensure evidence entries exist for pointers + +For each pointer `p`: + +* `pk = pointerKey(p)` +* If `evidenceByPointer[pk]` is missing: + + * create `{ pointer: p, status: 'unresolved', seq: 0 }` +* Else merge pointer metadata into `evidenceByPointer[pk].pointer` using pointer-merge rules (below). + (Do **not** overwrite existing resolver resolution fields.) + +### Step D — update tuple aggregate (merge/enrichment) + +Let `tk = tupleKey(event)`. + +* If `tupleAggByKey[tk]` missing, create new `TupleAggregate` with: + + * `eventIdsSorted = [event_id]` + * `canonicalEventId = event_id` + * `merged` from this event + +* Else: + + * Insert `event_id` into `eventIdsSorted` in sorted order (using `compareEvent` via `eventsById`). + * Recompute: + + * `canonicalEventId = min(eventIdsSorted)` by compareEvent + * `merged` deterministically from all contributing events (see merge rules) + +### Tuple merge rules (deterministic) + +Given contributing events `E` sorted by `(tsMs, event_id)` ascending: + +* `canonical = E[0]` +* `merged.summary = canonical.summary` +* `merged.error_class = canonical.error_class` +* `merged.kv`: + + * start empty `{}` + * for each event `e` in order, for each `(k,v)` in `e.kv`: `merged.kv[k] = v` + (later events overwrite earlier keys) +* `merged.pointers`: + + * maintain `map: Record` + * for each event `e` in order, for each pointer `p`: + + * `pk = pointerKey(p)` + * if not present: set map[pk] = p + * else: map[pk] = mergePointerMeta(map[pk], p) (see below) + * output pointers as an array sorted by `PointerKey` lexicographically (for stable UI lists) +* `merged.updatedAtMs = max(e.tsMs)` + +### Pointer metadata merge rule (non-null wins, later wins) + +```ts +function mergePointerMeta(oldP: Pointer, newP: Pointer): Pointer { + // type/ref must match + return { + type: oldP.type, + ref: oldP.ref, + // later non-empty wins + mime: newP.mime ?? oldP.mime, + label: newP.label ?? oldP.label, + expires_at: newP.expires_at ?? oldP.expires_at, + sha256: newP.sha256 ?? oldP.sha256, + }; +} +``` + +--- + +## 4.5 Update StepAttemptState (best status + no regression) + +After tuple aggregate update, update the parent step attempt: + +* Let `sak = stepAttemptKey(event)` and `sid = stepIdentityKey(event)`. + +### latest attempt tracking + +* `latestAttemptByStep[sid] = max(previous, event.attempt)` + +### StepAttemptState update + +* If missing, create: + + * `bestStatus = 'unknown'`, `bestStatusRank = 0`, `tuplesByStatus = {}` +* Set `tuplesByStatus[event.status] = tk` + +### Recompute best status (never decreases) + +Compute candidate best by checking all statuses present for this attempt: + +```ts +candidateBest = argmax(status in tuplesByStatus) STATUS_RANK[status] +``` + +Then apply **no-regression rule**: + +* If `STATUS_RANK[candidateBest] >= step.bestStatusRank`: + + * update `bestStatus`, `bestStatusRank` +* Else: + + * keep existing `bestStatus` (prevents fail → pass regressions) + +Set `updatedAtMs = max(updatedAtMs, tupleAgg.merged.updatedAtMs)`. + +**Important:** This rule guarantees “late pass/info” cannot override a prior fail. + +--- + +## 4.6 Stage rollups (optional but recommended) + +Whenever any `StepAttemptState` changes, update `stageRollups[stage]` deterministically: + +For each stage: + +* Consider only the **latest attempt per step identity** in that stage: + + * For each `StepIdentityKey = stage|step`, find `attempt = latestAttemptByStep[stage|step]` + * Look up `StepAttemptState` for that attempt. +* Roll up stage status as the **worst rank** among those: + + * `rollupRank = max(step.bestStatusRank)` + * `rollupStatus = status with that rank` + +If a stage has no steps yet, set `rollupStatus='unknown'`. + +--- + +# 5) Evidence hydration reducer rules + +Evidence actions update `evidenceByPointer` only; they must not mutate events/aggregates. + +## 5.1 EVIDENCE_RESOLVE_REQUESTED + +**Input:** `{ pointerKey }` + +Rules: + +* If no evidence entry exists: create one with status `unresolved` and `seq=0` (should be rare). +* Increment `seq = seq + 1` +* Set `inFlightSeq = seq` +* Set `status = 'loading'` +* Keep `resolution` (optional: clear it if you want UI to hide stale info; recommended to keep and show “Refreshing…”) + +**Middleware/effect contract (outside reducer):** + +* After dispatching `EVIDENCE_RESOLVE_REQUESTED`, the effect layer reads `inFlightSeq` from state and uses it in the API call. +* When the response returns, dispatch `EVIDENCE_RESOLVE_RESULT` with that same `seq`. + +## 5.2 EVIDENCE_RESOLVE_RESULT + +**Input:** `{ pointerKey, seq, resolvedAtMs, resolution }` + +Rules: + +* If `evidenceByPointer[pointerKey]` missing: ignore or create (implementation choice). +* If `evidence.inFlightSeq !== seq`: **ignore stale response**. +* Else: + + * `status = resolution.status` + * `resolution = resolution` + * `lastResolvedAtMs = resolvedAtMs` + * `inFlightSeq = undefined` + +## 5.3 EVIDENCE_RESOLVE_CLEARED + +* Reset entry back to `{ status:'unresolved', resolution: undefined, inFlightSeq: undefined }` +* Keep `pointer` metadata. + +--- + +# 6) Selectors (pure outputs for rendering) + +These are not reducer logic, but they define how UI consumes state deterministically. + +## 6.1 Timeline view model + +```ts +selectTimeline(state): NormalizedEvent[] { + return state.timelineEventIds.map(id => state.eventsById[id]); +} +``` + +## 6.2 Latest attempt cards per step identity + +```ts +type StepCardVM = { + stage: StageName; + step: string; + attempt: number; + status: StepStatus; + error_class?: string; + summary?: string; + kv: Record; + pointers: PointerAggregate[]; + updatedAtMs: number; +}; + +selectLatestStepCards(state): StepCardVM[] { + const cards: StepCardVM[] = []; + for (const sid in state.latestAttemptByStep) { + const attempt = state.latestAttemptByStep[sid]; + const [stage, step] = sid.split('|') as [StageName, string]; + const sak = `${stage}|${step}|${attempt}`; + + const sa = state.stepAttemptByKey[sak]; + if (!sa) continue; + + // Prefer fail tuple for details if present + const failTk = sa.tuplesByStatus['fail']; + const bestTk = sa.tuplesByStatus[sa.bestStatus]; + const tk = failTk ?? bestTk; + const agg = tk ? state.tupleAggByKey[tk] : undefined; + + cards.push({ + stage, step, attempt, + status: sa.bestStatus, + error_class: agg?.merged.error_class, + summary: agg?.merged.summary, + kv: agg?.merged.kv ?? {}, + pointers: agg?.merged.pointers ?? [], + updatedAtMs: sa.updatedAtMs, + }); + } + // stable ordering: by stage order, then step name + return cards.sort((a,b) => + (STAGE_ORDER.indexOf(a.stage) - STAGE_ORDER.indexOf(b.stage)) || + a.step.localeCompare(b.step) + ); +} +``` + +## 6.3 Failure banner (first failure by time) + +```ts +selectFirstFailure(state): StepCardVM | null { + const cards = selectLatestStepCards(state).filter(c => c.status === 'fail'); + if (cards.length === 0) return null; + return cards.sort((a,b) => a.updatedAtMs - b.updatedAtMs)[0]; +} +``` + +--- + +# 7) Worked examples (expected reducer behavior) + +## Example A: fail event arrives, then enrichment adds pointers + +1. Receive fail event (no pointers) + +* Step card shows `fail`, summary, error_class, evidence list empty. + +2. Receive second event same tupleKey with pointers + +* Same step card remains `fail` (no regression) +* Evidence section now lists pointers (status `unresolved` until resolved). + +## Example B: out-of-order enrichment arrives before initial fail + +* Enrichment event arrives first (later tsMs) → creates tupleAgg; canonical is that (for now). +* Later initial fail arrives with earlier tsMs: + + * canonical becomes the earlier event (smaller tsMs) + * **pointers remain**, because merged pointers are union across all contributing events. + +## Example C: duplicate delivery + +* Same `event_id` received twice → second is ignored (idempotent). + +## Example D: late pass after fail (future-proof) + +* If a `pass` event arrives after a `fail` for the same step attempt: + + * `bestStatusRank` is already `fail` (6) + * candidate is `pass` (4) + * no-regression rule keeps `fail` + +--- + +# 8) Implementation notes (non-binding but useful) + +* Event counts per run are usually small; simple array insert + sort is fine. +* If you expect thousands of events, maintain a binary insertion for `timelineEventIds` and `eventIdsSorted`. +* Keep all “current time” out of reducer. Any timestamps used in actions (e.g., `resolvedAtMs`) must be created outside. + +--- + +If you want next, I can provide: + +* a drop-in **TypeScript implementation** of `reduceRunView` with helper functions, and +* a set of **unit test vectors** (Given actions → expect final state) covering all edge cases above. diff --git a/docs/product-advisories/14-Dec-2025 - Create a small ground‑truth corpus.md b/docs/product-advisories/14-Dec-2025 - Create a small ground‑truth corpus.md new file mode 100644 index 000000000..8b191f3cf --- /dev/null +++ b/docs/product-advisories/14-Dec-2025 - Create a small ground‑truth corpus.md @@ -0,0 +1,787 @@ +Here’s a compact playbook for building **10–20 “toy services” with planted, labeled vulnerabilities** so you can demo reachability, measure scanner accuracy, and make the “why” behind each finding obvious. + +### Why do this + +* **Repeatable benchmarks:** same inputs → same findings → track accuracy over time. +* **Explainable demos:** each vuln has a story, proof path, and a fix. +* **Coverage sanity checks:** distinguish **reachable** vs **unreachable** vulns so tools can’t inflate results. + +### Core design + +* Each service = 1 repo with: + + * `/app` (tiny API or worker), `/infra` (Dockerfile/compose), `/tests` (PyTest/Jest + attack scripts), `/labels.yaml` (ground‑truth). + * `labels.yaml` schema: + + ```yaml + service: svc-01-password-reset + vulns: + - id: V1 + cve: CVE-2022-XXXXX + type: dep_runtime + package: express + version: 4.17.0 + reachable: true + path_tags: ["route:/reset", "call:crypto.md5", "env:DEV_MODE"] + proof: ["curl.sh#L10", "trace.json:/reset stack -> md5()"] + fix_hint: "upgrade express to 4.18.3" + - id: V2 + type: dep_build + package: lodash + version: 4.17.5 + reachable: false + path_tags: ["devDependency", "no-import"] + ``` +* **Tagged paths**: add lightweight traces (e.g., log “TAG:route:/reset” before vulnerable call) so tests can assert reachability. + +### Suggested catalog (pick 10–20) + +1. **Password reset token** (MD5, predictable tokens) – reachable via `/reset`. +2. **SQL injection** (string‑concat query) – reachable via `/search`. +3. **Path traversal** (`../` in `?file=`) – reachable but sandboxed; variant unreachable behind dead route flag. +4. **Deserialization bug** (unsafe `pickle`/`BinaryFormatter`) – reachable in worker queue. +5. **SSRF** (proxy fetch) – guarded by allow‑list in unreachable variant. +6. **Command injection** (`child_process.exec`) – reachable via debug param; unreachable alt uses execFile. +7. **JWT none‑alg** acceptance – only when `DEV_MODE=1`. +8. **Hardcoded credentials** (in config) – present but not used (unreachable). +9. **Dependency vuln (runtime)** old `express/fastapi` called in hot path. +10. **Dependency vuln (build‑time only)** devDependency only (unreachable at runtime). +11. **Insecure TLS** (skip verify) – gated behind feature flag. +12. **Open redirect** – requires crafted `next=` param. +13. **XXE** in XML upload – off by default in unreachable variant. +14. **Insecure deserialization in message bus consumer** – invoked by test producer. +15. **Race condition** (TOCTOU temp file) – demonstrated by parallel test. +16. **Use‑after‑free style bug** (C tiny service) – reachable with specific sequence; alt path never called. +17. **CSRF** on state‑changing route – reachable only without SameSite/CSRF tokens. +18. **Directory listing** (misconfigured static server) – reachable under `/public`. +19. **Prototype pollution** (JS merge) – only reachable when `content-type: application/json`. +20. **Zip‑slip** in archive import – prevented in unreachable variant via safe unzip. + +### Tech stack mix + +* **Languages:** Node (Express), Python (FastAPI/Flask), Go (net/http), C# (.NET Minimal API), one small C binary. +* **Packaging:** Docker per service; one multi‑stage with vulnerable build‑tool only (to test build‑time vs runtime vulns). +* **Data:** SQLite or in‑memory maps to avoid ops noise. + +### Test harness (deterministic) + +* `make test` runs: + + 1. **Smoke** (service up). + 2. **Exploit scripts** trigger each *reachable* vuln and store `evidence/trace.json`. + 3. **Scanner run** (your tool + competitors) against the image/container/fs. + 4. **Evaluator** compares scanner output to `labels.yaml`. + +### Metrics you’ll get + +* **Precision/recall** overall and by class (dep_runtime, dep_build, code, config). +* **Reachability precision**: % of flagged vulns with a proven path tag match. +* **Overreport index**: unreachable‑flag hits / total hits. +* **TTFS (Time‑to‑first‑signal)**: from scan start to first evidence‑backed block. +* **Fix guidance score**: did the tool propose the correct minimal upgrade/patch? + +### Minimal evaluator format + +Scanner output → normalized JSON: + +```json +{ "findings": [ + {"cve":"CVE-2022-XXXXX","package":"express","version":"4.17.0", + "class":"dep_runtime","path_tags":["route:/reset","call:crypto.md5"]} +]} +``` + +Evaluator joins on `(cve|type|package)` and checks: + +* tag overlap with `labels.vulns[*].path_tags` +* reachable expectation matches +* counts per class; exports `report.md` + `report.csv`. + +### Demo storyline (5 min) + +1. Run **svc‑01**; hit `/reset`; show trace marker. +2. Run your scanner; show it ranks the **reachable dep vuln** above the **devDependency vuln**. +3. Flip env to disable route; rerun → reachable finding disappears → score improves. +4. Show **fix hint** applied (upgrade) → green. + +### Repo layout (monorepo) + +``` +/toys/ + svc-01-reset-md5/ + svc-02-sql-injection/ + ... +/harness/ + normalize.py + evaluate.py + run_scans.sh +/docs/ + rubric.md # metric definitions & thresholds +``` + +### Guardrails + +* Keep images tiny (<150MB) and ports unique. +* Deterministic seeds for any randomness. +* No outbound calls in tests (use local mocks). +* Clearly mark **unsafe** code blocks with comments. + +### First 5 to build this week + +1. `svc-01-reset-md5` (Node) +2. `svc-02-sql-injection` (Python/FastAPI) +3. `svc-03-dep-build-only` (Node devDependency) +4. `svc-04-cmd-injection` (.NET Minimal API) +5. `svc-05-zip-slip` (Go) + +If you want, I can generate the skeleton repos (Dockerfile, app, tests, `labels.yaml`, and the evaluator script) so you can drop them into your monorepo and start measuring immediately. +Below is a **developer framework** you can hand to the team as the governing “contract” for implementing the full toy-service catalogue at a **best-in-class** standard, while keeping the suite deterministic, safe, and maximally useful for scanner R&D. + +--- + +## 1) Non-negotiable principles + +1. **Determinism first** + + * Same git SHA + same inputs ⇒ identical images, SBOMs, findings, scores. + * Pin everything: base image **by digest**, language deps **by lockfiles**, tool versions **by exact semver**, and record it in an evidence manifest. + +2. **Ground truth is authoritative** + + * Every planted weakness must have a **machine-readable label**, and at least one **verifiable proof artifact**. + * No “implicit” vulnerabilities; if it’s not labeled, it does not exist for scoring. + +3. **Reachability is tiered, not binary** + + * You will label and prove *how* it is reachable (imported vs executed vs tainted input), not just “reachable: true”. + +4. **Safety by construction** + + * Services run on an isolated docker network; tests must not require internet. + * Proofs should demonstrate *execution and dataflow* rather than “weaponized exploitation”. + +--- + +## 2) Repository and service contract + +### Standard monorepo layout + +``` +/toys/ + svc-01-.../ + app/ + infra/ # Dockerfile, compose, network policy + tests/ # positive + negative reachability tests + labels.yaml # ground truth + evidence/ # generated by tests (trace, tags, manifests) + fix/ # minimal patch proving remediation +/harness/ + run-suite/ + normalize/ + evaluate/ +/schemas/ + labels.schema.json +/docs/ + benchmark-contract.md + scoring.md + reviewer-checklist.md +``` + +### Required service deliverables (Definition of Done) + +A service PR is “DONE” only if it includes: + +* `labels.yaml` validated by `schemas/labels.schema.json` +* Docker build reproducible enough to be stable in CI (digest pinned; lockfiles committed) +* **Positive tests** that generate evidence proving reachability tiers (see §3) +* **Negative tests** proving “unreachable” claims (feature flags off, devDependency only, dead route, etc.) +* `fix/` patch that removes/mitigates the weakness and produces a measurable delta (findings drop, reachability flips, or config gate blocks) +* An `evidence/manifest.json` capturing tool versions, git sha, image digest, timestamps (UTC), and hashes of evidence files + +--- + +## 3) Reachability tiers and evidence requirements + +### Reachability levels (use these everywhere) + +* **R0 Present**: vulnerable component exists in image/SBOM, not imported/loaded. +* **R1 Loaded**: imported/linked/initialized, but no executed path proven. +* **R2 Executed**: vulnerable function/module is executed in a test (deterministic trace). +* **R3 Tainted execution**: execution occurs with externally influenced input (route param/message/body). +* **R4 Exploitable** (optional): controlled, non-harmful PoC demonstrates full impact. + +### Minimum evidence per level + +* R0: SBOM + file hash / package metadata +* R1: runtime startup logs or module load trace tag +* R2: callsite tag + stack trace snippet (or deterministic trace file) +* R3: R2 + taint marker showing data originated from external boundary (HTTP/queue/env) and reached call +* R4: only if safe and necessary; keep it non-weaponized and sandboxed + +**Key rule:** prefer proving **execution + dataflow** over providing “payload recipes”. + +--- + +## 4) Ground truth schema (what `labels.yaml` must capture) + +Every vuln entry must have: + +* Stable ID: `svc-XX:Vn` (never renumber once published) +* Class: `dep_runtime | dep_build | code | config | os_pkg | supply_chain` +* Identity: `cve` (if applicable), `purl`, `package`, `version`, `location` (path/module) +* Reachability: `reachability_level: R0..R4`, `entrypoint` (route/topic/cli), `preconditions` (flags/env/auth) +* Proofs: + + * `proof.artifacts[]` (e.g., trace file, tag log, coverage snippet) + * `proof.tags[]` (canonical tag strings) +* Fix: + + * `fix.type` (upgrade/config/code) + * `fix.patch_path` (under `fix/`) + * `fix.expected_delta` (what should change in findings/evidence) +* Negatives (if unreachable): + + * `negative_proof` explaining and proving why it is unreachable + +Canonical tag format (consistent across languages): + +* `TAG:route:/reset` +* `TAG:call:Crypto.Md5` +* `TAG:taint:http.body.resetToken` +* `TAG:flag:DEV_MODE=true` + +--- + +## 5) Service implementation standards (how developers build each toy) + +### A. Vulnerability planting patterns (approved) + +* **Dependency runtime**: vulnerable version is a production dependency and exercised on a normal route/job. +* **Dependency build-only**: devDependency only, or used only in build stage; prove it never ships in final image. +* **Code vuln**: the vulnerable sink is behind a clean, deterministic entrypoint and instrumented. +* **Config vuln**: misconfig is explicit and versioned (headers, TLS settings, authz rules), with a fix patch. + +### B. Instrumentation requirements + +* Every reachable vuln must emit: + + * one **entrypoint tag** (route/topic/command) + * one **sink tag** (the vulnerable call or module) + * optional **taint tag** for R3 +* Evidence generation must be stable and machine-parsable: + + * JSON trace preferred (`evidence/trace.json`) + * Logs acceptable if structured and anchored with tags + +### C. Negative-case discipline (unreachable means proven unreachable) + +Unreachable claims must be backed by one of: + +* compilation/linker exclusion (dead code eliminated) + proof +* dependency not present in final image (multi-stage) + proof (image file listing / SBOM diff) +* feature flag off + proof (config captured + route unavailable) +* auth gate + proof (unauthorized cannot reach sink) + +--- + +## 6) Harness and scoring gates (how you enforce “best in class”) + +### Normalization + +All scanners’ outputs must normalize into one internal shape: + +* `(identity: purl+cve+version+location) + class + reachability_claim + evidence_refs` + +### Core metrics (tracked per commit) + +* **Recall (by class)**: runtime deps, OS pkgs, code, config +* **Precision**: false positive rate, especially R0/R1 misclassified as R2/R3 +* **Reachability accuracy**: + + * overreach: predicted reachable but labeled R0/R1 + * underreach: labeled R2/R3 but predicted non-reachable +* **TTFS** (Time-to-First-Signal): time to first *evidence-backed* blocking issue +* **Fix validation**: applying `fix/` must produce the expected delta + +### Quality gates (example thresholds you can enforce in CI) + +* Runtime dependency recall ≥ 0.95 +* Unreachable false positives ≤ 0.05 (for R0/R1) +* Reachability underreport ≤ 0.10 (for labeled R2/R3) +* TTFS regression: no worse than +10% vs main +* Fix validation pass rate = 100% for modified services + +(Adjust numbers as your suite matures; the framework is the key.) + +--- + +## 7) Review checklist (what reviewers enforce) + +A PR adding/modifying a service is rejected if any of these fail: + +* Labels complete, schema-valid, and stable IDs preserved +* Proof artifacts are deterministic and generated by tests +* Reachability tier justified and matches evidence +* Unreachable claims have negative proofs +* Docker build uses pinned digests + lockfiles committed +* `fix/` produces measurable delta and does not introduce new unlabeled issues +* No network egress required; tests are hermetic + +--- + +## 8) Catalogue coverage plan (so the suite actually pushes “best in class”) + +Define coverage targets across axes (each service must declare which axes it covers): + +* Language/runtime: Node, Python, Go, .NET 10, Java, native +* Packaging: fat images, slim, distroless, multi-stage, vendored deps +* Dependency complexity: direct, transitive, aliased/renamed, bundled +* Reachability: R0..R3 mandatory; R4 only for a small subset +* Supply chain: SBOM mismatches, provenance metadata, build-time tools present but excluded + +This prevents you from building 20 near-duplicates and thinking you have “coverage”. + +--- + +## 9) What to hand developers today (copy/paste into your repo) + +Create these docs/templates as the “contract”: + +* `/docs/benchmark-contract.md` (principles + DoD + reachability tiers) +* `/docs/reviewer-checklist.md` (the rejection criteria) +* `/schemas/labels.schema.json` (strict validation) +* `/toys/_TEMPLATE/` (service skeleton with instrumentation + tests scaffolding) +* `/harness/evaluate/` (metrics + CI gating) + +If you want, I can draft the actual **`labels.schema.json`**, the **service template**, and the **review checklist** in ready-to-commit form (paths, names, and .NET 10 conventions included). +Below is an **extended, implementable catalogue** (20 services + 4 “meta” repos) with **concrete stacks, entrypoints, reachability tiers, canonical tags, proof tests, and fixes**. It is written so developers can start building immediately while keeping the suite deterministic and scanner-grade. + +--- + +## A. Catalogue index (what each service is for) + +| ID | Service | Stack | Primary capability tested | +| -: | --------------------------- | --------------------- | ------------------------------------------------------ | +| 01 | reset-token-weak-crypto | Node/Express | Code reachability + crypto misuse | +| 02 | search-sql-injection | Python/FastAPI | Taint → sink (SQLi), route evidence | +| 03 | cmd-injection-diagnostics | .NET 10 Minimal API | Taint → shell sink + gating | +| 04 | zip-import-zip-slip | Go net/http | Archive handling (Zip Slip), filesystem proof | +| 05 | xml-upload-xxe | Java/Spring Boot | XML parser config (XXE), safe proof | +| 06 | jwt-none-devmode | .NET 10 | Config-gated auth bypass (reachability depends on env) | +| 07 | fetcher-ssrf | Node/Express | SSRF to internal-only target, network isolation | +| 08 | outbound-tls-skipverify | Go | TLS misconfig + “reachable only if feature enabled” | +| 09 | queue-pickle-deser | Python worker | Async reachability via queue + unsafe deserialization | +| 10 | efcore-rawsql | .NET 10 + EF Core | ORM raw SQL misuse + input flow | +| 11 | shaded-jar-deps | Java/Gradle | Shaded/fat jar dependency discovery | +| 12 | webpack-bundled-dep | Node/Webpack | Bundled deps + SBOM correctness | +| 13 | go-static-modver | Go static | Detect module versions in static binaries | +| 14 | dotnet-singlefile-trim | .NET 10 publish | Single-file/trimmed dependency evidence | +| 15 | cors-credentials-wildcard | .NET 10 or Node | Config vulnerability (CORS) + fix delta | +| 16 | open-redirect | Node/Express | Web vuln classification + allowlist fix | +| 17 | csrf-state-change | .NET 10 Razor/Minimal | Missing CSRF protections + cookie semantics | +| 18 | prototype-pollution-merge | Node | JSON-body gated path + sink | +| 19 | path-traversal-download | Python/Flask | File handling traversal + normalization | +| 20 | insecure-tempfile-toctou | Go or .NET | Concurrency/race evidence (safe) | +| 21 | k8s-misconfigs | YAML/Helm | IaC scanning (privileged, hostPath, etc.) | +| 22 | docker-multistage-buildonly | Any | Build-time-only vuln exclusion proof | +| 23 | secrets-fakes-corpus | Any | Secret detection precision (fake tokens) | +| 24 | sbom-mismatch-lab | Any | SBOM validation + diff correctness | + +--- + +## B. Canonical tagging (use across all services) + +Every reachable vuln must produce at least: + +* `TAG:route: ` or `TAG:topic:` +* `TAG:call:` +* If R3: `TAG:taint:` (http.query, http.body, queue.msg, env.var) + +**Evidence artifact:** `evidence/trace.json` lines such as: + +```json +{"ts":"...","corr":"...","tags":["TAG:route:POST /reset","TAG:taint:http.body.email","TAG:call:Crypto.MD5"]} +``` + +--- + +## C. Service specs (developers can implement 1:1) + +### 01) `svc-01-reset-token-weak-crypto` (Node/Express) + +**Purpose:** R3 code reachability; crypto misuse; ensure scanner doesn’t over-rank unreachable dev deps. +**Entrypoints:** `POST /reset` and `POST /reset/confirm` +**Vulns:** + +* `V1` **CWE-327 Weak Crypto** — reset token derived from deterministic inputs (no CSPRNG). + + * Reachability: **R3** + * Tags: `TAG:route:POST /reset`, `TAG:taint:http.body.email`, `TAG:call:Crypto.WeakToken` + * Proof test: request reset; assert trace contains sink tag. + * Fix: use `crypto.randomBytes()` and store hashed token. +* `V2` **dep_build** — vulnerable npm devDependency present only in `devDependencies`. + + * Reachability: **R0** + * Negative proof: final image contains no node_modules entry for it OR it is never imported (coverage + grep import map). + +**Hard mode variant:** token generation only happens when `FEATURE_RESET_V1=1` → label unreachable when off. + +--- + +### 02) `svc-02-search-sql-injection` (Python/FastAPI + SQLite) + +**Purpose:** Classic taint → SQL sink; evidence-driven. +**Entrypoint:** `GET /search?q=` +**Vulns:** + +* `V1` **CWE-89 SQL Injection** — query constructed via string concatenation. + + * Reachability: **R3** + * Tags: `TAG:route:GET /search`, `TAG:taint:http.query.q`, `TAG:call:SQL.Unparameterized` + * Proof test: send query with SQL metacharacters; verify trace hits sink. + * Fix: parameterized query / query builder. + +**Hard mode variant:** same route exists but safe path uses parameters; unsafe path only if header `X-Debug=1` and env `DEV_MODE=1`. + +--- + +### 03) `svc-03-cmd-injection-diagnostics` (.NET 10 Minimal API) + +**Purpose:** Detect command execution sink and prove gating. +**Entrypoint:** `GET /diag/ping?host=` +**Vulns:** + +* `V1` **CWE-78 Command Injection** — shell invocation with user-influenced argument. + + * Reachability: **R3** when `DIAG_ENABLED=1` + * Tags: `TAG:route:GET /diag/ping`, `TAG:taint:http.query.host`, `TAG:call:Process.Start.Shell` + * Proof test: call endpoint with characters that would alter shell parsing; evidence is sink tag + controlled output marker (not destructive). + * Fix: avoid shell, use argument arrays (`ProcessStartInfo.ArgumentList`) + allowlist hostnames. + +**Hard mode variant:** sink is in a helper library referenced transitively; scanner must resolve call graph. + +--- + +### 04) `svc-04-zip-import-zip-slip` (Go) + +**Purpose:** File/archive handling; safe filesystem proof; no “real system” impact. +**Entrypoint:** `POST /import-zip` +**Vulns:** + +* `V1` **CWE-22 Path Traversal (Zip Slip)** — extraction path not normalized/validated. + + * Reachability: **R3** + * Tags: `TAG:route:POST /import-zip`, `TAG:taint:http.body.zip`, `TAG:call:Archive.Extract.UnsafeJoin` + * Proof test: upload crafted zip that attempts to place `evidence/sentinel.txt` outside dest; assert sentinel ends up outside intended folder. + * Fix: clean paths; reject entries escaping dest; forbid absolute paths. + +--- + +### 05) `svc-05-xml-upload-xxe` (Java/Spring Boot) + +**Purpose:** Parser config scanning + code-path proof. +**Entrypoint:** `POST /upload-xml` +**Vulns:** + +* `V1` **CWE-611 XXE** — DocumentBuilderFactory with external entities enabled. + + * Reachability: **R3** + * Tags: `TAG:route:POST /upload-xml`, `TAG:taint:http.body.xml`, `TAG:call:XML.Parse.XXEEnabled` + * Proof test: XML references a **local test file under `/app/testdata/`** and returns its sentinel string (no external network). + * Fix: disable external entity resolution and secure processing. + +--- + +### 06) `svc-06-jwt-none-devmode` (.NET 10) + +**Purpose:** Reachability depends on environment and config. +**Entrypoint:** `GET /admin` (Bearer JWT) +**Vulns:** + +* `V1` **CWE-345 Insufficient Verification** — accepts unsigned token when `DEV_MODE=1`. + + * Reachability: **R2** (exec) / **R3** (if token from request) + * Tags: `TAG:route:GET /admin`, `TAG:flag:DEV_MODE=true`, `TAG:call:Auth.JWT.AcceptNoneAlg` + * Proof test: run container with DEV_MODE=1; request triggers sink tag. + * Negative test: DEV_MODE=0 must not hit sink tag. + * Fix: enforce algorithm + signature validation always. + +--- + +### 07) `svc-07-fetcher-ssrf` (Node/Express) + +**Purpose:** SSRF detection with internal-only target in docker network. +**Entrypoint:** `GET /fetch?url=` +**Vulns:** + +* `V1` **CWE-918 SSRF** — URL fetched without scheme/host restrictions. + + * Reachability: **R3** + * Tags: `TAG:route:GET /fetch`, `TAG:taint:http.query.url`, `TAG:call:HTTP.Client.Fetch` + * Proof test: fetch `http://internal-metadata/health` (a companion container in compose); assert response contains sentinel + sink tag. + * Fix: allowlist hosts/schemes; block private ranges; require signed destinations. + +--- + +### 08) `svc-08-outbound-tls-skipverify` (Go) + +**Purpose:** Config vuln + “reachable only when feature on.” +**Entrypoint:** `POST /sync` triggers outbound HTTPS call +**Vulns:** + +* `V1` **CWE-295 Improper Cert Validation** — `InsecureSkipVerify=true` when `SYNC_FAST=1`. + + * Reachability: **R2** (exec) + * Tags: `TAG:route:POST /sync`, `TAG:flag:SYNC_FAST=true`, `TAG:call:TLS.InsecureSkipVerify` + * Fix: proper CA pinning / system pool; explicit cert verification. + +--- + +### 09) `svc-09-queue-pickle-deser` (Python API + worker) + +**Purpose:** Async reachability: API enqueues → worker executes sink. +**Entrypoints:** `POST /enqueue` + worker consumer +**Vulns:** + +* `V1` **CWE-502 Unsafe Deserialization** — worker uses unsafe deserializer. + + * Reachability: **R3** (taint from HTTP → queue → worker) + * Tags: `TAG:route:POST /enqueue`, `TAG:topic:jobs`, `TAG:call:Deserialize.Unsafe` + * Proof test: enqueue benign payload that triggers sink tag and deterministic “handled” response (no arbitrary execution PoC). + * Fix: switch to safe format (JSON) and validate schema. + +--- + +### 10) `svc-10-efcore-rawsql` (.NET 10 + EF Core) + +**Purpose:** ORM misuse; taint → SQL sink detection. +**Entrypoint:** `GET /reports?where=` +**Vulns:** + +* `V1` **CWE-89 SQLi** — `FromSqlRaw`/`ExecuteSqlRaw` with interpolated input. + + * Reachability: **R3** + * Tags: `TAG:route:GET /reports`, `TAG:taint:http.query.where`, `TAG:call:EFCore.FromSqlRaw.Unsafe` + * Fix: `FromSqlInterpolated` with parameters or LINQ predicates. + +--- + +### 11) `svc-11-shaded-jar-deps` (Java/Gradle) + +**Purpose:** Dependency discovery inside fat/shaded jar; reachable vs present-only. +**Entrypoint:** `GET /parse` +**Vulns:** + +* `V1` **dep_runtime** — vulnerable lib included in shaded jar and actually invoked. + + * Reachability: **R2** + * Tags: `TAG:route:GET /parse`, `TAG:call:Lib.Parse.VulnerableMethod` +* `V2` **dep_build/test** — test-scoped vulnerable lib not packaged in runtime jar. + + * Reachability: **R0** + * Negative proof: SBOM for runtime jar excludes it; file listing confirms. + +**Fix:** bump dependency and rebuild shaded jar. + +--- + +### 12) `svc-12-webpack-bundled-dep` (Node/Webpack) + +**Purpose:** Bundled dependencies, source map presence/absence, SBOM correctness. +**Entrypoint:** `GET /render?template=` +**Vulns:** + +* `V1` **dep_runtime** — vulnerable template lib bundled; invoked by render. + + * Reachability: **R2/R3** depending on input usage + * Tags: `TAG:route:GET /render`, `TAG:taint:http.query.template`, `TAG:call:Template.Render` +* `V2` **R0** — vulnerable package in lockfile but tree-shaken and absent from output bundle. + + * Negative proof: bundle inspection + build manifest. + +**Fix:** upgrade dependency and rebuild bundle; ensure SBOM maps bundle contents. + +--- + +### 13) `svc-13-go-static-modver` (Go static binary) + +**Purpose:** Scanner capability to extract module versions from static binary. +**Entrypoint:** `GET /hash?alg=` +**Vulns:** + +* `V1` **dep_runtime** — vulnerable Go module version linked; executed on route. + + * Reachability: **R2** + * Tags: `TAG:route:GET /hash`, `TAG:call:GoMod.VulnFunc` +* `V2` **R1** — module linked but only used in dead code path (guarded by constant false). + + * Negative proof: coverage/trace never hits sink. + +**Fix:** update `go.mod` and rebuild. + +--- + +### 14) `svc-14-dotnet-singlefile-trim` (.NET 10 publish single-file) + +**Purpose:** Detect assemblies in single-file + trimming edge cases. +**Entrypoint:** `GET /export` +**Vulns:** + +* `V1` **dep_runtime** — vulnerable NuGet referenced and executed. + + * Reachability: **R2** + * Tags: `TAG:route:GET /export`, `TAG:call:NuGet.VulnMethod` +* `V2` **R0** — package referenced in project but trimmed out and not present. + + * Negative proof: runtime file map (single-file manifest) excludes it. + +**Fix:** bump NuGet; adjust trimming settings if needed. + +--- + +### 15) `svc-15-cors-credentials-wildcard` (.NET 10) + +**Purpose:** Config/misconfig detection; clear fix delta. +**Entrypoint:** any API route +**Vulns:** + +* `V1` **CWE-942 / CORS Misconfig** — `Access-Control-Allow-Origin: *` with credentials. + + * Reachability: **R2** (observed in response headers) + * Tags: `TAG:route:GET /health`, `TAG:call:HTTP.Headers.CORSWildcardCreds` + * Proof test: request and assert headers + tag. + * Fix: explicit allowed origins + disable credentials unless needed. + +--- + +### 16) `svc-16-open-redirect` (Node/Express) + +**Purpose:** Web vuln classification, allowlist fix. +**Entrypoint:** `GET /login?next=` +**Vulns:** + +* `V1` **CWE-601 Open Redirect** — next param used directly. + + * Reachability: **R3** + * Tags: `TAG:route:GET /login`, `TAG:taint:http.query.next`, `TAG:call:Redirect.Unvalidated` + * Fix: allowlist relative paths; reject absolute URLs. + +--- + +### 17) `svc-17-csrf-state-change` (.NET 10) + +**Purpose:** CSRF detection + cookie semantics. +**Entrypoint:** `POST /account/email` (cookie auth) +**Vulns:** + +* `V1` **CWE-352 CSRF** — no anti-forgery token; SameSite mis-set. + + * Reachability: **R2** + * Tags: `TAG:route:POST /account/email`, `TAG:call:Auth.CSRF.MissingProtection` + * Fix: antiforgery token + SameSite=Lax/Strict and proper CORS. + +--- + +### 18) `svc-18-prototype-pollution-merge` (Node) + +**Purpose:** JSON-body gated sink; reachability must respect content-type and route. +**Entrypoint:** `POST /profile` (application/json) +**Vulns:** + +* `V1` **CWE-1321 Prototype Pollution** — unsafe deep merge of user object into defaults. + + * Reachability: **R3** (only if JSON) + * Tags: `TAG:route:POST /profile`, `TAG:taint:http.body.json`, `TAG:call:Object.Merge.Unsafe` + * Negative test: same request with non-JSON must not hit sink tag. + * Fix: safe merge, deny `__proto__` / `constructor` keys. + +--- + +### 19) `svc-19-path-traversal-download` (Python/Flask) + +**Purpose:** File traversal with safe, local sentinel proof. +**Entrypoint:** `GET /download?file=` +**Vulns:** + +* `V1` **CWE-22 Path Traversal** — file path concatenated without normalization. + + * Reachability: **R3** + * Tags: `TAG:route:GET /download`, `TAG:taint:http.query.file`, `TAG:call:FS.Read.UnsafePath` + * Proof test: attempt to read a known sentinel file outside the allowed directory (within container). + * Fix: normalize path, enforce base dir constraint. + +--- + +### 20) `svc-20-insecure-tempfile-toctou` (Go or .NET) + +**Purpose:** Concurrency/race category; deterministic reproduction via controlled scheduling. +**Entrypoint:** `POST /export` creates temp file and then reopens by name +**Vulns:** + +* `V1` **CWE-367 TOCTOU** — uses predictable temp name + separate open. + + * Reachability: **R2** (requires parallel test harness) + * Tags: `TAG:route:POST /export`, `TAG:call:FS.TempFile.InsecurePattern` + * Proof test: run two coordinated requests; assert race condition triggers sentinel behavior. + * Fix: use secure temp APIs + hold open FD; atomic operations. + +--- + +## D. Meta repos (not “services” but essential for best-in-class scanning) + +### 21) `svc-21-k8s-misconfigs` (YAML/Helm) + +**Purpose:** IaC scanning; false-positive discipline. +**Artifacts:** `manifests/*.yaml`, `helm/Chart.yaml` +**Findings to plant:** + +* privileged container, `hostPath`, `runAsUser: 0`, missing resource limits, writable rootfs, wildcard RBAC + **Proof:** static assertions in tests (OPA/Conftest or your harness) generate evidence tags like `TAG:iac:k8s.privileged`. + +--- + +### 22) `svc-22-docker-multistage-buildonly` + +**Purpose:** Prove build-time-only deps do not ship; prevent scanners from overreporting. +**Pattern:** builder stage installs vulnerable tooling; final stage is distroless and excludes it. +**Proof:** final image SBOM + `docker export` file list hash; must not include builder artifacts. + +--- + +### 23) `svc-23-secrets-fakes-corpus` + +**Purpose:** Secret detection precision/recall without storing real secrets. +**Pattern:** files containing **fake** tokens matching common regexes but clearly marked `FAKE_` and useless. +**Labels:** must distinguish: + +* `R0 present` fake secret in docs/examples +* `R2 reachable` secret injected into runtime env accidentally (then fixed) + +--- + +### 24) `svc-24-sbom-mismatch-lab` + +**Purpose:** SBOM validation and drift detection. +**Pattern:** generate an SBOM, then change deps without regenerating; label mismatch as a “supply_chain” issue. +**Proof:** harness compares `image digest + lockfile hash + sbom hash`. + +--- + +## E. Implementation notes that raise the bar (recommended defaults) + +1. **Each service ships with both**: + + * `tests/test_positive_v*.{py,js,cs}` producing evidence for reachable vulns + * `tests/test_negative_v*.{py,js,cs}` proving unreachable claims +2. **Every service includes a `fix/` patch** and a CI job that: + + * builds “vuln image”, scans, evaluates + * applies fix, rebuilds, re-scans, confirms expected delta +3. **Hard-mode toggle per service** (optional but valuable): + + * `MODE=easy`: vuln sits on hot path (for demos) + * `MODE=hard`: same vuln behind realistic conditions (auth, header, flag, content-type, async) + +--- + +If you want this to be “maxim degree” for scanner R&D, the next step is to add **one additional dimension per service** (fat jar, single-file, distroless, vendored deps, shaded deps, optional extras, transitive only, etc.). I can propose a precise pairing (which dimension goes to which service) so the suite covers all packaging and reachability edge cases without duplication. diff --git a/docs/product-advisories/14-Dec-2025 - Dissect triage and evidence workflows.md b/docs/product-advisories/14-Dec-2025 - Dissect triage and evidence workflows.md new file mode 100644 index 000000000..b1a3b9be8 --- /dev/null +++ b/docs/product-advisories/14-Dec-2025 - Dissect triage and evidence workflows.md @@ -0,0 +1,551 @@ +Here’s a tight, practical blueprint for building (and proving) a fast, evidence‑first triage workflow—plus the power‑user affordances that make Stella Ops feel “snappy” even offline. + +# What “good” looks like (background in plain words) + +* **Alert → evidence → decision** in one flow: an alert should open directly onto the concrete proof (reachability, call‑stack, provenance), then offer a one‑click decision (VEX/CSAF status) with audit logging. +* **Time‑to‑First‑Signal (TTFS)** is king: how fast a human sees the first credible piece of evidence that explains *why this alert matters here*. +* **Clicks‑to‑Closure**: count how many interactions to reach a defensible decision recorded in the audit log. + +# Minimal evidence bundle per finding + +* **Reachability proof**: function‑level path or package‑level import chain (with “toggle reachability view” hotkey). +* **Call‑stack snippet**: 5–10 frames around the sink/source with file:line anchors. +* **Provenance**: attestation / DSSE + build ancestry (image → layer → artifact → commit). +* **VEX/CSAF status**: affected/not‑affected/under‑investigation + reason. +* **Diff**: what changed since last scan (SBOM or VEX delta), rendered as a small, human‑readable “smart‑diff.” + +# KPIs to measure in CI and UI + +* **TTFS (p50/p95)** from alert creation to first rendered evidence. +* **Clicks‑to‑Closure (median)** per decision type. +* **Evidence completeness score** (0–4): reachability, call‑stack, provenance, VEX/CSAF present. +* **Offline friendliness score**: % of evidence resolvable with no network. +* **Audit log completeness**: every decision has: evidence hash set, actor, policy context, replay token. + +# Power‑user affordances (keyboard first) + +* **Jump to evidence** (`J`): focuses the first incomplete evidence pane. +* **Copy DSSE** (`Y`): copies the attestation block or Rekor entry ref. +* **Toggle reachability view** (`R`): path list ↔ compact graph ↔ textual proof. +* **Search‑within‑graph** (`/`): node/func/package, instant. +* **Deterministic sort** (`S`): stable sort by (reachability→severity→age→component) to remove hesitation. +* **Quick VEX set** (`A`, `N`, `U`): Affected / Not‑affected / Under‑investigation with templated reasons. + +# UX flow to implement (end‑to‑end) + +1. **Alert row** shows: TTFS timer, reachability badge, “decision state,” and a diff‑dot if something changed. +2. **Open alert** lands on **Evidence tab** (not Details). Top strip = three proof pills: + + * Reachability ✓ / Call‑stack ✓ / Provenance ✓ (click to expand inline). +3. **Decision drawer** pinned on the right: + + * VEX/CSAF radio (A/N/U) → Reason presets → “Record decision.” + * Shows **audit‑ready summary** (hashes, timestamps, policy). +4. **Diff tab**: SBOM/VEX delta since last run, grouped by “meaningful risk shift.” +5. **Activity tab**: immutable audit log; export as a signed bundle for audits. + +# Graph performance on large call‑graphs + +* **Minimal‑latency snapshots**: pre‑render static PNG/SVG thumbnails server‑side; open with tiny preview then hydrate to interactive graph lazily. +* **Progressive neighborhood expansion**: load 1‑hop first, expand on demand; keep the first TTFS < 500 ms. +* **Stable node ordering**: deterministic layout with consistent anchors to avoid “graph shuffle” anxiety. +* **Chunked graph edges** with capped fan‑out; collapse identical library paths into a **reachability macro‑edge**. + +# Offline‑friendly design + +* **Local evidence cache**: store (SBOM slices, path proofs, DSSE attestations, compiled call‑stacks) in a signed bundle beside the SARIF/VEX. +* **Deferred enrichment**: mark fields that need internet (e.g., upstream CSAF fetch) and queue a background “enricher” when network returns. +* **Predictable fallbacks**: if provenance server missing, show embedded DSSE and “verification pending,” never blank states. + +# Audit & replay + +* **Deterministic replay token**: hash(feed manifests + rules + lattice policy + inputs) → attach to every decision. +* **One‑click “Reproduce”**: opens CLI snippet pinned to the exact versions and policies. +* **Evidence hash‑set**: content‑address each proof artifact; the audit entry stores only hashes + signer. + +# TTFS & Clicks‑to‑Closure: how to measure in code + +* Emit a `ttfs.start` at alert creation; first paint of any evidence card emits `ttfs.signal`. +* Increment a per‑alert **interaction counter**; on “Record decision” emit `close.clicks`. +* Log **evidence bitset** (reach, stack, prov, vex) at decision time for completeness scoring. + +# Developer tasks (concrete, shippable) + +* **Evidence API**: `GET /alerts/{id}/evidence` returns `{reachability, callstack, provenance, vex, hashes[]}` with deterministic sort. +* **Proof renderer**: tiny, no‑framework widget that can render from the offline bundle; hydrate to full only on interaction. +* **Keyboard map**: global handler with overlay help (`?`); no collisions; all actions are idempotent. +* **Graph service**: server‑side layout + snapshot PNG; client hydrates WebGL only when user expands. +* **Smart‑diff**: diff SBOM/VEX → classify into “risk‑raising / neutral / reducing,” surface only the first item by default. +* **Audit logger**: append‑only stream; signed checkpoints; export `.stella-audit.tgz` (attestations + JSONL). + +# Benchmarks to run weekly + +* **TTFS under poor network** (100 ms RTT, 1% loss): p95 < 1.5 s to first evidence. +* **Graph hydration on 250k‑edge image**: preview < 300 ms, interactive < 2.0 s. +* **Keyboard coverage**: ≥90% of triage actions executable without mouse. +* **Offline replay**: 100% of decisions re‑render from bundle; zero web calls required. + +# Why Stella’s approach reduces hesitation + +* **Deterministic sort orders** keep findings in place between refreshes. +* **Minimal‑latency graph snapshots** show something trustworthy immediately, then refine—no “blank panel” delay. +* **Replayable, signed bundles** make every click auditable and reversible, which builds operator confidence. + +If you want, I can turn this into: + +* a **UI checklist** for a design review, +* a **.NET 10 API contract** (DTOs + endpoints), +* or a **Cypress/Playwright test plan** that measures TTFS and clicks‑to‑closure automatically. +Below is a PM‑style implementation guideline you can hand to developers. It’s written as a **build spec**: clear goals, “MUST/SHOULD” requirements, acceptance criteria, and the non‑functional guardrails (performance, offline, auditability) that make triage feel fast and defensible. + +--- + +# Stella Ops — Evidence‑First Triage Implementation Guidelines (PM Spec) + +## 0) Assumptions and scope + +**Assumptions** + +* Stella Ops ingests vulnerability findings (SCA/SAST/image scans), has SBOM context, and can compute reachability/call paths. +* Triage outcomes must be recorded as VEX/CSAF‑compatible states with reasons and audit trails. +* Users may operate in restricted networks and need an offline mode that still shows evidence. + +**In scope** + +* Evidence‑first alert triage UI + APIs + telemetry. +* Reachability proof + call stack view + provenance attestation view. +* VEX/CSAF decision recording with audit export. +* Offline evidence bundle and deterministic replay token. + +**Out of scope (for this phase)** + +* Building the underlying static analyzer or SBOM generator (we consume their outputs). +* Full CSAF publishing workflow (we store and export; publishing is separate). +* Remediation automation (PRs, patching). + +--- + +## 1) Product principles (non‑negotiables) + +1. **Evidence before detail** + Opening an alert **MUST** show the best available evidence immediately (even partial/placeholder), not a generic “details” page. +2. **Fast first signal** + The UI **MUST** render a credible “first signal” quickly (reachability badge, call stack snippet, or provenance block). +3. **Determinism reduces hesitation** + Sorting, graphs, and diffs **MUST** be stable across refreshes. No jittery re-layout. +4. **Offline by design** + If evidence exists locally (bundle), the UI **MUST** render it without network access. +5. **Audit-ready by default** + Every decision **MUST** be reproducible, attributable, and exportable with evidence hashes. + +--- + +## 2) Success metrics (what we ship toward) + +These become acceptance criteria and dashboards. + +### Primary metrics (P0) + +* **TTFS (Time‑to‑First‑Signal)**: p95 < **1.5s** from opening an alert to first evidence card rendering (with 100ms RTT, 1% loss simulation). +* **Clicks‑to‑Closure**: median < **6** interactions to record a VEX decision. +* **Evidence completeness** at decision time: ≥ **90%** of decisions include evidence hash set + reason + replay token. + +### Secondary metrics (P1) + +* **Offline resolution rate**: ≥ **95%** of alerts opened with a local bundle show reachability + provenance without network. +* **Graph usability**: preview render < **300ms**, interactive hydration < **2.0s** for large graphs (see §7). + +--- + +## 3) User workflows and “Definition of Done” + +### Workflow A: Triage an alert to a decision + +**DoD**: user can open an alert, see evidence, set VEX state, and the system records a signed/auditable decision event. + +**Steps** + +1. Alert list shows key signals (reachability badge, decision state, diff indicator). +2. Open alert → Evidence view loads first. +3. User reviews reachability/call stack/provenance. +4. User sets VEX status + reason preset (editable). +5. User records decision. +6. Audit log entry appears instantly and is exportable. + +### Workflow B: Explain “why is this flagged?” + +**DoD**: user can show a defensible proof (path/call stack/provenance) and copy it into a ticket. + +--- + +## 4) UI requirements (MUST/SHOULD/MAY) + +## 4.1 Alert list page + +**MUST** + +* Each row includes: + + * Severity + component identifier + * **Decision state** (Unset / Under Investigation / Not Affected / Affected) + * **Reachability badge** (Reachable / Not Reachable / Unknown) where available + * **Diff indicator** if SBOM/VEX changed since last scan (simple dot/label) + * Age / first seen / last updated +* **Deterministic sort** default: + `Reachability DESC → Severity DESC → Decision state (Unset first) → Age DESC → Component name ASC` +* Keyboard navigation: + + * `↑/↓` move selection, `Enter` open alert. + * `/` search/filter focus. + +**SHOULD** + +* Inline “quick set” decision menu (Affected / Not affected / Under investigation) without leaving list for obvious cases, but still requires reason and logs evidence hashes. + +## 4.2 Alert detail — landing tab MUST be Evidence + +**MUST** + +* Default landing is **Evidence** (not “Overview”). +* Top section shows 3 “proof pills” with status: + + * Reachability (✓ / ! / …) + * Call stack (✓ / ! / …) + * Provenance (✓ / ! / …) +* Each pill expands inline (no navigation) into a compact evidence panel. + +**MUST: No blank panels** + +* If evidence is loading, show skeleton + “what’s coming.” +* If evidence missing, show a reason (“not computed”, “requires source map”, “offline – enrichment pending”). + +## 4.3 Decision drawer + +**MUST** + +* Pinned right drawer (or persistent bottom sheet on small screens). +* Controls: + + * VEX/CSAF status: **Affected / Not affected / Under investigation** + * Reason preset dropdown + editable reason text + * “Record decision” button +* Preview “Audit summary” before submit: + + * Evidence hashes included + * Policy context (ruleset version) + * Replay token + * Actor identity + +**MUST** + +* On submit, create an append-only audit event and immediately reflect status in UI. + +**SHOULD** + +* Allow attaching references: ticket URL, incident ID, PR link (stored as metadata). + +## 4.4 Diff tab + +**MUST** + +* Show delta since last scan: + + * SBOM diffs (component version changes, removals/additions) + * VEX diffs (status changes) +* Group diffs by **risk shift**: + + * Risk‑raising (new reachable vuln, severity increase) + * Neutral (metadata-only) + * Risk‑reducing (fixed version, reachability removed) + +**SHOULD** + +* Provide “Copy diff summary” for change management. + +## 4.5 Activity/Audit tab + +**MUST** + +* Immutable timeline of decisions and evidence changes. +* Each entry includes: + + * actor, timestamp, decision, reason + * evidence hash set + * replay token + * bundle/export availability + +--- + +## 5) Power-user and accessibility requirements + +### Keyboard shortcuts (MUST) + +* `J`: jump to next missing/incomplete evidence panel +* `R`: toggle reachability view (list ↔ compact graph ↔ textual proof) +* `Y`: copy selected evidence block (call stack / DSSE / path proof) +* `A`: set “Affected” (opens reason preset selection) +* `N`: set “Not affected” +* `U`: set “Under investigation” +* `?`: keyboard help overlay + +### Accessibility (MUST) + +* Fully navigable by keyboard +* Visible focus states +* Screen-reader labels for evidence pills and drawer controls +* Color is never the only signal (badges must have text/icon) + +--- + +## 6) Evidence model: what every alert should attempt to provide + +Treat this as the **minimum evidence bundle**. Each item may be “unavailable,” but must be explicit. + +**MUST** support: + +1. **Reachability proof** + + * At least one of: + + * function-level call path: `entry → … → vulnerable_sink` + * package/module import chain + * Includes confidence/algorithm tag: `static`, `dynamic`, `heuristic` +2. **Call stack snippet** + + * 5–10 frames around the relevant node with file:line anchors where possible +3. **Provenance** + + * DSSE attestation or equivalent statement + * Artifact ancestry chain: image → layer → artifact → commit (as available) + * Verification status: verified / pending / failed (with reason) +4. **Decision state** + + * VEX status + reason + timestamps +5. **Evidence hash set** + + * Content-addressed hashes of each evidence artifact included in the decision + +**SHOULD** + +* “Evidence freshness”: when computed, tool version, input revisions. + +--- + +## 7) Performance and graph rendering requirements + +### TTFS budget (MUST) + +* When opening an alert: + + * **<200ms**: show skeleton and cached row metadata + * **<500ms**: render at least one evidence pill with meaningful content OR a cached preview image + * **<1.5s p95**: render reachability + provenance for typical alerts + +### Graph rendering for large call graphs (MUST) + +* **Two-phase rendering** + + 1. Server-generated **static snapshot** (PNG/SVG) displayed immediately + 2. Interactive graph hydrates lazily on user expand +* **Progressive expansion** + + * Load 1-hop neighborhood first; expand on click +* **Deterministic layout** + + * Same input produces same layout anchors (no reshuffles between refreshes) +* **Fan-out control** + + * Collapse repeated library paths into “macro edges” to keep the graph readable + +--- + +## 8) Offline mode requirements + +Offline is not “nice to have”; it is a defined mode. + +### Offline evidence bundle (MUST) + +* A single file (e.g., `.stella.bundle.tgz`) that contains: + + * Alert metadata snapshot + * Evidence artifacts (reachability proofs, call stacks, provenance attestations) + * SBOM slice(s) necessary for diffs + * VEX decision history (if available) + * Manifest with content hashes (Merkle-ish) +* Bundle must be **signed** (or include signature material) and verifiable. + +### UI behavior (MUST) + +* If bundle is present: + + * UI loads evidence from it first + * Any missing items show “enrichment pending” (not “error”) +* If network returns: + + * Background refresh allowed, but **must not reorder** the alert list unexpectedly + * Must surface “updated evidence available” as a user-controlled refresh, not an auto-switch that changes context mid-triage + +--- + +## 9) Auditability and replay requirements + +### Decision event schema (MUST) + +Every recorded decision must store: + +* `alert_id`, `artifact_id` (image digest or commit hash) +* `actor_id`, `timestamp` +* `decision_status` (Affected/Not affected/Under investigation) +* `reason_code` (preset) + `reason_text` +* `evidence_hashes[]` (content-addressed hashes) +* `policy_context` (ruleset version, policy id) +* `replay_token` (hash of inputs needed to reproduce) + +### Replay token (MUST) + +* Deterministic hash of: + + * scan inputs (SBOM digest, image digest, tool versions) + * policy/rules versions + * reachability algorithm version +* “Reproduce” button produces a CLI snippet (copyable) pinned to these versions. + +### Export (MUST) + +* Exportable audit bundle that includes: + + * JSONL of decision events + * evidence artifacts referenced by hashes + * signatures/attestations +* Export must be stable and verifiable later. + +--- + +## 10) API and data contract guidelines (developer-facing) + +This is an implementation guideline, not a full API spec—keep it simple and cache-friendly. + +### MUST endpoints (or equivalent) + +* `GET /alerts?filters…` → list view payload (small, cacheable) +* `GET /alerts/{id}/evidence` → evidence payload (reachability, call stack, provenance, hashes) +* `POST /alerts/{id}/decisions` → record decision event (append-only) +* `GET /alerts/{id}/audit` → audit timeline +* `GET /alerts/{id}/diff?baseline=…` → SBOM/VEX diff view +* `GET /bundles/{id}` and/or `POST /bundles/verify` → offline bundle download/verify + +### Evidence payload guidelines (MUST) + +* Deterministic ordering for arrays and nodes (stable sorts). +* Explicit `status` per evidence section: `available | loading | unavailable | error`. +* Include `hash` per artifact for content addressing. + +**Example shape** + +```json +{ + "alert_id": "a123", + "reachability": { "status": "available", "hash": "sha256:…", "proof": { "type": "call_path", "nodes": [...] } }, + "callstack": { "status": "available", "hash": "sha256:…", "frames": [...] }, + "provenance": { "status": "pending", "hash": null, "dsse": { "embedded": true, "payload": "…" } }, + "vex": { "status": "available", "current": {...}, "history": [...] }, + "hashes": ["sha256:…", "sha256:…"] +} +``` + +--- + +## 11) Telemetry requirements (how we prove it’s fast) + +**MUST** instrument: + +* `alert_opened` (timestamp, alert_id) +* `evidence_first_paint` (timestamp, evidence_type) +* `decision_recorded` (timestamp, clicks_count, evidence_bitset) +* `bundle_loaded` (hit/miss, size, verification_status) +* `graph_preview_paint` and `graph_hydrated` + +**MUST** compute: + +* TTFS = `evidence_first_paint - alert_opened` +* Clicks‑to‑Closure = interaction counter per alert until decision recorded +* Evidence completeness bitset at decision time: reachability/callstack/provenance/vex present + +--- + +## 12) Error handling and edge cases + +**MUST** + +* Never show empty states without explanation. +* Distinguish between: + + * “not computed yet” + * “not possible due to missing inputs” + * “blocked by permissions” + * “offline—enrichment pending” + * “verification failed” + +**SHOULD** + +* Offer “Request enrichment” action when evidence missing (creates a job/task id). + +--- + +## 13) Security, permissions, and multi-tenancy + +**MUST** + +* RBAC gating for: + + * viewing provenance attestations + * recording decisions + * exporting audit bundles +* All decision events are immutable; corrections are new events (append-only). +* PII handling: + + * Avoid storing freeform reasons with secrets; warn on paste patterns (optional P1). + +--- + +## 14) Engineering execution plan (priorities) + +### P0 (ship first) + +* Evidence-first alert detail landing +* Decision drawer + append-only audit +* Deterministic alert list sort + reachability badge +* Evidence API + decision POST +* TTFS + clicks telemetry +* Static graph preview + lazy hydration + +### P1 + +* Offline bundle load/verify + offline rendering +* Smart diff view (risk shift grouping) +* Exportable audit bundle +* Keyboard shortcuts + help overlay + +### P2 + +* Inline quick decisions from list +* Advanced graph search within view +* Suggest reason presets based on evidence patterns + +--- + +## 15) Acceptance criteria checklist (what QA signs off) + +A build is acceptable when: + +* Opening an alert renders at least one evidence pill within **500ms** (with cache) and TTFS p95 meets target under network simulation. +* Users can record A/N/U decisions with reason and see an audit event immediately. +* Decision event includes evidence hashes + replay token. +* Alert list sorting is stable and deterministic across refresh. +* Graph preview appears instantly; interactive graph hydrates only on expand. +* Offline bundle renders evidence without network; missing items show “enrichment pending,” not errors. +* Keyboard shortcuts work; `?` overlay lists them; full keyboard navigation is possible. + +--- + +If you want, I can also format this into a **developer-ready ticket pack** (epics + user stories + acceptance tests) so engineers can implement without interpretation drift. diff --git a/docs/product-advisories/14-Dec-2025 - Evaluate PostgreSQL vs MongoDB for StellaOps.md b/docs/product-advisories/14-Dec-2025 - Evaluate PostgreSQL vs MongoDB for StellaOps.md new file mode 100644 index 000000000..9f73ca4b2 --- /dev/null +++ b/docs/product-advisories/14-Dec-2025 - Evaluate PostgreSQL vs MongoDB for StellaOps.md @@ -0,0 +1,544 @@ +Here’s a quick, practical cheat‑sheet on choosing **PostgreSQL vs MongoDB** for security/DevOps apps—plus how I’d model SBOM/VEX and queues in Stella Ops without adding moving parts. + +--- + +# PostgreSQL you can lean on (why it often wins for ops apps) + +* **JSONB that flies:** Store documents yet query like SQL. Add **GIN indexes** on JSONB fields for fast lookups (`jsonb_ops` general; `jsonb_path_ops` great for `@>` containment). +* **Queue pattern built‑in:** `SELECT … FOR UPDATE SKIP LOCKED` lets multiple workers pop jobs from the same table safely—no head‑of‑line blocking, no extra broker. +* **Cooperative locks:** **Advisory locks** (session/transaction) for “at‑most‑once” sections or leader election. +* **Lightweight pub/sub:** **LISTEN/NOTIFY** for async nudges between services (poke a worker to re‑scan, refresh cache, etc.). +* **Search included:** **Full‑text search** (tsvector/tsquery) is native—no separate search service for moderate needs. +* **Serious backups:** **PITR** with WAL archiving / `pg_basebackup` for deterministic rollbacks and offline bundles. + +# MongoDB facts to factor in + +* **Flexible ingest:** Schemaless docs make it easy to absorb varied telemetry and vendor feeds. +* **Horizontal scale:** Sharding is mature for huge, read‑heavy datasets. +* **Consistency is a choice:** Design embedding vs refs and when to use multi‑document transactions. + +--- + +# A simple rule of thumb (Stella Ops‑style) + +* **System of record:** PostgreSQL (JSONB first). +* **Hot paths:** Materialized views + JSONB GIN indexes. +* **Queues & coordination:** PostgreSQL (skip‑locked + advisory locks). +* **Cache/accel only:** Valkey (ephemeral). +* **MongoDB:** Optional for **very large, read‑optimized graph snapshots** (e.g., periodically baked reachability graphs) if Postgres starts to strain. + +--- + +# Concrete patterns you can drop in today + +**1) SBOM/VEX storage (Postgres JSONB)** + +```sql +-- Documents +CREATE TABLE sbom ( + id BIGSERIAL PRIMARY KEY, + artifact_purl TEXT NOT NULL, + doc JSONB NOT NULL, + created_at TIMESTAMPTZ DEFAULT now() +); +CREATE INDEX sbom_purl_idx ON sbom(artifact_purl); +CREATE INDEX sbom_doc_gin ON sbom USING GIN (doc jsonb_path_ops); + +-- Common queries +-- find components by name/version: +-- SELECT * FROM sbom WHERE doc @> '{"components":[{"name":"openssl","version":"3.0.14"}]}'; + +-- VEX +CREATE TABLE vex ( + id BIGSERIAL PRIMARY KEY, + subject_purl TEXT NOT NULL, + vex_doc JSONB NOT NULL, + created_at TIMESTAMPTZ DEFAULT now() +); +CREATE INDEX vex_subject_idx ON vex(subject_purl); +CREATE INDEX vex_doc_gin ON vex USING GIN (vex_doc jsonb_path_ops); +``` + +**2) Hot reads via materialized views** + +```sql +CREATE MATERIALIZED VIEW mv_open_findings AS +SELECT + s.artifact_purl, + c->>'name' AS comp, + c->>'version' AS ver, + v.vex_doc +FROM sbom s +CROSS JOIN LATERAL jsonb_array_elements(s.doc->'components') c +LEFT JOIN vex v ON v.subject_purl = s.artifact_purl +-- add WHERE clauses to pre‑filter only actionable rows +; +CREATE INDEX mv_open_findings_idx ON mv_open_findings(artifact_purl, comp); +``` + +Refresh cadence: on feed import or via a scheduler; `REFRESH MATERIALIZED VIEW CONCURRENTLY mv_open_findings;` + +**3) Queue without a broker** + +```sql +CREATE TABLE job_queue( + id BIGSERIAL PRIMARY KEY, + kind TEXT NOT NULL, -- e.g., 'scan', 'sbom-diff' + payload JSONB NOT NULL, + run_after TIMESTAMPTZ DEFAULT now(), + attempts INT DEFAULT 0, + locked_at TIMESTAMPTZ, + locked_by TEXT +); +CREATE INDEX job_queue_ready_idx ON job_queue(kind, run_after); + +-- Worker loop +WITH cte AS ( + SELECT id FROM job_queue + WHERE kind = $1 AND run_after <= now() AND locked_at IS NULL + ORDER BY id + FOR UPDATE SKIP LOCKED + LIMIT 1 +) +UPDATE job_queue j +SET locked_at = now(), locked_by = $2 +FROM cte +WHERE j.id = cte.id +RETURNING j.*; +``` + +Release/fail with: set `locked_at=NULL, locked_by=NULL, attempts=attempts+1` or delete on success. + +**4) Advisory lock for singletons** + +```sql +-- Acquire (per tenant, per artifact) +SELECT pg_try_advisory_xact_lock(hashtextextended('recalc:'||tenant||':'||artifact, 0)); +``` + +**5) Nudge workers without a bus** + +```sql +NOTIFY stella_scan, json_build_object('purl', $1, 'priority', 5)::TEXT; +-- workers LISTEN stella_scan and enqueue quickly +``` + +--- + +# When to add MongoDB + +* You need **interactive exploration** over **hundreds of millions of nodes/edges** (e.g., historical “proof‑of‑integrity” graphs) where document fan‑out and denormalized reads beat relational joins. +* Snapshot cadence is **batchy** (hourly/daily), and you can **re‑emit** snapshots deterministically from Postgres (single source of truth). +* You want to isolate read spikes from the transactional core. + +**Snapshot pipe:** Postgres → (ETL) → MongoDB collection `{graph_id, node, edges[], attrs}` with **compound shard keys** tuned to your UI traversal. + +--- + +# Why this fits Stella Ops + +* Fewer moving parts on‑prem/air‑gapped. +* Deterministic replays (PITR + immutable imports). +* Clear performance levers (GIN indexes, MVs, skip‑locked queues). +* MongoDB stays optional, purpose‑built for giant read graphs—not a default dependency. + +If you want, I can turn the above into ready‑to‑run `.sql` migrations and a small **.NET 10** worker (Dapper/EF Core) that implements the queue loop + advisory locks + LISTEN/NOTIFY hooks. +Below is a handoff-ready set of **PostgreSQL tables/views engineering guidelines** intended for developer review. It is written as a **gap-finding checklist** with **concrete DDL patterns** and **performance red flags** (Postgres as system of record, JSONB where useful, derived projections where needed). + +--- + +# PostgreSQL Tables & Views Engineering Guide + +## 0) Non-negotiable principles + +1. **Every hot query must have an index story.** If you cannot name the index that serves it, you have a performance gap. +2. **Write path stays simple.** Prefer **append-only** versioning to large updates (especially for JSONB). +3. **Multi-tenant must be explicit.** Every core table includes `tenant_id` and indexes are tenant-prefixed. +4. **Derived data is a product.** If the UI needs it fast, model it as a **projection table or materialized view**, not as an ad-hoc mega-join. +5. **Idempotency is enforced in the DB.** Unique keys for imports/jobs/results; no “best effort” dedupe in application only. + +--- + +# 1) Table taxonomy and what to look for + +Use this to classify every table; each class has different indexing/retention/locking rules. + +### A. Source-of-truth (SOR) tables + +Examples: `sbom_document`, `vex_document`, `feed_import`, `scan_manifest`, `attestation`. + +* **Expect:** immutable rows, versioning via new row inserts. +* **Gaps:** frequent updates to large JSONB; missing `content_hash`; no unique idempotency key. + +### B. Projection tables (query-optimized) + +Examples: `open_findings`, `artifact_risk_summary`, `component_index`. + +* **Expect:** denormalized, indexed for UI/API; refresh/update strategy defined. +* **Gaps:** projections rebuilt from scratch too often; missing incremental update plan; no retention plan. + +### C. Queue/outbox tables + +Examples: `job_queue`, `outbox_events`. + +* **Expect:** `SKIP LOCKED` claim pattern; retry + DLQ; minimal lock duration. +* **Gaps:** holding row locks while doing work; missing partial index for “ready” jobs. + +### D. Audit/event tables + +Examples: `scan_run_event`, `decision_event`, `access_audit`. + +* **Expect:** append-only; partitioned by time; BRIN on timestamps. +* **Gaps:** single huge table without partitioning; slow deletes instead of partition drops. + +--- + +# 2) Naming, keys, and required columns + +## Required columns per class + +### SOR documents (SBOM/VEX/Attestations) + +* `tenant_id uuid` +* `id bigserial` (internal PK) +* `external_id uuid` (optional API-facing id) +* `content_hash bytea` (sha256) **NOT NULL** +* `doc jsonb` **NOT NULL** +* `created_at timestamptz` **NOT NULL default now()** +* `supersedes_id bigint NULL` (version chain) OR `version int` + +**Checklist** + +* [ ] Unique constraint exists: `(tenant_id, content_hash)` +* [ ] Version strategy exists (supersedes/version) and is queryable +* [ ] “Latest” access is index-backed (see §4) + +### Queue + +* `tenant_id uuid` (if multi-tenant) +* `id bigserial` +* `kind text` +* `payload jsonb` +* `run_after timestamptz` +* `attempts int` +* `locked_at timestamptz NULL` +* `locked_by text NULL` +* `status smallint` (optional; e.g., ready/running/done/dead) + +**Checklist** + +* [ ] “Ready to claim” has a partial index (see §4) +* [ ] Claim transaction is short (claim+commit; work outside lock) + +--- + +# 3) JSONB rules that prevent “looks fine → melts in prod” + +## When JSONB is appropriate + +* Storing signed envelopes (DSSE), SBOM/VEX raw docs, vendor payloads. +* Ingest-first scenarios where schema evolves. + +## When JSONB is a performance hazard + +* You frequently query deep keys/arrays (components, vulnerabilities, call paths). +* You need sorting/aggregations on doc fields. + +**Mandatory pattern for hot JSON fields** + +1. Keep the raw JSONB for fidelity. +2. Extract **hot keys** into **stored generated columns** (or real columns), index those. +3. Extract **hot arrays** into child tables (components, vulnerabilities). + +Example: + +```sql +CREATE TABLE sbom_document ( + id bigserial PRIMARY KEY, + tenant_id uuid NOT NULL, + artifact_purl text NOT NULL, + content_hash bytea NOT NULL, + doc jsonb NOT NULL, + created_at timestamptz NOT NULL DEFAULT now(), + + -- hot keys as generated columns + bom_format text GENERATED ALWAYS AS ((doc->>'bomFormat')) STORED, + spec_version text GENERATED ALWAYS AS ((doc->>'specVersion')) STORED +); + +CREATE UNIQUE INDEX ux_sbom_doc_hash ON sbom_document(tenant_id, content_hash); +CREATE INDEX ix_sbom_doc_tenant_artifact ON sbom_document(tenant_id, artifact_purl, created_at DESC); +CREATE INDEX ix_sbom_doc_json_gin ON sbom_document USING GIN (doc jsonb_path_ops); +CREATE INDEX ix_sbom_doc_bomformat ON sbom_document(tenant_id, bom_format); +``` + +**Checklist** + +* [ ] Any query using `doc->>` in WHERE has either an expression index or a generated column index +* [ ] Any query using `jsonb_array_elements(...)` in hot path has been replaced by a normalized child table or a projection table + +--- + +# 4) Indexing standards (what devs must justify) + +## Core rules + +1. **Tenant-first**: `INDEX(tenant_id, …)` for anything read per tenant. +2. **Sort support**: if query uses `ORDER BY created_at DESC`, index must end with `created_at DESC`. +3. **Partial indexes** for sparse predicates (status/locked flags). +4. **BRIN** for massive append-only time series. +5. **GIN jsonb_path_ops** for containment (`@>`) on JSONB; avoid GIN for everything. + +## Required index patterns by use case + +### “Latest version per artifact” + +If you store versions as rows: + +```sql +-- supports: WHERE tenant_id=? AND artifact_purl=? ORDER BY created_at DESC LIMIT 1 +CREATE INDEX ix_sbom_latest ON sbom_document(tenant_id, artifact_purl, created_at DESC); +``` + +### Ready queue claims + +```sql +CREATE INDEX ix_job_ready +ON job_queue(kind, run_after, id) +WHERE locked_at IS NULL; + +-- Optional: tenant scoped +CREATE INDEX ix_job_ready_tenant +ON job_queue(tenant_id, kind, run_after, id) +WHERE locked_at IS NULL; +``` + +### JSON key lookup (expression index) + +```sql +-- supports: WHERE (doc->>'subject') = ? +CREATE INDEX ix_vex_subject_expr +ON vex_document(tenant_id, (doc->>'subject')); +``` + +### Massive event table time filtering + +```sql +CREATE INDEX brin_scan_events_time +ON scan_run_event USING BRIN (occurred_at); +``` + +**Red flags** + +* GIN index on a JSONB column + frequent updates = bloat and write amplification. +* No partial index for queue readiness → sequential scans under load. +* Composite indexes with wrong leading column order (e.g., `created_at, tenant_id`) → not used. + +--- + +# 5) Partitioning and retention (avoid “infinite tables”) + +Use partitioning for: + +* audit/events +* scan run logs +* large finding histories +* anything > tens of millions rows with time-based access + +## Standard approach + +* Partition by `occurred_at` (monthly) for event/audit tables. +* Retention by dropping partitions (fast and vacuum-free). + +Example: + +```sql +CREATE TABLE scan_run_event ( + tenant_id uuid NOT NULL, + scan_run_id bigint NOT NULL, + occurred_at timestamptz NOT NULL, + event_type text NOT NULL, + payload jsonb NOT NULL +) PARTITION BY RANGE (occurred_at); +``` + +**Checklist** + +* [ ] Partition creation/rollover process exists (migration or scheduler) +* [ ] Retention is “DROP PARTITION”, not “DELETE WHERE occurred_at < …” +* [ ] Each partition has needed local indexes (BRIN/time + tenant filters) + +--- + +# 6) Views vs Materialized Views vs Projection Tables + +## Use a normal VIEW when + +* It’s thin (renaming columns, simple joins) and not used in hot paths. + +## Use a MATERIALIZED VIEW when + +* It accelerates complex joins/aggregations and can be refreshed on a schedule. +* You can tolerate refresh lag. + +**Materialized view requirements** + +* Must have a **unique index** to use `REFRESH … CONCURRENTLY`. +* Refresh must be **outside** an explicit transaction block. + +Example: + +```sql +CREATE MATERIALIZED VIEW mv_artifact_risk AS +SELECT tenant_id, artifact_purl, max(score) AS risk_score +FROM open_findings +GROUP BY tenant_id, artifact_purl; + +CREATE UNIQUE INDEX ux_mv_artifact_risk +ON mv_artifact_risk(tenant_id, artifact_purl); +``` + +## Prefer projection tables over MV when + +* You need **incremental updates** (on import/scan completion). +* You need deterministic “point-in-time” snapshots per manifest. + +**Checklist** + +* [ ] Every MV has refresh cadence + owner (which worker/job triggers it) +* [ ] UI/API queries do not depend on a heavy non-materialized view +* [ ] If “refresh cost” scales with whole dataset, projection table exists instead + +--- + +# 7) Queue and outbox patterns that do not deadlock + +## Claim pattern (short transaction) + +```sql +WITH cte AS ( + SELECT id + FROM job_queue + WHERE kind = $1 + AND run_after <= now() + AND locked_at IS NULL + ORDER BY id + FOR UPDATE SKIP LOCKED + LIMIT 1 +) +UPDATE job_queue j +SET locked_at = now(), + locked_by = $2 +FROM cte +WHERE j.id = cte.id +RETURNING j.*; +``` + +**Rules** + +* Claim + commit quickly. +* Do work outside the lock. +* On completion: update row to done (or delete if you want compactness). +* On failure: increment attempts, set `run_after = now() + backoff`, release lock. + +**Checklist** + +* [ ] Worker does not keep transaction open while scanning/importing +* [ ] Backoff policy is encoded (in DB columns) and observable +* [ ] DLQ condition exists (attempts > N) and is queryable + +--- + +# 8) Query performance review checklist (what to require in PRs) + +For each new endpoint/query: + +* [ ] Provide the query (SQL) and the intended parameters. +* [ ] Provide `EXPLAIN (ANALYZE, BUFFERS)` from a dataset size that resembles staging. +* [ ] Identify the serving index(es). +* [ ] Confirm row estimates are not wildly wrong (if they are: stats or predicate mismatch). +* [ ] Confirm it is tenant-scoped and uses the tenant-leading index. + +**Common fixes** + +* Replace `IN (SELECT …)` with `EXISTS` for correlated checks. +* Replace `ORDER BY … LIMIT` without index with an index that matches ordering. +* Avoid exploding joins with JSON arrays; pre-extract. + +--- + +# 9) Vacuum, bloat, and “why is disk growing” + +## Design to avoid bloat + +* Append-only for large docs and events. +* If frequent updates are needed, isolate hot-updated columns into a smaller table. + +Example split: + +* `job_queue_payload` (stable) +* `job_queue_state` (locked/status/attempts updated frequently) + +**Checklist** + +* [ ] Large frequently-updated JSONB tables have been questioned +* [ ] Updates do not rewrite big TOAST values repeatedly +* [ ] Retention is partition-drop where possible + +--- + +# 10) Migration safety rules (prevent production locks) + +* Index creation: `CREATE INDEX CONCURRENTLY`. +* Dropping indexes: `DROP INDEX CONCURRENTLY`. +* New column with default on large table: + + 1. `ADD COLUMN` nullable + 2. backfill in batches + 3. `ALTER COLUMN SET NOT NULL` + 4. add default if needed + +**Checklist** + +* [ ] No long-running `ALTER TABLE` on huge tables without plan +* [ ] Any new NOT NULL constraint is staged safely + +--- + +# 11) Stella Ops-specific schema guidance (SBOM/VEX/Finding) + +## Minimum recommended normalized tables + +Even if you keep raw SBOM/VEX JSON: + +* `sbom_document` (raw, immutable) +* `sbom_component` (extracted components) +* `vex_document` (raw, immutable) +* `vex_statement` (extracted statements per CVE/component) +* `finding` (facts: CVE ↔ component ↔ artifact ↔ scan_run) +* `scan_manifest` (determinism: feed versions/hashes, policy hash) +* `scan_run` (links results to manifest) + +**Key gap detectors** + +* If “find all artifacts affected by CVE X” is slow → missing `finding` indexing. +* If “component search” is slow → missing `sbom_component` and its indexes. +* If “replay this scan” is not exact → missing `scan_manifest` + feed import hashes. + +--- + +# 12) Minimal “definition of done” for a new table/view + +A PR adding a table/view is incomplete unless it includes: + +* [ ] Table classification (SOR / projection / queue / event) +* [ ] Primary key and idempotency unique key +* [ ] Tenant scoping strategy +* [ ] Index plan mapped to known queries +* [ ] Retention plan (especially for event/projection tables) +* [ ] Refresh/update plan if derived +* [ ] Example query + `EXPLAIN` for the top 1–3 access patterns + +--- + +If you want this as a single drop-in repo document, tell me the target path (e.g., `/docs/platform/postgres-table-view-guidelines.md`) and I will format it exactly as a team-facing guideline, including a one-page “Architecture/Performance Gaps” review form that engineers can paste into PR descriptions. diff --git a/docs/product-advisories/archived/AR-REVIVE-PLAN.md b/docs/product-advisories/archived/AR-REVIVE-PLAN.md deleted file mode 100644 index c7aff3cf5..000000000 --- a/docs/product-advisories/archived/AR-REVIVE-PLAN.md +++ /dev/null @@ -1,12 +0,0 @@ -# Archived Advisories Revival Plan (Stub) - -Use with sprint task 13 (ARCHIVED-GAPS-300-020). - -- Candidate advisories to revive: - - SBOM-Provenance-Spine - - Binary reachability (VB branch) - - Function-level VEX explainability - - PostgreSQL storage blueprint -- Decide canonical schemas/recipes (provenance, reachability, PURL/Build-ID). -- Document determinism seeds/SLOs, redaction/isolation rules, changelog/signing approach. -- Mark supersedes/duplicates and PostgreSQL storage blueprint guardrails.