up

2025-12-14 16:23:44 +02:00
parent 233873f620
commit 01f4943ab9
8 changed files with 6193 additions and 12 deletions
--- a/docs/product-advisories/14-Dec-2025
+++ b/docs/product-advisories/14-Dec-2025
@@ -0,0 +1,551 @@
+Here’s a tight, practical blueprint for building (and proving) a fast, evidence‑first triage workflow—plus the power‑user affordances that make Stella Ops feel “snappy” even offline.
+
+# What “good” looks like (background in plain words)
+
+* **Alert → evidence → decision** in one flow: an alert should open directly onto the concrete proof (reachability, call‑stack, provenance), then offer a one‑click decision (VEX/CSAF status) with audit logging.
+* **Time‑to‑First‑Signal (TTFS)** is king: how fast a human sees the first credible piece of evidence that explains *why this alert matters here*.
+* **Clicks‑to‑Closure**: count how many interactions to reach a defensible decision recorded in the audit log.
+
+# Minimal evidence bundle per finding
+
+* **Reachability proof**: function‑level path or package‑level import chain (with “toggle reachability view” hotkey).
+* **Call‑stack snippet**: 5–10 frames around the sink/source with file:line anchors.
+* **Provenance**: attestation / DSSE + build ancestry (image → layer → artifact → commit).
+* **VEX/CSAF status**: affected/not‑affected/under‑investigation + reason.
+* **Diff**: what changed since last scan (SBOM or VEX delta), rendered as a small, human‑readable “smart‑diff.”
+
+# KPIs to measure in CI and UI
+
+* **TTFS (p50/p95)** from alert creation to first rendered evidence.
+* **Clicks‑to‑Closure (median)** per decision type.
+* **Evidence completeness score** (0–4): reachability, call‑stack, provenance, VEX/CSAF present.
+* **Offline friendliness score**: % of evidence resolvable with no network.
+* **Audit log completeness**: every decision has: evidence hash set, actor, policy context, replay token.
+
+# Power‑user affordances (keyboard first)
+
+* **Jump to evidence** (`J`): focuses the first incomplete evidence pane.
+* **Copy DSSE** (`Y`): copies the attestation block or Rekor entry ref.
+* **Toggle reachability view** (`R`): path list ↔ compact graph ↔ textual proof.
+* **Search‑within‑graph** (`/`): node/func/package, instant.
+* **Deterministic sort** (`S`): stable sort by (reachability→severity→age→component) to remove hesitation.
+* **Quick VEX set** (`A`, `N`, `U`): Affected / Not‑affected / Under‑investigation with templated reasons.
+
+# UX flow to implement (end‑to‑end)
+
+1. **Alert row** shows: TTFS timer, reachability badge, “decision state,” and a diff‑dot if something changed.
+2. **Open alert** lands on **Evidence tab** (not Details). Top strip = three proof pills:
+
+   * Reachability ✓ / Call‑stack ✓ / Provenance ✓ (click to expand inline).
+3. **Decision drawer** pinned on the right:
+
+   * VEX/CSAF radio (A/N/U) → Reason presets → “Record decision.”
+   * Shows **audit‑ready summary** (hashes, timestamps, policy).
+4. **Diff tab**: SBOM/VEX delta since last run, grouped by “meaningful risk shift.”
+5. **Activity tab**: immutable audit log; export as a signed bundle for audits.
+
+# Graph performance on large call‑graphs
+
+* **Minimal‑latency snapshots**: pre‑render static PNG/SVG thumbnails server‑side; open with tiny preview then hydrate to interactive graph lazily.
+* **Progressive neighborhood expansion**: load 1‑hop first, expand on demand; keep the first TTFS < 500 ms.
+* **Stable node ordering**: deterministic layout with consistent anchors to avoid “graph shuffle” anxiety.
+* **Chunked graph edges** with capped fan‑out; collapse identical library paths into a **reachability macro‑edge**.
+
+# Offline‑friendly design
+
+* **Local evidence cache**: store (SBOM slices, path proofs, DSSE attestations, compiled call‑stacks) in a signed bundle beside the SARIF/VEX.
+* **Deferred enrichment**: mark fields that need internet (e.g., upstream CSAF fetch) and queue a background “enricher” when network returns.
+* **Predictable fallbacks**: if provenance server missing, show embedded DSSE and “verification pending,” never blank states.
+
+# Audit & replay
+
+* **Deterministic replay token**: hash(feed manifests + rules + lattice policy + inputs) → attach to every decision.
+* **One‑click “Reproduce”**: opens CLI snippet pinned to the exact versions and policies.
+* **Evidence hash‑set**: content‑address each proof artifact; the audit entry stores only hashes + signer.
+
+# TTFS & Clicks‑to‑Closure: how to measure in code
+
+* Emit a `ttfs.start` at alert creation; first paint of any evidence card emits `ttfs.signal`.
+* Increment a per‑alert **interaction counter**; on “Record decision” emit `close.clicks`.
+* Log **evidence bitset** (reach, stack, prov, vex) at decision time for completeness scoring.
+
+# Developer tasks (concrete, shippable)
+
+* **Evidence API**: `GET /alerts/{id}/evidence` returns `{reachability, callstack, provenance, vex, hashes[]}` with deterministic sort.
+* **Proof renderer**: tiny, no‑framework widget that can render from the offline bundle; hydrate to full only on interaction.
+* **Keyboard map**: global handler with overlay help (`?`); no collisions; all actions are idempotent.
+* **Graph service**: server‑side layout + snapshot PNG; client hydrates WebGL only when user expands.
+* **Smart‑diff**: diff SBOM/VEX → classify into “risk‑raising / neutral / reducing,” surface only the first item by default.
+* **Audit logger**: append‑only stream; signed checkpoints; export `.stella-audit.tgz` (attestations + JSONL).
+
+# Benchmarks to run weekly
+
+* **TTFS under poor network** (100 ms RTT, 1% loss): p95 < 1.5 s to first evidence.
+* **Graph hydration on 250k‑edge image**: preview < 300 ms, interactive < 2.0 s.
+* **Keyboard coverage**: ≥90% of triage actions executable without mouse.
+* **Offline replay**: 100% of decisions re‑render from bundle; zero web calls required.
+
+# Why Stella’s approach reduces hesitation
+
+* **Deterministic sort orders** keep findings in place between refreshes.
+* **Minimal‑latency graph snapshots** show something trustworthy immediately, then refine—no “blank panel” delay.
+* **Replayable, signed bundles** make every click auditable and reversible, which builds operator confidence.
+
+If you want, I can turn this into:
+
+* a **UI checklist** for a design review,
+* a **.NET 10 API contract** (DTOs + endpoints),
+* or a **Cypress/Playwright test plan** that measures TTFS and clicks‑to‑closure automatically.
+Below is a PM‑style implementation guideline you can hand to developers. It’s written as a **build spec**: clear goals, “MUST/SHOULD” requirements, acceptance criteria, and the non‑functional guardrails (performance, offline, auditability) that make triage feel fast and defensible.
+
+---
+
+# Stella Ops — Evidence‑First Triage Implementation Guidelines (PM Spec)
+
+## 0) Assumptions and scope
+
+**Assumptions**
+
+* Stella Ops ingests vulnerability findings (SCA/SAST/image scans), has SBOM context, and can compute reachability/call paths.
+* Triage outcomes must be recorded as VEX/CSAF‑compatible states with reasons and audit trails.
+* Users may operate in restricted networks and need an offline mode that still shows evidence.
+
+**In scope**
+
+* Evidence‑first alert triage UI + APIs + telemetry.
+* Reachability proof + call stack view + provenance attestation view.
+* VEX/CSAF decision recording with audit export.
+* Offline evidence bundle and deterministic replay token.
+
+**Out of scope (for this phase)**
+
+* Building the underlying static analyzer or SBOM generator (we consume their outputs).
+* Full CSAF publishing workflow (we store and export; publishing is separate).
+* Remediation automation (PRs, patching).
+
+---
+
+## 1) Product principles (non‑negotiables)
+
+1. **Evidence before detail**
+   Opening an alert **MUST** show the best available evidence immediately (even partial/placeholder), not a generic “details” page.
+2. **Fast first signal**
+   The UI **MUST** render a credible “first signal” quickly (reachability badge, call stack snippet, or provenance block).
+3. **Determinism reduces hesitation**
+   Sorting, graphs, and diffs **MUST** be stable across refreshes. No jittery re-layout.
+4. **Offline by design**
+   If evidence exists locally (bundle), the UI **MUST** render it without network access.
+5. **Audit-ready by default**
+   Every decision **MUST** be reproducible, attributable, and exportable with evidence hashes.
+
+---
+
+## 2) Success metrics (what we ship toward)
+
+These become acceptance criteria and dashboards.
+
+### Primary metrics (P0)
+
+* **TTFS (Time‑to‑First‑Signal)**: p95 < **1.5s** from opening an alert to first evidence card rendering (with 100ms RTT, 1% loss simulation).
+* **Clicks‑to‑Closure**: median < **6** interactions to record a VEX decision.
+* **Evidence completeness** at decision time: ≥ **90%** of decisions include evidence hash set + reason + replay token.
+
+### Secondary metrics (P1)
+
+* **Offline resolution rate**: ≥ **95%** of alerts opened with a local bundle show reachability + provenance without network.
+* **Graph usability**: preview render < **300ms**, interactive hydration < **2.0s** for large graphs (see §7).
+
+---
+
+## 3) User workflows and “Definition of Done”
+
+### Workflow A: Triage an alert to a decision
+
+**DoD**: user can open an alert, see evidence, set VEX state, and the system records a signed/auditable decision event.
+
+**Steps**
+
+1. Alert list shows key signals (reachability badge, decision state, diff indicator).
+2. Open alert → Evidence view loads first.
+3. User reviews reachability/call stack/provenance.
+4. User sets VEX status + reason preset (editable).
+5. User records decision.
+6. Audit log entry appears instantly and is exportable.
+
+### Workflow B: Explain “why is this flagged?”
+
+**DoD**: user can show a defensible proof (path/call stack/provenance) and copy it into a ticket.
+
+---
+
+## 4) UI requirements (MUST/SHOULD/MAY)
+
+## 4.1 Alert list page
+
+**MUST**
+
+* Each row includes:
+
+  * Severity + component identifier
+  * **Decision state** (Unset / Under Investigation / Not Affected / Affected)
+  * **Reachability badge** (Reachable / Not Reachable / Unknown) where available
+  * **Diff indicator** if SBOM/VEX changed since last scan (simple dot/label)
+  * Age / first seen / last updated
+* **Deterministic sort** default:
+  `Reachability DESC → Severity DESC → Decision state (Unset first) → Age DESC → Component name ASC`
+* Keyboard navigation:
+
+  * `↑/↓` move selection, `Enter` open alert.
+  * `/` search/filter focus.
+
+**SHOULD**
+
+* Inline “quick set” decision menu (Affected / Not affected / Under investigation) without leaving list for obvious cases, but still requires reason and logs evidence hashes.
+
+## 4.2 Alert detail — landing tab MUST be Evidence
+
+**MUST**
+
+* Default landing is **Evidence** (not “Overview”).
+* Top section shows 3 “proof pills” with status:
+
+  * Reachability (✓ / ! / …)
+  * Call stack (✓ / ! / …)
+  * Provenance (✓ / ! / …)
+* Each pill expands inline (no navigation) into a compact evidence panel.
+
+**MUST: No blank panels**
+
+* If evidence is loading, show skeleton + “what’s coming.”
+* If evidence missing, show a reason (“not computed”, “requires source map”, “offline – enrichment pending”).
+
+## 4.3 Decision drawer
+
+**MUST**
+
+* Pinned right drawer (or persistent bottom sheet on small screens).
+* Controls:
+
+  * VEX/CSAF status: **Affected / Not affected / Under investigation**
+  * Reason preset dropdown + editable reason text
+  * “Record decision” button
+* Preview “Audit summary” before submit:
+
+  * Evidence hashes included
+  * Policy context (ruleset version)
+  * Replay token
+  * Actor identity
+
+**MUST**
+
+* On submit, create an append-only audit event and immediately reflect status in UI.
+
+**SHOULD**
+
+* Allow attaching references: ticket URL, incident ID, PR link (stored as metadata).
+
+## 4.4 Diff tab
+
+**MUST**
+
+* Show delta since last scan:
+
+  * SBOM diffs (component version changes, removals/additions)
+  * VEX diffs (status changes)
+* Group diffs by **risk shift**:
+
+  * Risk‑raising (new reachable vuln, severity increase)
+  * Neutral (metadata-only)
+  * Risk‑reducing (fixed version, reachability removed)
+
+**SHOULD**
+
+* Provide “Copy diff summary” for change management.
+
+## 4.5 Activity/Audit tab
+
+**MUST**
+
+* Immutable timeline of decisions and evidence changes.
+* Each entry includes:
+
+  * actor, timestamp, decision, reason
+  * evidence hash set
+  * replay token
+  * bundle/export availability
+
+---
+
+## 5) Power-user and accessibility requirements
+
+### Keyboard shortcuts (MUST)
+
+* `J`: jump to next missing/incomplete evidence panel
+* `R`: toggle reachability view (list ↔ compact graph ↔ textual proof)
+* `Y`: copy selected evidence block (call stack / DSSE / path proof)
+* `A`: set “Affected” (opens reason preset selection)
+* `N`: set “Not affected”
+* `U`: set “Under investigation”
+* `?`: keyboard help overlay
+
+### Accessibility (MUST)
+
+* Fully navigable by keyboard
+* Visible focus states
+* Screen-reader labels for evidence pills and drawer controls
+* Color is never the only signal (badges must have text/icon)
+
+---
+
+## 6) Evidence model: what every alert should attempt to provide
+
+Treat this as the **minimum evidence bundle**. Each item may be “unavailable,” but must be explicit.
+
+**MUST** support:
+
+1. **Reachability proof**
+
+   * At least one of:
+
+     * function-level call path: `entry → … → vulnerable_sink`
+     * package/module import chain
+   * Includes confidence/algorithm tag: `static`, `dynamic`, `heuristic`
+2. **Call stack snippet**
+
+   * 5–10 frames around the relevant node with file:line anchors where possible
+3. **Provenance**
+
+   * DSSE attestation or equivalent statement
+   * Artifact ancestry chain: image → layer → artifact → commit (as available)
+   * Verification status: verified / pending / failed (with reason)
+4. **Decision state**
+
+   * VEX status + reason + timestamps
+5. **Evidence hash set**
+
+   * Content-addressed hashes of each evidence artifact included in the decision
+
+**SHOULD**
+
+* “Evidence freshness”: when computed, tool version, input revisions.
+
+---
+
+## 7) Performance and graph rendering requirements
+
+### TTFS budget (MUST)
+
+* When opening an alert:
+
+  * **<200ms**: show skeleton and cached row metadata
+  * **<500ms**: render at least one evidence pill with meaningful content OR a cached preview image
+  * **<1.5s p95**: render reachability + provenance for typical alerts
+
+### Graph rendering for large call graphs (MUST)
+
+* **Two-phase rendering**
+
+  1. Server-generated **static snapshot** (PNG/SVG) displayed immediately
+  2. Interactive graph hydrates lazily on user expand
+* **Progressive expansion**
+
+  * Load 1-hop neighborhood first; expand on click
+* **Deterministic layout**
+
+  * Same input produces same layout anchors (no reshuffles between refreshes)
+* **Fan-out control**
+
+  * Collapse repeated library paths into “macro edges” to keep the graph readable
+
+---
+
+## 8) Offline mode requirements
+
+Offline is not “nice to have”; it is a defined mode.
+
+### Offline evidence bundle (MUST)
+
+* A single file (e.g., `.stella.bundle.tgz`) that contains:
+
+  * Alert metadata snapshot
+  * Evidence artifacts (reachability proofs, call stacks, provenance attestations)
+  * SBOM slice(s) necessary for diffs
+  * VEX decision history (if available)
+  * Manifest with content hashes (Merkle-ish)
+* Bundle must be **signed** (or include signature material) and verifiable.
+
+### UI behavior (MUST)
+
+* If bundle is present:
+
+  * UI loads evidence from it first
+  * Any missing items show “enrichment pending” (not “error”)
+* If network returns:
+
+  * Background refresh allowed, but **must not reorder** the alert list unexpectedly
+  * Must surface “updated evidence available” as a user-controlled refresh, not an auto-switch that changes context mid-triage
+
+---
+
+## 9) Auditability and replay requirements
+
+### Decision event schema (MUST)
+
+Every recorded decision must store:
+
+* `alert_id`, `artifact_id` (image digest or commit hash)
+* `actor_id`, `timestamp`
+* `decision_status` (Affected/Not affected/Under investigation)
+* `reason_code` (preset) + `reason_text`
+* `evidence_hashes[]` (content-addressed hashes)
+* `policy_context` (ruleset version, policy id)
+* `replay_token` (hash of inputs needed to reproduce)
+
+### Replay token (MUST)
+
+* Deterministic hash of:
+
+  * scan inputs (SBOM digest, image digest, tool versions)
+  * policy/rules versions
+  * reachability algorithm version
+* “Reproduce” button produces a CLI snippet (copyable) pinned to these versions.
+
+### Export (MUST)
+
+* Exportable audit bundle that includes:
+
+  * JSONL of decision events
+  * evidence artifacts referenced by hashes
+  * signatures/attestations
+* Export must be stable and verifiable later.
+
+---
+
+## 10) API and data contract guidelines (developer-facing)
+
+This is an implementation guideline, not a full API spec—keep it simple and cache-friendly.
+
+### MUST endpoints (or equivalent)
+
+* `GET /alerts?filters…` → list view payload (small, cacheable)
+* `GET /alerts/{id}/evidence` → evidence payload (reachability, call stack, provenance, hashes)
+* `POST /alerts/{id}/decisions` → record decision event (append-only)
+* `GET /alerts/{id}/audit` → audit timeline
+* `GET /alerts/{id}/diff?baseline=…` → SBOM/VEX diff view
+* `GET /bundles/{id}` and/or `POST /bundles/verify` → offline bundle download/verify
+
+### Evidence payload guidelines (MUST)
+
+* Deterministic ordering for arrays and nodes (stable sorts).
+* Explicit `status` per evidence section: `available | loading | unavailable | error`.
+* Include `hash` per artifact for content addressing.
+
+**Example shape**
+
+```json
+{
+  "alert_id": "a123",
+  "reachability": { "status": "available", "hash": "sha256:…", "proof": { "type": "call_path", "nodes": [...] } },
+  "callstack":     { "status": "available", "hash": "sha256:…", "frames": [...] },
+  "provenance":    { "status": "pending",   "hash": null,       "dsse": { "embedded": true, "payload": "…" } },
+  "vex":           { "status": "available", "current": {...}, "history": [...] },
+  "hashes": ["sha256:…", "sha256:…"]
+}
+```
+
+---
+
+## 11) Telemetry requirements (how we prove it’s fast)
+
+**MUST** instrument:
+
+* `alert_opened` (timestamp, alert_id)
+* `evidence_first_paint` (timestamp, evidence_type)
+* `decision_recorded` (timestamp, clicks_count, evidence_bitset)
+* `bundle_loaded` (hit/miss, size, verification_status)
+* `graph_preview_paint` and `graph_hydrated`
+
+**MUST** compute:
+
+* TTFS = `evidence_first_paint - alert_opened`
+* Clicks‑to‑Closure = interaction counter per alert until decision recorded
+* Evidence completeness bitset at decision time: reachability/callstack/provenance/vex present
+
+---
+
+## 12) Error handling and edge cases
+
+**MUST**
+
+* Never show empty states without explanation.
+* Distinguish between:
+
+  * “not computed yet”
+  * “not possible due to missing inputs”
+  * “blocked by permissions”
+  * “offline—enrichment pending”
+  * “verification failed”
+
+**SHOULD**
+
+* Offer “Request enrichment” action when evidence missing (creates a job/task id).
+
+---
+
+## 13) Security, permissions, and multi-tenancy
+
+**MUST**
+
+* RBAC gating for:
+
+  * viewing provenance attestations
+  * recording decisions
+  * exporting audit bundles
+* All decision events are immutable; corrections are new events (append-only).
+* PII handling:
+
+  * Avoid storing freeform reasons with secrets; warn on paste patterns (optional P1).
+
+---
+
+## 14) Engineering execution plan (priorities)
+
+### P0 (ship first)
+
+* Evidence-first alert detail landing
+* Decision drawer + append-only audit
+* Deterministic alert list sort + reachability badge
+* Evidence API + decision POST
+* TTFS + clicks telemetry
+* Static graph preview + lazy hydration
+
+### P1
+
+* Offline bundle load/verify + offline rendering
+* Smart diff view (risk shift grouping)
+* Exportable audit bundle
+* Keyboard shortcuts + help overlay
+
+### P2
+
+* Inline quick decisions from list
+* Advanced graph search within view
+* Suggest reason presets based on evidence patterns
+
+---
+
+## 15) Acceptance criteria checklist (what QA signs off)
+
+A build is acceptable when:
+
+* Opening an alert renders at least one evidence pill within **500ms** (with cache) and TTFS p95 meets target under network simulation.
+* Users can record A/N/U decisions with reason and see an audit event immediately.
+* Decision event includes evidence hashes + replay token.
+* Alert list sorting is stable and deterministic across refresh.
+* Graph preview appears instantly; interactive graph hydrates only on expand.
+* Offline bundle renders evidence without network; missing items show “enrichment pending,” not errors.
+* Keyboard shortcuts work; `?` overlay lists them; full keyboard navigation is possible.
+
+---
+
+If you want, I can also format this into a **developer-ready ticket pack** (epics + user stories + acceptance tests) so engineers can implement without interpretation drift.