This commit is contained in:
2025-12-14 16:23:44 +02:00
parent 233873f620
commit 01f4943ab9
8 changed files with 6193 additions and 12 deletions

View File

@@ -0,0 +1,551 @@
Heres a tight, practical blueprint for building (and proving) a fast, evidencefirst triage workflow—plus the poweruser affordances that make StellaOps feel “snappy” even offline.
# What “good” looks like (background in plain words)
* **Alert → evidence → decision** in one flow: an alert should open directly onto the concrete proof (reachability, callstack, provenance), then offer a oneclick decision (VEX/CSAF status) with audit logging.
* **TimetoFirstSignal (TTFS)** is king: how fast a human sees the first credible piece of evidence that explains *why this alert matters here*.
* **ClickstoClosure**: count how many interactions to reach a defensible decision recorded in the audit log.
# Minimal evidence bundle per finding
* **Reachability proof**: functionlevel path or packagelevel import chain (with “toggle reachability view” hotkey).
* **Callstack snippet**: 510 frames around the sink/source with file:line anchors.
* **Provenance**: attestation / DSSE + build ancestry (image → layer → artifact → commit).
* **VEX/CSAF status**: affected/notaffected/underinvestigation + reason.
* **Diff**: what changed since last scan (SBOM or VEX delta), rendered as a small, humanreadable “smartdiff.”
# KPIs to measure in CI and UI
* **TTFS (p50/p95)** from alert creation to first rendered evidence.
* **ClickstoClosure (median)** per decision type.
* **Evidence completeness score** (04): reachability, callstack, provenance, VEX/CSAF present.
* **Offline friendliness score**: % of evidence resolvable with no network.
* **Audit log completeness**: every decision has: evidence hash set, actor, policy context, replay token.
# Poweruser affordances (keyboard first)
* **Jump to evidence** (`J`): focuses the first incomplete evidence pane.
* **Copy DSSE** (`Y`): copies the attestation block or Rekor entry ref.
* **Toggle reachability view** (`R`): path list ↔ compact graph ↔ textual proof.
* **Searchwithingraph** (`/`): node/func/package, instant.
* **Deterministic sort** (`S`): stable sort by (reachability→severity→age→component) to remove hesitation.
* **Quick VEX set** (`A`, `N`, `U`): Affected / Notaffected / Underinvestigation with templated reasons.
# UX flow to implement (endtoend)
1. **Alert row** shows: TTFS timer, reachability badge, “decision state,” and a diffdot if something changed.
2. **Open alert** lands on **Evidence tab** (not Details). Top strip = three proof pills:
* Reachability ✓ / Callstack ✓ / Provenance ✓ (click to expand inline).
3. **Decision drawer** pinned on the right:
* VEX/CSAF radio (A/N/U) → Reason presets → “Record decision.”
* Shows **auditready summary** (hashes, timestamps, policy).
4. **Diff tab**: SBOM/VEX delta since last run, grouped by “meaningful risk shift.”
5. **Activity tab**: immutable audit log; export as a signed bundle for audits.
# Graph performance on large callgraphs
* **Minimallatency snapshots**: prerender static PNG/SVG thumbnails serverside; open with tiny preview then hydrate to interactive graph lazily.
* **Progressive neighborhood expansion**: load 1hop first, expand on demand; keep the first TTFS < 500ms.
* **Stable node ordering**: deterministic layout with consistent anchors to avoid graph shuffle anxiety.
* **Chunked graph edges** with capped fanout; collapse identical library paths into a **reachability macroedge**.
# Offlinefriendly design
* **Local evidence cache**: store (SBOM slices, path proofs, DSSE attestations, compiled callstacks) in a signed bundle beside the SARIF/VEX.
* **Deferred enrichment**: mark fields that need internet (e.g., upstream CSAF fetch) and queue a background enricher when network returns.
* **Predictable fallbacks**: if provenance server missing, show embedded DSSE and verification pending,” never blank states.
# Audit & replay
* **Deterministic replay token**: hash(feed manifests + rules + lattice policy + inputs) attach to every decision.
* **Oneclick Reproduce”**: opens CLI snippet pinned to the exact versions and policies.
* **Evidence hashset**: contentaddress each proof artifact; the audit entry stores only hashes + signer.
# TTFS & ClickstoClosure: how to measure in code
* Emit a `ttfs.start` at alert creation; first paint of any evidence card emits `ttfs.signal`.
* Increment a peralert **interaction counter**; on Record decision emit `close.clicks`.
* Log **evidence bitset** (reach, stack, prov, vex) at decision time for completeness scoring.
# Developer tasks (concrete, shippable)
* **Evidence API**: `GET /alerts/{id}/evidence` returns `{reachability, callstack, provenance, vex, hashes[]}` with deterministic sort.
* **Proof renderer**: tiny, noframework widget that can render from the offline bundle; hydrate to full only on interaction.
* **Keyboard map**: global handler with overlay help (`?`); no collisions; all actions are idempotent.
* **Graph service**: serverside layout + snapshot PNG; client hydrates WebGL only when user expands.
* **Smartdiff**: diff SBOM/VEX classify into riskraising / neutral / reducing,” surface only the first item by default.
* **Audit logger**: appendonly stream; signed checkpoints; export `.stella-audit.tgz` (attestations + JSONL).
# Benchmarks to run weekly
* **TTFS under poor network** (100ms RTT, 1% loss): p95 < 1.5s to first evidence.
* **Graph hydration on 250kedge image**: preview < 300ms, interactive < 2.0s.
* **Keyboard coverage**: 90% of triage actions executable without mouse.
* **Offline replay**: 100% of decisions rerender from bundle; zero web calls required.
# Why Stellas approach reduces hesitation
* **Deterministic sort orders** keep findings in place between refreshes.
* **Minimallatency graph snapshots** show something trustworthy immediately, then refineno blank panel delay.
* **Replayable, signed bundles** make every click auditable and reversible, which builds operator confidence.
If you want, I can turn this into:
* a **UI checklist** for a design review,
* a **.NET 10 API contract** (DTOs + endpoints),
* or a **Cypress/Playwright test plan** that measures TTFS and clickstoclosure automatically.
Below is a PMstyle implementation guideline you can hand to developers. Its written as a **build spec**: clear goals, MUST/SHOULD requirements, acceptance criteria, and the nonfunctional guardrails (performance, offline, auditability) that make triage feel fast and defensible.
---
# Stella Ops — EvidenceFirst Triage Implementation Guidelines (PM Spec)
## 0) Assumptions and scope
**Assumptions**
* Stella Ops ingests vulnerability findings (SCA/SAST/image scans), has SBOM context, and can compute reachability/call paths.
* Triage outcomes must be recorded as VEX/CSAFcompatible states with reasons and audit trails.
* Users may operate in restricted networks and need an offline mode that still shows evidence.
**In scope**
* Evidencefirst alert triage UI + APIs + telemetry.
* Reachability proof + call stack view + provenance attestation view.
* VEX/CSAF decision recording with audit export.
* Offline evidence bundle and deterministic replay token.
**Out of scope (for this phase)**
* Building the underlying static analyzer or SBOM generator (we consume their outputs).
* Full CSAF publishing workflow (we store and export; publishing is separate).
* Remediation automation (PRs, patching).
---
## 1) Product principles (nonnegotiables)
1. **Evidence before detail**
Opening an alert **MUST** show the best available evidence immediately (even partial/placeholder), not a generic details page.
2. **Fast first signal**
The UI **MUST** render a credible first signal quickly (reachability badge, call stack snippet, or provenance block).
3. **Determinism reduces hesitation**
Sorting, graphs, and diffs **MUST** be stable across refreshes. No jittery re-layout.
4. **Offline by design**
If evidence exists locally (bundle), the UI **MUST** render it without network access.
5. **Audit-ready by default**
Every decision **MUST** be reproducible, attributable, and exportable with evidence hashes.
---
## 2) Success metrics (what we ship toward)
These become acceptance criteria and dashboards.
### Primary metrics (P0)
* **TTFS (TimetoFirstSignal)**: p95 < **1.5s** from opening an alert to first evidence card rendering (with 100ms RTT, 1% loss simulation).
* **ClickstoClosure**: median < **6** interactions to record a VEX decision.
* **Evidence completeness** at decision time: **90%** of decisions include evidence hash set + reason + replay token.
### Secondary metrics (P1)
* **Offline resolution rate**: **95%** of alerts opened with a local bundle show reachability + provenance without network.
* **Graph usability**: preview render < **300ms**, interactive hydration < **2.0s** for large graphs (see §7).
---
## 3) User workflows and “Definition of Done”
### Workflow A: Triage an alert to a decision
**DoD**: user can open an alert, see evidence, set VEX state, and the system records a signed/auditable decision event.
**Steps**
1. Alert list shows key signals (reachability badge, decision state, diff indicator).
2. Open alert Evidence view loads first.
3. User reviews reachability/call stack/provenance.
4. User sets VEX status + reason preset (editable).
5. User records decision.
6. Audit log entry appears instantly and is exportable.
### Workflow B: Explain “why is this flagged?”
**DoD**: user can show a defensible proof (path/call stack/provenance) and copy it into a ticket.
---
## 4) UI requirements (MUST/SHOULD/MAY)
## 4.1 Alert list page
**MUST**
* Each row includes:
* Severity + component identifier
* **Decision state** (Unset / Under Investigation / Not Affected / Affected)
* **Reachability badge** (Reachable / Not Reachable / Unknown) where available
* **Diff indicator** if SBOM/VEX changed since last scan (simple dot/label)
* Age / first seen / last updated
* **Deterministic sort** default:
`Reachability DESC → Severity DESC → Decision state (Unset first) → Age DESC → Component name ASC`
* Keyboard navigation:
* `↑/↓` move selection, `Enter` open alert.
* `/` search/filter focus.
**SHOULD**
* Inline quick set decision menu (Affected / Not affected / Under investigation) without leaving list for obvious cases, but still requires reason and logs evidence hashes.
## 4.2 Alert detail — landing tab MUST be Evidence
**MUST**
* Default landing is **Evidence** (not Overview”).
* Top section shows 3 proof pills with status:
* Reachability (✓ / ! / …)
* Call stack (✓ / ! / …)
* Provenance (✓ / ! / …)
* Each pill expands inline (no navigation) into a compact evidence panel.
**MUST: No blank panels**
* If evidence is loading, show skeleton + whats coming.”
* If evidence missing, show a reason (“not computed”, requires source map”, offline enrichment pending”).
## 4.3 Decision drawer
**MUST**
* Pinned right drawer (or persistent bottom sheet on small screens).
* Controls:
* VEX/CSAF status: **Affected / Not affected / Under investigation**
* Reason preset dropdown + editable reason text
* Record decision button
* Preview Audit summary before submit:
* Evidence hashes included
* Policy context (ruleset version)
* Replay token
* Actor identity
**MUST**
* On submit, create an append-only audit event and immediately reflect status in UI.
**SHOULD**
* Allow attaching references: ticket URL, incident ID, PR link (stored as metadata).
## 4.4 Diff tab
**MUST**
* Show delta since last scan:
* SBOM diffs (component version changes, removals/additions)
* VEX diffs (status changes)
* Group diffs by **risk shift**:
* Riskraising (new reachable vuln, severity increase)
* Neutral (metadata-only)
* Riskreducing (fixed version, reachability removed)
**SHOULD**
* Provide Copy diff summary for change management.
## 4.5 Activity/Audit tab
**MUST**
* Immutable timeline of decisions and evidence changes.
* Each entry includes:
* actor, timestamp, decision, reason
* evidence hash set
* replay token
* bundle/export availability
---
## 5) Power-user and accessibility requirements
### Keyboard shortcuts (MUST)
* `J`: jump to next missing/incomplete evidence panel
* `R`: toggle reachability view (list compact graph textual proof)
* `Y`: copy selected evidence block (call stack / DSSE / path proof)
* `A`: set Affected (opens reason preset selection)
* `N`: set Not affected
* `U`: set Under investigation
* `?`: keyboard help overlay
### Accessibility (MUST)
* Fully navigable by keyboard
* Visible focus states
* Screen-reader labels for evidence pills and drawer controls
* Color is never the only signal (badges must have text/icon)
---
## 6) Evidence model: what every alert should attempt to provide
Treat this as the **minimum evidence bundle**. Each item may be unavailable,” but must be explicit.
**MUST** support:
1. **Reachability proof**
* At least one of:
* function-level call path: `entry → … → vulnerable_sink`
* package/module import chain
* Includes confidence/algorithm tag: `static`, `dynamic`, `heuristic`
2. **Call stack snippet**
* 510 frames around the relevant node with file:line anchors where possible
3. **Provenance**
* DSSE attestation or equivalent statement
* Artifact ancestry chain: image layer artifact commit (as available)
* Verification status: verified / pending / failed (with reason)
4. **Decision state**
* VEX status + reason + timestamps
5. **Evidence hash set**
* Content-addressed hashes of each evidence artifact included in the decision
**SHOULD**
* Evidence freshness: when computed, tool version, input revisions.
---
## 7) Performance and graph rendering requirements
### TTFS budget (MUST)
* When opening an alert:
* **<200ms**: show skeleton and cached row metadata
* **<500ms**: render at least one evidence pill with meaningful content OR a cached preview image
* **<1.5s p95**: render reachability + provenance for typical alerts
### Graph rendering for large call graphs (MUST)
* **Two-phase rendering**
1. Server-generated **static snapshot** (PNG/SVG) displayed immediately
2. Interactive graph hydrates lazily on user expand
* **Progressive expansion**
* Load 1-hop neighborhood first; expand on click
* **Deterministic layout**
* Same input produces same layout anchors (no reshuffles between refreshes)
* **Fan-out control**
* Collapse repeated library paths into macro edges to keep the graph readable
---
## 8) Offline mode requirements
Offline is not nice to have”; it is a defined mode.
### Offline evidence bundle (MUST)
* A single file (e.g., `.stella.bundle.tgz`) that contains:
* Alert metadata snapshot
* Evidence artifacts (reachability proofs, call stacks, provenance attestations)
* SBOM slice(s) necessary for diffs
* VEX decision history (if available)
* Manifest with content hashes (Merkle-ish)
* Bundle must be **signed** (or include signature material) and verifiable.
### UI behavior (MUST)
* If bundle is present:
* UI loads evidence from it first
* Any missing items show enrichment pending (not error”)
* If network returns:
* Background refresh allowed, but **must not reorder** the alert list unexpectedly
* Must surface updated evidence available as a user-controlled refresh, not an auto-switch that changes context mid-triage
---
## 9) Auditability and replay requirements
### Decision event schema (MUST)
Every recorded decision must store:
* `alert_id`, `artifact_id` (image digest or commit hash)
* `actor_id`, `timestamp`
* `decision_status` (Affected/Not affected/Under investigation)
* `reason_code` (preset) + `reason_text`
* `evidence_hashes[]` (content-addressed hashes)
* `policy_context` (ruleset version, policy id)
* `replay_token` (hash of inputs needed to reproduce)
### Replay token (MUST)
* Deterministic hash of:
* scan inputs (SBOM digest, image digest, tool versions)
* policy/rules versions
* reachability algorithm version
* Reproduce button produces a CLI snippet (copyable) pinned to these versions.
### Export (MUST)
* Exportable audit bundle that includes:
* JSONL of decision events
* evidence artifacts referenced by hashes
* signatures/attestations
* Export must be stable and verifiable later.
---
## 10) API and data contract guidelines (developer-facing)
This is an implementation guideline, not a full API speckeep it simple and cache-friendly.
### MUST endpoints (or equivalent)
* `GET /alerts?filters…` list view payload (small, cacheable)
* `GET /alerts/{id}/evidence` evidence payload (reachability, call stack, provenance, hashes)
* `POST /alerts/{id}/decisions` record decision event (append-only)
* `GET /alerts/{id}/audit` audit timeline
* `GET /alerts/{id}/diff?baseline=…` SBOM/VEX diff view
* `GET /bundles/{id}` and/or `POST /bundles/verify` offline bundle download/verify
### Evidence payload guidelines (MUST)
* Deterministic ordering for arrays and nodes (stable sorts).
* Explicit `status` per evidence section: `available | loading | unavailable | error`.
* Include `hash` per artifact for content addressing.
**Example shape**
```json
{
"alert_id": "a123",
"reachability": { "status": "available", "hash": "sha256:…", "proof": { "type": "call_path", "nodes": [...] } },
"callstack": { "status": "available", "hash": "sha256:…", "frames": [...] },
"provenance": { "status": "pending", "hash": null, "dsse": { "embedded": true, "payload": "…" } },
"vex": { "status": "available", "current": {...}, "history": [...] },
"hashes": ["sha256:…", "sha256:…"]
}
```
---
## 11) Telemetry requirements (how we prove its fast)
**MUST** instrument:
* `alert_opened` (timestamp, alert_id)
* `evidence_first_paint` (timestamp, evidence_type)
* `decision_recorded` (timestamp, clicks_count, evidence_bitset)
* `bundle_loaded` (hit/miss, size, verification_status)
* `graph_preview_paint` and `graph_hydrated`
**MUST** compute:
* TTFS = `evidence_first_paint - alert_opened`
* ClickstoClosure = interaction counter per alert until decision recorded
* Evidence completeness bitset at decision time: reachability/callstack/provenance/vex present
---
## 12) Error handling and edge cases
**MUST**
* Never show empty states without explanation.
* Distinguish between:
* not computed yet
* not possible due to missing inputs
* blocked by permissions
* offlineenrichment pending
* verification failed
**SHOULD**
* Offer Request enrichment action when evidence missing (creates a job/task id).
---
## 13) Security, permissions, and multi-tenancy
**MUST**
* RBAC gating for:
* viewing provenance attestations
* recording decisions
* exporting audit bundles
* All decision events are immutable; corrections are new events (append-only).
* PII handling:
* Avoid storing freeform reasons with secrets; warn on paste patterns (optional P1).
---
## 14) Engineering execution plan (priorities)
### P0 (ship first)
* Evidence-first alert detail landing
* Decision drawer + append-only audit
* Deterministic alert list sort + reachability badge
* Evidence API + decision POST
* TTFS + clicks telemetry
* Static graph preview + lazy hydration
### P1
* Offline bundle load/verify + offline rendering
* Smart diff view (risk shift grouping)
* Exportable audit bundle
* Keyboard shortcuts + help overlay
### P2
* Inline quick decisions from list
* Advanced graph search within view
* Suggest reason presets based on evidence patterns
---
## 15) Acceptance criteria checklist (what QA signs off)
A build is acceptable when:
* Opening an alert renders at least one evidence pill within **500ms** (with cache) and TTFS p95 meets target under network simulation.
* Users can record A/N/U decisions with reason and see an audit event immediately.
* Decision event includes evidence hashes + replay token.
* Alert list sorting is stable and deterministic across refresh.
* Graph preview appears instantly; interactive graph hydrates only on expand.
* Offline bundle renders evidence without network; missing items show enrichment pending,” not errors.
* Keyboard shortcuts work; `?` overlay lists them; full keyboard navigation is possible.
---
If you want, I can also format this into a **developer-ready ticket pack** (epics + user stories + acceptance tests) so engineers can implement without interpretation drift.