Files
git.stella-ops.org/docs/product-advisories/14-Dec-2025 - Dissect triage and evidence workflows.md
2025-12-14 16:23:44 +02:00

20 KiB
Raw Blame History

Heres a tight, practical blueprint for building (and proving) a fast, evidencefirst triage workflow—plus the poweruser affordances that make StellaOps feel “snappy” even offline.

What “good” looks like (background in plain words)

  • Alert → evidence → decision in one flow: an alert should open directly onto the concrete proof (reachability, callstack, provenance), then offer a oneclick decision (VEX/CSAF status) with audit logging.
  • TimetoFirstSignal (TTFS) is king: how fast a human sees the first credible piece of evidence that explains why this alert matters here.
  • ClickstoClosure: count how many interactions to reach a defensible decision recorded in the audit log.

Minimal evidence bundle per finding

  • Reachability proof: functionlevel path or packagelevel import chain (with “toggle reachability view” hotkey).
  • Callstack snippet: 510 frames around the sink/source with file:line anchors.
  • Provenance: attestation / DSSE + build ancestry (image → layer → artifact → commit).
  • VEX/CSAF status: affected/notaffected/underinvestigation + reason.
  • Diff: what changed since last scan (SBOM or VEX delta), rendered as a small, humanreadable “smartdiff.”

KPIs to measure in CI and UI

  • TTFS (p50/p95) from alert creation to first rendered evidence.
  • ClickstoClosure (median) per decision type.
  • Evidence completeness score (04): reachability, callstack, provenance, VEX/CSAF present.
  • Offline friendliness score: % of evidence resolvable with no network.
  • Audit log completeness: every decision has: evidence hash set, actor, policy context, replay token.

Poweruser affordances (keyboard first)

  • Jump to evidence (J): focuses the first incomplete evidence pane.
  • Copy DSSE (Y): copies the attestation block or Rekor entry ref.
  • Toggle reachability view (R): path list ↔ compact graph ↔ textual proof.
  • Searchwithingraph (/): node/func/package, instant.
  • Deterministic sort (S): stable sort by (reachability→severity→age→component) to remove hesitation.
  • Quick VEX set (A, N, U): Affected / Notaffected / Underinvestigation with templated reasons.

UX flow to implement (endtoend)

  1. Alert row shows: TTFS timer, reachability badge, “decision state,” and a diffdot if something changed.

  2. Open alert lands on Evidence tab (not Details). Top strip = three proof pills:

    • Reachability ✓ / Callstack ✓ / Provenance ✓ (click to expand inline).
  3. Decision drawer pinned on the right:

    • VEX/CSAF radio (A/N/U) → Reason presets → “Record decision.”
    • Shows auditready summary (hashes, timestamps, policy).
  4. Diff tab: SBOM/VEX delta since last run, grouped by “meaningful risk shift.”

  5. Activity tab: immutable audit log; export as a signed bundle for audits.

Graph performance on large callgraphs

  • Minimallatency snapshots: prerender static PNG/SVG thumbnails serverside; open with tiny preview then hydrate to interactive graph lazily.
  • Progressive neighborhood expansion: load 1hop first, expand on demand; keep the first TTFS < 500ms.
  • Stable node ordering: deterministic layout with consistent anchors to avoid “graph shuffle” anxiety.
  • Chunked graph edges with capped fanout; collapse identical library paths into a reachability macroedge.

Offlinefriendly design

  • Local evidence cache: store (SBOM slices, path proofs, DSSE attestations, compiled callstacks) in a signed bundle beside the SARIF/VEX.
  • Deferred enrichment: mark fields that need internet (e.g., upstream CSAF fetch) and queue a background “enricher” when network returns.
  • Predictable fallbacks: if provenance server missing, show embedded DSSE and “verification pending,” never blank states.

Audit & replay

  • Deterministic replay token: hash(feed manifests + rules + lattice policy + inputs) → attach to every decision.
  • Oneclick “Reproduce”: opens CLI snippet pinned to the exact versions and policies.
  • Evidence hashset: contentaddress each proof artifact; the audit entry stores only hashes + signer.

TTFS & ClickstoClosure: how to measure in code

  • Emit a ttfs.start at alert creation; first paint of any evidence card emits ttfs.signal.
  • Increment a peralert interaction counter; on “Record decision” emit close.clicks.
  • Log evidence bitset (reach, stack, prov, vex) at decision time for completeness scoring.

Developer tasks (concrete, shippable)

  • Evidence API: GET /alerts/{id}/evidence returns {reachability, callstack, provenance, vex, hashes[]} with deterministic sort.
  • Proof renderer: tiny, noframework widget that can render from the offline bundle; hydrate to full only on interaction.
  • Keyboard map: global handler with overlay help (?); no collisions; all actions are idempotent.
  • Graph service: serverside layout + snapshot PNG; client hydrates WebGL only when user expands.
  • Smartdiff: diff SBOM/VEX → classify into “riskraising / neutral / reducing,” surface only the first item by default.
  • Audit logger: appendonly stream; signed checkpoints; export .stella-audit.tgz (attestations + JSONL).

Benchmarks to run weekly

  • TTFS under poor network (100ms RTT, 1% loss): p95 < 1.5s to first evidence.
  • Graph hydration on 250kedge image: preview < 300ms, interactive < 2.0s.
  • Keyboard coverage: ≥90% of triage actions executable without mouse.
  • Offline replay: 100% of decisions rerender from bundle; zero web calls required.

Why Stellas approach reduces hesitation

  • Deterministic sort orders keep findings in place between refreshes.
  • Minimallatency graph snapshots show something trustworthy immediately, then refine—no “blank panel” delay.
  • Replayable, signed bundles make every click auditable and reversible, which builds operator confidence.

If you want, I can turn this into:

  • a UI checklist for a design review,
  • a .NET 10 API contract (DTOs + endpoints),
  • or a Cypress/Playwright test plan that measures TTFS and clickstoclosure automatically. Below is a PMstyle implementation guideline you can hand to developers. Its written as a build spec: clear goals, “MUST/SHOULD” requirements, acceptance criteria, and the nonfunctional guardrails (performance, offline, auditability) that make triage feel fast and defensible.

Stella Ops — EvidenceFirst Triage Implementation Guidelines (PM Spec)

0) Assumptions and scope

Assumptions

  • Stella Ops ingests vulnerability findings (SCA/SAST/image scans), has SBOM context, and can compute reachability/call paths.
  • Triage outcomes must be recorded as VEX/CSAFcompatible states with reasons and audit trails.
  • Users may operate in restricted networks and need an offline mode that still shows evidence.

In scope

  • Evidencefirst alert triage UI + APIs + telemetry.
  • Reachability proof + call stack view + provenance attestation view.
  • VEX/CSAF decision recording with audit export.
  • Offline evidence bundle and deterministic replay token.

Out of scope (for this phase)

  • Building the underlying static analyzer or SBOM generator (we consume their outputs).
  • Full CSAF publishing workflow (we store and export; publishing is separate).
  • Remediation automation (PRs, patching).

1) Product principles (nonnegotiables)

  1. Evidence before detail Opening an alert MUST show the best available evidence immediately (even partial/placeholder), not a generic “details” page.
  2. Fast first signal The UI MUST render a credible “first signal” quickly (reachability badge, call stack snippet, or provenance block).
  3. Determinism reduces hesitation Sorting, graphs, and diffs MUST be stable across refreshes. No jittery re-layout.
  4. Offline by design If evidence exists locally (bundle), the UI MUST render it without network access.
  5. Audit-ready by default Every decision MUST be reproducible, attributable, and exportable with evidence hashes.

2) Success metrics (what we ship toward)

These become acceptance criteria and dashboards.

Primary metrics (P0)

  • TTFS (TimetoFirstSignal): p95 < 1.5s from opening an alert to first evidence card rendering (with 100ms RTT, 1% loss simulation).
  • ClickstoClosure: median < 6 interactions to record a VEX decision.
  • Evidence completeness at decision time: ≥ 90% of decisions include evidence hash set + reason + replay token.

Secondary metrics (P1)

  • Offline resolution rate: ≥ 95% of alerts opened with a local bundle show reachability + provenance without network.
  • Graph usability: preview render < 300ms, interactive hydration < 2.0s for large graphs (see §7).

3) User workflows and “Definition of Done”

Workflow A: Triage an alert to a decision

DoD: user can open an alert, see evidence, set VEX state, and the system records a signed/auditable decision event.

Steps

  1. Alert list shows key signals (reachability badge, decision state, diff indicator).
  2. Open alert → Evidence view loads first.
  3. User reviews reachability/call stack/provenance.
  4. User sets VEX status + reason preset (editable).
  5. User records decision.
  6. Audit log entry appears instantly and is exportable.

Workflow B: Explain “why is this flagged?”

DoD: user can show a defensible proof (path/call stack/provenance) and copy it into a ticket.


4) UI requirements (MUST/SHOULD/MAY)

4.1 Alert list page

MUST

  • Each row includes:

    • Severity + component identifier
    • Decision state (Unset / Under Investigation / Not Affected / Affected)
    • Reachability badge (Reachable / Not Reachable / Unknown) where available
    • Diff indicator if SBOM/VEX changed since last scan (simple dot/label)
    • Age / first seen / last updated
  • Deterministic sort default: Reachability DESC → Severity DESC → Decision state (Unset first) → Age DESC → Component name ASC

  • Keyboard navigation:

    • ↑/↓ move selection, Enter open alert.
    • / search/filter focus.

SHOULD

  • Inline “quick set” decision menu (Affected / Not affected / Under investigation) without leaving list for obvious cases, but still requires reason and logs evidence hashes.

4.2 Alert detail — landing tab MUST be Evidence

MUST

  • Default landing is Evidence (not “Overview”).

  • Top section shows 3 “proof pills” with status:

    • Reachability (✓ / ! / …)
    • Call stack (✓ / ! / …)
    • Provenance (✓ / ! / …)
  • Each pill expands inline (no navigation) into a compact evidence panel.

MUST: No blank panels

  • If evidence is loading, show skeleton + “whats coming.”
  • If evidence missing, show a reason (“not computed”, “requires source map”, “offline enrichment pending”).

4.3 Decision drawer

MUST

  • Pinned right drawer (or persistent bottom sheet on small screens).

  • Controls:

    • VEX/CSAF status: Affected / Not affected / Under investigation
    • Reason preset dropdown + editable reason text
    • “Record decision” button
  • Preview “Audit summary” before submit:

    • Evidence hashes included
    • Policy context (ruleset version)
    • Replay token
    • Actor identity

MUST

  • On submit, create an append-only audit event and immediately reflect status in UI.

SHOULD

  • Allow attaching references: ticket URL, incident ID, PR link (stored as metadata).

4.4 Diff tab

MUST

  • Show delta since last scan:

    • SBOM diffs (component version changes, removals/additions)
    • VEX diffs (status changes)
  • Group diffs by risk shift:

    • Riskraising (new reachable vuln, severity increase)
    • Neutral (metadata-only)
    • Riskreducing (fixed version, reachability removed)

SHOULD

  • Provide “Copy diff summary” for change management.

4.5 Activity/Audit tab

MUST

  • Immutable timeline of decisions and evidence changes.

  • Each entry includes:

    • actor, timestamp, decision, reason
    • evidence hash set
    • replay token
    • bundle/export availability

5) Power-user and accessibility requirements

Keyboard shortcuts (MUST)

  • J: jump to next missing/incomplete evidence panel
  • R: toggle reachability view (list ↔ compact graph ↔ textual proof)
  • Y: copy selected evidence block (call stack / DSSE / path proof)
  • A: set “Affected” (opens reason preset selection)
  • N: set “Not affected”
  • U: set “Under investigation”
  • ?: keyboard help overlay

Accessibility (MUST)

  • Fully navigable by keyboard
  • Visible focus states
  • Screen-reader labels for evidence pills and drawer controls
  • Color is never the only signal (badges must have text/icon)

6) Evidence model: what every alert should attempt to provide

Treat this as the minimum evidence bundle. Each item may be “unavailable,” but must be explicit.

MUST support:

  1. Reachability proof

    • At least one of:

      • function-level call path: entry → … → vulnerable_sink
      • package/module import chain
    • Includes confidence/algorithm tag: static, dynamic, heuristic

  2. Call stack snippet

    • 510 frames around the relevant node with file:line anchors where possible
  3. Provenance

    • DSSE attestation or equivalent statement
    • Artifact ancestry chain: image → layer → artifact → commit (as available)
    • Verification status: verified / pending / failed (with reason)
  4. Decision state

    • VEX status + reason + timestamps
  5. Evidence hash set

    • Content-addressed hashes of each evidence artifact included in the decision

SHOULD

  • “Evidence freshness”: when computed, tool version, input revisions.

7) Performance and graph rendering requirements

TTFS budget (MUST)

  • When opening an alert:

    • <200ms: show skeleton and cached row metadata
    • <500ms: render at least one evidence pill with meaningful content OR a cached preview image
    • <1.5s p95: render reachability + provenance for typical alerts

Graph rendering for large call graphs (MUST)

  • Two-phase rendering

    1. Server-generated static snapshot (PNG/SVG) displayed immediately
    2. Interactive graph hydrates lazily on user expand
  • Progressive expansion

    • Load 1-hop neighborhood first; expand on click
  • Deterministic layout

    • Same input produces same layout anchors (no reshuffles between refreshes)
  • Fan-out control

    • Collapse repeated library paths into “macro edges” to keep the graph readable

8) Offline mode requirements

Offline is not “nice to have”; it is a defined mode.

Offline evidence bundle (MUST)

  • A single file (e.g., .stella.bundle.tgz) that contains:

    • Alert metadata snapshot
    • Evidence artifacts (reachability proofs, call stacks, provenance attestations)
    • SBOM slice(s) necessary for diffs
    • VEX decision history (if available)
    • Manifest with content hashes (Merkle-ish)
  • Bundle must be signed (or include signature material) and verifiable.

UI behavior (MUST)

  • If bundle is present:

    • UI loads evidence from it first
    • Any missing items show “enrichment pending” (not “error”)
  • If network returns:

    • Background refresh allowed, but must not reorder the alert list unexpectedly
    • Must surface “updated evidence available” as a user-controlled refresh, not an auto-switch that changes context mid-triage

9) Auditability and replay requirements

Decision event schema (MUST)

Every recorded decision must store:

  • alert_id, artifact_id (image digest or commit hash)
  • actor_id, timestamp
  • decision_status (Affected/Not affected/Under investigation)
  • reason_code (preset) + reason_text
  • evidence_hashes[] (content-addressed hashes)
  • policy_context (ruleset version, policy id)
  • replay_token (hash of inputs needed to reproduce)

Replay token (MUST)

  • Deterministic hash of:

    • scan inputs (SBOM digest, image digest, tool versions)
    • policy/rules versions
    • reachability algorithm version
  • “Reproduce” button produces a CLI snippet (copyable) pinned to these versions.

Export (MUST)

  • Exportable audit bundle that includes:

    • JSONL of decision events
    • evidence artifacts referenced by hashes
    • signatures/attestations
  • Export must be stable and verifiable later.


10) API and data contract guidelines (developer-facing)

This is an implementation guideline, not a full API spec—keep it simple and cache-friendly.

MUST endpoints (or equivalent)

  • GET /alerts?filters… → list view payload (small, cacheable)
  • GET /alerts/{id}/evidence → evidence payload (reachability, call stack, provenance, hashes)
  • POST /alerts/{id}/decisions → record decision event (append-only)
  • GET /alerts/{id}/audit → audit timeline
  • GET /alerts/{id}/diff?baseline=… → SBOM/VEX diff view
  • GET /bundles/{id} and/or POST /bundles/verify → offline bundle download/verify

Evidence payload guidelines (MUST)

  • Deterministic ordering for arrays and nodes (stable sorts).
  • Explicit status per evidence section: available | loading | unavailable | error.
  • Include hash per artifact for content addressing.

Example shape

{
  "alert_id": "a123",
  "reachability": { "status": "available", "hash": "sha256:…", "proof": { "type": "call_path", "nodes": [...] } },
  "callstack":     { "status": "available", "hash": "sha256:…", "frames": [...] },
  "provenance":    { "status": "pending",   "hash": null,       "dsse": { "embedded": true, "payload": "…" } },
  "vex":           { "status": "available", "current": {...}, "history": [...] },
  "hashes": ["sha256:…", "sha256:…"]
}

11) Telemetry requirements (how we prove its fast)

MUST instrument:

  • alert_opened (timestamp, alert_id)
  • evidence_first_paint (timestamp, evidence_type)
  • decision_recorded (timestamp, clicks_count, evidence_bitset)
  • bundle_loaded (hit/miss, size, verification_status)
  • graph_preview_paint and graph_hydrated

MUST compute:

  • TTFS = evidence_first_paint - alert_opened
  • ClickstoClosure = interaction counter per alert until decision recorded
  • Evidence completeness bitset at decision time: reachability/callstack/provenance/vex present

12) Error handling and edge cases

MUST

  • Never show empty states without explanation.

  • Distinguish between:

    • “not computed yet”
    • “not possible due to missing inputs”
    • “blocked by permissions”
    • “offline—enrichment pending”
    • “verification failed”

SHOULD

  • Offer “Request enrichment” action when evidence missing (creates a job/task id).

13) Security, permissions, and multi-tenancy

MUST

  • RBAC gating for:

    • viewing provenance attestations
    • recording decisions
    • exporting audit bundles
  • All decision events are immutable; corrections are new events (append-only).

  • PII handling:

    • Avoid storing freeform reasons with secrets; warn on paste patterns (optional P1).

14) Engineering execution plan (priorities)

P0 (ship first)

  • Evidence-first alert detail landing
  • Decision drawer + append-only audit
  • Deterministic alert list sort + reachability badge
  • Evidence API + decision POST
  • TTFS + clicks telemetry
  • Static graph preview + lazy hydration

P1

  • Offline bundle load/verify + offline rendering
  • Smart diff view (risk shift grouping)
  • Exportable audit bundle
  • Keyboard shortcuts + help overlay

P2

  • Inline quick decisions from list
  • Advanced graph search within view
  • Suggest reason presets based on evidence patterns

15) Acceptance criteria checklist (what QA signs off)

A build is acceptable when:

  • Opening an alert renders at least one evidence pill within 500ms (with cache) and TTFS p95 meets target under network simulation.
  • Users can record A/N/U decisions with reason and see an audit event immediately.
  • Decision event includes evidence hashes + replay token.
  • Alert list sorting is stable and deterministic across refresh.
  • Graph preview appears instantly; interactive graph hydrates only on expand.
  • Offline bundle renders evidence without network; missing items show “enrichment pending,” not errors.
  • Keyboard shortcuts work; ? overlay lists them; full keyboard navigation is possible.

If you want, I can also format this into a developer-ready ticket pack (epics + user stories + acceptance tests) so engineers can implement without interpretation drift.