up
This commit is contained in:
@@ -0,0 +1,866 @@
|
||||
Here’s a simple, battle‑tested way to make your UX feel fast under pressure: treat **Time‑to‑First‑Signal (TTFS)** as a product SLO and design everything backwards from it.
|
||||
|
||||
---
|
||||
|
||||
# TTFS SLO: the idea in one line
|
||||
|
||||
Guarantee **p50 < 2s, p95 < 5s** from user action (or CI event) to the **first meaningful signal** (status, cause, or next step)—fast enough to calm triage, short enough to be felt.
|
||||
|
||||
---
|
||||
|
||||
## What counts as “First Signal”?
|
||||
|
||||
* A clear, human message like: “Scan started; last error matched: `NU1605` (likely transitive). Retry advice →”
|
||||
* Or a progress token with context: “Queued (ETA ~18s). Cached reachability graph loaded.”
|
||||
|
||||
Not a spinner. Not 0% progress. A real, decision‑shaping hint.
|
||||
|
||||
---
|
||||
|
||||
## Budget the pipeline backwards (guardrails)
|
||||
|
||||
* **Frontend (≤150 ms):** render instant skeleton + last known state; optimistic UI; no blocking on fresh data.
|
||||
* **Edge/API (≤250 ms):** return a “signal frame” fast path (status + last error signature + cached ETA) from cache.
|
||||
* **Core services (≤500–1500 ms):** pre‑index failures, warm reachability summaries, enqueue heavy work, emit stream token.
|
||||
* **Slow work (async):** full scan, lattice policy merge, provenance trails—arrive later via push updates.
|
||||
|
||||
---
|
||||
|
||||
## Minimal implementation (1–2 sprints)
|
||||
|
||||
1. **Define the signal contract**
|
||||
|
||||
* `FirstSignal { kind, verb, scope, lastKnownOutcome?, ETA?, nextAction? }`
|
||||
* Version it; keep it <1 KB; always return within the SLO window.
|
||||
|
||||
2. **Cache last error signature**
|
||||
|
||||
* Key: `(repo, branch|imageDigest, toolchain-hash)`
|
||||
* Value: `{errorCode, excerpt, fixLink, firstSeenAt, hitCount}`
|
||||
* Evict by LRU + TTL (e.g., 7–14 days). Use Valkey in default profile; Postgres JSONB in air‑gap.
|
||||
|
||||
3. **Pre‑index the failing step**
|
||||
|
||||
* When a job fails, extract and store:
|
||||
|
||||
* normalized step id (e.g., `scanner:deps-restore`)
|
||||
* top 1–3 error tokens (codes, regex’d phrases)
|
||||
* minimal context (package id, version range)
|
||||
* Write a tiny **“failure indexer”** that runs in‑band on failure and out‑of‑band on success.
|
||||
|
||||
4. **Lazy‑load everything else**
|
||||
|
||||
* UI shows FirstSignal + “Details loading…”
|
||||
* Fetch heavy panes (full CVE list, call‑graph, SBOM diff) after paint.
|
||||
|
||||
5. **Fast path endpoint**
|
||||
|
||||
* `GET /signal/{jobId}` returns from cache or snapshot table.
|
||||
* If cache miss: fall back to “cold signal” (`queued`, basic ETA) and **immediately** enqueue warmup tasks.
|
||||
|
||||
6. **Streaming updates**
|
||||
|
||||
* Emit compact deltas: `status:started → status:analyzing → triage:blocked(POLICY_X)` etc.
|
||||
* UI subscribes; CI annotates with the same tokens.
|
||||
|
||||
---
|
||||
|
||||
## TTFS SLO monitor (keep it honest)
|
||||
|
||||
* Emit for every user‑visible action: `ttfs_ms`, `path` (UI|CLI|CI), `signal_kind`, `cache_hit` (T/F).
|
||||
* Track **p50/p95** by surface and by repo size.
|
||||
* Page on **p95 > 5s** for 5 consecutive minutes (or >2% of traffic).
|
||||
* Store exemplars (trace ids) to replay slow paths.
|
||||
|
||||
---
|
||||
|
||||
## Stella Ops–specific hooks (drop‑in)
|
||||
|
||||
* **Scanner.WebService:** on job accept, write `FirstSignal{kind:"queued", ETA}`; if failure index has a hit, attach `lastKnownOutcome`.
|
||||
* **Feedser/Vexer:** publish “known criticals changed since last run” as a hint in FirstSignal.
|
||||
* **Policy Engine:** pre‑evaluate “obvious blocks” (e.g., banned license) and surface as `nextAction:"toggle waiver or update license map"`.
|
||||
* **Air‑gapped profile:** skip Valkey; keep a `first_signal_snapshots` Postgres table + NOTIFY/LISTEN for streaming.
|
||||
|
||||
---
|
||||
|
||||
## UX micro‑rules
|
||||
|
||||
* **Never show a spinner alone**; always pair with a sentence or chip (“Warm cache found; verifying”).
|
||||
* **3 taps max** to reach evidence: Button → FirstSignal → Evidence card.
|
||||
* **Always include a next step** (“Retry with `--ignore NU1605` is unsafe; use `PackageReference` pin → link”).
|
||||
|
||||
---
|
||||
|
||||
## Quick success criteria
|
||||
|
||||
* New incident claims: “I knew what was happening within 2 seconds.”
|
||||
* CI annotates within 5s on p95.
|
||||
* Support tickets referencing “stuck scans” drop ≥40%.
|
||||
|
||||
---
|
||||
|
||||
If you want, I can turn this into a ready‑to‑paste **TASKS.md** (owners, DOD, metrics, endpoints, DB schemas) for your Stella Ops repos.
|
||||
````md
|
||||
# TASKS.md — TTFS (Time‑to‑First‑Signal) Fast Signal + Progressive Updates
|
||||
|
||||
> Paste this file into the repo root (or `/docs/TTFS/TASKS.md`).
|
||||
> This plan is structured as two sprints (A + B) with clear owners, dependencies, and DoD.
|
||||
|
||||
---
|
||||
|
||||
## 0) Product SLO and non‑negotiables
|
||||
|
||||
### SLO
|
||||
- **TTFS p50 < 2s, p95 < 5s**
|
||||
- Applies to: **Web UI**, **CLI**, **CI annotations**
|
||||
- TTFS = time from **user action / CI start** → **first meaningful signal rendered/logged**
|
||||
|
||||
### What counts as “First Signal”
|
||||
A First Signal must include at least one of:
|
||||
- Status + context (“Queued, ETA ~18s”; “Started, phase: restore”; “Blocked by policy XYZ”)
|
||||
- Known cause hint (error token/code/category)
|
||||
- Next action (open logs, docs link, retry command)
|
||||
|
||||
A spinner alone does **not** count.
|
||||
|
||||
### Hard constraints
|
||||
- `/jobs/{id}/signal` must **never block** on full scan work
|
||||
- FirstSignal payload in normal cases **< 1KB**
|
||||
- **No secrets** in snapshots, excerpts, telemetry
|
||||
|
||||
---
|
||||
|
||||
## 1) Scope and module owners
|
||||
|
||||
### Modules (assumed)
|
||||
- **Scanner.WebService** (job API + signal provider)
|
||||
- **Scanner.Worker** (phase changes + event publishing)
|
||||
- **Policy Engine** (block reasons + quick pre-eval hooks)
|
||||
- **Feedser/Vexer** (optional: “critical changed” hint)
|
||||
- **Web UI** (progressive rendering + streaming)
|
||||
- **CLI** (first signal + streaming)
|
||||
- **CI Integration** (checks/annotations)
|
||||
- **Platform/Observability** (metrics, dashboards, alerts)
|
||||
- **Security/Compliance** (redaction + tenant isolation)
|
||||
|
||||
### Owners (replace with actual people/teams)
|
||||
- **Backend Lead:** @be-owner
|
||||
- **Frontend Lead:** @fe-owner
|
||||
- **DevEx/CLI Lead:** @dx-owner
|
||||
- **CI Integrations Lead:** @ci-owner
|
||||
- **SRE/Obs Lead:** @sre-owner
|
||||
- **Security Lead:** @sec-owner
|
||||
- **PM:** @pm-owner
|
||||
|
||||
---
|
||||
|
||||
## 2) Canonical contract: FirstSignal v1.0
|
||||
|
||||
### FirstSignal shape (canonical)
|
||||
All surfaces (UI/CLI/CI) must be representable via this contract.
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.0",
|
||||
"signalId": "sig_...",
|
||||
"jobId": "job_...",
|
||||
|
||||
"timestamp": "2025-12-14T18:22:31.014Z",
|
||||
"kind": "queued|started|phase|blocked|failed|succeeded|canceled|unavailable",
|
||||
"phase": "resolve|fetch|restore|analyze|policy|report|unknown",
|
||||
|
||||
"scope": { "type": "repo|image|artifact", "id": "org/repo@branch-or-digest" },
|
||||
|
||||
"summary": "Queued (ETA ~18s). Last failure matched: NU1605 (dependency downgrade).",
|
||||
"etaSeconds": 18,
|
||||
|
||||
"lastKnownOutcome": {
|
||||
"signatureId": "sigerr_...",
|
||||
"errorCode": "NU1605",
|
||||
"token": "dependency-downgrade",
|
||||
"excerpt": "Detected package downgrade: ...",
|
||||
"confidence": "low|medium|high",
|
||||
"firstSeenAt": "2025-12-01T00:00:00Z",
|
||||
"hitCount": 14
|
||||
},
|
||||
|
||||
"nextActions": [
|
||||
{ "type": "open_logs|open_job|docs|retry|cli_command", "label": "Open logs", "target": "/jobs/job_.../logs" }
|
||||
],
|
||||
|
||||
"diagnostics": {
|
||||
"cacheHit": true,
|
||||
"source": "snapshot|failure_index|cold_start",
|
||||
"correlationId": "corr_..."
|
||||
}
|
||||
}
|
||||
````
|
||||
|
||||
### Contract rules
|
||||
|
||||
* Must always include: `version`, `jobId`, `timestamp`, `kind`, `summary`
|
||||
* Keep normal payload < 1KB (enforce excerpt max length; avoid lists)
|
||||
* Never include secrets; excerpts must be redacted
|
||||
|
||||
---
|
||||
|
||||
## 3) Milestones
|
||||
|
||||
### Sprint A — “TTFS Baseline”
|
||||
|
||||
Goal: Always show **some** meaningful First Signal quickly.
|
||||
|
||||
Deliverables:
|
||||
|
||||
* Snapshot persistence (DB) + optional cache
|
||||
* `/jobs/{id}/signal` fast path
|
||||
* UI skeleton + immediate FirstSignal rendering (poll fallback OK)
|
||||
* Base telemetry: `ttfs_ms`, endpoint latency, cache hit
|
||||
|
||||
### Sprint B — “Smart Hints + Streaming”
|
||||
|
||||
Goal: First Signal is helpful and updates live.
|
||||
|
||||
Deliverables:
|
||||
|
||||
* Failure signature indexer + lookup
|
||||
* SSE events (or WebSocket) for incremental updates
|
||||
* CLI streaming + CI annotations
|
||||
* Dashboards + alerts + exemplars/traces
|
||||
* Redaction hardening and tenant isolation validation
|
||||
|
||||
---
|
||||
|
||||
## 4) Sprint A tasks — TTFS baseline
|
||||
|
||||
### A1 — Implement FirstSignal types and helpers (shared package)
|
||||
|
||||
**Owner:** @be-owner
|
||||
**Depends on:** none
|
||||
**Est:** 2–4 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Define FirstSignal v1.0 schema in a shared package (`/common/contracts/firstsignal`)
|
||||
* [ ] Add validators:
|
||||
|
||||
* [ ] required fields present
|
||||
* [ ] size limits (excerpt length; total serialized bytes threshold warning)
|
||||
* [ ] allowed enums for kind/phase
|
||||
* [ ] Add builders:
|
||||
|
||||
* [ ] `buildQueuedSignal(job, eta?)`
|
||||
* [ ] `buildColdSignal(job)`
|
||||
* [ ] `mergeHint(signal, lastKnownOutcome)`
|
||||
* [ ] `addNextActions(signal, actions[])`
|
||||
|
||||
**DoD**
|
||||
|
||||
* Contract is versioned, unit-tested, and used by backend endpoint
|
||||
* Validation rejects/flags invalid signals in tests
|
||||
|
||||
---
|
||||
|
||||
### A2 — Snapshot storage: `first_signal_snapshots` table + migrations
|
||||
|
||||
**Owner:** @be-owner
|
||||
**Depends on:** A1
|
||||
**Est:** 3–5 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Add Postgres migration for `first_signal_snapshots`
|
||||
* [ ] Implement CRUD:
|
||||
|
||||
* [ ] `createSnapshot(jobId, signal)`
|
||||
* [ ] `updateSnapshot(jobId, partialSignal)` (phase transitions)
|
||||
* [ ] `getSnapshot(jobId)`
|
||||
* [ ] Enforce:
|
||||
|
||||
* [ ] `payload_json` size guard (soft warn + hard cap via excerpt limit)
|
||||
* [ ] `updated_at` maintained automatically
|
||||
|
||||
**Suggested schema**
|
||||
|
||||
```sql
|
||||
CREATE TABLE first_signal_snapshots (
|
||||
job_id TEXT PRIMARY KEY,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
kind TEXT NOT NULL,
|
||||
phase TEXT NOT NULL,
|
||||
summary TEXT NOT NULL,
|
||||
eta_seconds INT NULL,
|
||||
payload_json JSONB NOT NULL
|
||||
);
|
||||
CREATE INDEX ON first_signal_snapshots (updated_at DESC);
|
||||
```
|
||||
|
||||
**DoD**
|
||||
|
||||
* Migration included
|
||||
* Integration test: create job → snapshot exists within request lifecycle (or best-effort async write + immediate cold response)
|
||||
|
||||
---
|
||||
|
||||
### A3 — Cache layer (default profile) with Postgres fallback
|
||||
|
||||
**Owner:** @be-owner
|
||||
**Depends on:** A2
|
||||
**Est:** 3–6 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Add optional Valkey/Redis support:
|
||||
|
||||
* [ ] key: `signal:job:{jobId}` TTL: 24h
|
||||
* [ ] read-through cache on `/signal`
|
||||
* [ ] write-through on snapshot updates
|
||||
* [ ] Air-gapped mode behavior:
|
||||
|
||||
* [ ] cache disabled → read/write snapshots in Postgres only
|
||||
* [ ] Add config toggles:
|
||||
|
||||
* [ ] `TTFS_CACHE_BACKEND=valkey|postgres|none`
|
||||
* [ ] `TTFS_CACHE_TTL_SECONDS=86400`
|
||||
|
||||
**DoD**
|
||||
|
||||
* With cache enabled: `/signal` p95 latency meets budget in load test
|
||||
* With cache disabled: correctness remains; p95 within acceptable baseline
|
||||
|
||||
---
|
||||
|
||||
### A4 — `/jobs/{jobId}/signal` fast-path endpoint
|
||||
|
||||
**Owner:** @be-owner
|
||||
**Depends on:** A2, A3
|
||||
**Est:** 4–8 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Implement `GET /jobs/{jobId}/signal`
|
||||
|
||||
* [ ] Try cache snapshot
|
||||
* [ ] Else DB snapshot
|
||||
* [ ] Else cold signal (`kind=queued`, `phase=unknown`, summary “Queued. Preparing scan…”)
|
||||
* [ ] Best-effort snapshot write if missing (non-blocking)
|
||||
* [ ] Response headers:
|
||||
|
||||
* [ ] `X-Correlation-Id`
|
||||
* [ ] `Cache-Status: hit|miss|bypass`
|
||||
* [ ] Add server-side timing logs (debug-level) for:
|
||||
|
||||
* [ ] cache read time
|
||||
* [ ] db read time
|
||||
* [ ] cold path time
|
||||
|
||||
**Performance budget**
|
||||
|
||||
* Cache-hit response: **p95 ≤ 250ms**
|
||||
* Cold response: **p95 ≤ 500ms**
|
||||
|
||||
**DoD**
|
||||
|
||||
* Endpoint never blocks on scan work
|
||||
* Returns a valid FirstSignal every time job exists
|
||||
* Load test demonstrates budgets
|
||||
|
||||
---
|
||||
|
||||
### A5 — Create snapshot at job creation and update on phase changes
|
||||
|
||||
**Owner:** @be-owner + @worker-owner
|
||||
**Depends on:** A2
|
||||
**Est:** 5–8 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] In `POST /jobs`:
|
||||
|
||||
* [ ] Immediately write initial snapshot:
|
||||
|
||||
* `kind=queued`
|
||||
* `phase=unknown`
|
||||
* summary includes “Queued” and optional ETA
|
||||
* [ ] In worker:
|
||||
|
||||
* [ ] When job starts: update snapshot to `kind=started`, `phase=resolve|fetch|restore…`
|
||||
* [ ] On phase transitions: update snapshot
|
||||
* [ ] On terminal: `kind=succeeded|failed|canceled`
|
||||
* [ ] Ensure updates are idempotent and safe (replays)
|
||||
|
||||
**DoD**
|
||||
|
||||
* For any started job, snapshot shows phase changes within a few seconds
|
||||
* Terminal kind always correct
|
||||
|
||||
---
|
||||
|
||||
### A6 — UI: Immediate “First Signal” rendering with polling fallback
|
||||
|
||||
**Owner:** @fe-owner
|
||||
**Depends on:** A4
|
||||
**Est:** 6–10 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] On scan trigger:
|
||||
|
||||
* [ ] Render skeleton + “Preparing scan…” message (no spinner-only)
|
||||
* [ ] Call `POST /jobs` (get jobId)
|
||||
* [ ] Immediately call `GET /jobs/{jobId}/signal`
|
||||
* [ ] Render summary + at least one next action button (Open job/logs)
|
||||
* [ ] Poll fallback:
|
||||
|
||||
* [ ] If streaming not available yet (Sprint A), poll `/signal` every 2–5s until terminal
|
||||
* [ ] Lazy-load heavy panels (must not block First Signal):
|
||||
|
||||
* [ ] vulnerability list
|
||||
* [ ] dependency graph
|
||||
* [ ] SBOM diff
|
||||
|
||||
**DoD**
|
||||
|
||||
* Real user monitoring shows UI TTFS p50 < 2s, p95 < 5s for the baseline path
|
||||
* No spinner-only states
|
||||
|
||||
---
|
||||
|
||||
### A7 — Telemetry: baseline metrics and tracing
|
||||
|
||||
**Owner:** @sre-owner + @be-owner + @fe-owner
|
||||
**Depends on:** A4, A6
|
||||
**Est:** 5–10 pts
|
||||
|
||||
**Metrics**
|
||||
|
||||
* [ ] `ttfs_ms` (emitted client-side for UI; server-side for CLI/CI if needed)
|
||||
|
||||
* tags: `surface=ui|cli|ci`, `cache_hit=true|false`, `signal_source=snapshot|cold_start`, `kind`, `repo_size_bucket`
|
||||
* [ ] `signal_endpoint_latency_ms`
|
||||
* [ ] `signal_payload_bytes`
|
||||
* [ ] `signal_error_rate`
|
||||
|
||||
**Tracing**
|
||||
|
||||
* [ ] Correlation id propagated:
|
||||
|
||||
* [ ] API response header
|
||||
* [ ] worker logs
|
||||
* [ ] events (Sprint B)
|
||||
|
||||
**Dashboards**
|
||||
|
||||
* [ ] TTFS p50/p95 by surface
|
||||
* [ ] cache hit rate
|
||||
* [ ] endpoint latency percentiles
|
||||
|
||||
**DoD**
|
||||
|
||||
* Metrics visible in dashboard
|
||||
* Correlation ids make it possible to trace slow examples end-to-end
|
||||
|
||||
---
|
||||
|
||||
## 5) Sprint B tasks — smart hints + streaming
|
||||
|
||||
### B1 — Failure signature extraction + redaction library
|
||||
|
||||
**Owner:** @be-owner + @sec-owner
|
||||
**Depends on:** A1
|
||||
**Est:** 6–12 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Implement redaction utility (unit-tested):
|
||||
|
||||
* [ ] strip bearer tokens, API keys, access tokens, private URLs
|
||||
* [ ] cap excerpt length (e.g., 240 chars)
|
||||
* [ ] normalize whitespace
|
||||
* [ ] Implement signature extraction from:
|
||||
|
||||
* [ ] structured step errors (preferred)
|
||||
* [ ] raw logs (fallback) via regex ruleset
|
||||
* [ ] Map to:
|
||||
|
||||
* `errorCode` (if present)
|
||||
* `token` (normalized category)
|
||||
* `confidence` (high/med/low)
|
||||
|
||||
**DoD**
|
||||
|
||||
* Redaction unit tests include “known secret-like patterns”
|
||||
* Extraction produces stable tokens for top failure families
|
||||
|
||||
---
|
||||
|
||||
### B2 — Failure signature storage: `failure_signatures` table + upsert on failures
|
||||
|
||||
**Owner:** @be-owner
|
||||
**Depends on:** B1
|
||||
**Est:** 5–10 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Add Postgres migration for `failure_signatures`
|
||||
* [ ] Implement lookup key:
|
||||
|
||||
* `(scope_type, scope_id, toolchain_hash)`
|
||||
* [ ] On job failure:
|
||||
|
||||
* [ ] extract signature → redaction → upsert
|
||||
* [ ] increment hit_count; update last_seen_at
|
||||
* [ ] Retention:
|
||||
|
||||
* [ ] TTL job: delete signatures older than 14 days (configurable)
|
||||
* [ ] or retain last N signatures per scope
|
||||
|
||||
**Suggested schema**
|
||||
|
||||
```sql
|
||||
CREATE TABLE failure_signatures (
|
||||
signature_id TEXT PRIMARY KEY,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
scope_type TEXT NOT NULL,
|
||||
scope_id TEXT NOT NULL,
|
||||
toolchain_hash TEXT NOT NULL,
|
||||
error_code TEXT NULL,
|
||||
token TEXT NOT NULL,
|
||||
excerpt TEXT NULL,
|
||||
confidence TEXT NOT NULL,
|
||||
first_seen_at TIMESTAMPTZ NOT NULL,
|
||||
last_seen_at TIMESTAMPTZ NOT NULL,
|
||||
hit_count INT NOT NULL DEFAULT 1
|
||||
);
|
||||
CREATE INDEX ON failure_signatures (scope_type, scope_id, toolchain_hash);
|
||||
CREATE INDEX ON failure_signatures (token);
|
||||
```
|
||||
|
||||
**DoD**
|
||||
|
||||
* Failure runs populate signatures
|
||||
* Excerpts are redacted and capped
|
||||
* Retention job verified
|
||||
|
||||
---
|
||||
|
||||
### B3 — Enrich FirstSignal with “lastKnownOutcome” hint
|
||||
|
||||
**Owner:** @be-owner
|
||||
**Depends on:** B2
|
||||
**Est:** 3–6 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] On `/signal` (fast path):
|
||||
|
||||
* [ ] if snapshot exists but has no hint, attempt signature lookup by scope+toolchain hash
|
||||
* [ ] merge hint into signal
|
||||
* [ ] include `diagnostics.source=failure_index` when used
|
||||
* [ ] Add “next actions” for common tokens:
|
||||
|
||||
* [ ] docs link for known error codes/tokens
|
||||
* [ ] “open logs” always present
|
||||
|
||||
**DoD**
|
||||
|
||||
* For scopes with prior failures, FirstSignal includes hint within SLO budgets
|
||||
|
||||
---
|
||||
|
||||
### B4 — Streaming updates via SSE (recommended)
|
||||
|
||||
**Owner:** @be-owner + @worker-owner + @fe-owner
|
||||
**Depends on:** A5
|
||||
**Est:** 8–16 pts
|
||||
|
||||
**Backend tasks**
|
||||
|
||||
* [ ] Add `GET /jobs/{jobId}/events` SSE endpoint
|
||||
* [ ] Define event payloads:
|
||||
|
||||
* `status` (kind+phase+message)
|
||||
* `hint` (token+errorCode+confidence)
|
||||
* `policy` (blocked + policyId)
|
||||
* `complete` (terminal)
|
||||
* [ ] Worker publishes events at:
|
||||
|
||||
* start
|
||||
* phase transitions
|
||||
* policy decision
|
||||
* terminal
|
||||
* [ ] Ensure reconnect safety:
|
||||
|
||||
* [ ] event id monotonic or timestamp
|
||||
* [ ] optional replay window (last N events in memory or DB)
|
||||
|
||||
**Frontend tasks**
|
||||
|
||||
* [ ] Subscribe after jobId known
|
||||
* [ ] Update FirstSignal UI in-place on deltas
|
||||
* [ ] Fallback to polling when SSE fails
|
||||
|
||||
**DoD**
|
||||
|
||||
* UI updates without refresh
|
||||
* Event stream doesn’t spam (3–8 meaningful events per job typical)
|
||||
* SSE failure degrades gracefully
|
||||
|
||||
---
|
||||
|
||||
### B5 — Policy Engine: “obvious block” pre-eval for early signal
|
||||
|
||||
**Owner:** @be-owner + @policy-owner
|
||||
**Depends on:** B4 (optional), or can enrich snapshot directly
|
||||
**Est:** 5–10 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Add a quick pre-evaluation hook for high-signal blocks:
|
||||
|
||||
* banned license
|
||||
* disallowed package
|
||||
* org-level denylist
|
||||
* [ ] Emit early policy event or update snapshot:
|
||||
|
||||
* `kind=blocked`, `phase=policy`, summary names the policy
|
||||
* next action points to waiver/docs (if supported)
|
||||
|
||||
**DoD**
|
||||
|
||||
* When an obvious block is present, users see it in FirstSignal without waiting for full analysis
|
||||
|
||||
---
|
||||
|
||||
### B6 — CLI: First Signal + streaming
|
||||
|
||||
**Owner:** @dx-owner
|
||||
**Depends on:** A4, B4
|
||||
**Est:** 5–10 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Ensure CLI prints FirstSignal within TTFS budget
|
||||
* [ ] Add `--follow` default behavior:
|
||||
|
||||
* connect to SSE and stream deltas
|
||||
* [ ] Provide minimal, non-spammy output:
|
||||
|
||||
* only on meaningful transitions
|
||||
* [ ] Print correlation id for support triage
|
||||
|
||||
**DoD**
|
||||
|
||||
* CLI TTFS p50 < 2s, p95 < 5s
|
||||
* Streaming works and degrades to polling
|
||||
|
||||
---
|
||||
|
||||
### B7 — CI annotations/checks: initial First Signal within 5s p95
|
||||
|
||||
**Owner:** @ci-owner
|
||||
**Depends on:** A4, B4 (optional)
|
||||
**Est:** 6–12 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] On CI job start:
|
||||
|
||||
* [ ] call `/signal` and publish check/annotation with summary + job link
|
||||
* [ ] Update annotations only on state changes:
|
||||
|
||||
* queued → started
|
||||
* started → blocked/failed/succeeded
|
||||
* [ ] Avoid annotation spam (max 3–5 updates)
|
||||
|
||||
**DoD**
|
||||
|
||||
* CI shows actionable first message within 5s p95
|
||||
* Updates are minimal and meaningful
|
||||
|
||||
---
|
||||
|
||||
### B8 — Observability: SLO alerts + exemplars
|
||||
|
||||
**Owner:** @sre-owner
|
||||
**Depends on:** A7
|
||||
**Est:** 5–10 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Alerts:
|
||||
|
||||
* [ ] page when `p95(ttfs_ms) > 5000` for 5 mins
|
||||
* [ ] page when `signal_endpoint_error_rate > 1%`
|
||||
* [ ] Add exemplars / trace links on slow TTFS samples
|
||||
* [ ] Add breakdown dashboard:
|
||||
|
||||
* surface (ui/cli/ci)
|
||||
* cacheHit
|
||||
* repo size bucket
|
||||
* kind/phase
|
||||
|
||||
**DoD**
|
||||
|
||||
* On-call can diagnose slow TTFS with one click to traces/logs
|
||||
|
||||
---
|
||||
|
||||
## 6) Cross-cutting: security, privacy, and tenancy
|
||||
|
||||
### S1 — Tenant-safe caching and lookups
|
||||
|
||||
**Owner:** @sec-owner + @be-owner
|
||||
**Depends on:** A3, B2
|
||||
**Est:** 3–6 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Ensure cache keys include tenant/org boundary where applicable:
|
||||
|
||||
* `tenant:{tenantId}:signal:job:{jobId}`
|
||||
* [ ] Ensure failure signatures are only looked up within same tenant
|
||||
* [ ] Add tests for cross-tenant leakage
|
||||
|
||||
**DoD**
|
||||
|
||||
* No cross-tenant access possible via cache or signature index
|
||||
|
||||
---
|
||||
|
||||
### S2 — No secrets policy enforcement
|
||||
|
||||
**Owner:** @sec-owner
|
||||
**Depends on:** B1
|
||||
**Est:** 2–5 pts
|
||||
|
||||
**Tasks**
|
||||
|
||||
* [ ] Add “secret scanning” unit tests for redaction
|
||||
* [ ] Add runtime guardrails:
|
||||
|
||||
* if excerpt contains forbidden patterns → replace with “[redacted]”
|
||||
* [ ] Ensure telemetry attributes never include excerpts
|
||||
|
||||
**DoD**
|
||||
|
||||
* Security review sign-off for snapshot + signature + telemetry
|
||||
|
||||
---
|
||||
|
||||
## 7) Global Definition of Done
|
||||
|
||||
A feature is “done” only when:
|
||||
|
||||
* [ ] Meets TTFS SLO in staging load test and in production RUM (within agreed rollout window)
|
||||
* [ ] Has:
|
||||
|
||||
* [ ] unit tests
|
||||
* [ ] integration tests
|
||||
* [ ] basic load test coverage for `/signal`
|
||||
* [ ] Has:
|
||||
|
||||
* [ ] dashboards
|
||||
* [ ] alerts (or explicitly deferred with signed waiver)
|
||||
* [ ] Has:
|
||||
|
||||
* [ ] secure redaction
|
||||
* [ ] tenant isolation
|
||||
* [ ] Has a rollback plan via feature flag
|
||||
|
||||
---
|
||||
|
||||
## 8) Test plan
|
||||
|
||||
### Unit tests
|
||||
|
||||
* FirstSignal contract validation (required fields, enums)
|
||||
* Redaction patterns (bearer tokens, API keys, URLs, long strings)
|
||||
* Signature extraction rule correctness
|
||||
|
||||
### Integration tests
|
||||
|
||||
* Create job → snapshot exists → `/signal` returns it
|
||||
* Worker phase transitions update snapshot
|
||||
* Job fail → signature stored → next job → `/signal` includes lastKnownOutcome
|
||||
* SSE connect → receive events in order → terminal event once
|
||||
|
||||
### Load tests (must-have)
|
||||
|
||||
* `/jobs/{id}/signal`:
|
||||
|
||||
* cache-hit p95 ≤ 250ms
|
||||
* cold path p95 ≤ 500ms
|
||||
* error rate < 0.1% under expected concurrency
|
||||
|
||||
### Chaos/degraded tests
|
||||
|
||||
* Cache down → Postgres fallback works
|
||||
* SSE blocked → UI polls and still updates
|
||||
|
||||
---
|
||||
|
||||
## 9) Feature flags and rollout
|
||||
|
||||
### Flags
|
||||
|
||||
* `ttfs.first_signal_enabled` (default ON in staging)
|
||||
* `ttfs.cache_enabled`
|
||||
* `ttfs.failure_index_enabled`
|
||||
* `ttfs.sse_enabled`
|
||||
* `ttfs.policy_preeval_enabled`
|
||||
|
||||
### Rollout steps
|
||||
|
||||
1. Enable baseline FirstSignal + snapshots for internal/staging
|
||||
2. Enable cache in default profile
|
||||
3. Enable failure index (read-only first; then write)
|
||||
4. Enable SSE for 10% traffic → 50% → 100%
|
||||
5. Enable CI annotations (start with non-blocking informational checks)
|
||||
|
||||
---
|
||||
|
||||
## 10) PR review checklist (paste into PR template)
|
||||
|
||||
* [ ] No blocking heavy work added to `/signal` path
|
||||
* [ ] Signal payload size remains < 1KB in normal cases
|
||||
* [ ] Excerpts are redacted + length-capped
|
||||
* [ ] Tenant boundary included in cache keys and DB queries
|
||||
* [ ] Metrics emitted (`ttfs_ms`, endpoint latency, cacheHit)
|
||||
* [ ] UI has no spinner-only state; always shows message + next action
|
||||
* [ ] Streaming has polling fallback
|
||||
* [ ] Tests added/updated (unit + integration)
|
||||
|
||||
---
|
||||
|
||||
## 11) “Ready for QA” scenarios
|
||||
|
||||
QA should validate:
|
||||
|
||||
* UI:
|
||||
|
||||
* click scan → first message within 2s typical
|
||||
* see queued/started/blocked states clearly
|
||||
* open logs works
|
||||
* CLI:
|
||||
|
||||
* first output within 2s typical
|
||||
* follow stream updates
|
||||
* CI:
|
||||
|
||||
* first annotation/check appears quickly and links to job
|
||||
* Security:
|
||||
|
||||
* inject fake token into logs → stored excerpt is redacted
|
||||
* Multi-tenant:
|
||||
|
||||
* run jobs across tenants → no leakage in signals or hints
|
||||
|
||||
---
|
||||
|
||||
```
|
||||
|
||||
If you want this split into **multiple repo-local files** (e.g., `/docs/TTFS/ARCH.md`, `/docs/TTFS/SCHEMAS.sql`, `/docs/TTFS/RUNBOOK.md`, plus a PR template snippet), say the folder structure you prefer and I’ll output them in the same paste-ready format.
|
||||
```
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,892 @@
|
||||
Here’s a crisp, first‑time‑friendly blueprint for **Smart‑Diff**—a minimal‑noise way to highlight only changes that actually shift security risk, not every tiny SBOM/VEX delta.
|
||||
|
||||
---
|
||||
|
||||
# What “Smart‑Diff” means (in plain terms)
|
||||
|
||||
Smart‑Diff is the **smallest set of changes** between two builds/releases that **materially change risk**. We only surface a change when it affects exploitability or policy—not when a dev-only transitive bumped a patch with no runtime path.
|
||||
|
||||
**Count it as a Smart‑Diff only if at least one of these flips:**
|
||||
|
||||
* **Reachability:** new reachable vulnerable code appears, or previously reachable code becomes unreachable.
|
||||
* **VEX status:** a CVE’s status changes (e.g., to `not_affected`).
|
||||
* **Version vs affected ranges:** a dependency crosses into/out of a known vulnerable range.
|
||||
* **KEV/EPSS/Policy:** CISA KEV listing, EPSS spike, or your org policy gates change.
|
||||
|
||||
Ignore:
|
||||
|
||||
* CVEs that are both **unreachable** and **VEX = not_affected**.
|
||||
* Pure patch‑level churn that doesn’t cross an affected range and isn’t KEV‑listed.
|
||||
* Dev/test‑only deps with **no runtime path**.
|
||||
|
||||
---
|
||||
|
||||
# Minimal data model (practical)
|
||||
|
||||
* **DiffSet { added, removed, changed }** for packages, symbols, CVEs, and policy gates.
|
||||
* **AffectedGraph { package → symbol → call‑site }**: reachability edges from entrypoints to vulnerable sinks.
|
||||
* **EvidenceLink { attestation | VEX | KEV | scanner trace }** per item, so every claim is traceable.
|
||||
|
||||
---
|
||||
|
||||
# Core algorithms (what makes it “smart”)
|
||||
|
||||
* **Reachability‑aware set ops:** run set diffs only on **reachable** vuln findings.
|
||||
* **SemVer gates:** treat “crossing an affected range” as a boolean boundary; patch bumps inside a safe range don’t alert.
|
||||
* **VEX merge logic:** vendor or internal VEX that says `not_affected` suppresses noise unless KEV contradicts.
|
||||
* **EPSS‑weighted priority:** rank surfaced diffs by latest EPSS; KEV always escalates to top.
|
||||
* **Policy overlays:** org rules (e.g., “block any KEV,” “warn if EPSS > 0.7”) applied last.
|
||||
|
||||
---
|
||||
|
||||
# Example (why it’s quieter, but safer)
|
||||
|
||||
* **OpenSSL 3.0.10 → 3.0.11** with VEX `not_affected` for a CVE: Smart‑Diff marks **risk down** and **closes** the prior alert.
|
||||
* A **transitive dev dependency** changes with **no runtime path**: Smart‑Diff **logs only**, no red flag.
|
||||
|
||||
---
|
||||
|
||||
# Implementation plan (Stella Ops‑ready)
|
||||
|
||||
**1) Inputs**
|
||||
|
||||
* SBOM (CycloneDX/SPDX) old vs new
|
||||
* VEX (OpenVEX/CycloneDX VEX)
|
||||
* Vuln feeds (NVD, vendor), **CISA KEV**, **EPSS**
|
||||
* Reachability traces (per language analyzers)
|
||||
|
||||
**2) Normalize**
|
||||
|
||||
* Map all deps to **purl**, normalize versions, index CVEs → affected ranges.
|
||||
* Ingest VEX and attach to CVE ↔ component with clear status precedence.
|
||||
|
||||
**3) Build graphs**
|
||||
|
||||
* Generate/refresh **AffectedGraph** per build: entrypoints → call stacks → vulnerable symbols.
|
||||
* Tag each finding with `{reachable?, vex_status, kev?, epss, policy_flags}`.
|
||||
|
||||
**4) Diff**
|
||||
|
||||
* Compute **DiffSet** between builds for:
|
||||
|
||||
* Reachable findings
|
||||
* VEX statuses
|
||||
* Version/range crossings
|
||||
* Policy/KEV/EPSS gates
|
||||
|
||||
**5) Prioritize & suppress**
|
||||
|
||||
* Drop items that are **unreachable AND not_affected**.
|
||||
* Collapse patch‑level churn unless **KEV‑listed**.
|
||||
* Sort remaining by **KEV first**, then **EPSS**, then **runtime blast‑radius** (fan‑in/fan‑out).
|
||||
|
||||
**6) Evidence**
|
||||
|
||||
* Attach **EvidenceLink** to each surfaced change:
|
||||
|
||||
* VEX doc (line/ID)
|
||||
* KEV entry
|
||||
* EPSS score + timestamp
|
||||
* Reachability call stack (top 1‑3 paths)
|
||||
|
||||
**7) UX**
|
||||
|
||||
* Pipeline‑first: output a **Smart‑Diff report JSON** + concise CLI table:
|
||||
|
||||
* `risk ↑/↓`, reason (reachability/VEX/KEV/EPSS), component@version, CVE, **one** example call‑stack.
|
||||
* UI is an explainer: expand to full stack, VEX note, KEV link, and “minimum safe change” suggestion.
|
||||
|
||||
---
|
||||
|
||||
# Module sketch (your stack)
|
||||
|
||||
* **Services:** `Sbomer.Diff`, `Vexer.Merge`, `Scanner.Reachability`, `Feedser.KEV/EPSS`, `Policy.Engine`, `SmartDiff.Service`
|
||||
* **Store:** PostgreSQL (SoR), Valkey cache (ephemeral). Tables: `components`, `cves`, `vex_entries`, `reachability_edges`, `smartdiff_events`, `evidence_links`.
|
||||
* **APIs:**
|
||||
|
||||
* `POST /smartdiff/compare` → returns filtered diff + priorities
|
||||
* `GET /smartdiff/:id/evidence` → links to VEX/KEV/EPSS + trace
|
||||
* **CI usage:** `stella smart-diff --old sbomA.json --new sbomB.json --vex vex.json --out smartdiff.json`
|
||||
|
||||
---
|
||||
|
||||
# Guardrails (to keep it deterministic)
|
||||
|
||||
* Freeze feed snapshots per run (hash KEV/EPSS CSVs + VEX docs).
|
||||
* Version the merge rules (VEX precedence + policy) and emit in the report header.
|
||||
* Log the **exact** semver comparisons that triggered/exempted an alert.
|
||||
|
||||
If you want, I can draft the **Postgres schema**, the **.NET 10 DTOs** for `DiffSet` and `AffectedGraph`, and a **CLI prototype** (`stella smart-diff`) you can drop into your pipeline.
|
||||
Noted: the services are **Concelier** (feeds: KEV/EPSS/NVD/vendor snapshots) and **Excititor** (VEX merge + status resolution). I’ll use those names going forward.
|
||||
|
||||
Below is a **product + business analysis implementation spec** that a developer can follow to build the Smart‑Diff capability you described.
|
||||
|
||||
---
|
||||
|
||||
# 1) Product objective
|
||||
|
||||
## Problem
|
||||
|
||||
Classic SBOM/VEX diffs are noisy: they surface *all* dependency/CVE churn, even when nothing changes in **actual exploitable risk**.
|
||||
|
||||
## Goal
|
||||
|
||||
Produce a **Smart‑Diff report** between two builds/releases that highlights only changes that **materially impact security risk**, with evidence attached.
|
||||
|
||||
## Success criteria
|
||||
|
||||
* **Noise reduction:** >80% fewer diff items vs raw SBOM diff for typical builds (measured by count).
|
||||
* **No missed “high-risk flips”:** any change that creates or removes a **reachable vulnerable path** must appear.
|
||||
* **Traceability:** every surfaced Smart‑Diff item has at least **one evidence link** (VEX entry, reachability trace, KEV reference, feed snapshot hash, scanner output).
|
||||
|
||||
---
|
||||
|
||||
# 2) Scope
|
||||
|
||||
## In scope (MVP)
|
||||
|
||||
* Compare two “build snapshots”: `{SBOM, VEX, reachability traces, vuln feed snapshot, policy snapshot}`
|
||||
* Detect & report these change types:
|
||||
|
||||
1. **Reachability flips** (reachable ↔ unreachable)
|
||||
2. **VEX status changes** (e.g., `affected` → `not_affected`)
|
||||
3. **Version crosses vuln boundary** (safe ↔ affected range)
|
||||
4. **KEV/EPSS/policy gate flips** (e.g., becomes KEV-listed)
|
||||
* Suppress noise using explicit rules (see section 6)
|
||||
* Output:
|
||||
|
||||
* JSON report for CI
|
||||
* concise CLI output (table)
|
||||
* optional UI list view (later)
|
||||
|
||||
## Out of scope (for now)
|
||||
|
||||
* Full remediation planning / patch PR automation
|
||||
* Cross-repo portfolio aggregation (doable later)
|
||||
* Advanced exploit intelligence beyond KEV/EPSS
|
||||
|
||||
---
|
||||
|
||||
# 3) Key definitions (developers must implement these exactly)
|
||||
|
||||
## 3.1 Finding
|
||||
|
||||
A “finding” is a tuple:
|
||||
|
||||
`FindingKey = (component_purl, component_version, cve_id)`
|
||||
|
||||
…and includes computed fields:
|
||||
|
||||
* `reachable: bool | unknown`
|
||||
* `vex_status: enum` (see 3.3)
|
||||
* `in_affected_range: bool | unknown`
|
||||
* `kev: bool`
|
||||
* `epss_score: float | null`
|
||||
* `policy_flags: set<string>`
|
||||
* `evidence_links: list<EvidenceLink>`
|
||||
|
||||
## 3.2 Material risk change (Smart‑Diff item)
|
||||
|
||||
A change is “material” if it changes the computed **RiskState** for any `FindingKey` or creates/removes a `FindingKey` that is in-scope after suppression rules.
|
||||
|
||||
## 3.3 VEX status vocabulary
|
||||
|
||||
Normalize all incoming VEX statuses into a fixed internal enum:
|
||||
|
||||
* `AFFECTED`
|
||||
* `NOT_AFFECTED`
|
||||
* `FIXED`
|
||||
* `UNDER_INVESTIGATION`
|
||||
* `UNKNOWN` (no statement or unparseable)
|
||||
|
||||
> Note: Use OpenVEX/CycloneDX VEX mappings, but internal logic must operate on the above set.
|
||||
|
||||
---
|
||||
|
||||
# 4) System context and responsibilities
|
||||
|
||||
You already have a modular setup. Developers should implement Smart‑Diff as a pipeline over these components:
|
||||
|
||||
## Components (names aligned to your system)
|
||||
|
||||
* **Sbomer**
|
||||
|
||||
* Ingest SBOM(s), normalize to purl/version graph
|
||||
* **Scanner.Reachability**
|
||||
|
||||
* Produce reachability traces: entrypoints → call paths → vulnerable symbol/sink
|
||||
* **Concelier**
|
||||
|
||||
* Fetch + snapshot vulnerability intelligence (NVD/vendor/OSV as applicable), **CISA KEV**, **EPSS**
|
||||
* Provide *feed snapshot identifiers* (hashes) per run
|
||||
* **Excititor**
|
||||
|
||||
* Ingest and merge VEX sources
|
||||
* Resolve a final `vex_status` per (component, cve)
|
||||
* Provide precedence + explanation
|
||||
* **Policy.Engine**
|
||||
|
||||
* Evaluate org rules against a computed finding (e.g., “block if KEV”)
|
||||
* **SmartDiff.Service**
|
||||
|
||||
* Compute risk states for “old” and “new”
|
||||
* Diff them
|
||||
* Suppress noise
|
||||
* Rank + output report with evidence
|
||||
|
||||
---
|
||||
|
||||
# 5) Developer deliverables
|
||||
|
||||
## Deliverable A: Smart‑Diff computation library
|
||||
|
||||
A deterministic library that takes:
|
||||
|
||||
* `OldSnapshot` and `NewSnapshot` (see section 7)
|
||||
* returns a `SmartDiffReport`
|
||||
|
||||
## Deliverable B: Service endpoint
|
||||
|
||||
`POST /smartdiff/compare` returns report JSON.
|
||||
|
||||
## Deliverable C: CLI command
|
||||
|
||||
`stella smart-diff --old <dir|file> --new <dir|file> [--policy policy.json] --out smartdiff.json`
|
||||
|
||||
---
|
||||
|
||||
# 6) Smart‑Diff rules
|
||||
|
||||
Developers must implement these as **explicit, testable rule functions**.
|
||||
|
||||
## 6.1 Suppression rules (noise filters)
|
||||
|
||||
A finding is **suppressed** if ALL apply:
|
||||
|
||||
1. `reachable == false` (or `unknown` treated as false only if you explicitly decide; recommended: unknown is *not* suppressible)
|
||||
2. `vex_status == NOT_AFFECTED`
|
||||
3. `kev == false`
|
||||
4. no policy requires it (e.g., “report all vuln findings” override)
|
||||
|
||||
**Patch churn suppression**
|
||||
|
||||
* If a component version changes but:
|
||||
|
||||
* `in_affected_range` remains false in both versions, AND
|
||||
* no KEV/policy flag flips,
|
||||
* then suppress (don’t surface).
|
||||
|
||||
**Dev/test dependency suppression (optional if you already tag scopes)**
|
||||
|
||||
* If SBOM scope indicates `dev/test` AND `reachable == false`, suppress.
|
||||
* If reachability is unknown, do **not** suppress by scope alone (avoid false negatives).
|
||||
|
||||
## 6.2 Material change detection rules
|
||||
|
||||
Surface a Smart‑Diff item when any of the following changes between old and new:
|
||||
|
||||
### Rule R1: Reachability flip
|
||||
|
||||
* `reachable` changes: `false → true` (risk ↑) or `true → false` (risk ↓)
|
||||
* Include at least one call path as evidence if reachable is true.
|
||||
|
||||
### Rule R2: VEX status flip
|
||||
|
||||
* `vex_status` changes meaningfully:
|
||||
|
||||
* `AFFECTED ↔ NOT_AFFECTED`
|
||||
* `UNDER_INVESTIGATION → NOT_AFFECTED` etc.
|
||||
* Changes involving `UNKNOWN` should be shown but ranked lower unless KEV.
|
||||
|
||||
### Rule R3: Affected range boundary
|
||||
|
||||
* `in_affected_range` flips:
|
||||
|
||||
* `false → true` (risk ↑)
|
||||
* `true → false` (risk ↓)
|
||||
* This is the main guard against patch churn noise.
|
||||
|
||||
### Rule R4: Intelligence / policy flip
|
||||
|
||||
* `kev` changes `false → true` or `epss_score` crosses a configured threshold
|
||||
* any `policy_flag` changes severity (warn → block)
|
||||
|
||||
---
|
||||
|
||||
# 7) Snapshot contract (what Smart‑Diff compares)
|
||||
|
||||
Define a stable internal format:
|
||||
|
||||
```json
|
||||
{
|
||||
"snapshot_id": "build-2025.12.14+sha.abc123",
|
||||
"created_at": "2025-12-14T12:34:56Z",
|
||||
"sbom": { "...": "CycloneDX or SPDX raw" },
|
||||
"vex_documents": [ { "...": "OpenVEX/CycloneDX VEX raw" } ],
|
||||
"reachability": {
|
||||
"analyzer": "java-callgraph@1.2.0",
|
||||
"entrypoints": ["com.app.Main#main"],
|
||||
"paths": [
|
||||
{
|
||||
"component_purl": "pkg:maven/org.example/foo@1.2.3",
|
||||
"cve": "CVE-2024-1234",
|
||||
"sink": "org.example.foo.VulnClass#vulnMethod",
|
||||
"callstack": ["...", "..."]
|
||||
}
|
||||
]
|
||||
},
|
||||
"concelier_feed_snapshot": {
|
||||
"kev_hash": "sha256:...",
|
||||
"epss_hash": "sha256:...",
|
||||
"vuln_db_hash": "sha256:..."
|
||||
},
|
||||
"policy_snapshot": { "policy_hash": "sha256:...", "rules": [ ... ] }
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation note**
|
||||
|
||||
* SBOM/VEX can remain “raw”, but you must also build normalized indexes (in-memory or stored) for diffing.
|
||||
|
||||
---
|
||||
|
||||
# 8) Data normalization requirements
|
||||
|
||||
## 8.1 Component identity
|
||||
|
||||
* Use **purl** as canonical component ID.
|
||||
* Normalize casing, qualifiers, and version string normalization per ecosystem.
|
||||
|
||||
## 8.2 Vulnerability identity
|
||||
|
||||
* Use `CVE-*` as primary key where available.
|
||||
* If you ingest OSV IDs too, map them to CVE when possible but keep OSV ID in evidence.
|
||||
|
||||
## 8.3 Affected range evaluation
|
||||
|
||||
Implement:
|
||||
`bool? IsVersionInAffectedRange(version, affectedRanges)`
|
||||
|
||||
Return `null` (unknown) if version cannot be parsed or range semantics are unknown.
|
||||
|
||||
---
|
||||
|
||||
# 9) Excititor: VEX merge requirements
|
||||
|
||||
Developers should implement Excititor as a deterministic resolver:
|
||||
|
||||
## 9.1 Inputs
|
||||
|
||||
* List of VEX documents, each with metadata:
|
||||
|
||||
* `source` (vendor/internal/scanner)
|
||||
* `issued_at`
|
||||
* `signature/attestation` info (if present)
|
||||
|
||||
## 9.2 Output
|
||||
|
||||
For each `(component_purl, cve_id)`:
|
||||
|
||||
* `final_status`
|
||||
* `winning_statement_id`
|
||||
* `precedence_reason`
|
||||
* `all_statements[]` (for audit)
|
||||
|
||||
## 9.3 Precedence rules (recommendation)
|
||||
|
||||
Implement as ordered priority (highest wins), unless overridden by your org:
|
||||
|
||||
1. **Internal signed VEX** (security team attested)
|
||||
2. **Vendor signed VEX**
|
||||
3. **Internal unsigned VEX**
|
||||
4. **Scanner/VEX-like annotations**
|
||||
5. None → `UNKNOWN`
|
||||
|
||||
Conflict handling:
|
||||
|
||||
* If two same-priority statements disagree, pick newest by `issued_at`, but **record conflict** and surface it as a low-priority Smart‑Diff meta-item (optional).
|
||||
|
||||
---
|
||||
|
||||
# 10) Concelier: feed snapshot requirements
|
||||
|
||||
Concelier must provide deterministic inputs to Smart‑Diff.
|
||||
|
||||
## 10.1 What Concelier stores
|
||||
|
||||
* KEV list snapshot
|
||||
* EPSS snapshot
|
||||
* Vulnerability database snapshot (your choice: NVD mirror, OSV, vendor advisories)
|
||||
|
||||
## 10.2 Required APIs (internal)
|
||||
|
||||
* `GET /concelier/snapshots/latest`
|
||||
* `GET /concelier/snapshots/{hash}`
|
||||
* `GET /concelier/kev/{snapshotHash}/is_listed?cve=CVE-...`
|
||||
* `GET /concelier/epss/{snapshotHash}/score?cve=CVE-...`
|
||||
|
||||
## 10.3 Determinism
|
||||
|
||||
Smart‑Diff report must include the snapshot hashes used, so the result can be reproduced.
|
||||
|
||||
---
|
||||
|
||||
# 11) RiskState computation (core dev logic)
|
||||
|
||||
Implement a pure function:
|
||||
|
||||
`RiskState ComputeRiskState(FindingKey key, Snapshot snapshot)`
|
||||
|
||||
### Inputs used
|
||||
|
||||
* SBOM: to confirm component exists, scope, runtime path
|
||||
* Concelier feeds: KEV, EPSS, affected ranges
|
||||
* Excititor: VEX status
|
||||
* Reachability analyzer output
|
||||
* Policy engine: flags based on org rules
|
||||
|
||||
### Output
|
||||
|
||||
```json
|
||||
{
|
||||
"finding_key": { "purl": "...", "version": "...", "cve": "..." },
|
||||
"reachable": true,
|
||||
"vex_status": "AFFECTED",
|
||||
"in_affected_range": true,
|
||||
"kev": false,
|
||||
"epss": 0.42,
|
||||
"policy": {
|
||||
"decision": "WARN|BLOCK|ALLOW",
|
||||
"flags": ["epss_over_0_4"]
|
||||
},
|
||||
"evidence": [
|
||||
{ "type": "reachability_trace", "ref": "trace:abc", "detail": "short call stack..." },
|
||||
{ "type": "vex", "ref": "openvex:doc123#stmt7" },
|
||||
{ "type": "concelier_snapshot", "ref": "sha256:..." }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# 12) Diff engine specification
|
||||
|
||||
## 12.1 Inputs
|
||||
|
||||
* `OldRiskStates: map<FindingKey, RiskState>`
|
||||
* `NewRiskStates: map<FindingKey, RiskState>`
|
||||
|
||||
You build these maps by:
|
||||
|
||||
1. Enumerating candidate findings in each snapshot:
|
||||
|
||||
* from vulnerability matching against SBOM components (affected ranges)
|
||||
* plus any VEX statements referencing components
|
||||
2. Joining with reachability traces
|
||||
3. Resolving status via Excititor
|
||||
4. Applying Concelier intelligence + policy
|
||||
|
||||
## 12.2 Diff output types
|
||||
|
||||
Return `SmartDiffItem` with:
|
||||
|
||||
* `change_type`: `ADDED|REMOVED|CHANGED`
|
||||
* `risk_direction`: `UP|DOWN|NEUTRAL`
|
||||
* `reason_codes`: `[REACHABILITY_FLIP, VEX_FLIP, RANGE_FLIP, KEV_FLIP, POLICY_FLIP, EPSS_THRESHOLD]`
|
||||
* `old_state` / `new_state`
|
||||
* `priority_score`
|
||||
* `evidence_links[]`
|
||||
|
||||
## 12.3 Suppress AFTER diff, not before
|
||||
|
||||
Important: compute diff on full sets, then suppress items by rules, because:
|
||||
|
||||
* suppression itself can flip (e.g., VEX becomes `not_affected` → item disappears, which is meaningful as “risk down”).
|
||||
|
||||
---
|
||||
|
||||
# 13) Priority scoring & ranking
|
||||
|
||||
Implement a deterministic score:
|
||||
|
||||
### Hard ordering
|
||||
|
||||
1. `kev == true` in new state → top tier
|
||||
2. Reachable in new state (`reachable == true`) → next tier
|
||||
|
||||
### Numeric scoring (example)
|
||||
|
||||
```
|
||||
score =
|
||||
+ 1000 if new.kev
|
||||
+ 500 if new.reachable
|
||||
+ 200 if reason includes RANGE_FLIP to affected
|
||||
+ 150 if VEX_FLIP to AFFECTED
|
||||
+ 0..100 based on EPSS (epss * 100)
|
||||
+ policy weight: +300 if decision BLOCK, +100 if WARN
|
||||
```
|
||||
|
||||
Always include `score_breakdown` in report for explainability.
|
||||
|
||||
---
|
||||
|
||||
# 14) Evidence requirements (must implement)
|
||||
|
||||
Every Smart‑Diff item must include **at least one** evidence link, and ideally 2–4:
|
||||
|
||||
EvidenceLink schema:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "vex|reachability|kev|epss|scanner|sbom|policy",
|
||||
"ref": "stable identifier",
|
||||
"summary": "one-line human readable",
|
||||
"blob_hash": "sha256 of raw evidence payload (optional)"
|
||||
}
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
* `type=kev`: ref is `concelier:kev@{snapshotHash}#CVE-2024-1234`
|
||||
* `type=reachability`: ref is `reach:{snapshotId}:{traceId}`
|
||||
* `type=vex`: ref is `openvex:{docHash}#statement:{id}`
|
||||
|
||||
---
|
||||
|
||||
# 15) API specification
|
||||
|
||||
## 15.1 Compare endpoint
|
||||
|
||||
`POST /smartdiff/compare`
|
||||
|
||||
Request:
|
||||
|
||||
```json
|
||||
{
|
||||
"old_snapshot_id": "buildA",
|
||||
"new_snapshot_id": "buildB",
|
||||
"options": {
|
||||
"include_suppressed": false,
|
||||
"max_items": 200,
|
||||
"epss_threshold": 0.7
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"report_id": "smartdiff:2025-12-14:xyz",
|
||||
"old": { "snapshot_id": "buildA", "feed_hashes": { ... } },
|
||||
"new": { "snapshot_id": "buildB", "feed_hashes": { ... } },
|
||||
"summary": {
|
||||
"risk_up": 3,
|
||||
"risk_down": 8,
|
||||
"reachable_new": 2,
|
||||
"kev_new": 1,
|
||||
"suppressed": 143
|
||||
},
|
||||
"items": [
|
||||
{
|
||||
"change_type": "CHANGED",
|
||||
"risk_direction": "UP",
|
||||
"priority_score": 1680,
|
||||
"reason_codes": ["REACHABILITY_FLIP","RANGE_FLIP"],
|
||||
"finding_key": {
|
||||
"purl": "pkg:maven/org.example/foo",
|
||||
"version_old": "1.2.3",
|
||||
"version_new": "1.2.4",
|
||||
"cve": "CVE-2024-1234"
|
||||
},
|
||||
"old_state": { "...": "RiskState" },
|
||||
"new_state": { "...": "RiskState" },
|
||||
"evidence": [ ... ]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 15.2 Evidence endpoint
|
||||
|
||||
`GET /smartdiff/{report_id}/evidence/{evidence_ref}`
|
||||
|
||||
Returns raw stored evidence (or a signed URL if you store blobs elsewhere).
|
||||
|
||||
---
|
||||
|
||||
# 16) CLI behavior
|
||||
|
||||
Command:
|
||||
|
||||
```
|
||||
stella smart-diff \
|
||||
--old ./snapshots/buildA \
|
||||
--new ./snapshots/buildB \
|
||||
--policy ./policy.json \
|
||||
--out ./smartdiff.json
|
||||
```
|
||||
|
||||
CLI output (human):
|
||||
|
||||
* Summary line: `risk ↑ 3 | risk ↓ 8 | new reachable 2 | new KEV 1`
|
||||
* Then top N items sorted by priority, each one line:
|
||||
|
||||
* `↑ REACHABILITY_FLIP foo@1.2.4 CVE-2024-1234 (EPSS 0.42) path: Main→...→vulnMethod`
|
||||
|
||||
Exit code:
|
||||
|
||||
* `0` if policy decision overall is ALLOW/WARN
|
||||
* `2` if any item triggers policy BLOCK in new snapshot (configurable)
|
||||
|
||||
---
|
||||
|
||||
# 17) Storage schema (Postgres) — implementation-ready
|
||||
|
||||
You can implement in a single schema to start; split later.
|
||||
|
||||
## Core tables
|
||||
|
||||
### `snapshots`
|
||||
|
||||
* `snapshot_id (pk)`
|
||||
* `created_at`
|
||||
* `sbom_hash`
|
||||
* `policy_hash`
|
||||
* `kev_hash`
|
||||
* `epss_hash`
|
||||
* `vuln_db_hash`
|
||||
* `metadata jsonb`
|
||||
|
||||
### `components`
|
||||
|
||||
* `component_id (pk)` (internal UUID)
|
||||
* `snapshot_id (fk)`
|
||||
* `purl`
|
||||
* `version`
|
||||
* `scope` (runtime/dev/test/unknown)
|
||||
* `direct bool`
|
||||
* indexes on `(snapshot_id, purl)` and `(purl, version)`
|
||||
|
||||
### `findings`
|
||||
|
||||
* `finding_id (pk)`
|
||||
* `snapshot_id (fk)`
|
||||
* `purl`
|
||||
* `version`
|
||||
* `cve`
|
||||
* `reachable bool null`
|
||||
* `vex_status text`
|
||||
* `in_affected_range bool null`
|
||||
* `kev bool`
|
||||
* `epss real null`
|
||||
* `policy_decision text`
|
||||
* `policy_flags text[]`
|
||||
* index `(snapshot_id, purl, cve)`
|
||||
|
||||
### `reachability_traces`
|
||||
|
||||
* `trace_id (pk)`
|
||||
* `snapshot_id (fk)`
|
||||
* `purl`
|
||||
* `cve`
|
||||
* `sink`
|
||||
* `callstack jsonb`
|
||||
* index `(snapshot_id, purl, cve)`
|
||||
|
||||
### `vex_statements`
|
||||
|
||||
* `stmt_id (pk)`
|
||||
* `snapshot_id (fk)`
|
||||
* `purl`
|
||||
* `cve`
|
||||
* `source`
|
||||
* `issued_at`
|
||||
* `status`
|
||||
* `doc_hash`
|
||||
* `raw jsonb`
|
||||
* index `(snapshot_id, purl, cve)`
|
||||
|
||||
### `smartdiff_reports`
|
||||
|
||||
* `report_id (pk)`
|
||||
* `created_at`
|
||||
* `old_snapshot_id`
|
||||
* `new_snapshot_id`
|
||||
* `options jsonb`
|
||||
* `summary jsonb`
|
||||
|
||||
### `smartdiff_items`
|
||||
|
||||
* `item_id (pk)`
|
||||
* `report_id (fk)`
|
||||
* `change_type`
|
||||
* `risk_direction`
|
||||
* `priority_score`
|
||||
* `reason_codes text[]`
|
||||
* `purl`
|
||||
* `cve`
|
||||
* `old_version`
|
||||
* `new_version`
|
||||
* `old_state jsonb`
|
||||
* `new_state jsonb`
|
||||
|
||||
### `evidence_links`
|
||||
|
||||
* `evidence_id (pk)`
|
||||
* `report_id (fk)`
|
||||
* `item_id (fk)`
|
||||
* `type`
|
||||
* `ref`
|
||||
* `summary`
|
||||
* `blob_hash`
|
||||
|
||||
---
|
||||
|
||||
# 18) Implementation plan (developer-focused)
|
||||
|
||||
## Phase 1 — MVP (end-to-end working)
|
||||
|
||||
1. **Normalize SBOM**
|
||||
|
||||
* Parse CycloneDX/SPDX
|
||||
* Build `components` list with purl + version + scope
|
||||
2. **Concelier integration**
|
||||
|
||||
* Load KEV + EPSS snapshots (even from local files initially)
|
||||
* Expose snapshot hashes
|
||||
3. **Excititor integration**
|
||||
|
||||
* Parse OpenVEX/CycloneDX VEX
|
||||
* Implement precedence rules and output `final_status`
|
||||
4. **Affected range matching**
|
||||
|
||||
* For each component, query vulnerability DB snapshot for affected ranges
|
||||
* Produce candidate findings `(purl, version, cve)`
|
||||
5. **Reachability ingestion**
|
||||
|
||||
* Accept reachability JSON traces (even if generated elsewhere initially)
|
||||
* Mark `reachable=true` when trace exists for (purl,cve)
|
||||
6. **Compute RiskState**
|
||||
|
||||
* For each finding compute `kev`, `epss`, `policy_decision`
|
||||
7. **Diff + suppression + ranking**
|
||||
|
||||
* Generate `SmartDiffReport`
|
||||
8. **Outputs**
|
||||
|
||||
* JSON report + CLI table
|
||||
* Store report + items in Postgres
|
||||
|
||||
Acceptance tests for Phase 1:
|
||||
|
||||
* Given a known pair of snapshots, Smart‑Diff only includes:
|
||||
|
||||
* reachable vulnerable changes
|
||||
* VEX flips
|
||||
* affected range boundary flips
|
||||
* KEV flips
|
||||
* Patch churn not crossing ranges is absent.
|
||||
|
||||
## Phase 2 — Determinism & evidence hardening
|
||||
|
||||
* Store raw evidence blobs (VEX doc hash, trace payload hash)
|
||||
* Ensure feed snapshots are immutable and referenced by hash
|
||||
* Add `score_breakdown`
|
||||
* Add conflict surfacing for VEX merge
|
||||
|
||||
## Phase 3 — Performance & scale
|
||||
|
||||
* Incremental computation (only recompute affected components changed)
|
||||
* Cache Concelier lookups by `(snapshotHash, cve)`
|
||||
* Batch range matching queries
|
||||
* Add pagination and `max_items` enforcement
|
||||
|
||||
---
|
||||
|
||||
# 19) Edge cases developers must handle
|
||||
|
||||
1. **Reachability unknown**
|
||||
|
||||
* If no analyzer output exists, set `reachable = null`
|
||||
* Do not suppress solely based on `reachable=null`
|
||||
2. **Version parse failures**
|
||||
|
||||
* `in_affected_range = null`
|
||||
* Surface range-related changes only when one side is determinable
|
||||
3. **Component renamed / purl drift**
|
||||
|
||||
* Consider purl normalization rules (namespace casing, qualifiers)
|
||||
* If purl changes but is same artifact, treat as new component (unless you implement alias mapping later)
|
||||
4. **Multiple CVE sources / duplicates**
|
||||
|
||||
* Deduplicate by CVE ID per component+version
|
||||
5. **Conflicting VEX statements**
|
||||
|
||||
* Pick winner deterministically, but log conflict evidence
|
||||
6. **KEV listed but VEX says not affected**
|
||||
|
||||
* Still suppress? Recommended:
|
||||
|
||||
* Do **not** suppress; surface as “KEV listed but VEX not_affected” and rank high (KEV tier)
|
||||
7. **Policy config changes**
|
||||
|
||||
* Treat policy hash difference as a diff dimension; surface “policy flip” items even if underlying vuln unchanged
|
||||
|
||||
---
|
||||
|
||||
# 20) Testing strategy (must implement)
|
||||
|
||||
## Unit tests
|
||||
|
||||
* SemVer compare + affected range evaluation
|
||||
* Excititor precedence resolution
|
||||
* Suppression rules (table-driven tests)
|
||||
* Priority scoring determinism
|
||||
|
||||
## Integration tests
|
||||
|
||||
* Build synthetic snapshots:
|
||||
|
||||
* A: vuln present, unreachable, VEX not_affected
|
||||
* B: same vuln reachable
|
||||
* Assert Smart‑Diff surfaces exactly one item with `REACHABILITY_FLIP`
|
||||
* KEV flip test:
|
||||
|
||||
* Same findings, KEV list changes between Concelier snapshots
|
||||
* Assert item surfaces with `KEV_FLIP`
|
||||
|
||||
## Regression suite
|
||||
|
||||
Keep a folder of snapshot pairs and expected outputs:
|
||||
|
||||
* `fixtures/snapA`, `fixtures/snapB`, `expected.smartdiff.json`
|
||||
|
||||
---
|
||||
|
||||
# 21) What the developer should code first (practical order)
|
||||
|
||||
1. DTOs:
|
||||
|
||||
* `Snapshot`, `Component`, `VexStatement`, `ReachTrace`, `FindingKey`, `RiskState`, `SmartDiffItem`, `SmartDiffReport`
|
||||
2. Pure functions:
|
||||
|
||||
* `NormalizePurl`
|
||||
* `IsVersionInAffectedRange`
|
||||
* `ResolveVexStatus` (Excititor)
|
||||
* `ComputeRiskState`
|
||||
* `DiffRiskStates`
|
||||
* `ApplySuppression`
|
||||
* `ScoreAndRank`
|
||||
3. Persistence:
|
||||
|
||||
* store snapshots and computed findings
|
||||
4. API + CLI wrappers
|
||||
|
||||
---
|
||||
|
||||
If you want, I can also provide:
|
||||
|
||||
* a **concrete JSON Schema** for `SmartDiffReport`
|
||||
* **C# (.NET 10) interfaces + class skeletons** for `ConcelierClient`, `ExcititorResolver`, and `SmartDiffService`
|
||||
* a **fixture set** (sample SBOM/VEX/reach traces) to bootstrap the test suite
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,787 @@
|
||||
Here’s a compact playbook for building **10–20 “toy services” with planted, labeled vulnerabilities** so you can demo reachability, measure scanner accuracy, and make the “why” behind each finding obvious.
|
||||
|
||||
### Why do this
|
||||
|
||||
* **Repeatable benchmarks:** same inputs → same findings → track accuracy over time.
|
||||
* **Explainable demos:** each vuln has a story, proof path, and a fix.
|
||||
* **Coverage sanity checks:** distinguish **reachable** vs **unreachable** vulns so tools can’t inflate results.
|
||||
|
||||
### Core design
|
||||
|
||||
* Each service = 1 repo with:
|
||||
|
||||
* `/app` (tiny API or worker), `/infra` (Dockerfile/compose), `/tests` (PyTest/Jest + attack scripts), `/labels.yaml` (ground‑truth).
|
||||
* `labels.yaml` schema:
|
||||
|
||||
```yaml
|
||||
service: svc-01-password-reset
|
||||
vulns:
|
||||
- id: V1
|
||||
cve: CVE-2022-XXXXX
|
||||
type: dep_runtime
|
||||
package: express
|
||||
version: 4.17.0
|
||||
reachable: true
|
||||
path_tags: ["route:/reset", "call:crypto.md5", "env:DEV_MODE"]
|
||||
proof: ["curl.sh#L10", "trace.json:/reset stack -> md5()"]
|
||||
fix_hint: "upgrade express to 4.18.3"
|
||||
- id: V2
|
||||
type: dep_build
|
||||
package: lodash
|
||||
version: 4.17.5
|
||||
reachable: false
|
||||
path_tags: ["devDependency", "no-import"]
|
||||
```
|
||||
* **Tagged paths**: add lightweight traces (e.g., log “TAG:route:/reset” before vulnerable call) so tests can assert reachability.
|
||||
|
||||
### Suggested catalog (pick 10–20)
|
||||
|
||||
1. **Password reset token** (MD5, predictable tokens) – reachable via `/reset`.
|
||||
2. **SQL injection** (string‑concat query) – reachable via `/search`.
|
||||
3. **Path traversal** (`../` in `?file=`) – reachable but sandboxed; variant unreachable behind dead route flag.
|
||||
4. **Deserialization bug** (unsafe `pickle`/`BinaryFormatter`) – reachable in worker queue.
|
||||
5. **SSRF** (proxy fetch) – guarded by allow‑list in unreachable variant.
|
||||
6. **Command injection** (`child_process.exec`) – reachable via debug param; unreachable alt uses execFile.
|
||||
7. **JWT none‑alg** acceptance – only when `DEV_MODE=1`.
|
||||
8. **Hardcoded credentials** (in config) – present but not used (unreachable).
|
||||
9. **Dependency vuln (runtime)** old `express/fastapi` called in hot path.
|
||||
10. **Dependency vuln (build‑time only)** devDependency only (unreachable at runtime).
|
||||
11. **Insecure TLS** (skip verify) – gated behind feature flag.
|
||||
12. **Open redirect** – requires crafted `next=` param.
|
||||
13. **XXE** in XML upload – off by default in unreachable variant.
|
||||
14. **Insecure deserialization in message bus consumer** – invoked by test producer.
|
||||
15. **Race condition** (TOCTOU temp file) – demonstrated by parallel test.
|
||||
16. **Use‑after‑free style bug** (C tiny service) – reachable with specific sequence; alt path never called.
|
||||
17. **CSRF** on state‑changing route – reachable only without SameSite/CSRF tokens.
|
||||
18. **Directory listing** (misconfigured static server) – reachable under `/public`.
|
||||
19. **Prototype pollution** (JS merge) – only reachable when `content-type: application/json`.
|
||||
20. **Zip‑slip** in archive import – prevented in unreachable variant via safe unzip.
|
||||
|
||||
### Tech stack mix
|
||||
|
||||
* **Languages:** Node (Express), Python (FastAPI/Flask), Go (net/http), C# (.NET Minimal API), one small C binary.
|
||||
* **Packaging:** Docker per service; one multi‑stage with vulnerable build‑tool only (to test build‑time vs runtime vulns).
|
||||
* **Data:** SQLite or in‑memory maps to avoid ops noise.
|
||||
|
||||
### Test harness (deterministic)
|
||||
|
||||
* `make test` runs:
|
||||
|
||||
1. **Smoke** (service up).
|
||||
2. **Exploit scripts** trigger each *reachable* vuln and store `evidence/trace.json`.
|
||||
3. **Scanner run** (your tool + competitors) against the image/container/fs.
|
||||
4. **Evaluator** compares scanner output to `labels.yaml`.
|
||||
|
||||
### Metrics you’ll get
|
||||
|
||||
* **Precision/recall** overall and by class (dep_runtime, dep_build, code, config).
|
||||
* **Reachability precision**: % of flagged vulns with a proven path tag match.
|
||||
* **Overreport index**: unreachable‑flag hits / total hits.
|
||||
* **TTFS (Time‑to‑first‑signal)**: from scan start to first evidence‑backed block.
|
||||
* **Fix guidance score**: did the tool propose the correct minimal upgrade/patch?
|
||||
|
||||
### Minimal evaluator format
|
||||
|
||||
Scanner output → normalized JSON:
|
||||
|
||||
```json
|
||||
{ "findings": [
|
||||
{"cve":"CVE-2022-XXXXX","package":"express","version":"4.17.0",
|
||||
"class":"dep_runtime","path_tags":["route:/reset","call:crypto.md5"]}
|
||||
]}
|
||||
```
|
||||
|
||||
Evaluator joins on `(cve|type|package)` and checks:
|
||||
|
||||
* tag overlap with `labels.vulns[*].path_tags`
|
||||
* reachable expectation matches
|
||||
* counts per class; exports `report.md` + `report.csv`.
|
||||
|
||||
### Demo storyline (5 min)
|
||||
|
||||
1. Run **svc‑01**; hit `/reset`; show trace marker.
|
||||
2. Run your scanner; show it ranks the **reachable dep vuln** above the **devDependency vuln**.
|
||||
3. Flip env to disable route; rerun → reachable finding disappears → score improves.
|
||||
4. Show **fix hint** applied (upgrade) → green.
|
||||
|
||||
### Repo layout (monorepo)
|
||||
|
||||
```
|
||||
/toys/
|
||||
svc-01-reset-md5/
|
||||
svc-02-sql-injection/
|
||||
...
|
||||
/harness/
|
||||
normalize.py
|
||||
evaluate.py
|
||||
run_scans.sh
|
||||
/docs/
|
||||
rubric.md # metric definitions & thresholds
|
||||
```
|
||||
|
||||
### Guardrails
|
||||
|
||||
* Keep images tiny (<150MB) and ports unique.
|
||||
* Deterministic seeds for any randomness.
|
||||
* No outbound calls in tests (use local mocks).
|
||||
* Clearly mark **unsafe** code blocks with comments.
|
||||
|
||||
### First 5 to build this week
|
||||
|
||||
1. `svc-01-reset-md5` (Node)
|
||||
2. `svc-02-sql-injection` (Python/FastAPI)
|
||||
3. `svc-03-dep-build-only` (Node devDependency)
|
||||
4. `svc-04-cmd-injection` (.NET Minimal API)
|
||||
5. `svc-05-zip-slip` (Go)
|
||||
|
||||
If you want, I can generate the skeleton repos (Dockerfile, app, tests, `labels.yaml`, and the evaluator script) so you can drop them into your monorepo and start measuring immediately.
|
||||
Below is a **developer framework** you can hand to the team as the governing “contract” for implementing the full toy-service catalogue at a **best-in-class** standard, while keeping the suite deterministic, safe, and maximally useful for scanner R&D.
|
||||
|
||||
---
|
||||
|
||||
## 1) Non-negotiable principles
|
||||
|
||||
1. **Determinism first**
|
||||
|
||||
* Same git SHA + same inputs ⇒ identical images, SBOMs, findings, scores.
|
||||
* Pin everything: base image **by digest**, language deps **by lockfiles**, tool versions **by exact semver**, and record it in an evidence manifest.
|
||||
|
||||
2. **Ground truth is authoritative**
|
||||
|
||||
* Every planted weakness must have a **machine-readable label**, and at least one **verifiable proof artifact**.
|
||||
* No “implicit” vulnerabilities; if it’s not labeled, it does not exist for scoring.
|
||||
|
||||
3. **Reachability is tiered, not binary**
|
||||
|
||||
* You will label and prove *how* it is reachable (imported vs executed vs tainted input), not just “reachable: true”.
|
||||
|
||||
4. **Safety by construction**
|
||||
|
||||
* Services run on an isolated docker network; tests must not require internet.
|
||||
* Proofs should demonstrate *execution and dataflow* rather than “weaponized exploitation”.
|
||||
|
||||
---
|
||||
|
||||
## 2) Repository and service contract
|
||||
|
||||
### Standard monorepo layout
|
||||
|
||||
```
|
||||
/toys/
|
||||
svc-01-.../
|
||||
app/
|
||||
infra/ # Dockerfile, compose, network policy
|
||||
tests/ # positive + negative reachability tests
|
||||
labels.yaml # ground truth
|
||||
evidence/ # generated by tests (trace, tags, manifests)
|
||||
fix/ # minimal patch proving remediation
|
||||
/harness/
|
||||
run-suite/
|
||||
normalize/
|
||||
evaluate/
|
||||
/schemas/
|
||||
labels.schema.json
|
||||
/docs/
|
||||
benchmark-contract.md
|
||||
scoring.md
|
||||
reviewer-checklist.md
|
||||
```
|
||||
|
||||
### Required service deliverables (Definition of Done)
|
||||
|
||||
A service PR is “DONE” only if it includes:
|
||||
|
||||
* `labels.yaml` validated by `schemas/labels.schema.json`
|
||||
* Docker build reproducible enough to be stable in CI (digest pinned; lockfiles committed)
|
||||
* **Positive tests** that generate evidence proving reachability tiers (see §3)
|
||||
* **Negative tests** proving “unreachable” claims (feature flags off, devDependency only, dead route, etc.)
|
||||
* `fix/` patch that removes/mitigates the weakness and produces a measurable delta (findings drop, reachability flips, or config gate blocks)
|
||||
* An `evidence/manifest.json` capturing tool versions, git sha, image digest, timestamps (UTC), and hashes of evidence files
|
||||
|
||||
---
|
||||
|
||||
## 3) Reachability tiers and evidence requirements
|
||||
|
||||
### Reachability levels (use these everywhere)
|
||||
|
||||
* **R0 Present**: vulnerable component exists in image/SBOM, not imported/loaded.
|
||||
* **R1 Loaded**: imported/linked/initialized, but no executed path proven.
|
||||
* **R2 Executed**: vulnerable function/module is executed in a test (deterministic trace).
|
||||
* **R3 Tainted execution**: execution occurs with externally influenced input (route param/message/body).
|
||||
* **R4 Exploitable** (optional): controlled, non-harmful PoC demonstrates full impact.
|
||||
|
||||
### Minimum evidence per level
|
||||
|
||||
* R0: SBOM + file hash / package metadata
|
||||
* R1: runtime startup logs or module load trace tag
|
||||
* R2: callsite tag + stack trace snippet (or deterministic trace file)
|
||||
* R3: R2 + taint marker showing data originated from external boundary (HTTP/queue/env) and reached call
|
||||
* R4: only if safe and necessary; keep it non-weaponized and sandboxed
|
||||
|
||||
**Key rule:** prefer proving **execution + dataflow** over providing “payload recipes”.
|
||||
|
||||
---
|
||||
|
||||
## 4) Ground truth schema (what `labels.yaml` must capture)
|
||||
|
||||
Every vuln entry must have:
|
||||
|
||||
* Stable ID: `svc-XX:Vn` (never renumber once published)
|
||||
* Class: `dep_runtime | dep_build | code | config | os_pkg | supply_chain`
|
||||
* Identity: `cve` (if applicable), `purl`, `package`, `version`, `location` (path/module)
|
||||
* Reachability: `reachability_level: R0..R4`, `entrypoint` (route/topic/cli), `preconditions` (flags/env/auth)
|
||||
* Proofs:
|
||||
|
||||
* `proof.artifacts[]` (e.g., trace file, tag log, coverage snippet)
|
||||
* `proof.tags[]` (canonical tag strings)
|
||||
* Fix:
|
||||
|
||||
* `fix.type` (upgrade/config/code)
|
||||
* `fix.patch_path` (under `fix/`)
|
||||
* `fix.expected_delta` (what should change in findings/evidence)
|
||||
* Negatives (if unreachable):
|
||||
|
||||
* `negative_proof` explaining and proving why it is unreachable
|
||||
|
||||
Canonical tag format (consistent across languages):
|
||||
|
||||
* `TAG:route:/reset`
|
||||
* `TAG:call:Crypto.Md5`
|
||||
* `TAG:taint:http.body.resetToken`
|
||||
* `TAG:flag:DEV_MODE=true`
|
||||
|
||||
---
|
||||
|
||||
## 5) Service implementation standards (how developers build each toy)
|
||||
|
||||
### A. Vulnerability planting patterns (approved)
|
||||
|
||||
* **Dependency runtime**: vulnerable version is a production dependency and exercised on a normal route/job.
|
||||
* **Dependency build-only**: devDependency only, or used only in build stage; prove it never ships in final image.
|
||||
* **Code vuln**: the vulnerable sink is behind a clean, deterministic entrypoint and instrumented.
|
||||
* **Config vuln**: misconfig is explicit and versioned (headers, TLS settings, authz rules), with a fix patch.
|
||||
|
||||
### B. Instrumentation requirements
|
||||
|
||||
* Every reachable vuln must emit:
|
||||
|
||||
* one **entrypoint tag** (route/topic/command)
|
||||
* one **sink tag** (the vulnerable call or module)
|
||||
* optional **taint tag** for R3
|
||||
* Evidence generation must be stable and machine-parsable:
|
||||
|
||||
* JSON trace preferred (`evidence/trace.json`)
|
||||
* Logs acceptable if structured and anchored with tags
|
||||
|
||||
### C. Negative-case discipline (unreachable means proven unreachable)
|
||||
|
||||
Unreachable claims must be backed by one of:
|
||||
|
||||
* compilation/linker exclusion (dead code eliminated) + proof
|
||||
* dependency not present in final image (multi-stage) + proof (image file listing / SBOM diff)
|
||||
* feature flag off + proof (config captured + route unavailable)
|
||||
* auth gate + proof (unauthorized cannot reach sink)
|
||||
|
||||
---
|
||||
|
||||
## 6) Harness and scoring gates (how you enforce “best in class”)
|
||||
|
||||
### Normalization
|
||||
|
||||
All scanners’ outputs must normalize into one internal shape:
|
||||
|
||||
* `(identity: purl+cve+version+location) + class + reachability_claim + evidence_refs`
|
||||
|
||||
### Core metrics (tracked per commit)
|
||||
|
||||
* **Recall (by class)**: runtime deps, OS pkgs, code, config
|
||||
* **Precision**: false positive rate, especially R0/R1 misclassified as R2/R3
|
||||
* **Reachability accuracy**:
|
||||
|
||||
* overreach: predicted reachable but labeled R0/R1
|
||||
* underreach: labeled R2/R3 but predicted non-reachable
|
||||
* **TTFS** (Time-to-First-Signal): time to first *evidence-backed* blocking issue
|
||||
* **Fix validation**: applying `fix/` must produce the expected delta
|
||||
|
||||
### Quality gates (example thresholds you can enforce in CI)
|
||||
|
||||
* Runtime dependency recall ≥ 0.95
|
||||
* Unreachable false positives ≤ 0.05 (for R0/R1)
|
||||
* Reachability underreport ≤ 0.10 (for labeled R2/R3)
|
||||
* TTFS regression: no worse than +10% vs main
|
||||
* Fix validation pass rate = 100% for modified services
|
||||
|
||||
(Adjust numbers as your suite matures; the framework is the key.)
|
||||
|
||||
---
|
||||
|
||||
## 7) Review checklist (what reviewers enforce)
|
||||
|
||||
A PR adding/modifying a service is rejected if any of these fail:
|
||||
|
||||
* Labels complete, schema-valid, and stable IDs preserved
|
||||
* Proof artifacts are deterministic and generated by tests
|
||||
* Reachability tier justified and matches evidence
|
||||
* Unreachable claims have negative proofs
|
||||
* Docker build uses pinned digests + lockfiles committed
|
||||
* `fix/` produces measurable delta and does not introduce new unlabeled issues
|
||||
* No network egress required; tests are hermetic
|
||||
|
||||
---
|
||||
|
||||
## 8) Catalogue coverage plan (so the suite actually pushes “best in class”)
|
||||
|
||||
Define coverage targets across axes (each service must declare which axes it covers):
|
||||
|
||||
* Language/runtime: Node, Python, Go, .NET 10, Java, native
|
||||
* Packaging: fat images, slim, distroless, multi-stage, vendored deps
|
||||
* Dependency complexity: direct, transitive, aliased/renamed, bundled
|
||||
* Reachability: R0..R3 mandatory; R4 only for a small subset
|
||||
* Supply chain: SBOM mismatches, provenance metadata, build-time tools present but excluded
|
||||
|
||||
This prevents you from building 20 near-duplicates and thinking you have “coverage”.
|
||||
|
||||
---
|
||||
|
||||
## 9) What to hand developers today (copy/paste into your repo)
|
||||
|
||||
Create these docs/templates as the “contract”:
|
||||
|
||||
* `/docs/benchmark-contract.md` (principles + DoD + reachability tiers)
|
||||
* `/docs/reviewer-checklist.md` (the rejection criteria)
|
||||
* `/schemas/labels.schema.json` (strict validation)
|
||||
* `/toys/_TEMPLATE/` (service skeleton with instrumentation + tests scaffolding)
|
||||
* `/harness/evaluate/` (metrics + CI gating)
|
||||
|
||||
If you want, I can draft the actual **`labels.schema.json`**, the **service template**, and the **review checklist** in ready-to-commit form (paths, names, and .NET 10 conventions included).
|
||||
Below is an **extended, implementable catalogue** (20 services + 4 “meta” repos) with **concrete stacks, entrypoints, reachability tiers, canonical tags, proof tests, and fixes**. It is written so developers can start building immediately while keeping the suite deterministic and scanner-grade.
|
||||
|
||||
---
|
||||
|
||||
## A. Catalogue index (what each service is for)
|
||||
|
||||
| ID | Service | Stack | Primary capability tested |
|
||||
| -: | --------------------------- | --------------------- | ------------------------------------------------------ |
|
||||
| 01 | reset-token-weak-crypto | Node/Express | Code reachability + crypto misuse |
|
||||
| 02 | search-sql-injection | Python/FastAPI | Taint → sink (SQLi), route evidence |
|
||||
| 03 | cmd-injection-diagnostics | .NET 10 Minimal API | Taint → shell sink + gating |
|
||||
| 04 | zip-import-zip-slip | Go net/http | Archive handling (Zip Slip), filesystem proof |
|
||||
| 05 | xml-upload-xxe | Java/Spring Boot | XML parser config (XXE), safe proof |
|
||||
| 06 | jwt-none-devmode | .NET 10 | Config-gated auth bypass (reachability depends on env) |
|
||||
| 07 | fetcher-ssrf | Node/Express | SSRF to internal-only target, network isolation |
|
||||
| 08 | outbound-tls-skipverify | Go | TLS misconfig + “reachable only if feature enabled” |
|
||||
| 09 | queue-pickle-deser | Python worker | Async reachability via queue + unsafe deserialization |
|
||||
| 10 | efcore-rawsql | .NET 10 + EF Core | ORM raw SQL misuse + input flow |
|
||||
| 11 | shaded-jar-deps | Java/Gradle | Shaded/fat jar dependency discovery |
|
||||
| 12 | webpack-bundled-dep | Node/Webpack | Bundled deps + SBOM correctness |
|
||||
| 13 | go-static-modver | Go static | Detect module versions in static binaries |
|
||||
| 14 | dotnet-singlefile-trim | .NET 10 publish | Single-file/trimmed dependency evidence |
|
||||
| 15 | cors-credentials-wildcard | .NET 10 or Node | Config vulnerability (CORS) + fix delta |
|
||||
| 16 | open-redirect | Node/Express | Web vuln classification + allowlist fix |
|
||||
| 17 | csrf-state-change | .NET 10 Razor/Minimal | Missing CSRF protections + cookie semantics |
|
||||
| 18 | prototype-pollution-merge | Node | JSON-body gated path + sink |
|
||||
| 19 | path-traversal-download | Python/Flask | File handling traversal + normalization |
|
||||
| 20 | insecure-tempfile-toctou | Go or .NET | Concurrency/race evidence (safe) |
|
||||
| 21 | k8s-misconfigs | YAML/Helm | IaC scanning (privileged, hostPath, etc.) |
|
||||
| 22 | docker-multistage-buildonly | Any | Build-time-only vuln exclusion proof |
|
||||
| 23 | secrets-fakes-corpus | Any | Secret detection precision (fake tokens) |
|
||||
| 24 | sbom-mismatch-lab | Any | SBOM validation + diff correctness |
|
||||
|
||||
---
|
||||
|
||||
## B. Canonical tagging (use across all services)
|
||||
|
||||
Every reachable vuln must produce at least:
|
||||
|
||||
* `TAG:route:<method> <path>` or `TAG:topic:<name>`
|
||||
* `TAG:call:<sink>`
|
||||
* If R3: `TAG:taint:<boundary>` (http.query, http.body, queue.msg, env.var)
|
||||
|
||||
**Evidence artifact:** `evidence/trace.json` lines such as:
|
||||
|
||||
```json
|
||||
{"ts":"...","corr":"...","tags":["TAG:route:POST /reset","TAG:taint:http.body.email","TAG:call:Crypto.MD5"]}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## C. Service specs (developers can implement 1:1)
|
||||
|
||||
### 01) `svc-01-reset-token-weak-crypto` (Node/Express)
|
||||
|
||||
**Purpose:** R3 code reachability; crypto misuse; ensure scanner doesn’t over-rank unreachable dev deps.
|
||||
**Entrypoints:** `POST /reset` and `POST /reset/confirm`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-327 Weak Crypto** — reset token derived from deterministic inputs (no CSPRNG).
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:POST /reset`, `TAG:taint:http.body.email`, `TAG:call:Crypto.WeakToken`
|
||||
* Proof test: request reset; assert trace contains sink tag.
|
||||
* Fix: use `crypto.randomBytes()` and store hashed token.
|
||||
* `V2` **dep_build** — vulnerable npm devDependency present only in `devDependencies`.
|
||||
|
||||
* Reachability: **R0**
|
||||
* Negative proof: final image contains no node_modules entry for it OR it is never imported (coverage + grep import map).
|
||||
|
||||
**Hard mode variant:** token generation only happens when `FEATURE_RESET_V1=1` → label unreachable when off.
|
||||
|
||||
---
|
||||
|
||||
### 02) `svc-02-search-sql-injection` (Python/FastAPI + SQLite)
|
||||
|
||||
**Purpose:** Classic taint → SQL sink; evidence-driven.
|
||||
**Entrypoint:** `GET /search?q=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-89 SQL Injection** — query constructed via string concatenation.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:GET /search`, `TAG:taint:http.query.q`, `TAG:call:SQL.Unparameterized`
|
||||
* Proof test: send query with SQL metacharacters; verify trace hits sink.
|
||||
* Fix: parameterized query / query builder.
|
||||
|
||||
**Hard mode variant:** same route exists but safe path uses parameters; unsafe path only if header `X-Debug=1` and env `DEV_MODE=1`.
|
||||
|
||||
---
|
||||
|
||||
### 03) `svc-03-cmd-injection-diagnostics` (.NET 10 Minimal API)
|
||||
|
||||
**Purpose:** Detect command execution sink and prove gating.
|
||||
**Entrypoint:** `GET /diag/ping?host=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-78 Command Injection** — shell invocation with user-influenced argument.
|
||||
|
||||
* Reachability: **R3** when `DIAG_ENABLED=1`
|
||||
* Tags: `TAG:route:GET /diag/ping`, `TAG:taint:http.query.host`, `TAG:call:Process.Start.Shell`
|
||||
* Proof test: call endpoint with characters that would alter shell parsing; evidence is sink tag + controlled output marker (not destructive).
|
||||
* Fix: avoid shell, use argument arrays (`ProcessStartInfo.ArgumentList`) + allowlist hostnames.
|
||||
|
||||
**Hard mode variant:** sink is in a helper library referenced transitively; scanner must resolve call graph.
|
||||
|
||||
---
|
||||
|
||||
### 04) `svc-04-zip-import-zip-slip` (Go)
|
||||
|
||||
**Purpose:** File/archive handling; safe filesystem proof; no “real system” impact.
|
||||
**Entrypoint:** `POST /import-zip`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-22 Path Traversal (Zip Slip)** — extraction path not normalized/validated.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:POST /import-zip`, `TAG:taint:http.body.zip`, `TAG:call:Archive.Extract.UnsafeJoin`
|
||||
* Proof test: upload crafted zip that attempts to place `evidence/sentinel.txt` outside dest; assert sentinel ends up outside intended folder.
|
||||
* Fix: clean paths; reject entries escaping dest; forbid absolute paths.
|
||||
|
||||
---
|
||||
|
||||
### 05) `svc-05-xml-upload-xxe` (Java/Spring Boot)
|
||||
|
||||
**Purpose:** Parser config scanning + code-path proof.
|
||||
**Entrypoint:** `POST /upload-xml`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-611 XXE** — DocumentBuilderFactory with external entities enabled.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:POST /upload-xml`, `TAG:taint:http.body.xml`, `TAG:call:XML.Parse.XXEEnabled`
|
||||
* Proof test: XML references a **local test file under `/app/testdata/`** and returns its sentinel string (no external network).
|
||||
* Fix: disable external entity resolution and secure processing.
|
||||
|
||||
---
|
||||
|
||||
### 06) `svc-06-jwt-none-devmode` (.NET 10)
|
||||
|
||||
**Purpose:** Reachability depends on environment and config.
|
||||
**Entrypoint:** `GET /admin` (Bearer JWT)
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-345 Insufficient Verification** — accepts unsigned token when `DEV_MODE=1`.
|
||||
|
||||
* Reachability: **R2** (exec) / **R3** (if token from request)
|
||||
* Tags: `TAG:route:GET /admin`, `TAG:flag:DEV_MODE=true`, `TAG:call:Auth.JWT.AcceptNoneAlg`
|
||||
* Proof test: run container with DEV_MODE=1; request triggers sink tag.
|
||||
* Negative test: DEV_MODE=0 must not hit sink tag.
|
||||
* Fix: enforce algorithm + signature validation always.
|
||||
|
||||
---
|
||||
|
||||
### 07) `svc-07-fetcher-ssrf` (Node/Express)
|
||||
|
||||
**Purpose:** SSRF detection with internal-only target in docker network.
|
||||
**Entrypoint:** `GET /fetch?url=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-918 SSRF** — URL fetched without scheme/host restrictions.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:GET /fetch`, `TAG:taint:http.query.url`, `TAG:call:HTTP.Client.Fetch`
|
||||
* Proof test: fetch `http://internal-metadata/health` (a companion container in compose); assert response contains sentinel + sink tag.
|
||||
* Fix: allowlist hosts/schemes; block private ranges; require signed destinations.
|
||||
|
||||
---
|
||||
|
||||
### 08) `svc-08-outbound-tls-skipverify` (Go)
|
||||
|
||||
**Purpose:** Config vuln + “reachable only when feature on.”
|
||||
**Entrypoint:** `POST /sync` triggers outbound HTTPS call
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-295 Improper Cert Validation** — `InsecureSkipVerify=true` when `SYNC_FAST=1`.
|
||||
|
||||
* Reachability: **R2** (exec)
|
||||
* Tags: `TAG:route:POST /sync`, `TAG:flag:SYNC_FAST=true`, `TAG:call:TLS.InsecureSkipVerify`
|
||||
* Fix: proper CA pinning / system pool; explicit cert verification.
|
||||
|
||||
---
|
||||
|
||||
### 09) `svc-09-queue-pickle-deser` (Python API + worker)
|
||||
|
||||
**Purpose:** Async reachability: API enqueues → worker executes sink.
|
||||
**Entrypoints:** `POST /enqueue` + worker consumer
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-502 Unsafe Deserialization** — worker uses unsafe deserializer.
|
||||
|
||||
* Reachability: **R3** (taint from HTTP → queue → worker)
|
||||
* Tags: `TAG:route:POST /enqueue`, `TAG:topic:jobs`, `TAG:call:Deserialize.Unsafe`
|
||||
* Proof test: enqueue benign payload that triggers sink tag and deterministic “handled” response (no arbitrary execution PoC).
|
||||
* Fix: switch to safe format (JSON) and validate schema.
|
||||
|
||||
---
|
||||
|
||||
### 10) `svc-10-efcore-rawsql` (.NET 10 + EF Core)
|
||||
|
||||
**Purpose:** ORM misuse; taint → SQL sink detection.
|
||||
**Entrypoint:** `GET /reports?where=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-89 SQLi** — `FromSqlRaw`/`ExecuteSqlRaw` with interpolated input.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:GET /reports`, `TAG:taint:http.query.where`, `TAG:call:EFCore.FromSqlRaw.Unsafe`
|
||||
* Fix: `FromSqlInterpolated` with parameters or LINQ predicates.
|
||||
|
||||
---
|
||||
|
||||
### 11) `svc-11-shaded-jar-deps` (Java/Gradle)
|
||||
|
||||
**Purpose:** Dependency discovery inside fat/shaded jar; reachable vs present-only.
|
||||
**Entrypoint:** `GET /parse`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **dep_runtime** — vulnerable lib included in shaded jar and actually invoked.
|
||||
|
||||
* Reachability: **R2**
|
||||
* Tags: `TAG:route:GET /parse`, `TAG:call:Lib.Parse.VulnerableMethod`
|
||||
* `V2` **dep_build/test** — test-scoped vulnerable lib not packaged in runtime jar.
|
||||
|
||||
* Reachability: **R0**
|
||||
* Negative proof: SBOM for runtime jar excludes it; file listing confirms.
|
||||
|
||||
**Fix:** bump dependency and rebuild shaded jar.
|
||||
|
||||
---
|
||||
|
||||
### 12) `svc-12-webpack-bundled-dep` (Node/Webpack)
|
||||
|
||||
**Purpose:** Bundled dependencies, source map presence/absence, SBOM correctness.
|
||||
**Entrypoint:** `GET /render?template=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **dep_runtime** — vulnerable template lib bundled; invoked by render.
|
||||
|
||||
* Reachability: **R2/R3** depending on input usage
|
||||
* Tags: `TAG:route:GET /render`, `TAG:taint:http.query.template`, `TAG:call:Template.Render`
|
||||
* `V2` **R0** — vulnerable package in lockfile but tree-shaken and absent from output bundle.
|
||||
|
||||
* Negative proof: bundle inspection + build manifest.
|
||||
|
||||
**Fix:** upgrade dependency and rebuild bundle; ensure SBOM maps bundle contents.
|
||||
|
||||
---
|
||||
|
||||
### 13) `svc-13-go-static-modver` (Go static binary)
|
||||
|
||||
**Purpose:** Scanner capability to extract module versions from static binary.
|
||||
**Entrypoint:** `GET /hash?alg=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **dep_runtime** — vulnerable Go module version linked; executed on route.
|
||||
|
||||
* Reachability: **R2**
|
||||
* Tags: `TAG:route:GET /hash`, `TAG:call:GoMod.VulnFunc`
|
||||
* `V2` **R1** — module linked but only used in dead code path (guarded by constant false).
|
||||
|
||||
* Negative proof: coverage/trace never hits sink.
|
||||
|
||||
**Fix:** update `go.mod` and rebuild.
|
||||
|
||||
---
|
||||
|
||||
### 14) `svc-14-dotnet-singlefile-trim` (.NET 10 publish single-file)
|
||||
|
||||
**Purpose:** Detect assemblies in single-file + trimming edge cases.
|
||||
**Entrypoint:** `GET /export`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **dep_runtime** — vulnerable NuGet referenced and executed.
|
||||
|
||||
* Reachability: **R2**
|
||||
* Tags: `TAG:route:GET /export`, `TAG:call:NuGet.VulnMethod`
|
||||
* `V2` **R0** — package referenced in project but trimmed out and not present.
|
||||
|
||||
* Negative proof: runtime file map (single-file manifest) excludes it.
|
||||
|
||||
**Fix:** bump NuGet; adjust trimming settings if needed.
|
||||
|
||||
---
|
||||
|
||||
### 15) `svc-15-cors-credentials-wildcard` (.NET 10)
|
||||
|
||||
**Purpose:** Config/misconfig detection; clear fix delta.
|
||||
**Entrypoint:** any API route
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-942 / CORS Misconfig** — `Access-Control-Allow-Origin: *` with credentials.
|
||||
|
||||
* Reachability: **R2** (observed in response headers)
|
||||
* Tags: `TAG:route:GET /health`, `TAG:call:HTTP.Headers.CORSWildcardCreds`
|
||||
* Proof test: request and assert headers + tag.
|
||||
* Fix: explicit allowed origins + disable credentials unless needed.
|
||||
|
||||
---
|
||||
|
||||
### 16) `svc-16-open-redirect` (Node/Express)
|
||||
|
||||
**Purpose:** Web vuln classification, allowlist fix.
|
||||
**Entrypoint:** `GET /login?next=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-601 Open Redirect** — next param used directly.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:GET /login`, `TAG:taint:http.query.next`, `TAG:call:Redirect.Unvalidated`
|
||||
* Fix: allowlist relative paths; reject absolute URLs.
|
||||
|
||||
---
|
||||
|
||||
### 17) `svc-17-csrf-state-change` (.NET 10)
|
||||
|
||||
**Purpose:** CSRF detection + cookie semantics.
|
||||
**Entrypoint:** `POST /account/email` (cookie auth)
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-352 CSRF** — no anti-forgery token; SameSite mis-set.
|
||||
|
||||
* Reachability: **R2**
|
||||
* Tags: `TAG:route:POST /account/email`, `TAG:call:Auth.CSRF.MissingProtection`
|
||||
* Fix: antiforgery token + SameSite=Lax/Strict and proper CORS.
|
||||
|
||||
---
|
||||
|
||||
### 18) `svc-18-prototype-pollution-merge` (Node)
|
||||
|
||||
**Purpose:** JSON-body gated sink; reachability must respect content-type and route.
|
||||
**Entrypoint:** `POST /profile` (application/json)
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-1321 Prototype Pollution** — unsafe deep merge of user object into defaults.
|
||||
|
||||
* Reachability: **R3** (only if JSON)
|
||||
* Tags: `TAG:route:POST /profile`, `TAG:taint:http.body.json`, `TAG:call:Object.Merge.Unsafe`
|
||||
* Negative test: same request with non-JSON must not hit sink tag.
|
||||
* Fix: safe merge, deny `__proto__` / `constructor` keys.
|
||||
|
||||
---
|
||||
|
||||
### 19) `svc-19-path-traversal-download` (Python/Flask)
|
||||
|
||||
**Purpose:** File traversal with safe, local sentinel proof.
|
||||
**Entrypoint:** `GET /download?file=`
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-22 Path Traversal** — file path concatenated without normalization.
|
||||
|
||||
* Reachability: **R3**
|
||||
* Tags: `TAG:route:GET /download`, `TAG:taint:http.query.file`, `TAG:call:FS.Read.UnsafePath`
|
||||
* Proof test: attempt to read a known sentinel file outside the allowed directory (within container).
|
||||
* Fix: normalize path, enforce base dir constraint.
|
||||
|
||||
---
|
||||
|
||||
### 20) `svc-20-insecure-tempfile-toctou` (Go or .NET)
|
||||
|
||||
**Purpose:** Concurrency/race category; deterministic reproduction via controlled scheduling.
|
||||
**Entrypoint:** `POST /export` creates temp file and then reopens by name
|
||||
**Vulns:**
|
||||
|
||||
* `V1` **CWE-367 TOCTOU** — uses predictable temp name + separate open.
|
||||
|
||||
* Reachability: **R2** (requires parallel test harness)
|
||||
* Tags: `TAG:route:POST /export`, `TAG:call:FS.TempFile.InsecurePattern`
|
||||
* Proof test: run two coordinated requests; assert race condition triggers sentinel behavior.
|
||||
* Fix: use secure temp APIs + hold open FD; atomic operations.
|
||||
|
||||
---
|
||||
|
||||
## D. Meta repos (not “services” but essential for best-in-class scanning)
|
||||
|
||||
### 21) `svc-21-k8s-misconfigs` (YAML/Helm)
|
||||
|
||||
**Purpose:** IaC scanning; false-positive discipline.
|
||||
**Artifacts:** `manifests/*.yaml`, `helm/Chart.yaml`
|
||||
**Findings to plant:**
|
||||
|
||||
* privileged container, `hostPath`, `runAsUser: 0`, missing resource limits, writable rootfs, wildcard RBAC
|
||||
**Proof:** static assertions in tests (OPA/Conftest or your harness) generate evidence tags like `TAG:iac:k8s.privileged`.
|
||||
|
||||
---
|
||||
|
||||
### 22) `svc-22-docker-multistage-buildonly`
|
||||
|
||||
**Purpose:** Prove build-time-only deps do not ship; prevent scanners from overreporting.
|
||||
**Pattern:** builder stage installs vulnerable tooling; final stage is distroless and excludes it.
|
||||
**Proof:** final image SBOM + `docker export` file list hash; must not include builder artifacts.
|
||||
|
||||
---
|
||||
|
||||
### 23) `svc-23-secrets-fakes-corpus`
|
||||
|
||||
**Purpose:** Secret detection precision/recall without storing real secrets.
|
||||
**Pattern:** files containing **fake** tokens matching common regexes but clearly marked `FAKE_` and useless.
|
||||
**Labels:** must distinguish:
|
||||
|
||||
* `R0 present` fake secret in docs/examples
|
||||
* `R2 reachable` secret injected into runtime env accidentally (then fixed)
|
||||
|
||||
---
|
||||
|
||||
### 24) `svc-24-sbom-mismatch-lab`
|
||||
|
||||
**Purpose:** SBOM validation and drift detection.
|
||||
**Pattern:** generate an SBOM, then change deps without regenerating; label mismatch as a “supply_chain” issue.
|
||||
**Proof:** harness compares `image digest + lockfile hash + sbom hash`.
|
||||
|
||||
---
|
||||
|
||||
## E. Implementation notes that raise the bar (recommended defaults)
|
||||
|
||||
1. **Each service ships with both**:
|
||||
|
||||
* `tests/test_positive_v*.{py,js,cs}` producing evidence for reachable vulns
|
||||
* `tests/test_negative_v*.{py,js,cs}` proving unreachable claims
|
||||
2. **Every service includes a `fix/` patch** and a CI job that:
|
||||
|
||||
* builds “vuln image”, scans, evaluates
|
||||
* applies fix, rebuilds, re-scans, confirms expected delta
|
||||
3. **Hard-mode toggle per service** (optional but valuable):
|
||||
|
||||
* `MODE=easy`: vuln sits on hot path (for demos)
|
||||
* `MODE=hard`: same vuln behind realistic conditions (auth, header, flag, content-type, async)
|
||||
|
||||
---
|
||||
|
||||
If you want this to be “maxim degree” for scanner R&D, the next step is to add **one additional dimension per service** (fat jar, single-file, distroless, vendored deps, shaded deps, optional extras, transitive only, etc.). I can propose a precise pairing (which dimension goes to which service) so the suite covers all packaging and reachability edge cases without duplication.
|
||||
@@ -0,0 +1,551 @@
|
||||
Here’s a tight, practical blueprint for building (and proving) a fast, evidence‑first triage workflow—plus the power‑user affordances that make Stella Ops feel “snappy” even offline.
|
||||
|
||||
# What “good” looks like (background in plain words)
|
||||
|
||||
* **Alert → evidence → decision** in one flow: an alert should open directly onto the concrete proof (reachability, call‑stack, provenance), then offer a one‑click decision (VEX/CSAF status) with audit logging.
|
||||
* **Time‑to‑First‑Signal (TTFS)** is king: how fast a human sees the first credible piece of evidence that explains *why this alert matters here*.
|
||||
* **Clicks‑to‑Closure**: count how many interactions to reach a defensible decision recorded in the audit log.
|
||||
|
||||
# Minimal evidence bundle per finding
|
||||
|
||||
* **Reachability proof**: function‑level path or package‑level import chain (with “toggle reachability view” hotkey).
|
||||
* **Call‑stack snippet**: 5–10 frames around the sink/source with file:line anchors.
|
||||
* **Provenance**: attestation / DSSE + build ancestry (image → layer → artifact → commit).
|
||||
* **VEX/CSAF status**: affected/not‑affected/under‑investigation + reason.
|
||||
* **Diff**: what changed since last scan (SBOM or VEX delta), rendered as a small, human‑readable “smart‑diff.”
|
||||
|
||||
# KPIs to measure in CI and UI
|
||||
|
||||
* **TTFS (p50/p95)** from alert creation to first rendered evidence.
|
||||
* **Clicks‑to‑Closure (median)** per decision type.
|
||||
* **Evidence completeness score** (0–4): reachability, call‑stack, provenance, VEX/CSAF present.
|
||||
* **Offline friendliness score**: % of evidence resolvable with no network.
|
||||
* **Audit log completeness**: every decision has: evidence hash set, actor, policy context, replay token.
|
||||
|
||||
# Power‑user affordances (keyboard first)
|
||||
|
||||
* **Jump to evidence** (`J`): focuses the first incomplete evidence pane.
|
||||
* **Copy DSSE** (`Y`): copies the attestation block or Rekor entry ref.
|
||||
* **Toggle reachability view** (`R`): path list ↔ compact graph ↔ textual proof.
|
||||
* **Search‑within‑graph** (`/`): node/func/package, instant.
|
||||
* **Deterministic sort** (`S`): stable sort by (reachability→severity→age→component) to remove hesitation.
|
||||
* **Quick VEX set** (`A`, `N`, `U`): Affected / Not‑affected / Under‑investigation with templated reasons.
|
||||
|
||||
# UX flow to implement (end‑to‑end)
|
||||
|
||||
1. **Alert row** shows: TTFS timer, reachability badge, “decision state,” and a diff‑dot if something changed.
|
||||
2. **Open alert** lands on **Evidence tab** (not Details). Top strip = three proof pills:
|
||||
|
||||
* Reachability ✓ / Call‑stack ✓ / Provenance ✓ (click to expand inline).
|
||||
3. **Decision drawer** pinned on the right:
|
||||
|
||||
* VEX/CSAF radio (A/N/U) → Reason presets → “Record decision.”
|
||||
* Shows **audit‑ready summary** (hashes, timestamps, policy).
|
||||
4. **Diff tab**: SBOM/VEX delta since last run, grouped by “meaningful risk shift.”
|
||||
5. **Activity tab**: immutable audit log; export as a signed bundle for audits.
|
||||
|
||||
# Graph performance on large call‑graphs
|
||||
|
||||
* **Minimal‑latency snapshots**: pre‑render static PNG/SVG thumbnails server‑side; open with tiny preview then hydrate to interactive graph lazily.
|
||||
* **Progressive neighborhood expansion**: load 1‑hop first, expand on demand; keep the first TTFS < 500 ms.
|
||||
* **Stable node ordering**: deterministic layout with consistent anchors to avoid “graph shuffle” anxiety.
|
||||
* **Chunked graph edges** with capped fan‑out; collapse identical library paths into a **reachability macro‑edge**.
|
||||
|
||||
# Offline‑friendly design
|
||||
|
||||
* **Local evidence cache**: store (SBOM slices, path proofs, DSSE attestations, compiled call‑stacks) in a signed bundle beside the SARIF/VEX.
|
||||
* **Deferred enrichment**: mark fields that need internet (e.g., upstream CSAF fetch) and queue a background “enricher” when network returns.
|
||||
* **Predictable fallbacks**: if provenance server missing, show embedded DSSE and “verification pending,” never blank states.
|
||||
|
||||
# Audit & replay
|
||||
|
||||
* **Deterministic replay token**: hash(feed manifests + rules + lattice policy + inputs) → attach to every decision.
|
||||
* **One‑click “Reproduce”**: opens CLI snippet pinned to the exact versions and policies.
|
||||
* **Evidence hash‑set**: content‑address each proof artifact; the audit entry stores only hashes + signer.
|
||||
|
||||
# TTFS & Clicks‑to‑Closure: how to measure in code
|
||||
|
||||
* Emit a `ttfs.start` at alert creation; first paint of any evidence card emits `ttfs.signal`.
|
||||
* Increment a per‑alert **interaction counter**; on “Record decision” emit `close.clicks`.
|
||||
* Log **evidence bitset** (reach, stack, prov, vex) at decision time for completeness scoring.
|
||||
|
||||
# Developer tasks (concrete, shippable)
|
||||
|
||||
* **Evidence API**: `GET /alerts/{id}/evidence` returns `{reachability, callstack, provenance, vex, hashes[]}` with deterministic sort.
|
||||
* **Proof renderer**: tiny, no‑framework widget that can render from the offline bundle; hydrate to full only on interaction.
|
||||
* **Keyboard map**: global handler with overlay help (`?`); no collisions; all actions are idempotent.
|
||||
* **Graph service**: server‑side layout + snapshot PNG; client hydrates WebGL only when user expands.
|
||||
* **Smart‑diff**: diff SBOM/VEX → classify into “risk‑raising / neutral / reducing,” surface only the first item by default.
|
||||
* **Audit logger**: append‑only stream; signed checkpoints; export `.stella-audit.tgz` (attestations + JSONL).
|
||||
|
||||
# Benchmarks to run weekly
|
||||
|
||||
* **TTFS under poor network** (100 ms RTT, 1% loss): p95 < 1.5 s to first evidence.
|
||||
* **Graph hydration on 250k‑edge image**: preview < 300 ms, interactive < 2.0 s.
|
||||
* **Keyboard coverage**: ≥90% of triage actions executable without mouse.
|
||||
* **Offline replay**: 100% of decisions re‑render from bundle; zero web calls required.
|
||||
|
||||
# Why Stella’s approach reduces hesitation
|
||||
|
||||
* **Deterministic sort orders** keep findings in place between refreshes.
|
||||
* **Minimal‑latency graph snapshots** show something trustworthy immediately, then refine—no “blank panel” delay.
|
||||
* **Replayable, signed bundles** make every click auditable and reversible, which builds operator confidence.
|
||||
|
||||
If you want, I can turn this into:
|
||||
|
||||
* a **UI checklist** for a design review,
|
||||
* a **.NET 10 API contract** (DTOs + endpoints),
|
||||
* or a **Cypress/Playwright test plan** that measures TTFS and clicks‑to‑closure automatically.
|
||||
Below is a PM‑style implementation guideline you can hand to developers. It’s written as a **build spec**: clear goals, “MUST/SHOULD” requirements, acceptance criteria, and the non‑functional guardrails (performance, offline, auditability) that make triage feel fast and defensible.
|
||||
|
||||
---
|
||||
|
||||
# Stella Ops — Evidence‑First Triage Implementation Guidelines (PM Spec)
|
||||
|
||||
## 0) Assumptions and scope
|
||||
|
||||
**Assumptions**
|
||||
|
||||
* Stella Ops ingests vulnerability findings (SCA/SAST/image scans), has SBOM context, and can compute reachability/call paths.
|
||||
* Triage outcomes must be recorded as VEX/CSAF‑compatible states with reasons and audit trails.
|
||||
* Users may operate in restricted networks and need an offline mode that still shows evidence.
|
||||
|
||||
**In scope**
|
||||
|
||||
* Evidence‑first alert triage UI + APIs + telemetry.
|
||||
* Reachability proof + call stack view + provenance attestation view.
|
||||
* VEX/CSAF decision recording with audit export.
|
||||
* Offline evidence bundle and deterministic replay token.
|
||||
|
||||
**Out of scope (for this phase)**
|
||||
|
||||
* Building the underlying static analyzer or SBOM generator (we consume their outputs).
|
||||
* Full CSAF publishing workflow (we store and export; publishing is separate).
|
||||
* Remediation automation (PRs, patching).
|
||||
|
||||
---
|
||||
|
||||
## 1) Product principles (non‑negotiables)
|
||||
|
||||
1. **Evidence before detail**
|
||||
Opening an alert **MUST** show the best available evidence immediately (even partial/placeholder), not a generic “details” page.
|
||||
2. **Fast first signal**
|
||||
The UI **MUST** render a credible “first signal” quickly (reachability badge, call stack snippet, or provenance block).
|
||||
3. **Determinism reduces hesitation**
|
||||
Sorting, graphs, and diffs **MUST** be stable across refreshes. No jittery re-layout.
|
||||
4. **Offline by design**
|
||||
If evidence exists locally (bundle), the UI **MUST** render it without network access.
|
||||
5. **Audit-ready by default**
|
||||
Every decision **MUST** be reproducible, attributable, and exportable with evidence hashes.
|
||||
|
||||
---
|
||||
|
||||
## 2) Success metrics (what we ship toward)
|
||||
|
||||
These become acceptance criteria and dashboards.
|
||||
|
||||
### Primary metrics (P0)
|
||||
|
||||
* **TTFS (Time‑to‑First‑Signal)**: p95 < **1.5s** from opening an alert to first evidence card rendering (with 100ms RTT, 1% loss simulation).
|
||||
* **Clicks‑to‑Closure**: median < **6** interactions to record a VEX decision.
|
||||
* **Evidence completeness** at decision time: ≥ **90%** of decisions include evidence hash set + reason + replay token.
|
||||
|
||||
### Secondary metrics (P1)
|
||||
|
||||
* **Offline resolution rate**: ≥ **95%** of alerts opened with a local bundle show reachability + provenance without network.
|
||||
* **Graph usability**: preview render < **300ms**, interactive hydration < **2.0s** for large graphs (see §7).
|
||||
|
||||
---
|
||||
|
||||
## 3) User workflows and “Definition of Done”
|
||||
|
||||
### Workflow A: Triage an alert to a decision
|
||||
|
||||
**DoD**: user can open an alert, see evidence, set VEX state, and the system records a signed/auditable decision event.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. Alert list shows key signals (reachability badge, decision state, diff indicator).
|
||||
2. Open alert → Evidence view loads first.
|
||||
3. User reviews reachability/call stack/provenance.
|
||||
4. User sets VEX status + reason preset (editable).
|
||||
5. User records decision.
|
||||
6. Audit log entry appears instantly and is exportable.
|
||||
|
||||
### Workflow B: Explain “why is this flagged?”
|
||||
|
||||
**DoD**: user can show a defensible proof (path/call stack/provenance) and copy it into a ticket.
|
||||
|
||||
---
|
||||
|
||||
## 4) UI requirements (MUST/SHOULD/MAY)
|
||||
|
||||
## 4.1 Alert list page
|
||||
|
||||
**MUST**
|
||||
|
||||
* Each row includes:
|
||||
|
||||
* Severity + component identifier
|
||||
* **Decision state** (Unset / Under Investigation / Not Affected / Affected)
|
||||
* **Reachability badge** (Reachable / Not Reachable / Unknown) where available
|
||||
* **Diff indicator** if SBOM/VEX changed since last scan (simple dot/label)
|
||||
* Age / first seen / last updated
|
||||
* **Deterministic sort** default:
|
||||
`Reachability DESC → Severity DESC → Decision state (Unset first) → Age DESC → Component name ASC`
|
||||
* Keyboard navigation:
|
||||
|
||||
* `↑/↓` move selection, `Enter` open alert.
|
||||
* `/` search/filter focus.
|
||||
|
||||
**SHOULD**
|
||||
|
||||
* Inline “quick set” decision menu (Affected / Not affected / Under investigation) without leaving list for obvious cases, but still requires reason and logs evidence hashes.
|
||||
|
||||
## 4.2 Alert detail — landing tab MUST be Evidence
|
||||
|
||||
**MUST**
|
||||
|
||||
* Default landing is **Evidence** (not “Overview”).
|
||||
* Top section shows 3 “proof pills” with status:
|
||||
|
||||
* Reachability (✓ / ! / …)
|
||||
* Call stack (✓ / ! / …)
|
||||
* Provenance (✓ / ! / …)
|
||||
* Each pill expands inline (no navigation) into a compact evidence panel.
|
||||
|
||||
**MUST: No blank panels**
|
||||
|
||||
* If evidence is loading, show skeleton + “what’s coming.”
|
||||
* If evidence missing, show a reason (“not computed”, “requires source map”, “offline – enrichment pending”).
|
||||
|
||||
## 4.3 Decision drawer
|
||||
|
||||
**MUST**
|
||||
|
||||
* Pinned right drawer (or persistent bottom sheet on small screens).
|
||||
* Controls:
|
||||
|
||||
* VEX/CSAF status: **Affected / Not affected / Under investigation**
|
||||
* Reason preset dropdown + editable reason text
|
||||
* “Record decision” button
|
||||
* Preview “Audit summary” before submit:
|
||||
|
||||
* Evidence hashes included
|
||||
* Policy context (ruleset version)
|
||||
* Replay token
|
||||
* Actor identity
|
||||
|
||||
**MUST**
|
||||
|
||||
* On submit, create an append-only audit event and immediately reflect status in UI.
|
||||
|
||||
**SHOULD**
|
||||
|
||||
* Allow attaching references: ticket URL, incident ID, PR link (stored as metadata).
|
||||
|
||||
## 4.4 Diff tab
|
||||
|
||||
**MUST**
|
||||
|
||||
* Show delta since last scan:
|
||||
|
||||
* SBOM diffs (component version changes, removals/additions)
|
||||
* VEX diffs (status changes)
|
||||
* Group diffs by **risk shift**:
|
||||
|
||||
* Risk‑raising (new reachable vuln, severity increase)
|
||||
* Neutral (metadata-only)
|
||||
* Risk‑reducing (fixed version, reachability removed)
|
||||
|
||||
**SHOULD**
|
||||
|
||||
* Provide “Copy diff summary” for change management.
|
||||
|
||||
## 4.5 Activity/Audit tab
|
||||
|
||||
**MUST**
|
||||
|
||||
* Immutable timeline of decisions and evidence changes.
|
||||
* Each entry includes:
|
||||
|
||||
* actor, timestamp, decision, reason
|
||||
* evidence hash set
|
||||
* replay token
|
||||
* bundle/export availability
|
||||
|
||||
---
|
||||
|
||||
## 5) Power-user and accessibility requirements
|
||||
|
||||
### Keyboard shortcuts (MUST)
|
||||
|
||||
* `J`: jump to next missing/incomplete evidence panel
|
||||
* `R`: toggle reachability view (list ↔ compact graph ↔ textual proof)
|
||||
* `Y`: copy selected evidence block (call stack / DSSE / path proof)
|
||||
* `A`: set “Affected” (opens reason preset selection)
|
||||
* `N`: set “Not affected”
|
||||
* `U`: set “Under investigation”
|
||||
* `?`: keyboard help overlay
|
||||
|
||||
### Accessibility (MUST)
|
||||
|
||||
* Fully navigable by keyboard
|
||||
* Visible focus states
|
||||
* Screen-reader labels for evidence pills and drawer controls
|
||||
* Color is never the only signal (badges must have text/icon)
|
||||
|
||||
---
|
||||
|
||||
## 6) Evidence model: what every alert should attempt to provide
|
||||
|
||||
Treat this as the **minimum evidence bundle**. Each item may be “unavailable,” but must be explicit.
|
||||
|
||||
**MUST** support:
|
||||
|
||||
1. **Reachability proof**
|
||||
|
||||
* At least one of:
|
||||
|
||||
* function-level call path: `entry → … → vulnerable_sink`
|
||||
* package/module import chain
|
||||
* Includes confidence/algorithm tag: `static`, `dynamic`, `heuristic`
|
||||
2. **Call stack snippet**
|
||||
|
||||
* 5–10 frames around the relevant node with file:line anchors where possible
|
||||
3. **Provenance**
|
||||
|
||||
* DSSE attestation or equivalent statement
|
||||
* Artifact ancestry chain: image → layer → artifact → commit (as available)
|
||||
* Verification status: verified / pending / failed (with reason)
|
||||
4. **Decision state**
|
||||
|
||||
* VEX status + reason + timestamps
|
||||
5. **Evidence hash set**
|
||||
|
||||
* Content-addressed hashes of each evidence artifact included in the decision
|
||||
|
||||
**SHOULD**
|
||||
|
||||
* “Evidence freshness”: when computed, tool version, input revisions.
|
||||
|
||||
---
|
||||
|
||||
## 7) Performance and graph rendering requirements
|
||||
|
||||
### TTFS budget (MUST)
|
||||
|
||||
* When opening an alert:
|
||||
|
||||
* **<200ms**: show skeleton and cached row metadata
|
||||
* **<500ms**: render at least one evidence pill with meaningful content OR a cached preview image
|
||||
* **<1.5s p95**: render reachability + provenance for typical alerts
|
||||
|
||||
### Graph rendering for large call graphs (MUST)
|
||||
|
||||
* **Two-phase rendering**
|
||||
|
||||
1. Server-generated **static snapshot** (PNG/SVG) displayed immediately
|
||||
2. Interactive graph hydrates lazily on user expand
|
||||
* **Progressive expansion**
|
||||
|
||||
* Load 1-hop neighborhood first; expand on click
|
||||
* **Deterministic layout**
|
||||
|
||||
* Same input produces same layout anchors (no reshuffles between refreshes)
|
||||
* **Fan-out control**
|
||||
|
||||
* Collapse repeated library paths into “macro edges” to keep the graph readable
|
||||
|
||||
---
|
||||
|
||||
## 8) Offline mode requirements
|
||||
|
||||
Offline is not “nice to have”; it is a defined mode.
|
||||
|
||||
### Offline evidence bundle (MUST)
|
||||
|
||||
* A single file (e.g., `.stella.bundle.tgz`) that contains:
|
||||
|
||||
* Alert metadata snapshot
|
||||
* Evidence artifacts (reachability proofs, call stacks, provenance attestations)
|
||||
* SBOM slice(s) necessary for diffs
|
||||
* VEX decision history (if available)
|
||||
* Manifest with content hashes (Merkle-ish)
|
||||
* Bundle must be **signed** (or include signature material) and verifiable.
|
||||
|
||||
### UI behavior (MUST)
|
||||
|
||||
* If bundle is present:
|
||||
|
||||
* UI loads evidence from it first
|
||||
* Any missing items show “enrichment pending” (not “error”)
|
||||
* If network returns:
|
||||
|
||||
* Background refresh allowed, but **must not reorder** the alert list unexpectedly
|
||||
* Must surface “updated evidence available” as a user-controlled refresh, not an auto-switch that changes context mid-triage
|
||||
|
||||
---
|
||||
|
||||
## 9) Auditability and replay requirements
|
||||
|
||||
### Decision event schema (MUST)
|
||||
|
||||
Every recorded decision must store:
|
||||
|
||||
* `alert_id`, `artifact_id` (image digest or commit hash)
|
||||
* `actor_id`, `timestamp`
|
||||
* `decision_status` (Affected/Not affected/Under investigation)
|
||||
* `reason_code` (preset) + `reason_text`
|
||||
* `evidence_hashes[]` (content-addressed hashes)
|
||||
* `policy_context` (ruleset version, policy id)
|
||||
* `replay_token` (hash of inputs needed to reproduce)
|
||||
|
||||
### Replay token (MUST)
|
||||
|
||||
* Deterministic hash of:
|
||||
|
||||
* scan inputs (SBOM digest, image digest, tool versions)
|
||||
* policy/rules versions
|
||||
* reachability algorithm version
|
||||
* “Reproduce” button produces a CLI snippet (copyable) pinned to these versions.
|
||||
|
||||
### Export (MUST)
|
||||
|
||||
* Exportable audit bundle that includes:
|
||||
|
||||
* JSONL of decision events
|
||||
* evidence artifacts referenced by hashes
|
||||
* signatures/attestations
|
||||
* Export must be stable and verifiable later.
|
||||
|
||||
---
|
||||
|
||||
## 10) API and data contract guidelines (developer-facing)
|
||||
|
||||
This is an implementation guideline, not a full API spec—keep it simple and cache-friendly.
|
||||
|
||||
### MUST endpoints (or equivalent)
|
||||
|
||||
* `GET /alerts?filters…` → list view payload (small, cacheable)
|
||||
* `GET /alerts/{id}/evidence` → evidence payload (reachability, call stack, provenance, hashes)
|
||||
* `POST /alerts/{id}/decisions` → record decision event (append-only)
|
||||
* `GET /alerts/{id}/audit` → audit timeline
|
||||
* `GET /alerts/{id}/diff?baseline=…` → SBOM/VEX diff view
|
||||
* `GET /bundles/{id}` and/or `POST /bundles/verify` → offline bundle download/verify
|
||||
|
||||
### Evidence payload guidelines (MUST)
|
||||
|
||||
* Deterministic ordering for arrays and nodes (stable sorts).
|
||||
* Explicit `status` per evidence section: `available | loading | unavailable | error`.
|
||||
* Include `hash` per artifact for content addressing.
|
||||
|
||||
**Example shape**
|
||||
|
||||
```json
|
||||
{
|
||||
"alert_id": "a123",
|
||||
"reachability": { "status": "available", "hash": "sha256:…", "proof": { "type": "call_path", "nodes": [...] } },
|
||||
"callstack": { "status": "available", "hash": "sha256:…", "frames": [...] },
|
||||
"provenance": { "status": "pending", "hash": null, "dsse": { "embedded": true, "payload": "…" } },
|
||||
"vex": { "status": "available", "current": {...}, "history": [...] },
|
||||
"hashes": ["sha256:…", "sha256:…"]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11) Telemetry requirements (how we prove it’s fast)
|
||||
|
||||
**MUST** instrument:
|
||||
|
||||
* `alert_opened` (timestamp, alert_id)
|
||||
* `evidence_first_paint` (timestamp, evidence_type)
|
||||
* `decision_recorded` (timestamp, clicks_count, evidence_bitset)
|
||||
* `bundle_loaded` (hit/miss, size, verification_status)
|
||||
* `graph_preview_paint` and `graph_hydrated`
|
||||
|
||||
**MUST** compute:
|
||||
|
||||
* TTFS = `evidence_first_paint - alert_opened`
|
||||
* Clicks‑to‑Closure = interaction counter per alert until decision recorded
|
||||
* Evidence completeness bitset at decision time: reachability/callstack/provenance/vex present
|
||||
|
||||
---
|
||||
|
||||
## 12) Error handling and edge cases
|
||||
|
||||
**MUST**
|
||||
|
||||
* Never show empty states without explanation.
|
||||
* Distinguish between:
|
||||
|
||||
* “not computed yet”
|
||||
* “not possible due to missing inputs”
|
||||
* “blocked by permissions”
|
||||
* “offline—enrichment pending”
|
||||
* “verification failed”
|
||||
|
||||
**SHOULD**
|
||||
|
||||
* Offer “Request enrichment” action when evidence missing (creates a job/task id).
|
||||
|
||||
---
|
||||
|
||||
## 13) Security, permissions, and multi-tenancy
|
||||
|
||||
**MUST**
|
||||
|
||||
* RBAC gating for:
|
||||
|
||||
* viewing provenance attestations
|
||||
* recording decisions
|
||||
* exporting audit bundles
|
||||
* All decision events are immutable; corrections are new events (append-only).
|
||||
* PII handling:
|
||||
|
||||
* Avoid storing freeform reasons with secrets; warn on paste patterns (optional P1).
|
||||
|
||||
---
|
||||
|
||||
## 14) Engineering execution plan (priorities)
|
||||
|
||||
### P0 (ship first)
|
||||
|
||||
* Evidence-first alert detail landing
|
||||
* Decision drawer + append-only audit
|
||||
* Deterministic alert list sort + reachability badge
|
||||
* Evidence API + decision POST
|
||||
* TTFS + clicks telemetry
|
||||
* Static graph preview + lazy hydration
|
||||
|
||||
### P1
|
||||
|
||||
* Offline bundle load/verify + offline rendering
|
||||
* Smart diff view (risk shift grouping)
|
||||
* Exportable audit bundle
|
||||
* Keyboard shortcuts + help overlay
|
||||
|
||||
### P2
|
||||
|
||||
* Inline quick decisions from list
|
||||
* Advanced graph search within view
|
||||
* Suggest reason presets based on evidence patterns
|
||||
|
||||
---
|
||||
|
||||
## 15) Acceptance criteria checklist (what QA signs off)
|
||||
|
||||
A build is acceptable when:
|
||||
|
||||
* Opening an alert renders at least one evidence pill within **500ms** (with cache) and TTFS p95 meets target under network simulation.
|
||||
* Users can record A/N/U decisions with reason and see an audit event immediately.
|
||||
* Decision event includes evidence hashes + replay token.
|
||||
* Alert list sorting is stable and deterministic across refresh.
|
||||
* Graph preview appears instantly; interactive graph hydrates only on expand.
|
||||
* Offline bundle renders evidence without network; missing items show “enrichment pending,” not errors.
|
||||
* Keyboard shortcuts work; `?` overlay lists them; full keyboard navigation is possible.
|
||||
|
||||
---
|
||||
|
||||
If you want, I can also format this into a **developer-ready ticket pack** (epics + user stories + acceptance tests) so engineers can implement without interpretation drift.
|
||||
@@ -0,0 +1,544 @@
|
||||
Here’s a quick, practical cheat‑sheet on choosing **PostgreSQL vs MongoDB** for security/DevOps apps—plus how I’d model SBOM/VEX and queues in Stella Ops without adding moving parts.
|
||||
|
||||
---
|
||||
|
||||
# PostgreSQL you can lean on (why it often wins for ops apps)
|
||||
|
||||
* **JSONB that flies:** Store documents yet query like SQL. Add **GIN indexes** on JSONB fields for fast lookups (`jsonb_ops` general; `jsonb_path_ops` great for `@>` containment).
|
||||
* **Queue pattern built‑in:** `SELECT … FOR UPDATE SKIP LOCKED` lets multiple workers pop jobs from the same table safely—no head‑of‑line blocking, no extra broker.
|
||||
* **Cooperative locks:** **Advisory locks** (session/transaction) for “at‑most‑once” sections or leader election.
|
||||
* **Lightweight pub/sub:** **LISTEN/NOTIFY** for async nudges between services (poke a worker to re‑scan, refresh cache, etc.).
|
||||
* **Search included:** **Full‑text search** (tsvector/tsquery) is native—no separate search service for moderate needs.
|
||||
* **Serious backups:** **PITR** with WAL archiving / `pg_basebackup` for deterministic rollbacks and offline bundles.
|
||||
|
||||
# MongoDB facts to factor in
|
||||
|
||||
* **Flexible ingest:** Schemaless docs make it easy to absorb varied telemetry and vendor feeds.
|
||||
* **Horizontal scale:** Sharding is mature for huge, read‑heavy datasets.
|
||||
* **Consistency is a choice:** Design embedding vs refs and when to use multi‑document transactions.
|
||||
|
||||
---
|
||||
|
||||
# A simple rule of thumb (Stella Ops‑style)
|
||||
|
||||
* **System of record:** PostgreSQL (JSONB first).
|
||||
* **Hot paths:** Materialized views + JSONB GIN indexes.
|
||||
* **Queues & coordination:** PostgreSQL (skip‑locked + advisory locks).
|
||||
* **Cache/accel only:** Valkey (ephemeral).
|
||||
* **MongoDB:** Optional for **very large, read‑optimized graph snapshots** (e.g., periodically baked reachability graphs) if Postgres starts to strain.
|
||||
|
||||
---
|
||||
|
||||
# Concrete patterns you can drop in today
|
||||
|
||||
**1) SBOM/VEX storage (Postgres JSONB)**
|
||||
|
||||
```sql
|
||||
-- Documents
|
||||
CREATE TABLE sbom (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
artifact_purl TEXT NOT NULL,
|
||||
doc JSONB NOT NULL,
|
||||
created_at TIMESTAMPTZ DEFAULT now()
|
||||
);
|
||||
CREATE INDEX sbom_purl_idx ON sbom(artifact_purl);
|
||||
CREATE INDEX sbom_doc_gin ON sbom USING GIN (doc jsonb_path_ops);
|
||||
|
||||
-- Common queries
|
||||
-- find components by name/version:
|
||||
-- SELECT * FROM sbom WHERE doc @> '{"components":[{"name":"openssl","version":"3.0.14"}]}';
|
||||
|
||||
-- VEX
|
||||
CREATE TABLE vex (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
subject_purl TEXT NOT NULL,
|
||||
vex_doc JSONB NOT NULL,
|
||||
created_at TIMESTAMPTZ DEFAULT now()
|
||||
);
|
||||
CREATE INDEX vex_subject_idx ON vex(subject_purl);
|
||||
CREATE INDEX vex_doc_gin ON vex USING GIN (vex_doc jsonb_path_ops);
|
||||
```
|
||||
|
||||
**2) Hot reads via materialized views**
|
||||
|
||||
```sql
|
||||
CREATE MATERIALIZED VIEW mv_open_findings AS
|
||||
SELECT
|
||||
s.artifact_purl,
|
||||
c->>'name' AS comp,
|
||||
c->>'version' AS ver,
|
||||
v.vex_doc
|
||||
FROM sbom s
|
||||
CROSS JOIN LATERAL jsonb_array_elements(s.doc->'components') c
|
||||
LEFT JOIN vex v ON v.subject_purl = s.artifact_purl
|
||||
-- add WHERE clauses to pre‑filter only actionable rows
|
||||
;
|
||||
CREATE INDEX mv_open_findings_idx ON mv_open_findings(artifact_purl, comp);
|
||||
```
|
||||
|
||||
Refresh cadence: on feed import or via a scheduler; `REFRESH MATERIALIZED VIEW CONCURRENTLY mv_open_findings;`
|
||||
|
||||
**3) Queue without a broker**
|
||||
|
||||
```sql
|
||||
CREATE TABLE job_queue(
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
kind TEXT NOT NULL, -- e.g., 'scan', 'sbom-diff'
|
||||
payload JSONB NOT NULL,
|
||||
run_after TIMESTAMPTZ DEFAULT now(),
|
||||
attempts INT DEFAULT 0,
|
||||
locked_at TIMESTAMPTZ,
|
||||
locked_by TEXT
|
||||
);
|
||||
CREATE INDEX job_queue_ready_idx ON job_queue(kind, run_after);
|
||||
|
||||
-- Worker loop
|
||||
WITH cte AS (
|
||||
SELECT id FROM job_queue
|
||||
WHERE kind = $1 AND run_after <= now() AND locked_at IS NULL
|
||||
ORDER BY id
|
||||
FOR UPDATE SKIP LOCKED
|
||||
LIMIT 1
|
||||
)
|
||||
UPDATE job_queue j
|
||||
SET locked_at = now(), locked_by = $2
|
||||
FROM cte
|
||||
WHERE j.id = cte.id
|
||||
RETURNING j.*;
|
||||
```
|
||||
|
||||
Release/fail with: set `locked_at=NULL, locked_by=NULL, attempts=attempts+1` or delete on success.
|
||||
|
||||
**4) Advisory lock for singletons**
|
||||
|
||||
```sql
|
||||
-- Acquire (per tenant, per artifact)
|
||||
SELECT pg_try_advisory_xact_lock(hashtextextended('recalc:'||tenant||':'||artifact, 0));
|
||||
```
|
||||
|
||||
**5) Nudge workers without a bus**
|
||||
|
||||
```sql
|
||||
NOTIFY stella_scan, json_build_object('purl', $1, 'priority', 5)::TEXT;
|
||||
-- workers LISTEN stella_scan and enqueue quickly
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# When to add MongoDB
|
||||
|
||||
* You need **interactive exploration** over **hundreds of millions of nodes/edges** (e.g., historical “proof‑of‑integrity” graphs) where document fan‑out and denormalized reads beat relational joins.
|
||||
* Snapshot cadence is **batchy** (hourly/daily), and you can **re‑emit** snapshots deterministically from Postgres (single source of truth).
|
||||
* You want to isolate read spikes from the transactional core.
|
||||
|
||||
**Snapshot pipe:** Postgres → (ETL) → MongoDB collection `{graph_id, node, edges[], attrs}` with **compound shard keys** tuned to your UI traversal.
|
||||
|
||||
---
|
||||
|
||||
# Why this fits Stella Ops
|
||||
|
||||
* Fewer moving parts on‑prem/air‑gapped.
|
||||
* Deterministic replays (PITR + immutable imports).
|
||||
* Clear performance levers (GIN indexes, MVs, skip‑locked queues).
|
||||
* MongoDB stays optional, purpose‑built for giant read graphs—not a default dependency.
|
||||
|
||||
If you want, I can turn the above into ready‑to‑run `.sql` migrations and a small **.NET 10** worker (Dapper/EF Core) that implements the queue loop + advisory locks + LISTEN/NOTIFY hooks.
|
||||
Below is a handoff-ready set of **PostgreSQL tables/views engineering guidelines** intended for developer review. It is written as a **gap-finding checklist** with **concrete DDL patterns** and **performance red flags** (Postgres as system of record, JSONB where useful, derived projections where needed).
|
||||
|
||||
---
|
||||
|
||||
# PostgreSQL Tables & Views Engineering Guide
|
||||
|
||||
## 0) Non-negotiable principles
|
||||
|
||||
1. **Every hot query must have an index story.** If you cannot name the index that serves it, you have a performance gap.
|
||||
2. **Write path stays simple.** Prefer **append-only** versioning to large updates (especially for JSONB).
|
||||
3. **Multi-tenant must be explicit.** Every core table includes `tenant_id` and indexes are tenant-prefixed.
|
||||
4. **Derived data is a product.** If the UI needs it fast, model it as a **projection table or materialized view**, not as an ad-hoc mega-join.
|
||||
5. **Idempotency is enforced in the DB.** Unique keys for imports/jobs/results; no “best effort” dedupe in application only.
|
||||
|
||||
---
|
||||
|
||||
# 1) Table taxonomy and what to look for
|
||||
|
||||
Use this to classify every table; each class has different indexing/retention/locking rules.
|
||||
|
||||
### A. Source-of-truth (SOR) tables
|
||||
|
||||
Examples: `sbom_document`, `vex_document`, `feed_import`, `scan_manifest`, `attestation`.
|
||||
|
||||
* **Expect:** immutable rows, versioning via new row inserts.
|
||||
* **Gaps:** frequent updates to large JSONB; missing `content_hash`; no unique idempotency key.
|
||||
|
||||
### B. Projection tables (query-optimized)
|
||||
|
||||
Examples: `open_findings`, `artifact_risk_summary`, `component_index`.
|
||||
|
||||
* **Expect:** denormalized, indexed for UI/API; refresh/update strategy defined.
|
||||
* **Gaps:** projections rebuilt from scratch too often; missing incremental update plan; no retention plan.
|
||||
|
||||
### C. Queue/outbox tables
|
||||
|
||||
Examples: `job_queue`, `outbox_events`.
|
||||
|
||||
* **Expect:** `SKIP LOCKED` claim pattern; retry + DLQ; minimal lock duration.
|
||||
* **Gaps:** holding row locks while doing work; missing partial index for “ready” jobs.
|
||||
|
||||
### D. Audit/event tables
|
||||
|
||||
Examples: `scan_run_event`, `decision_event`, `access_audit`.
|
||||
|
||||
* **Expect:** append-only; partitioned by time; BRIN on timestamps.
|
||||
* **Gaps:** single huge table without partitioning; slow deletes instead of partition drops.
|
||||
|
||||
---
|
||||
|
||||
# 2) Naming, keys, and required columns
|
||||
|
||||
## Required columns per class
|
||||
|
||||
### SOR documents (SBOM/VEX/Attestations)
|
||||
|
||||
* `tenant_id uuid`
|
||||
* `id bigserial` (internal PK)
|
||||
* `external_id uuid` (optional API-facing id)
|
||||
* `content_hash bytea` (sha256) **NOT NULL**
|
||||
* `doc jsonb` **NOT NULL**
|
||||
* `created_at timestamptz` **NOT NULL default now()**
|
||||
* `supersedes_id bigint NULL` (version chain) OR `version int`
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] Unique constraint exists: `(tenant_id, content_hash)`
|
||||
* [ ] Version strategy exists (supersedes/version) and is queryable
|
||||
* [ ] “Latest” access is index-backed (see §4)
|
||||
|
||||
### Queue
|
||||
|
||||
* `tenant_id uuid` (if multi-tenant)
|
||||
* `id bigserial`
|
||||
* `kind text`
|
||||
* `payload jsonb`
|
||||
* `run_after timestamptz`
|
||||
* `attempts int`
|
||||
* `locked_at timestamptz NULL`
|
||||
* `locked_by text NULL`
|
||||
* `status smallint` (optional; e.g., ready/running/done/dead)
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] “Ready to claim” has a partial index (see §4)
|
||||
* [ ] Claim transaction is short (claim+commit; work outside lock)
|
||||
|
||||
---
|
||||
|
||||
# 3) JSONB rules that prevent “looks fine → melts in prod”
|
||||
|
||||
## When JSONB is appropriate
|
||||
|
||||
* Storing signed envelopes (DSSE), SBOM/VEX raw docs, vendor payloads.
|
||||
* Ingest-first scenarios where schema evolves.
|
||||
|
||||
## When JSONB is a performance hazard
|
||||
|
||||
* You frequently query deep keys/arrays (components, vulnerabilities, call paths).
|
||||
* You need sorting/aggregations on doc fields.
|
||||
|
||||
**Mandatory pattern for hot JSON fields**
|
||||
|
||||
1. Keep the raw JSONB for fidelity.
|
||||
2. Extract **hot keys** into **stored generated columns** (or real columns), index those.
|
||||
3. Extract **hot arrays** into child tables (components, vulnerabilities).
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE sbom_document (
|
||||
id bigserial PRIMARY KEY,
|
||||
tenant_id uuid NOT NULL,
|
||||
artifact_purl text NOT NULL,
|
||||
content_hash bytea NOT NULL,
|
||||
doc jsonb NOT NULL,
|
||||
created_at timestamptz NOT NULL DEFAULT now(),
|
||||
|
||||
-- hot keys as generated columns
|
||||
bom_format text GENERATED ALWAYS AS ((doc->>'bomFormat')) STORED,
|
||||
spec_version text GENERATED ALWAYS AS ((doc->>'specVersion')) STORED
|
||||
);
|
||||
|
||||
CREATE UNIQUE INDEX ux_sbom_doc_hash ON sbom_document(tenant_id, content_hash);
|
||||
CREATE INDEX ix_sbom_doc_tenant_artifact ON sbom_document(tenant_id, artifact_purl, created_at DESC);
|
||||
CREATE INDEX ix_sbom_doc_json_gin ON sbom_document USING GIN (doc jsonb_path_ops);
|
||||
CREATE INDEX ix_sbom_doc_bomformat ON sbom_document(tenant_id, bom_format);
|
||||
```
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] Any query using `doc->>` in WHERE has either an expression index or a generated column index
|
||||
* [ ] Any query using `jsonb_array_elements(...)` in hot path has been replaced by a normalized child table or a projection table
|
||||
|
||||
---
|
||||
|
||||
# 4) Indexing standards (what devs must justify)
|
||||
|
||||
## Core rules
|
||||
|
||||
1. **Tenant-first**: `INDEX(tenant_id, …)` for anything read per tenant.
|
||||
2. **Sort support**: if query uses `ORDER BY created_at DESC`, index must end with `created_at DESC`.
|
||||
3. **Partial indexes** for sparse predicates (status/locked flags).
|
||||
4. **BRIN** for massive append-only time series.
|
||||
5. **GIN jsonb_path_ops** for containment (`@>`) on JSONB; avoid GIN for everything.
|
||||
|
||||
## Required index patterns by use case
|
||||
|
||||
### “Latest version per artifact”
|
||||
|
||||
If you store versions as rows:
|
||||
|
||||
```sql
|
||||
-- supports: WHERE tenant_id=? AND artifact_purl=? ORDER BY created_at DESC LIMIT 1
|
||||
CREATE INDEX ix_sbom_latest ON sbom_document(tenant_id, artifact_purl, created_at DESC);
|
||||
```
|
||||
|
||||
### Ready queue claims
|
||||
|
||||
```sql
|
||||
CREATE INDEX ix_job_ready
|
||||
ON job_queue(kind, run_after, id)
|
||||
WHERE locked_at IS NULL;
|
||||
|
||||
-- Optional: tenant scoped
|
||||
CREATE INDEX ix_job_ready_tenant
|
||||
ON job_queue(tenant_id, kind, run_after, id)
|
||||
WHERE locked_at IS NULL;
|
||||
```
|
||||
|
||||
### JSON key lookup (expression index)
|
||||
|
||||
```sql
|
||||
-- supports: WHERE (doc->>'subject') = ?
|
||||
CREATE INDEX ix_vex_subject_expr
|
||||
ON vex_document(tenant_id, (doc->>'subject'));
|
||||
```
|
||||
|
||||
### Massive event table time filtering
|
||||
|
||||
```sql
|
||||
CREATE INDEX brin_scan_events_time
|
||||
ON scan_run_event USING BRIN (occurred_at);
|
||||
```
|
||||
|
||||
**Red flags**
|
||||
|
||||
* GIN index on a JSONB column + frequent updates = bloat and write amplification.
|
||||
* No partial index for queue readiness → sequential scans under load.
|
||||
* Composite indexes with wrong leading column order (e.g., `created_at, tenant_id`) → not used.
|
||||
|
||||
---
|
||||
|
||||
# 5) Partitioning and retention (avoid “infinite tables”)
|
||||
|
||||
Use partitioning for:
|
||||
|
||||
* audit/events
|
||||
* scan run logs
|
||||
* large finding histories
|
||||
* anything > tens of millions rows with time-based access
|
||||
|
||||
## Standard approach
|
||||
|
||||
* Partition by `occurred_at` (monthly) for event/audit tables.
|
||||
* Retention by dropping partitions (fast and vacuum-free).
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE scan_run_event (
|
||||
tenant_id uuid NOT NULL,
|
||||
scan_run_id bigint NOT NULL,
|
||||
occurred_at timestamptz NOT NULL,
|
||||
event_type text NOT NULL,
|
||||
payload jsonb NOT NULL
|
||||
) PARTITION BY RANGE (occurred_at);
|
||||
```
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] Partition creation/rollover process exists (migration or scheduler)
|
||||
* [ ] Retention is “DROP PARTITION”, not “DELETE WHERE occurred_at < …”
|
||||
* [ ] Each partition has needed local indexes (BRIN/time + tenant filters)
|
||||
|
||||
---
|
||||
|
||||
# 6) Views vs Materialized Views vs Projection Tables
|
||||
|
||||
## Use a normal VIEW when
|
||||
|
||||
* It’s thin (renaming columns, simple joins) and not used in hot paths.
|
||||
|
||||
## Use a MATERIALIZED VIEW when
|
||||
|
||||
* It accelerates complex joins/aggregations and can be refreshed on a schedule.
|
||||
* You can tolerate refresh lag.
|
||||
|
||||
**Materialized view requirements**
|
||||
|
||||
* Must have a **unique index** to use `REFRESH … CONCURRENTLY`.
|
||||
* Refresh must be **outside** an explicit transaction block.
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
CREATE MATERIALIZED VIEW mv_artifact_risk AS
|
||||
SELECT tenant_id, artifact_purl, max(score) AS risk_score
|
||||
FROM open_findings
|
||||
GROUP BY tenant_id, artifact_purl;
|
||||
|
||||
CREATE UNIQUE INDEX ux_mv_artifact_risk
|
||||
ON mv_artifact_risk(tenant_id, artifact_purl);
|
||||
```
|
||||
|
||||
## Prefer projection tables over MV when
|
||||
|
||||
* You need **incremental updates** (on import/scan completion).
|
||||
* You need deterministic “point-in-time” snapshots per manifest.
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] Every MV has refresh cadence + owner (which worker/job triggers it)
|
||||
* [ ] UI/API queries do not depend on a heavy non-materialized view
|
||||
* [ ] If “refresh cost” scales with whole dataset, projection table exists instead
|
||||
|
||||
---
|
||||
|
||||
# 7) Queue and outbox patterns that do not deadlock
|
||||
|
||||
## Claim pattern (short transaction)
|
||||
|
||||
```sql
|
||||
WITH cte AS (
|
||||
SELECT id
|
||||
FROM job_queue
|
||||
WHERE kind = $1
|
||||
AND run_after <= now()
|
||||
AND locked_at IS NULL
|
||||
ORDER BY id
|
||||
FOR UPDATE SKIP LOCKED
|
||||
LIMIT 1
|
||||
)
|
||||
UPDATE job_queue j
|
||||
SET locked_at = now(),
|
||||
locked_by = $2
|
||||
FROM cte
|
||||
WHERE j.id = cte.id
|
||||
RETURNING j.*;
|
||||
```
|
||||
|
||||
**Rules**
|
||||
|
||||
* Claim + commit quickly.
|
||||
* Do work outside the lock.
|
||||
* On completion: update row to done (or delete if you want compactness).
|
||||
* On failure: increment attempts, set `run_after = now() + backoff`, release lock.
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] Worker does not keep transaction open while scanning/importing
|
||||
* [ ] Backoff policy is encoded (in DB columns) and observable
|
||||
* [ ] DLQ condition exists (attempts > N) and is queryable
|
||||
|
||||
---
|
||||
|
||||
# 8) Query performance review checklist (what to require in PRs)
|
||||
|
||||
For each new endpoint/query:
|
||||
|
||||
* [ ] Provide the query (SQL) and the intended parameters.
|
||||
* [ ] Provide `EXPLAIN (ANALYZE, BUFFERS)` from a dataset size that resembles staging.
|
||||
* [ ] Identify the serving index(es).
|
||||
* [ ] Confirm row estimates are not wildly wrong (if they are: stats or predicate mismatch).
|
||||
* [ ] Confirm it is tenant-scoped and uses the tenant-leading index.
|
||||
|
||||
**Common fixes**
|
||||
|
||||
* Replace `IN (SELECT …)` with `EXISTS` for correlated checks.
|
||||
* Replace `ORDER BY … LIMIT` without index with an index that matches ordering.
|
||||
* Avoid exploding joins with JSON arrays; pre-extract.
|
||||
|
||||
---
|
||||
|
||||
# 9) Vacuum, bloat, and “why is disk growing”
|
||||
|
||||
## Design to avoid bloat
|
||||
|
||||
* Append-only for large docs and events.
|
||||
* If frequent updates are needed, isolate hot-updated columns into a smaller table.
|
||||
|
||||
Example split:
|
||||
|
||||
* `job_queue_payload` (stable)
|
||||
* `job_queue_state` (locked/status/attempts updated frequently)
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] Large frequently-updated JSONB tables have been questioned
|
||||
* [ ] Updates do not rewrite big TOAST values repeatedly
|
||||
* [ ] Retention is partition-drop where possible
|
||||
|
||||
---
|
||||
|
||||
# 10) Migration safety rules (prevent production locks)
|
||||
|
||||
* Index creation: `CREATE INDEX CONCURRENTLY`.
|
||||
* Dropping indexes: `DROP INDEX CONCURRENTLY`.
|
||||
* New column with default on large table:
|
||||
|
||||
1. `ADD COLUMN` nullable
|
||||
2. backfill in batches
|
||||
3. `ALTER COLUMN SET NOT NULL`
|
||||
4. add default if needed
|
||||
|
||||
**Checklist**
|
||||
|
||||
* [ ] No long-running `ALTER TABLE` on huge tables without plan
|
||||
* [ ] Any new NOT NULL constraint is staged safely
|
||||
|
||||
---
|
||||
|
||||
# 11) Stella Ops-specific schema guidance (SBOM/VEX/Finding)
|
||||
|
||||
## Minimum recommended normalized tables
|
||||
|
||||
Even if you keep raw SBOM/VEX JSON:
|
||||
|
||||
* `sbom_document` (raw, immutable)
|
||||
* `sbom_component` (extracted components)
|
||||
* `vex_document` (raw, immutable)
|
||||
* `vex_statement` (extracted statements per CVE/component)
|
||||
* `finding` (facts: CVE ↔ component ↔ artifact ↔ scan_run)
|
||||
* `scan_manifest` (determinism: feed versions/hashes, policy hash)
|
||||
* `scan_run` (links results to manifest)
|
||||
|
||||
**Key gap detectors**
|
||||
|
||||
* If “find all artifacts affected by CVE X” is slow → missing `finding` indexing.
|
||||
* If “component search” is slow → missing `sbom_component` and its indexes.
|
||||
* If “replay this scan” is not exact → missing `scan_manifest` + feed import hashes.
|
||||
|
||||
---
|
||||
|
||||
# 12) Minimal “definition of done” for a new table/view
|
||||
|
||||
A PR adding a table/view is incomplete unless it includes:
|
||||
|
||||
* [ ] Table classification (SOR / projection / queue / event)
|
||||
* [ ] Primary key and idempotency unique key
|
||||
* [ ] Tenant scoping strategy
|
||||
* [ ] Index plan mapped to known queries
|
||||
* [ ] Retention plan (especially for event/projection tables)
|
||||
* [ ] Refresh/update plan if derived
|
||||
* [ ] Example query + `EXPLAIN` for the top 1–3 access patterns
|
||||
|
||||
---
|
||||
|
||||
If you want this as a single drop-in repo document, tell me the target path (e.g., `/docs/platform/postgres-table-view-guidelines.md`) and I will format it exactly as a team-facing guideline, including a one-page “Architecture/Performance Gaps” review form that engineers can paste into PR descriptions.
|
||||
@@ -1,12 +0,0 @@
|
||||
# Archived Advisories Revival Plan (Stub)
|
||||
|
||||
Use with sprint task 13 (ARCHIVED-GAPS-300-020).
|
||||
|
||||
- Candidate advisories to revive:
|
||||
- SBOM-Provenance-Spine
|
||||
- Binary reachability (VB branch)
|
||||
- Function-level VEX explainability
|
||||
- PostgreSQL storage blueprint
|
||||
- Decide canonical schemas/recipes (provenance, reachability, PURL/Build-ID).
|
||||
- Document determinism seeds/SLOs, redaction/isolation rules, changelog/signing approach.
|
||||
- Mark supersedes/duplicates and PostgreSQL storage blueprint guardrails.
|
||||
Reference in New Issue
Block a user