From dee252940bfd1d1c874a64a3088c500693f37f5d Mon Sep 17 00:00:00 2001 From: master Date: Thu, 18 Dec 2025 00:02:31 +0200 Subject: [PATCH] SPRINT_3600_0001_0001 - Reachability Drift Detection Master Plan --- docs/api/triage.contract.v1.md | 334 ++++ docs/db/triage_schema.sql | 249 +++ docs/dev/performance-testing-playbook.md | 663 +++++++ ...600_0001_0001_reachability_drift_master.md | 365 ++++ ...600_0002_0001_call_graph_infrastructure.md | 1273 ++++++++++++ ...T_3600_0003_0001_drift_detection_engine.md | 949 +++++++++ ...SPRINT_3600_0004_0001_ui_evidence_chain.md | 886 +++++++++ .../SPRINT_3700_0001_0001_triage_db_schema.md | 241 +++ ...Dec-2025 - Reachability Drift Detection.md | 395 ++++ ...odeling StellaRouter Performance Curves.md | 0 ...g Proof‑Linked UX in Security Workflows.md | 1759 +---------------- docs/ux/TRIAGE_UI_REDUCER_SPEC.md | 400 ++++ docs/ux/TRIAGE_UX_GUIDE.md | 236 +++ 13 files changed, 6099 insertions(+), 1651 deletions(-) create mode 100644 docs/api/triage.contract.v1.md create mode 100644 docs/db/triage_schema.sql create mode 100644 docs/dev/performance-testing-playbook.md create mode 100644 docs/implplan/SPRINT_3600_0001_0001_reachability_drift_master.md create mode 100644 docs/implplan/SPRINT_3600_0002_0001_call_graph_infrastructure.md create mode 100644 docs/implplan/SPRINT_3600_0003_0001_drift_detection_engine.md create mode 100644 docs/implplan/SPRINT_3600_0004_0001_ui_evidence_chain.md create mode 100644 docs/implplan/SPRINT_3700_0001_0001_triage_db_schema.md create mode 100644 docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md rename docs/product-advisories/{unprocessed => archived}/15-Dec-2025 - Modeling StellaRouter Performance Curves.md (100%) create mode 100644 docs/ux/TRIAGE_UI_REDUCER_SPEC.md create mode 100644 docs/ux/TRIAGE_UX_GUIDE.md diff --git a/docs/api/triage.contract.v1.md b/docs/api/triage.contract.v1.md new file mode 100644 index 00000000..9e0f19a7 --- /dev/null +++ b/docs/api/triage.contract.v1.md @@ -0,0 +1,334 @@ +# Stella Ops Triage API Contract v1 + +Base path: `/api/triage/v1` + +This contract is served by `scanner.webservice` (or a dedicated triage facade that reads scanner-owned tables). +All risk/lattice outputs originate from `scanner.webservice`. + +Key requirements: +- Deterministic outputs (policyId + policyVersion + inputsHash). +- Proof-linking (chips reference evidenceIds). +- `concelier` and `excititor` preserve prune source: API surfaces source chains via `sourceRefs`. + +## 0. Conventions + +### 0.1 Identifiers +- `caseId` == `findingId` (UUID). A case is a finding scoped to an asset/environment. +- Hashes are hex strings. + +### 0.2 Caching +- GET endpoints SHOULD return `ETag`. +- Clients SHOULD send `If-None-Match`. + +### 0.3 Errors +Standard error envelope: + +```json +{ + "error": { + "code": "string", + "message": "string", + "details": { "any": "json" }, + "traceId": "string" + } +} +``` + +Common codes: + +* `not_found` +* `validation_error` +* `conflict` +* `unauthorized` +* `forbidden` +* `rate_limited` + +## 1. Findings Table + +### 1.1 List findings + +`GET /findings` + +Query params: + +* `showMuted` (bool, default false) +* `lane` (optional, enum) +* `search` (optional string; searches asset, purl, cveId) +* `page` (int, default 1) +* `pageSize` (int, default 50; max 200) +* `sort` (optional: `updatedAt`, `score`, `lane`) +* `order` (optional: `asc|desc`) + +Response 200: + +```json +{ + "page": 1, + "pageSize": 50, + "total": 12345, + "mutedCounts": { "reach": 1904, "vex": 513, "compensated": 18 }, + "rows": [ + { + "id": "uuid", + "lane": "BLOCKED", + "verdict": "BLOCK", + "score": 87, + "reachable": "YES", + "vex": "affected", + "exploit": "YES", + "asset": "prod/api-gateway:1.2.3", + "updatedAt": "2025-12-16T01:02:03Z" + } + ] +} +``` + +## 2. Case Narrative + +### 2.1 Get case header + +`GET /cases/{caseId}` + +Response 200: + +```json +{ + "id": "uuid", + "verdict": "BLOCK", + "lane": "BLOCKED", + "score": 87, + "policyId": "prod-strict", + "policyVersion": "2025.12.14", + "inputsHash": "hex", + "why": "Reachable path observed; exploit signal present; prod-strict blocks.", + "chips": [ + { "key": "reachability", "label": "Reachability", "value": "Reachable (92%)", "evidenceIds": ["uuid"] }, + { "key": "vex", "label": "VEX", "value": "affected", "evidenceIds": ["uuid"] }, + { "key": "gate", "label": "Gate", "value": "BLOCKED by prod-strict", "evidenceIds": ["uuid"] } + ], + "sourceRefs": [ + { + "domain": "concelier", + "kind": "cve_record", + "ref": "concelier:osv:...", + "pruned": false + }, + { + "domain": "excititor", + "kind": "effective_vex", + "ref": "excititor:openvex:...", + "pruned": false + } + ], + "updatedAt": "2025-12-16T01:02:03Z" +} +``` + +Notes: + +* `sourceRefs` provides preserved provenance chains (including pruned markers when applicable). + +## 3. Evidence + +### 3.1 List evidence for case + +`GET /cases/{caseId}/evidence` + +Response 200: + +```json +{ + "caseId": "uuid", + "items": [ + { + "id": "uuid", + "type": "VEX_DOC", + "title": "Vendor OpenVEX assertion", + "issuer": "vendor.example", + "signed": true, + "signedBy": "CN=Vendor VEX Signer", + "contentHash": "hex", + "createdAt": "2025-12-15T22:10:00Z", + "previewUrl": "/api/triage/v1/evidence/uuid/preview", + "rawUrl": "/api/triage/v1/evidence/uuid/raw" + } + ] +} +``` + +### 3.2 Get raw evidence object + +`GET /evidence/{evidenceId}/raw` + +Returns: + +* `application/json` for JSON evidence +* `application/octet-stream` for binary +* MUST include `Content-SHA256` header (hex) when possible. + +### 3.3 Preview evidence object + +`GET /evidence/{evidenceId}/preview` + +Returns a compact representation safe for UI preview. + +## 4. Decisions + +### 4.1 Create decision + +`POST /decisions` + +Request body: + +```json +{ + "caseId": "uuid", + "kind": "MUTE_REACH", + "reasonCode": "NON_REACHABLE", + "note": "No entry path in this env; reviewed runtime traces.", + "ttl": "2026-01-16T00:00:00Z" +} +``` + +Response 201: + +```json +{ + "decision": { + "id": "uuid", + "kind": "MUTE_REACH", + "reasonCode": "NON_REACHABLE", + "note": "No entry path in this env; reviewed runtime traces.", + "ttl": "2026-01-16T00:00:00Z", + "actor": { "subject": "user:abc", "display": "Vlad" }, + "createdAt": "2025-12-16T01:10:00Z", + "signatureRef": "dsse:rekor:uuid" + } +} +``` + +Rules: + +* Server signs decisions (DSSE) and persists signature reference. +* Creating a decision MUST create a `Snapshot` with trigger `DECISION`. + +### 4.2 Revoke decision + +`POST /decisions/{decisionId}/revoke` + +Body (optional): + +```json +{ "reason": "Mistake; reachability now observed." } +``` + +Response 200: + +```json +{ "revokedAt": "2025-12-16T02:00:00Z", "signatureRef": "dsse:rekor:uuid" } +``` + +## 5. Snapshots & Smart-Diff + +### 5.1 List snapshots + +`GET /cases/{caseId}/snapshots` + +Response 200: + +```json +{ + "caseId": "uuid", + "items": [ + { + "id": "uuid", + "trigger": "POLICY_UPDATE", + "changedAt": "2025-12-16T00:00:00Z", + "fromInputsHash": "hex", + "toInputsHash": "hex", + "summary": "Policy version changed; gate threshold crossed." + } + ] +} +``` + +### 5.2 Smart-Diff between two snapshots + +`GET /cases/{caseId}/smart-diff?from={inputsHashA}&to={inputsHashB}` + +Response 200: + +```json +{ + "fromInputsHash": "hex", + "toInputsHash": "hex", + "inputsChanged": [ + { "key": "policyVersion", "before": "2025.12.14", "after": "2025.12.16", "evidenceIds": ["uuid"] } + ], + "outputsChanged": [ + { "key": "verdict", "before": "SHIP", "after": "BLOCK", "evidenceIds": ["uuid"] } + ] +} +``` + +## 6. Export Evidence Bundle + +### 6.1 Start export + +`POST /cases/{caseId}/export` + +Response 202: + +```json +{ + "exportId": "uuid", + "status": "QUEUED" +} +``` + +### 6.2 Poll export + +`GET /exports/{exportId}` + +Response 200: + +```json +{ + "exportId": "uuid", + "status": "READY", + "downloadUrl": "/api/triage/v1/exports/uuid/download" +} +``` + +### 6.3 Download bundle + +`GET /exports/{exportId}/download` + +Returns: + +* `application/zip` +* DSSE envelope embedded (or alongside in zip) +* bundle contains replay manifest, artifacts, risk result, snapshots + +## 7. Events (Notify.WebService integration) + +These are emitted by `notify.webservice` when scanner outputs change. + +* `first_signal` + * fired on first actionable detection for an asset/environment +* `risk_changed` + * fired when verdict/lane changes or thresholds crossed +* `gate_blocked` + * fired when CI gate blocks + +Event payload includes: + +* caseId +* old/new verdict/lane/score (for changed events) +* inputsHash +* links to `/cases/{caseId}` + +--- + +**Document Version**: 1.0 +**Target Platform**: .NET 10, PostgreSQL >= 16 diff --git a/docs/db/triage_schema.sql b/docs/db/triage_schema.sql new file mode 100644 index 00000000..de6c7824 --- /dev/null +++ b/docs/db/triage_schema.sql @@ -0,0 +1,249 @@ +-- Stella Ops Triage Schema (PostgreSQL) +-- System of record: PostgreSQL +-- Ephemeral acceleration: Valkey (not represented here) + +BEGIN; + +-- Extensions +CREATE EXTENSION IF NOT EXISTS pgcrypto; + +-- Enums +DO $$ +BEGIN + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_lane') THEN + CREATE TYPE triage_lane AS ENUM ( + 'ACTIVE', + 'BLOCKED', + 'NEEDS_EXCEPTION', + 'MUTED_REACH', + 'MUTED_VEX', + 'COMPENSATED' + ); + END IF; + + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_verdict') THEN + CREATE TYPE triage_verdict AS ENUM ('SHIP', 'BLOCK', 'EXCEPTION'); + END IF; + + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_reachability') THEN + CREATE TYPE triage_reachability AS ENUM ('YES', 'NO', 'UNKNOWN'); + END IF; + + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_vex_status') THEN + CREATE TYPE triage_vex_status AS ENUM ('affected', 'not_affected', 'under_investigation', 'unknown'); + END IF; + + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_decision_kind') THEN + CREATE TYPE triage_decision_kind AS ENUM ('MUTE_REACH', 'MUTE_VEX', 'ACK', 'EXCEPTION'); + END IF; + + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_snapshot_trigger') THEN + CREATE TYPE triage_snapshot_trigger AS ENUM ( + 'FEED_UPDATE', + 'VEX_UPDATE', + 'SBOM_UPDATE', + 'RUNTIME_TRACE', + 'POLICY_UPDATE', + 'DECISION', + 'RESCAN' + ); + END IF; + + IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'triage_evidence_type') THEN + CREATE TYPE triage_evidence_type AS ENUM ( + 'SBOM_SLICE', + 'VEX_DOC', + 'PROVENANCE', + 'CALLSTACK_SLICE', + 'REACHABILITY_PROOF', + 'REPLAY_MANIFEST', + 'POLICY', + 'SCAN_LOG', + 'OTHER' + ); + END IF; +END $$; + +-- Core: finding (caseId == findingId) +CREATE TABLE IF NOT EXISTS triage_finding ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + asset_id uuid NOT NULL, + environment_id uuid NULL, + asset_label text NOT NULL, -- e.g. "prod/api-gateway:1.2.3" + purl text NOT NULL, -- package-url + cve_id text NULL, + rule_id text NULL, + first_seen_at timestamptz NOT NULL DEFAULT now(), + last_seen_at timestamptz NOT NULL DEFAULT now(), + UNIQUE (asset_id, environment_id, purl, cve_id, rule_id) +); + +CREATE INDEX IF NOT EXISTS ix_triage_finding_last_seen ON triage_finding (last_seen_at DESC); +CREATE INDEX IF NOT EXISTS ix_triage_finding_asset_label ON triage_finding (asset_label); +CREATE INDEX IF NOT EXISTS ix_triage_finding_purl ON triage_finding (purl); +CREATE INDEX IF NOT EXISTS ix_triage_finding_cve ON triage_finding (cve_id); + +-- Effective VEX (post-merge), with preserved provenance pointers +CREATE TABLE IF NOT EXISTS triage_effective_vex ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + finding_id uuid NOT NULL REFERENCES triage_finding(id) ON DELETE CASCADE, + status triage_vex_status NOT NULL, + source_domain text NOT NULL, -- "excititor" + source_ref text NOT NULL, -- stable ref string (preserve prune source) + pruned_sources jsonb NULL, -- array of pruned items with reasons (optional) + dsse_envelope_hash text NULL, + signature_ref text NULL, -- rekor/ledger ref + issuer text NULL, + valid_from timestamptz NOT NULL DEFAULT now(), + valid_to timestamptz NULL, + collected_at timestamptz NOT NULL DEFAULT now() +); + +CREATE INDEX IF NOT EXISTS ix_triage_effective_vex_finding ON triage_effective_vex (finding_id, collected_at DESC); + +-- Reachability results +CREATE TABLE IF NOT EXISTS triage_reachability_result ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + finding_id uuid NOT NULL REFERENCES triage_finding(id) ON DELETE CASCADE, + reachable triage_reachability NOT NULL, + confidence smallint NOT NULL CHECK (confidence >= 0 AND confidence <= 100), + static_proof_ref text NULL, -- evidence ref (callgraph slice / CFG slice) + runtime_proof_ref text NULL, -- evidence ref (runtime hits) + inputs_hash text NOT NULL, -- hash of inputs used to compute reachability + computed_at timestamptz NOT NULL DEFAULT now() +); + +CREATE INDEX IF NOT EXISTS ix_triage_reachability_finding ON triage_reachability_result (finding_id, computed_at DESC); + +-- Risk/lattice result (scanner.webservice output) +CREATE TABLE IF NOT EXISTS triage_risk_result ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + finding_id uuid NOT NULL REFERENCES triage_finding(id) ON DELETE CASCADE, + policy_id text NOT NULL, + policy_version text NOT NULL, + inputs_hash text NOT NULL, + score int NOT NULL CHECK (score >= 0 AND score <= 100), + verdict triage_verdict NOT NULL, + lane triage_lane NOT NULL, + why text NOT NULL, -- short narrative + explanation jsonb NULL, -- structured lattice explanation for UI diffing + computed_at timestamptz NOT NULL DEFAULT now(), + UNIQUE (finding_id, policy_id, policy_version, inputs_hash) +); + +CREATE INDEX IF NOT EXISTS ix_triage_risk_finding ON triage_risk_result (finding_id, computed_at DESC); +CREATE INDEX IF NOT EXISTS ix_triage_risk_lane ON triage_risk_result (lane, computed_at DESC); + +-- Signed Decisions (mute/ack/exception), reversible by revoke +CREATE TABLE IF NOT EXISTS triage_decision ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + finding_id uuid NOT NULL REFERENCES triage_finding(id) ON DELETE CASCADE, + kind triage_decision_kind NOT NULL, + reason_code text NOT NULL, + note text NULL, + policy_ref text NULL, -- optional: policy that allowed decision + ttl timestamptz NULL, + actor_subject text NOT NULL, -- Authority subject (sub) + actor_display text NULL, + signature_ref text NULL, -- DSSE signature reference + dsse_hash text NULL, + created_at timestamptz NOT NULL DEFAULT now(), + revoked_at timestamptz NULL, + revoke_reason text NULL, + revoke_signature_ref text NULL, + revoke_dsse_hash text NULL +); + +CREATE INDEX IF NOT EXISTS ix_triage_decision_finding ON triage_decision (finding_id, created_at DESC); +CREATE INDEX IF NOT EXISTS ix_triage_decision_kind ON triage_decision (kind, created_at DESC); +CREATE INDEX IF NOT EXISTS ix_triage_decision_active ON triage_decision (finding_id) WHERE revoked_at IS NULL; + +-- Evidence artifacts (hash-addressed, signed) +CREATE TABLE IF NOT EXISTS triage_evidence_artifact ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + finding_id uuid NOT NULL REFERENCES triage_finding(id) ON DELETE CASCADE, + type triage_evidence_type NOT NULL, + title text NOT NULL, + issuer text NULL, + signed boolean NOT NULL DEFAULT false, + signed_by text NULL, + content_hash text NOT NULL, + signature_ref text NULL, + media_type text NULL, + uri text NOT NULL, -- object store / file path / inline ref + size_bytes bigint NULL, + metadata jsonb NULL, + created_at timestamptz NOT NULL DEFAULT now(), + UNIQUE (finding_id, type, content_hash) +); + +CREATE INDEX IF NOT EXISTS ix_triage_evidence_finding ON triage_evidence_artifact (finding_id, created_at DESC); +CREATE INDEX IF NOT EXISTS ix_triage_evidence_type ON triage_evidence_artifact (type, created_at DESC); + +-- Snapshots for Smart-Diff (immutable records of input/output changes) +CREATE TABLE IF NOT EXISTS triage_snapshot ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + finding_id uuid NOT NULL REFERENCES triage_finding(id) ON DELETE CASCADE, + trigger triage_snapshot_trigger NOT NULL, + from_inputs_hash text NULL, + to_inputs_hash text NOT NULL, + summary text NOT NULL, + diff_json jsonb NULL, -- optional: precomputed diff + created_at timestamptz NOT NULL DEFAULT now(), + UNIQUE (finding_id, to_inputs_hash, created_at) +); + +CREATE INDEX IF NOT EXISTS ix_triage_snapshot_finding ON triage_snapshot (finding_id, created_at DESC); +CREATE INDEX IF NOT EXISTS ix_triage_snapshot_trigger ON triage_snapshot (trigger, created_at DESC); + +-- Current-case view: latest risk + latest reachability + latest effective VEX +CREATE OR REPLACE VIEW v_triage_case_current AS +WITH latest_risk AS ( + SELECT DISTINCT ON (finding_id) + finding_id, policy_id, policy_version, inputs_hash, score, verdict, lane, why, computed_at + FROM triage_risk_result + ORDER BY finding_id, computed_at DESC +), +latest_reach AS ( + SELECT DISTINCT ON (finding_id) + finding_id, reachable, confidence, static_proof_ref, runtime_proof_ref, computed_at + FROM triage_reachability_result + ORDER BY finding_id, computed_at DESC +), +latest_vex AS ( + SELECT DISTINCT ON (finding_id) + finding_id, status, issuer, signature_ref, source_domain, source_ref, collected_at + FROM triage_effective_vex + ORDER BY finding_id, collected_at DESC +) +SELECT + f.id AS case_id, + f.asset_id, + f.environment_id, + f.asset_label, + f.purl, + f.cve_id, + f.rule_id, + f.first_seen_at, + f.last_seen_at, + r.policy_id, + r.policy_version, + r.inputs_hash, + r.score, + r.verdict, + r.lane, + r.why, + r.computed_at AS risk_computed_at, + coalesce(re.reachable, 'UNKNOWN'::triage_reachability) AS reachable, + re.confidence AS reach_confidence, + v.status AS vex_status, + v.issuer AS vex_issuer, + v.signature_ref AS vex_signature_ref, + v.source_domain AS vex_source_domain, + v.source_ref AS vex_source_ref +FROM triage_finding f +LEFT JOIN latest_risk r ON r.finding_id = f.id +LEFT JOIN latest_reach re ON re.finding_id = f.id +LEFT JOIN latest_vex v ON v.finding_id = f.id; + +COMMIT; diff --git a/docs/dev/performance-testing-playbook.md b/docs/dev/performance-testing-playbook.md new file mode 100644 index 00000000..bacca1e7 --- /dev/null +++ b/docs/dev/performance-testing-playbook.md @@ -0,0 +1,663 @@ +# Performance Testing Pipeline for Queue-Based Workflows + +> **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing. + +--- + +## What we're measuring (plain English) + +* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job. +* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip. +* **Worker service time:** time to pick up, process, and ack. +* **Queueing delay:** time spent waiting in the queue (arrival → start of worker). + +These four add up to the "hop latency" users feel when the system is under load. + +--- + +## Minimal tracing you can add today + +Emit these IDs/headers end-to-end: + +* `x-stella-corr-id` (uuid) +* `x-stella-enq-ts` (gateway enqueue ts, ns) +* `x-stella-claim-ts` (worker claim ts, ns) +* `x-stella-done-ts` (worker done ts, ns) + +From these, compute: + +* `queue_delay = claim_ts - enq_ts` +* `service_time = done_ts - claim_ts` +* `http_ttfs = gateway_first_byte_ts - http_request_start_ts` +* `hop_latency = done_ts - enq_ts` (or return-path if synchronous) + +Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock. + +--- + +## Valkey commands (safe, BSD Valkey) + +Use **Valkey Streams + Consumer Groups** for fairness and metrics: + +* Enqueue: `XADD jobs * corr-id enq-ts payload <...>` +* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >` +* Ack: `XACK jobs workers ` + +Add a small Lua for timestamping at enqueue (atomic): + +```lua +-- KEYS[1]=stream +-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload +return redis.call('XADD', KEYS[1], '*', + 'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3]) +``` + +--- + +## Load shapes to test (find the envelope) + +1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset. +2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time. +3. **Step-up/down:** double every 2 min until SLO breach; then halve down. +4. **Long tail soak:** run at 70–80% of max for 1h; watch p95-p99.9 drift. + +Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**. + +--- + +## k6 script (HTTP client pressure) + +```javascript +// save as hop-test.js +import http from 'k6/http'; +import { check, sleep } from 'k6'; + +export let options = { + scenarios: { + step_load: { + executor: 'ramping-arrival-rate', + startRate: 20, timeUnit: '1s', + preAllocatedVUs: 200, maxVUs: 5000, + stages: [ + { target: 50, duration: '1m' }, + { target: 100, duration: '1m' }, + { target: 200, duration: '1m' }, + { target: 400, duration: '1m' }, + { target: 800, duration: '1m' }, + ], + }, + }, + thresholds: { + 'http_req_failed': ['rate<0.01'], + 'http_req_duration{phase:hop}': ['p(95)<500'], + }, +}; + +export default function () { + const corr = crypto.randomUUID(); + const res = http.post( + __ENV.GW_URL, + JSON.stringify({ data: 'ping', corr }), + { + headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr }, + tags: { phase: 'hop' }, + } + ); + check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 }); + sleep(0.01); +} +``` + +Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js` + +--- + +## Worker hooks (.NET 10 sketch) + +```csharp +// At claim +var now = Stopwatch.GetTimestamp(); // monotonic +var claimNs = now.ToNanoseconds(); +log.AddTag("x-stella-claim-ts", claimNs); + +// After processing +var doneNs = Stopwatch.GetTimestamp().ToNanoseconds(); +log.AddTag("x-stella-done-ts", doneNs); +// Include corr-id and stream entry id in logs/metrics +``` + +Helper: + +```csharp +public static class MonoTime { + static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency; + public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick); +} +``` + +--- + +## Prometheus metrics to expose + +* `valkey_enqueue_ns` (histogram) +* `valkey_claim_block_ms` (gauge) +* `worker_service_ns` (histogram, labels: worker_type, route) +* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`) +* `enqueue_rate`, `dequeue_rate` (counters) + +Example recording rules: + +```yaml +- record: hop:queue_delay_p95 + expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le)) +- record: hop:service_time_p95 + expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le)) +- record: hop:latency_budget_p95 + expr: hop:queue_delay_p95 + hop:service_time_p95 +``` + +--- + +## Autoscaling signals (HPA/KEDA friendly) + +* **Primary:** queue depth & its derivative (d/dt). +* **Secondary:** p95 `queue_delay` and worker CPU. +* **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`. + +--- + +## Plot the "envelope" (what you'll look at) + +* X-axis: **offered load** (req/s). +* Y-axis: **p95 hop latency** (ms). +* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises). +* Add secondary panel: **queue depth** vs load. + +--- + +# Performance Test Guidelines + +## HTTP → Valkey → Worker pipeline + +## 1) Objectives and scope + +### Primary objectives + +Your performance tests MUST answer these questions with evidence: + +1. **Capacity knee**: At what offered load does **queue delay** start growing sharply? +2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load? +3. **Decomposition**: How much of hop latency is: + * gateway enqueue time + * Valkey enqueue/claim RTT + * queue wait time + * worker service time +4. **Scaling behavior**: How do these change with worker replica counts (N workers)? +5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)? + +### Non-goals (explicitly out of scope unless you add them later) + +* Micro-optimizing single function runtime +* Synthetic "max QPS" records without a representative payload +* Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke + +--- + +## 2) Definitions and required metrics + +### Required latency definitions (standardize these names) + +Agents MUST compute and report these per request/job: + +* **`t_http_accept`**: time from client send → gateway accepts request +* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side) +* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s) +* **`t_queue_delay`**: `claim_ts - enq_ts` +* **`t_service`**: `done_ts - claim_ts` +* **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency) +* Optional but recommended: + * **`t_ack`**: time to ack completion (Valkey ack RTT) + * **`t_http_response`**: request start → gateway response sent (TTFB/TTFS) + +### Required percentiles and aggregations + +Per scenario step (e.g., each offered load plateau), agents MUST output: + +* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue` +* Throughput: offered rps and achieved rps +* Error rate: HTTP failures, enqueue failures, worker failures +* Queue depth and backlog drain time + +### Required system-level telemetry (minimum) + +Agents MUST collect these time series during tests: + +* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators +* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count +* **Gateway**: CPU/mem, request rate, response codes, request duration histogram + +--- + +## 3) Environment and test hygiene requirements + +### Environment requirements + +Agents SHOULD run tests in an environment that matches production in: + +* container CPU/memory limits +* number of nodes, network topology +* Valkey topology (single, cluster, sentinel, etc.) +* worker replica autoscaling rules (or deliberately disabled) + +If exact parity isn't possible, agents MUST record all known differences in the report. + +### Test hygiene (non-negotiable) + +Agents MUST: + +1. **Start from empty queues** (no backlog). +2. **Disable client retries** (or explicitly run two variants: retries off / retries on). +3. **Warm up** before measuring (e.g., 60s warm-up minimum). +4. **Hold steady plateaus** long enough to stabilize (usually 2–5 minutes per step). +5. **Cool down** and verify backlog drains (queue depth returns to baseline). +6. Record exact versions/SHAs of gateway/worker and Valkey config. + +### Load generator hygiene + +Agents MUST ensure the load generator is not the bottleneck: + +* CPU < ~70% during test +* no local socket exhaustion +* enough VUs/connections +* if needed, distributed load generation + +--- + +## 4) Instrumentation spec (agents implement this first) + +### Correlation and timestamps + +Agents MUST propagate an end-to-end correlation ID and timestamps. + +**Required fields** + +* `corr_id` (UUID) +* `enq_ts_ns` (set at enqueue, monotonic or consistent clock) +* `claim_ts_ns` (set by worker when job is claimed) +* `done_ts_ns` (set by worker when job processing ends) + +**Where these live** + +* HTTP request header: `x-corr-id: ` +* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type +* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns` + +### Clock requirements + +Agents MUST use a consistent timing source: + +* Prefer monotonic timers for durations (Stopwatch / monotonic clock) +* If timestamps cross machines, ensure they're comparable: + * either rely on synchronized clocks (NTP) **and** monitor drift + * or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay) + +**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity. + +### Valkey queue semantics (recommended) + +Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability: + +* Enqueue: `XADD jobs * corr enq payload <...>` +* Claim: `XREADGROUP GROUP workers COUNT 1 BLOCK 1000 STREAMS jobs >` +* Ack: `XACK jobs workers ` + +Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag. + +### Metrics exposure + +Agents MUST publish Prometheus (or equivalent) histograms: + +* `gateway_enqueue_seconds` (or ns) histogram +* `valkey_enqueue_rtt_seconds` histogram +* `worker_service_seconds` histogram +* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline) +* `hop_latency_seconds` histogram + +--- + +## 5) Workload modeling and test data + +Agents MUST define a workload model before running capacity tests: + +1. **Endpoint(s)**: list exact gateway routes under test +2. **Payload types**: small/typical/large +3. **Mix**: e.g., 70/25/5 by payload size +4. **Idempotency rules**: ensure repeated jobs don't corrupt state +5. **Data reset strategy**: how test data is cleaned or isolated per run + +Agents SHOULD test at least: + +* Typical payload (p50) +* Large payload (p95) +* Worst-case allowed payload (bounded by your API limits) + +--- + +## 6) Scenario suite your agents MUST implement + +Each scenario MUST be defined as code/config (not manual). + +### Scenario A — Smoke (fast sanity) + +**Goal**: verify instrumentation + basic correctness +**Load**: low (e.g., 1–5 rps), 2 minutes +**Pass**: + +* 0 backlog after run +* error rate < 0.1% +* metrics present for all segments + +### Scenario B — Baseline (repeatable reference point) + +**Goal**: establish a stable baseline for regression tracking +**Load**: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes +**Pass**: + +* p95 `t_hop` within baseline ± tolerance (set after first runs) +* no upward drift in p95 across time (trend line ~flat) + +### Scenario C — Capacity ramp (open-loop) + +**Goal**: find the knee where queueing begins +**Method**: open-loop arrival-rate ramp with plateaus +Example stages (edit to fit your system): + +* 50 rps for 2m +* 100 rps for 2m +* 200 rps for 2m +* 400 rps for 2m +* … until SLO breach or errors spike + +**MUST**: + +* warm-up stage before first plateau +* record per-plateau summary + +**Stop conditions** (any triggers stop): + +* error rate > 1% +* queue depth grows without bound over an entire plateau +* p95 `t_hop` exceeds SLO for 2 consecutive plateaus + +### Scenario D — Stress (push past capacity) + +**Goal**: characterize failure mode and recovery +**Load**: 120–200% of knee load, 5–10 minutes +**Pass** (for resilience): + +* system does not crash permanently +* once load stops, backlog drains within target time (define it) + +### Scenario E — Burst / spike + +**Goal**: see how quickly queue grows and drains +**Load shape**: + +* baseline low load +* sudden burst (e.g., 10× for 10–30s) +* return to baseline + +**Report**: + +* peak queue depth +* time to drain to baseline +* p99 `t_hop` during burst + +### Scenario F — Soak (long-running) + +**Goal**: detect drift (leaks, fragmentation, GC patterns) +**Load**: 70–85% of knee, 60–180 minutes +**Pass**: + +* p95 does not trend upward beyond threshold +* memory remains bounded +* no rising error rate + +### Scenario G — Scaling curve (worker replica sweep) + +**Goal**: turn results into scaling rules +**Method**: + +* Repeat Scenario C with worker replicas = 1, 2, 4, 8… + +**Deliverable**: + +* plot of knee load vs worker count +* p95 `t_service` vs worker count (should remain similar; queue delay should drop) + +--- + +## 7) Execution protocol (runbook) + +Agents MUST run every scenario using the same disciplined flow: + +### Pre-run checklist + +* confirm system versions/SHAs +* confirm autoscaling mode: + * **Off** for baseline capacity characterization + * **On** for validating autoscaling policies +* clear queues and consumer group pending entries +* restart or at least record "time since deploy" for services (cold vs warm) + +### During run + +* ensure load is truly open-loop when required (arrival-rate based) +* continuously record: + * offered vs achieved rate + * queue depth + * CPU/mem for gateway/worker/Valkey + +### Post-run + +* stop load +* wait until backlog drains (or record that it doesn't) +* export: + * k6/runner raw output + * Prometheus time series snapshot + * sampled logs with corr_id fields +* generate a summary report automatically (no hand calculations) + +--- + +## 8) Analysis rules (how agents compute "the envelope") + +Agents MUST generate at minimum two plots per run: + +1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis) + * overlay p99 (and SLO line) +2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time + +### How to identify the "knee" + +Agents SHOULD mark the knee as the first plateau where: + +* queue depth grows monotonically within the plateau, **or** +* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50–100%) + +### Convert results into scaling guidance + +Agents SHOULD compute: + +* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker) +* recommended replicas for offered load λ at target utilization U: + * `workers_needed = ceil(λ * mean(t_service) / U)` + * choose U ~ 0.6–0.75 for headroom + +This should be reported alongside the measured envelope. + +--- + +## 9) Pass/fail criteria and regression gates + +Agents MUST define gates in configuration, not in someone's head. + +Suggested gating structure: + +* **Smoke gate**: error rate < 0.1%, backlog drains +* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history) +* **Capacity gate**: knee load regression < 10% (optional but very valuable) +* **Soak gate**: p95 drift over time < 15% and no memory runaway + +--- + +## 10) Common pitfalls (agents must avoid) + +1. **Closed-loop tests used for capacity** + Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity. + +2. **Ignoring queue depth** + A system can look "healthy" in request latency while silently building backlog. + +3. **Measuring only gateway latency** + You must measure enqueue → claim → done to see the real hop. + +4. **Load generator bottleneck** + If the generator saturates, you'll under-estimate capacity. + +5. **Retries enabled by default** + Retries can inflate load and hide root causes; run with retries off first. + +6. **Not controlling warm vs cold** + Cold caches vs warmed services produce different envelopes; record the condition. + +--- + +# Agent implementation checklist (deliverables) + +Assign these as concrete tasks to your agents. + +## Agent 1 — Observability & tracing + +MUST deliver: + +* correlation id propagation gateway → Valkey → worker +* timestamps `enq/claim/done` +* Prometheus histograms for enqueue, service, hop +* queue depth metric (`XLEN` / `XINFO` lag) + +## Agent 2 — Load test harness + +MUST deliver: + +* test runner scripts (k6 or equivalent) for scenarios A–G +* test config file (YAML/JSON) controlling: + * stages (rates/durations) + * payload mix + * headers (corr-id) +* reproducible seeds and version stamping + +## Agent 3 — Result collector and analyzer + +MUST deliver: + +* a pipeline that merges: + * load generator output + * hop timing data (from logs or a completion stream) + * Prometheus snapshots +* automatic summary + plots: + * latency envelope + * queue depth/drain +* CSV/JSON exports for long-term tracking + +## Agent 4 — Reporting and dashboards + +MUST deliver: + +* a standard report template that includes: + * environment details + * scenario details + * key charts + * knee estimate + * scaling recommendation +* Grafana dashboard with the required panels + +## Agent 5 — CI / release integration + +SHOULD deliver: + +* PR-level smoke test (Scenario A) +* nightly baseline (Scenario B) +* weekly capacity sweep (Scenario C + scaling curve) + +--- + +## Template: scenario spec (agents can copy/paste) + +```yaml +test_run: + system_under_test: + gateway_sha: "" + worker_sha: "" + valkey_version: "" + environment: + cluster: "" + workers: 4 + autoscaling: "off" # off|on + workload: + endpoint: "/hop" + payload_profile: "p50" + mix: + p50: 0.7 + p95: 0.25 + max: 0.05 + scenario: + name: "capacity_ramp" + mode: "open_loop" + warmup_seconds: 60 + stages: + - rps: 50 + duration_seconds: 120 + - rps: 100 + duration_seconds: 120 + - rps: 200 + duration_seconds: 120 + - rps: 400 + duration_seconds: 120 + gates: + max_error_rate: 0.01 + slo_ms_p95_hop: 500 + backlog_must_drain_seconds: 300 + outputs: + artifacts_dir: "./artifacts//" +``` + +--- + +## Sample folder layout + +``` +perf/ + docker-compose.yml + prometheus/ + prometheus.yml + k6/ + lib.js + smoke.js + capacity_ramp.js + burst.js + soak.js + stress.js + scaling_curve.sh + tools/ + analyze.py + src/ + Perf.Gateway/ + Perf.Worker/ +``` + +--- + +**Document Version**: 1.0 +**Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md +**Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use. diff --git a/docs/implplan/SPRINT_3600_0001_0001_reachability_drift_master.md b/docs/implplan/SPRINT_3600_0001_0001_reachability_drift_master.md new file mode 100644 index 00000000..16191259 --- /dev/null +++ b/docs/implplan/SPRINT_3600_0001_0001_reachability_drift_master.md @@ -0,0 +1,365 @@ +# SPRINT_3600_0001_0001 - Reachability Drift Detection Master Plan + +**Status:** TODO +**Priority:** P0 - CRITICAL +**Module:** Scanner, Signals, Web +**Working Directory:** `src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/` +**Estimated Effort:** X-Large (3 sub-sprints) +**Dependencies:** SPRINT_3500 (Smart-Diff) - COMPLETE + +--- + +## Topic & Scope + +Implementation of Reachability Drift Detection as specified in `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md`. This extends Smart-Diff to detect when vulnerable code paths become reachable/unreachable between container image versions, with causal attribution and UI visualization. + +**Business Value:** +- Transform from "all vulnerabilities" to "material reachability changes" +- Reduce alert fatigue by 90%+ through meaningful drift detection +- Enable causal attribution ("guard removed in AuthFilter.cs:42") +- Provide actionable UI for security review + +--- + +## Dependencies & Concurrency + +**Internal Dependencies:** +- `SPRINT_3500` (Smart-Diff) - COMPLETE - Provides MaterialRiskChangeDetector, VexCandidateEmitter +- `StellaOps.Signals.Contracts` - Provides CallPath, ReachabilitySignal models +- `StellaOps.Scanner.SmartDiff` - Provides detection infrastructure +- `vex.graph_nodes/edges` - Existing graph storage schema + +**Concurrency:** +- Sprint 3600.2 (Call Graph) must complete before 3600.3 (Drift Detection) +- Sprint 3600.4 (UI) can start in parallel once 3600.3 API contracts are defined + +--- + +## Documentation Prerequisites + +Before starting implementation, read: +- `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md` +- `docs/product-advisories/14-Dec-2025 - Smart-Diff Technical Reference.md` +- `docs/product-advisories/14-Dec-2025 - Reachability Analysis Technical Reference.md` +- `docs/modules/scanner/architecture.md` +- `docs/reachability/lattice.md` +- `bench/reachability-benchmark/README.md` + +--- + +## Wave Coordination + +``` +SPRINT_3600_0002 (Call Graph Infrastructure) + │ + ▼ +SPRINT_3600_0003 (Drift Detection Engine) + │ + ├──────────────────────┐ + ▼ ▼ +SPRINT_3600_0004 (UI) API Integration + │ │ + └──────────────┬───────┘ + ▼ + Integration Tests +``` + +--- + +## Wave Detail Snapshots + +### Wave 1: Call Graph Infrastructure (SPRINT_3600_0002_0001) +- .NET call graph extraction via Roslyn +- Node.js call graph extraction via AST parsing +- Entrypoint discovery for ASP.NET Core, Express, Fastify +- Sink taxonomy implementation +- Call graph storage and caching + +### Wave 2: Drift Detection Engine (SPRINT_3600_0003_0001) +- Code change facts extraction (AST-level) +- Cross-scan graph comparison +- Drift cause attribution +- Path compression for storage +- API endpoints + +### Wave 3: UI and Evidence Chain (SPRINT_3600_0004_0001) +- Angular Path Viewer component +- Risk Drift Card component +- Evidence chain linking +- DSSE attestation for drift results +- CLI output enhancements + +--- + +## Interlocks + +1. **Schema Versioning**: New tables must be versioned migrations (006_reachability_drift_tables.sql) +2. **Determinism**: Call graph extraction must be deterministic (stable node IDs) +3. **Benchmark Alignment**: Must pass `bench/reachability-benchmark` cases +4. **Smart-Diff Compat**: Must integrate with existing MaterialRiskChangeDetector + +--- + +## Upcoming Checkpoints + +- TBD + +--- + +## Action Tracker + +| Date (UTC) | Action | Owner | Notes | +|---|---|---|---| +| 2025-12-17 | Created master sprint from advisory analysis | Agent | Initial planning | + +--- + +## 1. EXECUTIVE SUMMARY + +Reachability Drift Detection extends Smart-Diff to track **function-level reachability changes** between scans. Instead of reporting all vulnerabilities, it identifies: + +1. **New reachable paths** - Vulnerable sinks that became reachable +2. **Mitigated paths** - Vulnerable sinks that became unreachable +3. **Causal attribution** - Why the change occurred (guard removed, new route, etc.) + +### Technical Approach + +| Phase | Component | Description | +|-------|-----------|-------------| +| Extract | Call Graph Extractor | Per-language AST analysis | +| Model | Entrypoint Discovery | HTTP handlers, CLI commands, jobs | +| Diff | Code Change Facts | AST-level symbol changes | +| Analyze | Reachability BFS | Multi-source traversal from entrypoints | +| Compare | Drift Detector | Graph N vs N-1 comparison | +| Attribute | Cause Explainer | Link drift to code changes | +| Present | Path Viewer | Angular UI component | + +--- + +## 2. ARCHITECTURE OVERVIEW + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ REACHABILITY DRIFT ARCHITECTURE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Scan T-1 │ │ Scan T │ │ Call Graph │ │ +│ │ (Baseline) │────►│ (Current) │────►│ Extractor │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ GRAPH EXTRACTION │ │ +│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ +│ │ │ .NET/Roslyn│ │ Node/AST │ │ Go/SSA │ │ │ +│ │ └────────────┘ └────────────┘ └────────────┘ │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ REACHABILITY ANALYSIS │ │ +│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ +│ │ │Entrypoint│ │BFS/DFS │ │ Sink │ │Reachable│ │ │ +│ │ │Discovery │ │Traversal│ │Detection│ │ Set │ │ │ +│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ DRIFT DETECTION │ │ +│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ +│ │ │Code Change │ │Graph Diff │ │ Cause │ │ │ +│ │ │ Facts │ │ Comparison │ │ Attribution│ │ │ +│ │ └────────────┘ └────────────┘ └────────────┘ │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ OUTPUT GENERATION │ │ +│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ +│ │ │ Path Viewer│ │ SARIF │ │ DSSE │ │ │ +│ │ │ UI │ │ Output │ │ Attestation│ │ │ +│ │ └────────────┘ └────────────┘ └────────────┘ │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 3. SUB-SPRINT STRUCTURE + +| Sprint | ID | Topic | Status | Priority | Dependencies | +|--------|-----|-------|--------|----------|--------------| +| 1 | SPRINT_3600_0002_0001 | Call Graph Infrastructure | TODO | P0 | Master | +| 2 | SPRINT_3600_0003_0001 | Drift Detection Engine | TODO | P0 | Sprint 1 | +| 3 | SPRINT_3600_0004_0001 | UI and Evidence Chain | TODO | P1 | Sprint 2 | + +### Sprint Dependency Graph + +``` +SPRINT_3600_0002 (Call Graph) + │ + ├──────────────────────┐ + ▼ │ +SPRINT_3600_0003 (Detection) │ + │ │ + ├──────────────────────┤ + ▼ ▼ +SPRINT_3600_0004 (UI) Integration +``` + +--- + +## 4. GAP ANALYSIS SUMMARY + +### 4.1 Existing Infrastructure (Leverage Points) + +| Component | Location | Status | +|-----------|----------|--------| +| MaterialRiskChangeDetector | `Scanner.SmartDiff.Detection` | COMPLETE | +| VexCandidateEmitter | `Scanner.SmartDiff.Detection` | COMPLETE | +| ReachabilityGateBridge | `Scanner.SmartDiff.Detection` | COMPLETE | +| CallPath model | `Signals.Contracts.Evidence` | COMPLETE | +| ReachabilityLatticeState | `Signals.Contracts.Evidence` | COMPLETE | +| vex.graph_nodes/edges | Database | COMPLETE | +| scanner.material_risk_changes | Database | COMPLETE | +| FN-Drift tracking | `Scanner.Core.Drift` | COMPLETE | +| Reachability benchmark | `bench/reachability-benchmark` | COMPLETE | +| Language analyzers | `Scanner.Analyzers.Lang.*` | PARTIAL | + +### 4.2 Missing Components (Implementation Required) + +| Component | Sprint | Priority | +|-----------|--------|----------| +| CallGraphExtractor.DotNet (Roslyn) | 3600.2 | P0 | +| CallGraphExtractor.Node (AST) | 3600.2 | P0 | +| EntrypointDiscovery.AspNetCore | 3600.2 | P0 | +| EntrypointDiscovery.Express | 3600.2 | P0 | +| SinkDetector (taxonomy) | 3600.2 | P0 | +| scanner.code_changes table | 3600.3 | P0 | +| scanner.call_graph_snapshots table | 3600.2 | P0 | +| CodeChangeFact extractor | 3600.3 | P0 | +| DriftCauseExplainer | 3600.3 | P0 | +| PathCompressor | 3600.3 | P1 | +| PathViewerComponent (Angular) | 3600.4 | P1 | +| RiskDriftCardComponent (Angular) | 3600.4 | P1 | +| DSSE attestation for drift | 3600.4 | P1 | + +--- + +## 5. MODULE OWNERSHIP + +| Module | Owner Role | Sprints | +|--------|------------|---------| +| Scanner | Scanner Guild | 3600.2, 3600.3 | +| Signals | Signals Guild | 3600.2 (contracts) | +| Web | Frontend Guild | 3600.4 | +| Attestor | Attestor Guild | 3600.4 (DSSE) | + +--- + +## Delivery Tracker + +| # | Task ID | Sprint | Status | Description | +|---|---------|--------|--------|-------------| +| 1 | RDRIFT-MASTER-0001 | 3600 | DOING | Coordinate all sub-sprints | +| 2 | RDRIFT-MASTER-0002 | 3600 | TODO | Create integration test suite | +| 3 | RDRIFT-MASTER-0003 | 3600 | TODO | Update Scanner AGENTS.md | +| 4 | RDRIFT-MASTER-0004 | 3600 | TODO | Update Web AGENTS.md | +| 5 | RDRIFT-MASTER-0005 | 3600 | TODO | Validate benchmark cases pass | +| 6 | RDRIFT-MASTER-0006 | 3600 | TODO | Document air-gap workflows | + +--- + +## 6. SUCCESS CRITERIA + +### 6.1 Functional Requirements + +- [ ] .NET call graph extraction via Roslyn +- [ ] Node.js call graph extraction via AST +- [ ] ASP.NET Core entrypoint discovery +- [ ] Express/Fastify entrypoint discovery +- [ ] Sink taxonomy (9 categories) +- [ ] Code change facts extraction +- [ ] Cross-scan drift detection +- [ ] Causal attribution +- [ ] Path Viewer UI +- [ ] DSSE attestation + +### 6.2 Determinism Requirements + +- [ ] Same inputs produce identical call graph hash +- [ ] Node IDs stable across extractions +- [ ] Drift detection order-independent +- [ ] Path compression reversible + +### 6.3 Test Requirements + +- [ ] Unit tests for each extractor +- [ ] Integration tests with benchmark cases +- [ ] Golden fixtures for drift detection +- [ ] UI component tests + +### 6.4 Performance Requirements + +- [ ] Call graph extraction < 60s for 100K LOC +- [ ] Drift comparison < 5s per image pair +- [ ] Path storage < 10KB per compressed path + +--- + +## Decisions & Risks + +### 7.1 Architectural Decisions + +| ID | Decision | Rationale | +|----|----------|-----------| +| RDRIFT-DEC-001 | Use scan_id not commit_sha | StellaOps is image-centric | +| RDRIFT-DEC-002 | Store graphs in CAS, metadata in Postgres | Separate large blobs from metadata | +| RDRIFT-DEC-003 | Start with .NET + Node only | Highest ROI languages | +| RDRIFT-DEC-004 | Extend existing schema, don't duplicate | Leverage vex.graph_* tables | + +### 7.2 Risks & Mitigations + +| ID | Risk | Likelihood | Impact | Mitigation | +|----|------|------------|--------|------------| +| RDRIFT-RISK-001 | Roslyn memory pressure on large solutions | Medium | High | Incremental analysis, streaming | +| RDRIFT-RISK-002 | Call graph over-approximation | Medium | Medium | Conservative static analysis | +| RDRIFT-RISK-003 | UI performance with large paths | Low | Medium | Path compression, lazy loading | +| RDRIFT-RISK-004 | False positive drift detection | Medium | Medium | Confidence scoring, review workflow | + +--- + +## 8. DEPENDENCIES + +### 8.1 Internal Dependencies + +- `StellaOps.Scanner.SmartDiff` - Detection infrastructure +- `StellaOps.Signals.Contracts` - CallPath models +- `StellaOps.Attestor.ProofChain` - DSSE attestations +- `StellaOps.Scanner.Analyzers.Lang.*` - Language parsers + +### 8.2 External Dependencies + +- Microsoft.CodeAnalysis (Roslyn) - .NET analysis +- @babel/parser, @babel/traverse - Node.js analysis +- golang.org/x/tools/go/ssa - Go analysis (future) + +--- + +## Execution Log + +| Date (UTC) | Update | Owner | +|---|---|---| +| 2025-12-17 | Created master sprint from advisory analysis | Agent | + +--- + +## 9. REFERENCES + +- **Source Advisory**: `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md` +- **Smart-Diff Reference**: `docs/product-advisories/14-Dec-2025 - Smart-Diff Technical Reference.md` +- **Reachability Reference**: `docs/product-advisories/14-Dec-2025 - Reachability Analysis Technical Reference.md` +- **Benchmark**: `bench/reachability-benchmark/README.md` diff --git a/docs/implplan/SPRINT_3600_0002_0001_call_graph_infrastructure.md b/docs/implplan/SPRINT_3600_0002_0001_call_graph_infrastructure.md new file mode 100644 index 00000000..88a6f867 --- /dev/null +++ b/docs/implplan/SPRINT_3600_0002_0001_call_graph_infrastructure.md @@ -0,0 +1,1273 @@ +# SPRINT_3600_0002_0001 - Call Graph Infrastructure + +**Status:** TODO +**Priority:** P0 - CRITICAL +**Module:** Scanner +**Working Directory:** `src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/` +**Estimated Effort:** Large +**Dependencies:** SPRINT_3500 (Smart-Diff) - COMPLETE + +--- + +## Topic & Scope + +Implement call graph extraction infrastructure for reachability drift detection. This sprint covers: +- Per-language call graph extractors (.NET via Roslyn, Node.js via AST) +- Entrypoint discovery for web frameworks (ASP.NET Core, Express, Fastify) +- Sink taxonomy implementation +- Call graph storage and caching + +--- + +## Documentation Prerequisites + +- `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md` +- `docs/product-advisories/14-Dec-2025 - Reachability Analysis Technical Reference.md` +- `bench/reachability-benchmark/README.md` +- `src/Scanner/__Libraries/StellaOps.Scanner.Analyzers.Lang.Node/AGENTS.md` + +--- + +## Wave Coordination + +Single wave with parallel tracks: +- Track A: .NET/Roslyn extractor +- Track B: Node.js/AST extractor +- Track C: Entrypoint discovery +- Track D: PostgreSQL storage +- Track E: Valkey caching (align with Router.Gateway.RateLimit patterns) + +--- + +## Interlocks + +- Must produce stable node IDs for deterministic comparison +- Must align with existing `vex.graph_nodes` schema where possible +- Must pass `bench/reachability-benchmark` cases + +--- + +## Action Tracker + +| Date (UTC) | Action | Owner | Notes | +|---|---|---|---| +| 2025-12-17 | Created sprint from master plan | Agent | Initial | + +--- + +## 1. OBJECTIVE + +Build the foundation for reachability drift detection: +1. **Call Graph Extraction** - Per-language AST analysis +2. **Entrypoint Discovery** - HTTP handlers, CLI commands, scheduled jobs +3. **Sink Detection** - Vulnerable API patterns +4. **Storage** - Efficient graph persistence + +--- + +## 2. TECHNICAL DESIGN + +### 2.1 Core Models + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Models/CallGraphSnapshot.cs + +namespace StellaOps.Scanner.CallGraph; + +using System.Collections.Immutable; +using System.Text.Json.Serialization; + +/// +/// Complete call graph snapshot for a scan. +/// +public sealed record CallGraphSnapshot +{ + [JsonPropertyName("scanId")] + public required string ScanId { get; init; } + + [JsonPropertyName("graphDigest")] + public required string GraphDigest { get; init; } + + [JsonPropertyName("language")] + public required string Language { get; init; } + + [JsonPropertyName("extractedAt")] + public required DateTimeOffset ExtractedAt { get; init; } + + [JsonPropertyName("nodes")] + public required ImmutableArray Nodes { get; init; } + + [JsonPropertyName("edges")] + public required ImmutableArray Edges { get; init; } + + [JsonPropertyName("entrypointIds")] + public required ImmutableArray EntrypointIds { get; init; } + + [JsonPropertyName("sinkIds")] + public required ImmutableArray SinkIds { get; init; } +} + +/// +/// Node in the call graph representing a function/method. +/// +public sealed record CallGraphNode +{ + [JsonPropertyName("nodeId")] + public required string NodeId { get; init; } + + [JsonPropertyName("symbol")] + public required string Symbol { get; init; } + + [JsonPropertyName("file")] + public required string File { get; init; } + + [JsonPropertyName("line")] + public required int Line { get; init; } + + [JsonPropertyName("package")] + public required string Package { get; init; } + + [JsonPropertyName("visibility")] + public required Visibility Visibility { get; init; } + + [JsonPropertyName("isEntrypoint")] + public required bool IsEntrypoint { get; init; } + + [JsonPropertyName("entrypointType")] + public EntrypointType? EntrypointType { get; init; } + + [JsonPropertyName("isSink")] + public required bool IsSink { get; init; } + + [JsonPropertyName("sinkCategory")] + public SinkCategory? SinkCategory { get; init; } +} + +/// +/// Edge representing a call relationship. +/// +public sealed record CallGraphEdge +{ + [JsonPropertyName("sourceId")] + public required string SourceId { get; init; } + + [JsonPropertyName("targetId")] + public required string TargetId { get; init; } + + [JsonPropertyName("callKind")] + public required CallKind CallKind { get; init; } + + [JsonPropertyName("callSite")] + public string? CallSite { get; init; } // file:line +} + +public enum Visibility { Public, Internal, Protected, Private } + +public enum CallKind { Direct, Virtual, Delegate, Reflection, Dynamic } + +/// +/// Entrypoint types per advisory. +/// +public enum EntrypointType +{ + HttpHandler, // ASP.NET, Express, Fastify routes + GrpcMethod, // gRPC service methods + CliCommand, // Console entry points + ScheduledJob, // Hangfire, Quartz, cron + MessageHandler, // RabbitMQ, Kafka consumers + WebSocket, // SignalR, socket.io + GraphQL, // GraphQL resolvers + EventHandler, // Event-driven entry + BackgroundService // IHostedService, workers +} + +/// +/// Sink categories per advisory taxonomy. +/// +public enum SinkCategory +{ + CmdExec, // Command injection + UnsafeDeser, // Deserialization + SqlRaw, // SQL injection + Ssrf, // Server-side request forgery + FileWrite, // Arbitrary file write + PathTraversal, // Path traversal + TemplateInjection,// SSTI + CryptoWeak, // Weak cryptography + AuthzBypass // Authorization bypass +} +``` + +### 2.2 Call Graph Extractor Interface + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/ICallGraphExtractor.cs + +namespace StellaOps.Scanner.CallGraph; + +/// +/// Extracts call graphs from source code. +/// +public interface ICallGraphExtractor +{ + /// + /// Languages supported by this extractor. + /// + IReadOnlySet SupportedLanguages { get; } + + /// + /// Extracts a call graph from the given input. + /// + Task ExtractAsync( + CallGraphExtractionContext context, + CancellationToken cancellationToken = default); +} + +/// +/// Context for call graph extraction. +/// +public sealed record CallGraphExtractionContext +{ + public required string ScanId { get; init; } + public required string RootPath { get; init; } + public required string Language { get; init; } + public CallGraphExtractionOptions Options { get; init; } = CallGraphExtractionOptions.Default; +} + +/// +/// Options for call graph extraction. +/// +public sealed record CallGraphExtractionOptions +{ + public bool IncludeTestCode { get; init; } = false; + public bool IncludeVendored { get; init; } = false; + public int MaxDepth { get; init; } = 100; + public int MaxNodes { get; init; } = 100_000; + public TimeSpan Timeout { get; init; } = TimeSpan.FromMinutes(5); + + public static CallGraphExtractionOptions Default { get; } = new(); +} +``` + +### 2.3 .NET/Roslyn Extractor + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Extractors/DotNetCallGraphExtractor.cs + +namespace StellaOps.Scanner.CallGraph.Extractors; + +using Microsoft.Build.Locator; +using Microsoft.CodeAnalysis; +using Microsoft.CodeAnalysis.CSharp; +using Microsoft.CodeAnalysis.CSharp.Syntax; +using Microsoft.CodeAnalysis.MSBuild; + +/// +/// Extracts call graphs from .NET solutions using Roslyn. +/// +public sealed class DotNetCallGraphExtractor : ICallGraphExtractor +{ + private static readonly Lazy _msbuildRegistered = new(() => + { + MSBuildLocator.RegisterDefaults(); + return true; + }); + + public IReadOnlySet SupportedLanguages { get; } = + new HashSet { "csharp", "dotnet", "cs" }; + + public async Task ExtractAsync( + CallGraphExtractionContext context, + CancellationToken cancellationToken = default) + { + _ = _msbuildRegistered.Value; + + var nodes = new Dictionary(); + var edges = new List(); + var entrypointIds = new List(); + var sinkIds = new List(); + + using var workspace = MSBuildWorkspace.Create(); + + // Find solution or project file + var solutionPath = FindSolutionOrProject(context.RootPath); + if (solutionPath is null) + throw new InvalidOperationException($"No .sln or .csproj found in {context.RootPath}"); + + Solution solution; + if (solutionPath.EndsWith(".sln", StringComparison.OrdinalIgnoreCase)) + { + solution = await workspace.OpenSolutionAsync(solutionPath, cancellationToken: cancellationToken); + } + else + { + var project = await workspace.OpenProjectAsync(solutionPath, cancellationToken: cancellationToken); + solution = project.Solution; + } + + foreach (var project in solution.Projects) + { + if (context.Options.IncludeTestCode == false && IsTestProject(project)) + continue; + + var compilation = await project.GetCompilationAsync(cancellationToken); + if (compilation is null) continue; + + foreach (var tree in compilation.SyntaxTrees) + { + var model = compilation.GetSemanticModel(tree); + var root = await tree.GetRootAsync(cancellationToken); + + // Extract method nodes + foreach (var method in root.DescendantNodes().OfType()) + { + var methodSymbol = model.GetDeclaredSymbol(method); + if (methodSymbol is null) continue; + + var nodeId = GenerateNodeId(methodSymbol); + var node = CreateNode(methodSymbol, method, tree); + nodes.TryAdd(nodeId, node); + + if (node.IsEntrypoint) + entrypointIds.Add(nodeId); + if (node.IsSink) + sinkIds.Add(nodeId); + + // Extract call edges + foreach (var invocation in method.DescendantNodes().OfType()) + { + var targetSymbol = model.GetSymbolInfo(invocation).Symbol as IMethodSymbol; + if (targetSymbol is null) continue; + + var targetId = GenerateNodeId(targetSymbol); + edges.Add(new CallGraphEdge + { + SourceId = nodeId, + TargetId = targetId, + CallKind = ClassifyCallKind(invocation, targetSymbol), + CallSite = $"{tree.FilePath}:{invocation.GetLocation().GetLineSpan().StartLinePosition.Line + 1}" + }); + + // Add target node if not already present + if (!nodes.ContainsKey(targetId)) + { + var targetNode = CreateNodeFromSymbol(targetSymbol); + nodes.TryAdd(targetId, targetNode); + } + } + } + } + } + + var snapshot = new CallGraphSnapshot + { + ScanId = context.ScanId, + GraphDigest = ComputeGraphDigest(nodes.Values, edges), + Language = "dotnet", + ExtractedAt = DateTimeOffset.UtcNow, + Nodes = nodes.Values.ToImmutableArray(), + Edges = edges.ToImmutableArray(), + EntrypointIds = entrypointIds.ToImmutableArray(), + SinkIds = sinkIds.ToImmutableArray() + }; + + return snapshot; + } + + private static string? FindSolutionOrProject(string rootPath) + { + // Try .sln first + var slnFiles = Directory.GetFiles(rootPath, "*.sln", SearchOption.TopDirectoryOnly); + if (slnFiles.Length > 0) + return slnFiles[0]; + + // Fall back to .csproj + var csprojFiles = Directory.GetFiles(rootPath, "*.csproj", SearchOption.AllDirectories); + return csprojFiles.Length > 0 ? csprojFiles[0] : null; + } + + private static bool IsTestProject(Project project) + { + return project.Name.Contains("Test", StringComparison.OrdinalIgnoreCase) || + project.Name.Contains("Spec", StringComparison.OrdinalIgnoreCase); + } + + private static string GenerateNodeId(IMethodSymbol method) + { + // Stable ID: namespace.type.method(param_types) + var paramTypes = string.Join(",", method.Parameters.Select(p => p.Type.ToDisplayString())); + return $"{method.ContainingType?.ToDisplayString()}.{method.Name}({paramTypes})"; + } + + private static CallGraphNode CreateNode(IMethodSymbol symbol, MethodDeclarationSyntax syntax, SyntaxTree tree) + { + var isEntrypoint = IsEntrypoint(symbol); + var entrypointType = isEntrypoint ? ClassifyEntrypointType(symbol) : null; + var isSink = IsSink(symbol); + var sinkCategory = isSink ? ClassifySinkCategory(symbol) : null; + + return new CallGraphNode + { + NodeId = GenerateNodeId(symbol), + Symbol = symbol.ToDisplayString(), + File = tree.FilePath, + Line = syntax.GetLocation().GetLineSpan().StartLinePosition.Line + 1, + Package = symbol.ContainingAssembly?.Name ?? "unknown", + Visibility = MapVisibility(symbol.DeclaredAccessibility), + IsEntrypoint = isEntrypoint, + EntrypointType = entrypointType, + IsSink = isSink, + SinkCategory = sinkCategory + }; + } + + private static CallGraphNode CreateNodeFromSymbol(IMethodSymbol symbol) + { + var isSink = IsSink(symbol); + return new CallGraphNode + { + NodeId = GenerateNodeId(symbol), + Symbol = symbol.ToDisplayString(), + File = symbol.Locations.FirstOrDefault()?.SourceTree?.FilePath ?? "external", + Line = symbol.Locations.FirstOrDefault()?.GetLineSpan().StartLinePosition.Line + 1 ?? 0, + Package = symbol.ContainingAssembly?.Name ?? "unknown", + Visibility = MapVisibility(symbol.DeclaredAccessibility), + IsEntrypoint = false, + IsSink = isSink, + SinkCategory = isSink ? ClassifySinkCategory(symbol) : null + }; + } + + private static bool IsEntrypoint(IMethodSymbol method) + { + // Check for ASP.NET Core controller actions + var hasHttpAttribute = method.GetAttributes().Any(a => + a.AttributeClass?.Name.StartsWith("Http") == true || + a.AttributeClass?.Name == "RouteAttribute"); + + // Check for gRPC methods + var isGrpcMethod = method.ContainingType?.AllInterfaces.Any(i => + i.Name.EndsWith("Base") && i.ContainingNamespace?.Name == "Grpc") == true; + + // Check for Main entry point + var isMain = method.Name == "Main" && method.IsStatic; + + // Check for IHostedService + var isHostedService = method.Name == "ExecuteAsync" && + method.ContainingType?.AllInterfaces.Any(i => i.Name == "IHostedService") == true; + + return hasHttpAttribute || isGrpcMethod || isMain || isHostedService; + } + + private static EntrypointType? ClassifyEntrypointType(IMethodSymbol method) + { + var attributes = method.GetAttributes(); + + if (attributes.Any(a => a.AttributeClass?.Name.StartsWith("Http") == true)) + return EntrypointType.HttpHandler; + + if (method.ContainingType?.AllInterfaces.Any(i => i.Name.EndsWith("Base")) == true) + return EntrypointType.GrpcMethod; + + if (method.Name == "Main" && method.IsStatic) + return EntrypointType.CliCommand; + + if (method.ContainingType?.AllInterfaces.Any(i => i.Name == "IHostedService") == true) + return EntrypointType.BackgroundService; + + return null; + } + + private static bool IsSink(IMethodSymbol method) + { + var fullName = method.ToDisplayString(); + + // Command execution + if (fullName.Contains("Process.Start") || fullName.Contains("Runtime.exec")) + return true; + + // SQL + if (fullName.Contains("ExecuteNonQuery") || fullName.Contains("ExecuteReader") || + fullName.Contains("FromSqlRaw")) + return true; + + // Deserialization + if (fullName.Contains("JsonSerializer.Deserialize") || + fullName.Contains("BinaryFormatter.Deserialize") || + fullName.Contains("XmlSerializer.Deserialize")) + return true; + + // File operations + if (fullName.Contains("File.WriteAllText") || fullName.Contains("File.WriteAllBytes")) + return true; + + // HTTP client (SSRF) + if (fullName.Contains("HttpClient.GetAsync") || fullName.Contains("HttpClient.PostAsync")) + return true; + + return false; + } + + private static SinkCategory? ClassifySinkCategory(IMethodSymbol method) + { + var fullName = method.ToDisplayString(); + + if (fullName.Contains("Process.Start") || fullName.Contains("Runtime.exec")) + return SinkCategory.CmdExec; + + if (fullName.Contains("ExecuteNonQuery") || fullName.Contains("ExecuteReader") || + fullName.Contains("FromSqlRaw")) + return SinkCategory.SqlRaw; + + if (fullName.Contains("Deserialize")) + return SinkCategory.UnsafeDeser; + + if (fullName.Contains("File.Write")) + return SinkCategory.FileWrite; + + if (fullName.Contains("HttpClient")) + return SinkCategory.Ssrf; + + return null; + } + + private static Visibility MapVisibility(Accessibility accessibility) + { + return accessibility switch + { + Accessibility.Public => Visibility.Public, + Accessibility.Internal => Visibility.Internal, + Accessibility.Protected => Visibility.Protected, + Accessibility.Private => Visibility.Private, + _ => Visibility.Private + }; + } + + private static CallKind ClassifyCallKind(InvocationExpressionSyntax invocation, IMethodSymbol target) + { + if (target.IsVirtual || target.IsOverride || target.IsAbstract) + return CallKind.Virtual; + + if (invocation.Expression is MemberAccessExpressionSyntax memberAccess && + memberAccess.Expression is InvocationExpressionSyntax) + return CallKind.Dynamic; + + return CallKind.Direct; + } + + private static string ComputeGraphDigest(IEnumerable nodes, IEnumerable edges) + { + var builder = new StringBuilder(); + + foreach (var node in nodes.OrderBy(n => n.NodeId)) + builder.AppendLine(node.NodeId); + + foreach (var edge in edges.OrderBy(e => e.SourceId).ThenBy(e => e.TargetId)) + builder.AppendLine($"{edge.SourceId}->{edge.TargetId}"); + + var hash = SHA256.HashData(Encoding.UTF8.GetBytes(builder.ToString())); + return Convert.ToHexString(hash).ToLowerInvariant(); + } +} +``` + +### 2.4 Node.js Extractor (Skeleton) + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Extractors/NodeCallGraphExtractor.cs + +namespace StellaOps.Scanner.CallGraph.Extractors; + +/// +/// Extracts call graphs from Node.js projects. +/// Uses external babel-based tool for AST parsing. +/// +public sealed class NodeCallGraphExtractor : ICallGraphExtractor +{ + public IReadOnlySet SupportedLanguages { get; } = + new HashSet { "javascript", "typescript", "node", "js", "ts" }; + + public async Task ExtractAsync( + CallGraphExtractionContext context, + CancellationToken cancellationToken = default) + { + // Implementation will invoke external Node.js tool: + // npx stella-callgraph-node --root {rootPath} --output json + + throw new NotImplementedException("Node.js extractor to be implemented in separate task"); + } +} +``` + +### 2.5 Reachability Analyzer + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Analysis/ReachabilityAnalyzer.cs + +namespace StellaOps.Scanner.CallGraph.Analysis; + +/// +/// Performs multi-source BFS to determine reachable sinks. +/// +public sealed class ReachabilityAnalyzer +{ + /// + /// Finds all sinks reachable from entrypoints. + /// + public ReachabilityResult Analyze(CallGraphSnapshot graph) + { + var reachableFromEntrypoint = new HashSet(); + var pathsToSinks = new Dictionary>(); + + // Build adjacency list + var adjacency = graph.Edges + .GroupBy(e => e.SourceId) + .ToDictionary(g => g.Key, g => g.Select(e => e.TargetId).ToList()); + + // Multi-source BFS from all entrypoints + var queue = new Queue<(string NodeId, List Path)>(); + var visited = new HashSet(); + + foreach (var entrypoint in graph.EntrypointIds) + { + queue.Enqueue((entrypoint, new List { entrypoint })); + visited.Add(entrypoint); + } + + while (queue.Count > 0) + { + var (current, path) = queue.Dequeue(); + reachableFromEntrypoint.Add(current); + + // Check if this is a sink + if (graph.SinkIds.Contains(current)) + { + if (!pathsToSinks.ContainsKey(current)) + pathsToSinks[current] = path; + else if (path.Count < pathsToSinks[current].Count) + pathsToSinks[current] = path; // Keep shortest path + } + + // Explore neighbors + if (adjacency.TryGetValue(current, out var neighbors)) + { + foreach (var neighbor in neighbors) + { + if (visited.Add(neighbor)) + { + var newPath = new List(path) { neighbor }; + queue.Enqueue((neighbor, newPath)); + } + } + } + } + + return new ReachabilityResult + { + ReachableNodes = reachableFromEntrypoint.ToImmutableHashSet(), + ReachableSinks = pathsToSinks.Keys.ToImmutableArray(), + ShortestPaths = pathsToSinks.ToImmutableDictionary( + kvp => kvp.Key, + kvp => kvp.Value.ToImmutableArray()) + }; + } +} + +public sealed record ReachabilityResult +{ + public required ImmutableHashSet ReachableNodes { get; init; } + public required ImmutableArray ReachableSinks { get; init; } + public required ImmutableDictionary> ShortestPaths { get; init; } +} +``` + +### 2.6 Database Schema + +```sql +-- File: src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/006_call_graph_tables.sql +-- Sprint: SPRINT_3600_0002_0001 +-- Description: Call graph infrastructure tables + +-- Call graph snapshots (metadata only, full graph in CAS) +CREATE TABLE IF NOT EXISTS scanner.call_graph_snapshots ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + scan_id TEXT NOT NULL, + language TEXT NOT NULL, + graph_digest TEXT NOT NULL, + node_count INT NOT NULL, + edge_count INT NOT NULL, + entrypoint_count INT NOT NULL, + sink_count INT NOT NULL, + extracted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + cas_uri TEXT NOT NULL, -- cas://graphs/{digest} + + CONSTRAINT call_graph_snapshots_unique + UNIQUE (tenant_id, scan_id, language) +); + +CREATE INDEX IF NOT EXISTS idx_call_graph_snapshots_digest + ON scanner.call_graph_snapshots(graph_digest); +CREATE INDEX IF NOT EXISTS idx_call_graph_snapshots_scan + ON scanner.call_graph_snapshots(scan_id); + +-- Reachability results (cached per snapshot) +CREATE TABLE IF NOT EXISTS scanner.reachability_results ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + graph_snapshot_id UUID NOT NULL REFERENCES scanner.call_graph_snapshots(id), + sink_node_id TEXT NOT NULL, + is_reachable BOOLEAN NOT NULL, + shortest_path JSONB, -- Array of node IDs + path_length INT, + computed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + + CONSTRAINT reachability_results_unique + UNIQUE (graph_snapshot_id, sink_node_id) +); + +CREATE INDEX IF NOT EXISTS idx_reachability_results_reachable + ON scanner.reachability_results(is_reachable) + WHERE is_reachable = TRUE; + +-- Enable RLS +ALTER TABLE scanner.call_graph_snapshots ENABLE ROW LEVEL SECURITY; +ALTER TABLE scanner.reachability_results ENABLE ROW LEVEL SECURITY; + +DROP POLICY IF EXISTS call_graph_tenant_isolation ON scanner.call_graph_snapshots; +CREATE POLICY call_graph_tenant_isolation ON scanner.call_graph_snapshots + USING (tenant_id = scanner.current_tenant_id()); + +DROP POLICY IF EXISTS reachability_tenant_isolation ON scanner.reachability_results; +CREATE POLICY reachability_tenant_isolation ON scanner.reachability_results + USING (tenant_id = ( + SELECT tenant_id FROM scanner.call_graph_snapshots + WHERE id = graph_snapshot_id + )); + +COMMENT ON TABLE scanner.call_graph_snapshots IS + 'Per-scan call graph metadata; full graph stored in CAS'; +COMMENT ON TABLE scanner.reachability_results IS + 'Cached reachability results per sink per graph snapshot'; +``` + +### 2.7 Valkey Caching (Track E) + +Valkey is already integrated in `Router.Gateway.RateLimit`. Use the same patterns for call graph caching. + +#### Configuration + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Caching/CallGraphCacheConfig.cs + +namespace StellaOps.Scanner.CallGraph.Caching; + +using Microsoft.Extensions.Configuration; + +/// +/// Configuration for Valkey-based call graph caching. +/// Aligned with Router.Gateway.RateLimit patterns. +/// +public sealed class CallGraphCacheConfig +{ + [ConfigurationKeyName("valkey_connection")] + public string ValkeyConnection { get; set; } = "localhost:6379"; + + [ConfigurationKeyName("valkey_bucket")] + public string ValkeyBucket { get; set; } = "stella-callgraph"; + + [ConfigurationKeyName("cache_ttl_hours")] + public int CacheTtlHours { get; set; } = 24; + + [ConfigurationKeyName("circuit_breaker")] + public CircuitBreakerConfig? CircuitBreaker { get; set; } +} + +public sealed class CircuitBreakerConfig +{ + [ConfigurationKeyName("failure_threshold")] + public int FailureThreshold { get; set; } = 5; + + [ConfigurationKeyName("timeout_seconds")] + public int TimeoutSeconds { get; set; } = 30; + + [ConfigurationKeyName("half_open_max_attempts")] + public int HalfOpenMaxAttempts { get; set; } = 3; +} +``` + +#### Cache Key Pattern + +``` +stella:callgraph:{scan_id}:{lang}:{digest} → Full CallGraphSnapshot (compressed JSON) +stella:callgraph:{scan_id}:{lang}:reachable → Set of reachable sink IDs +stella:callgraph:{scan_id}:{lang}:paths:{sink} → Shortest path to sink (compressed) +``` + +#### Cache Service Interface + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Caching/ICallGraphCacheService.cs + +namespace StellaOps.Scanner.CallGraph.Caching; + +/// +/// Valkey-backed cache for call graph snapshots and reachability results. +/// +public interface ICallGraphCacheService +{ + /// + /// Gets a cached call graph snapshot. + /// + Task GetSnapshotAsync( + string scanId, + string language, + CancellationToken cancellationToken = default); + + /// + /// Caches a call graph snapshot. + /// + Task SetSnapshotAsync( + CallGraphSnapshot snapshot, + CancellationToken cancellationToken = default); + + /// + /// Gets cached reachability result for a sink. + /// + Task GetReachabilityAsync( + string scanId, + string language, + CancellationToken cancellationToken = default); + + /// + /// Caches reachability result. + /// + Task SetReachabilityAsync( + string scanId, + string language, + ReachabilityResult result, + CancellationToken cancellationToken = default); + + /// + /// Invalidates all cache entries for a scan. + /// + Task InvalidateAsync( + string scanId, + CancellationToken cancellationToken = default); +} +``` + +#### Cache Service Implementation + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/Caching/ValkeyCallGraphCacheService.cs + +namespace StellaOps.Scanner.CallGraph.Caching; + +using System.IO.Compression; +using System.Text.Json; +using Microsoft.Extensions.Logging; +using Microsoft.Extensions.Options; +using StackExchange.Redis; + +/// +/// Valkey-backed call graph cache with circuit breaker support. +/// +public sealed class ValkeyCallGraphCacheService : ICallGraphCacheService, IDisposable +{ + private readonly CallGraphCacheConfig _config; + private readonly ILogger _logger; + private readonly Lazy _connection; + private readonly CircuitBreakerState _circuitBreaker; + + public ValkeyCallGraphCacheService( + IOptions config, + ILogger logger) + { + _config = config.Value; + _logger = logger; + _connection = new Lazy(() => + ConnectionMultiplexer.Connect(_config.ValkeyConnection)); + _circuitBreaker = new CircuitBreakerState(_config.CircuitBreaker ?? new()); + } + + private IDatabase Database => _connection.Value.GetDatabase(); + + public async Task GetSnapshotAsync( + string scanId, + string language, + CancellationToken cancellationToken = default) + { + if (!_circuitBreaker.AllowRequest()) + { + _logger.LogWarning("Circuit breaker open, skipping cache lookup"); + return null; + } + + try + { + var key = FormatKey(scanId, language); + var value = await Database.StringGetAsync(key); + + if (value.IsNullOrEmpty) + return null; + + var decompressed = Decompress(value!); + _circuitBreaker.RecordSuccess(); + return JsonSerializer.Deserialize(decompressed); + } + catch (Exception ex) + { + _circuitBreaker.RecordFailure(); + _logger.LogWarning(ex, "Failed to get cached call graph for {ScanId}/{Language}", scanId, language); + return null; + } + } + + public async Task SetSnapshotAsync( + CallGraphSnapshot snapshot, + CancellationToken cancellationToken = default) + { + if (!_circuitBreaker.AllowRequest()) + return; + + try + { + var key = FormatKey(snapshot.ScanId, snapshot.Language); + var json = JsonSerializer.Serialize(snapshot); + var compressed = Compress(json); + + await Database.StringSetAsync( + key, + compressed, + TimeSpan.FromHours(_config.CacheTtlHours)); + + _circuitBreaker.RecordSuccess(); + _logger.LogDebug( + "Cached call graph {Digest} for {ScanId}/{Language}", + snapshot.GraphDigest, + snapshot.ScanId, + snapshot.Language); + } + catch (Exception ex) + { + _circuitBreaker.RecordFailure(); + _logger.LogWarning(ex, "Failed to cache call graph for {ScanId}", snapshot.ScanId); + } + } + + public async Task GetReachabilityAsync( + string scanId, + string language, + CancellationToken cancellationToken = default) + { + if (!_circuitBreaker.AllowRequest()) + return null; + + try + { + var key = $"{FormatKey(scanId, language)}:reachability"; + var value = await Database.StringGetAsync(key); + + if (value.IsNullOrEmpty) + return null; + + _circuitBreaker.RecordSuccess(); + return JsonSerializer.Deserialize(Decompress(value!)); + } + catch (Exception ex) + { + _circuitBreaker.RecordFailure(); + _logger.LogWarning(ex, "Failed to get cached reachability for {ScanId}", scanId); + return null; + } + } + + public async Task SetReachabilityAsync( + string scanId, + string language, + ReachabilityResult result, + CancellationToken cancellationToken = default) + { + if (!_circuitBreaker.AllowRequest()) + return; + + try + { + var key = $"{FormatKey(scanId, language)}:reachability"; + var json = JsonSerializer.Serialize(result); + await Database.StringSetAsync( + key, + Compress(json), + TimeSpan.FromHours(_config.CacheTtlHours)); + _circuitBreaker.RecordSuccess(); + } + catch (Exception ex) + { + _circuitBreaker.RecordFailure(); + _logger.LogWarning(ex, "Failed to cache reachability for {ScanId}", scanId); + } + } + + public async Task InvalidateAsync( + string scanId, + CancellationToken cancellationToken = default) + { + if (!_circuitBreaker.AllowRequest()) + return; + + try + { + var server = _connection.Value.GetServer(_config.ValkeyConnection); + var pattern = $"{_config.ValkeyBucket}:{scanId}:*"; + + await foreach (var key in server.KeysAsync(pattern: pattern)) + { + await Database.KeyDeleteAsync(key); + } + + _circuitBreaker.RecordSuccess(); + _logger.LogDebug("Invalidated cache for {ScanId}", scanId); + } + catch (Exception ex) + { + _circuitBreaker.RecordFailure(); + _logger.LogWarning(ex, "Failed to invalidate cache for {ScanId}", scanId); + } + } + + private string FormatKey(string scanId, string language) => + $"{_config.ValkeyBucket}:{scanId}:{language}"; + + private static byte[] Compress(string json) + { + using var output = new MemoryStream(); + using (var gzip = new GZipStream(output, CompressionLevel.Fastest)) + { + var bytes = System.Text.Encoding.UTF8.GetBytes(json); + gzip.Write(bytes, 0, bytes.Length); + } + return output.ToArray(); + } + + private static string Decompress(byte[] compressed) + { + using var input = new MemoryStream(compressed); + using var gzip = new GZipStream(input, CompressionMode.Decompress); + using var reader = new StreamReader(gzip); + return reader.ReadToEnd(); + } + + public void Dispose() + { + if (_connection.IsValueCreated) + _connection.Value.Dispose(); + } +} + +/// +/// Simple circuit breaker state machine. +/// +internal sealed class CircuitBreakerState +{ + private readonly CircuitBreakerConfig _config; + private int _failureCount; + private DateTimeOffset _lastFailure; + private bool _isOpen; + + public CircuitBreakerState(CircuitBreakerConfig config) => _config = config; + + public bool AllowRequest() + { + if (!_isOpen) + return true; + + // Check if timeout has elapsed + if (DateTimeOffset.UtcNow - _lastFailure > TimeSpan.FromSeconds(_config.TimeoutSeconds)) + { + _isOpen = false; + _failureCount = 0; + return true; + } + + return false; + } + + public void RecordSuccess() + { + _failureCount = 0; + _isOpen = false; + } + + public void RecordFailure() + { + _failureCount++; + _lastFailure = DateTimeOffset.UtcNow; + + if (_failureCount >= _config.FailureThreshold) + _isOpen = true; + } +} +``` + +#### DI Registration + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.CallGraph/DependencyInjection/CallGraphServiceCollectionExtensions.cs + +namespace Microsoft.Extensions.DependencyInjection; + +using StellaOps.Scanner.CallGraph.Caching; + +public static class CallGraphServiceCollectionExtensions +{ + public static IServiceCollection AddCallGraphServices( + this IServiceCollection services, + IConfiguration configuration) + { + services.Configure( + configuration.GetSection("CallGraph:Cache")); + + services.AddSingleton(); + + return services; + } +} +``` + +--- + +## Delivery Tracker + +| # | Task ID | Status | Description | Notes | +|---|---------|--------|-------------|-------| +| 1 | CG-001 | TODO | Create CallGraphSnapshot model | Core models | +| 2 | CG-002 | TODO | Create CallGraphNode model | With entrypoint/sink flags | +| 3 | CG-003 | TODO | Create CallGraphEdge model | With call kind | +| 4 | CG-004 | TODO | Create SinkCategory enum | 9 categories | +| 5 | CG-005 | TODO | Create EntrypointType enum | 9 types | +| 6 | CG-006 | TODO | Create ICallGraphExtractor interface | Base contract | +| 7 | CG-007 | TODO | Implement DotNetCallGraphExtractor | Roslyn-based | +| 8 | CG-008 | TODO | Implement Roslyn solution loading | MSBuildWorkspace | +| 9 | CG-009 | TODO | Implement method node extraction | MethodDeclarationSyntax | +| 10 | CG-010 | TODO | Implement call edge extraction | InvocationExpressionSyntax | +| 11 | CG-011 | TODO | Implement ASP.NET entrypoint detection | [Http*] attributes | +| 12 | CG-012 | TODO | Implement gRPC entrypoint detection | Service base classes | +| 13 | CG-013 | TODO | Implement IHostedService detection | Background services | +| 14 | CG-014 | TODO | Implement sink detection | Pattern matching | +| 15 | CG-015 | TODO | Implement stable node ID generation | Deterministic | +| 16 | CG-016 | TODO | Implement graph digest computation | SHA-256 | +| 17 | CG-017 | TODO | Create NodeCallGraphExtractor skeleton | Babel integration planned | +| 18 | CG-018 | TODO | Implement ReachabilityAnalyzer | Multi-source BFS | +| 19 | CG-019 | TODO | Implement shortest path extraction | For UI display | +| 20 | CG-020 | TODO | Create Postgres migration 006 | call_graph_snapshots, reachability_results | +| 21 | CG-021 | TODO | Implement ICallGraphSnapshotRepository | Storage contract | +| 22 | CG-022 | TODO | Implement PostgresCallGraphSnapshotRepository | With Dapper | +| 23 | CG-023 | TODO | Implement IReachabilityResultRepository | Storage contract | +| 24 | CG-024 | TODO | Implement PostgresReachabilityResultRepository | With Dapper | +| 25 | CG-025 | TODO | Unit tests for DotNetCallGraphExtractor | Mock workspace | +| 26 | CG-026 | TODO | Unit tests for ReachabilityAnalyzer | Various graph shapes | +| 27 | CG-027 | TODO | Unit tests for entrypoint detection | All types | +| 28 | CG-028 | TODO | Unit tests for sink detection | All categories | +| 29 | CG-029 | TODO | Integration tests with benchmark cases | js-unsafe-eval, etc. | +| 30 | CG-030 | TODO | Golden fixtures for graph extraction | Determinism | +| 31 | CG-031 | TODO | Create CallGraphCacheConfig model | Track E: Valkey | +| 32 | CG-032 | TODO | Create CircuitBreakerConfig model | Align with Router.Gateway | +| 33 | CG-033 | TODO | Create ICallGraphCacheService interface | Cache contract | +| 34 | CG-034 | TODO | Implement ValkeyCallGraphCacheService | StackExchange.Redis | +| 35 | CG-035 | TODO | Implement CircuitBreakerState | Failure tracking | +| 36 | CG-036 | TODO | Implement GZip compression for cached graphs | Reduce memory | +| 37 | CG-037 | TODO | Create CallGraphServiceCollectionExtensions | DI registration | +| 38 | CG-038 | TODO | Unit tests for ValkeyCallGraphCacheService | Mock Redis | +| 39 | CG-039 | TODO | Unit tests for CircuitBreakerState | State transitions | +| 40 | CG-040 | TODO | Integration tests with Testcontainers Redis | End-to-end caching | + +--- + +## 3. ACCEPTANCE CRITERIA + +### 3.1 Call Graph Extraction + +- [ ] Extracts all methods from .NET solutions +- [ ] Correctly identifies call relationships +- [ ] Handles virtual/abstract calls +- [ ] Produces stable node IDs +- [ ] Computes deterministic graph digest + +### 3.2 Entrypoint Detection + +- [ ] Detects ASP.NET Core controller actions +- [ ] Detects gRPC service methods +- [ ] Detects Main entry points +- [ ] Detects IHostedService implementations +- [ ] Classifies entrypoint types correctly + +### 3.3 Sink Detection + +- [ ] Detects command execution sinks +- [ ] Detects SQL injection sinks +- [ ] Detects deserialization sinks +- [ ] Detects file write sinks +- [ ] Detects SSRF sinks +- [ ] Classifies sink categories correctly + +### 3.4 Reachability Analysis + +- [ ] Finds all reachable nodes from entrypoints +- [ ] Correctly identifies reachable sinks +- [ ] Computes shortest paths +- [ ] Handles cyclic graphs + +### 3.5 Performance + +- [ ] Extracts 100K LOC solution in < 60s +- [ ] Reachability analysis completes in < 5s +- [ ] Memory usage < 2GB for large solutions + +### 3.6 Valkey Caching + +- [ ] Caches call graph snapshots with configurable TTL +- [ ] Caches reachability results per scan/language +- [ ] Compresses cached data with GZip +- [ ] Implements circuit breaker pattern (aligned with Router.Gateway) +- [ ] Supports cache invalidation per scan +- [ ] Graceful degradation when Valkey unavailable +- [ ] Cache hit improves reachability lookup by 10x + +--- + +## Decisions & Risks + +| ID | Decision | Rationale | +|----|----------|-----------| +| CG-DEC-001 | Use MSBuildWorkspace not Compilation | Full semantic model needed | +| CG-DEC-002 | Store full graph in CAS, metadata in Postgres | Separate large blobs | +| CG-DEC-003 | Start with .NET only | Highest priority, Node.js next | +| CG-DEC-004 | Use Valkey for call graph caching | Aligns with Router.Gateway.RateLimit patterns | +| CG-DEC-005 | Circuit breaker for cache resilience | Graceful degradation on Valkey failure | + +| ID | Risk | Mitigation | +|----|------|------------| +| CG-RISK-001 | Roslyn memory pressure | Incremental analysis, GC hints | +| CG-RISK-002 | MSBuild version conflicts | MSBuildLocator isolation | +| CG-RISK-003 | Missing external dependencies | Graceful degradation | +| CG-RISK-004 | Valkey unavailable in air-gapped environments | Circuit breaker + Postgres fallback | +| CG-RISK-005 | Cache invalidation consistency | TTL-based expiry + explicit invalidation on rescan | + +--- + +## Execution Log + +| Date (UTC) | Update | Owner | +|---|---|---| +| 2025-12-17 | Created sprint from master plan | Agent | +| 2025-12-17 | Added Valkey caching Track E (§2.7), tasks CG-031 to CG-040, acceptance criteria §3.6 | Agent | + +--- + +## References + +- **Master Sprint**: `SPRINT_3600_0001_0001_reachability_drift_master.md` +- **Advisory**: `17-Dec-2025 - Reachability Drift Detection.md` +- **Benchmark**: `bench/reachability-benchmark/README.md` +- **Roslyn Docs**: https://docs.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/ diff --git a/docs/implplan/SPRINT_3600_0003_0001_drift_detection_engine.md b/docs/implplan/SPRINT_3600_0003_0001_drift_detection_engine.md new file mode 100644 index 00000000..5edb6650 --- /dev/null +++ b/docs/implplan/SPRINT_3600_0003_0001_drift_detection_engine.md @@ -0,0 +1,949 @@ +# SPRINT_3600_0003_0001 - Drift Detection Engine + +**Status:** TODO +**Priority:** P0 - CRITICAL +**Module:** Scanner +**Working Directory:** `src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/` +**Estimated Effort:** Medium +**Dependencies:** SPRINT_3600_0002_0001 (Call Graph Infrastructure) + +--- + +## Topic & Scope + +Implement the drift detection engine that compares call graphs between scans to identify reachability changes. This sprint covers: +- Code change facts extraction (AST-level) +- Cross-scan graph comparison +- Drift cause attribution +- Path compression for storage +- API endpoints for drift results + +--- + +## Documentation Prerequisites + +- `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md` +- `docs/implplan/SPRINT_3600_0002_0001_call_graph_infrastructure.md` +- `src/Scanner/__Libraries/StellaOps.Scanner.SmartDiff/AGENTS.md` + +--- + +## Wave Coordination + +Single wave with sequential tasks: +1. Code change models and extraction +2. Cross-scan comparison engine +3. Cause attribution +4. Path compression +5. API integration + +--- + +## Interlocks + +- Depends on CallGraphSnapshot model from Sprint 3600.2 +- Must integrate with existing MaterialRiskChangeDetector +- Must extend scanner.material_risk_changes table + +--- + +## Action Tracker + +| Date (UTC) | Action | Owner | Notes | +|---|---|---|---| +| 2025-12-17 | Created sprint from master plan | Agent | Initial | + +--- + +## 1. OBJECTIVE + +Build the drift detection engine: +1. **Code Change Facts** - Extract AST-level changes between scans +2. **Graph Comparison** - Detect reachability flips +3. **Cause Attribution** - Explain why drift occurred +4. **Path Compression** - Efficient storage for UI display + +--- + +## 2. TECHNICAL DESIGN + +### 2.1 Code Change Facts Model + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Models/CodeChangeFact.cs + +namespace StellaOps.Scanner.ReachabilityDrift; + +using System.Text.Json; +using System.Text.Json.Serialization; + +/// +/// Represents an AST-level code change fact. +/// +public sealed record CodeChangeFact +{ + [JsonPropertyName("id")] + public required Guid Id { get; init; } + + [JsonPropertyName("scanId")] + public required string ScanId { get; init; } + + [JsonPropertyName("baseScanId")] + public required string BaseScanId { get; init; } + + [JsonPropertyName("file")] + public required string File { get; init; } + + [JsonPropertyName("symbol")] + public required string Symbol { get; init; } + + [JsonPropertyName("kind")] + public required CodeChangeKind Kind { get; init; } + + [JsonPropertyName("details")] + public JsonDocument? Details { get; init; } + + [JsonPropertyName("detectedAt")] + public required DateTimeOffset DetectedAt { get; init; } +} + +/// +/// Types of code changes relevant to reachability. +/// +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum CodeChangeKind +{ + /// Symbol added (new function/method). + [JsonStringEnumMemberName("added")] + Added, + + /// Symbol removed. + [JsonStringEnumMemberName("removed")] + Removed, + + /// Function signature changed (parameters, return type). + [JsonStringEnumMemberName("signature_changed")] + SignatureChanged, + + /// Guard condition around call modified. + [JsonStringEnumMemberName("guard_changed")] + GuardChanged, + + /// Callee package/version changed. + [JsonStringEnumMemberName("dependency_changed")] + DependencyChanged, + + /// Visibility changed (public<->internal<->private). + [JsonStringEnumMemberName("visibility_changed")] + VisibilityChanged +} +``` + +### 2.2 Drift Result Model + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Models/ReachabilityDriftResult.cs + +namespace StellaOps.Scanner.ReachabilityDrift; + +using System.Collections.Immutable; +using System.Text.Json.Serialization; + +/// +/// Result of reachability drift detection between two scans. +/// +public sealed record ReachabilityDriftResult +{ + [JsonPropertyName("baseScanId")] + public required string BaseScanId { get; init; } + + [JsonPropertyName("headScanId")] + public required string HeadScanId { get; init; } + + [JsonPropertyName("detectedAt")] + public required DateTimeOffset DetectedAt { get; init; } + + [JsonPropertyName("newlyReachable")] + public required ImmutableArray NewlyReachable { get; init; } + + [JsonPropertyName("newlyUnreachable")] + public required ImmutableArray NewlyUnreachable { get; init; } + + [JsonPropertyName("totalDriftCount")] + public int TotalDriftCount => NewlyReachable.Length + NewlyUnreachable.Length; + + [JsonPropertyName("hasMaterialDrift")] + public bool HasMaterialDrift => TotalDriftCount > 0; +} + +/// +/// A sink that changed reachability status. +/// +public sealed record DriftedSink +{ + [JsonPropertyName("sinkNodeId")] + public required string SinkNodeId { get; init; } + + [JsonPropertyName("symbol")] + public required string Symbol { get; init; } + + [JsonPropertyName("sinkCategory")] + public required SinkCategory SinkCategory { get; init; } + + [JsonPropertyName("direction")] + public required DriftDirection Direction { get; init; } + + [JsonPropertyName("cause")] + public required DriftCause Cause { get; init; } + + [JsonPropertyName("path")] + public required CompressedPath Path { get; init; } + + [JsonPropertyName("associatedVulns")] + public ImmutableArray AssociatedVulns { get; init; } = []; +} + +/// +/// Direction of reachability drift. +/// +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum DriftDirection +{ + [JsonStringEnumMemberName("became_reachable")] + BecameReachable, + + [JsonStringEnumMemberName("became_unreachable")] + BecameUnreachable +} + +/// +/// Cause of the drift, linked to code changes. +/// +public sealed record DriftCause +{ + [JsonPropertyName("kind")] + public required DriftCauseKind Kind { get; init; } + + [JsonPropertyName("description")] + public required string Description { get; init; } + + [JsonPropertyName("changedSymbol")] + public string? ChangedSymbol { get; init; } + + [JsonPropertyName("changedFile")] + public string? ChangedFile { get; init; } + + [JsonPropertyName("changedLine")] + public int? ChangedLine { get; init; } + + [JsonPropertyName("codeChangeId")] + public Guid? CodeChangeId { get; init; } + + public static DriftCause GuardRemoved(string symbol, string file, int line) => + new() + { + Kind = DriftCauseKind.GuardRemoved, + Description = $"Guard condition removed in {symbol}", + ChangedSymbol = symbol, + ChangedFile = file, + ChangedLine = line + }; + + public static DriftCause NewPublicRoute(string symbol) => + new() + { + Kind = DriftCauseKind.NewPublicRoute, + Description = $"New public entrypoint: {symbol}", + ChangedSymbol = symbol + }; + + public static DriftCause VisibilityEscalated(string symbol) => + new() + { + Kind = DriftCauseKind.VisibilityEscalated, + Description = $"Visibility escalated to public: {symbol}", + ChangedSymbol = symbol + }; + + public static DriftCause DependencyUpgraded(string package, string fromVersion, string toVersion) => + new() + { + Kind = DriftCauseKind.DependencyUpgraded, + Description = $"Dependency upgraded: {package} {fromVersion} -> {toVersion}" + }; + + public static DriftCause GuardAdded(string symbol) => + new() + { + Kind = DriftCauseKind.GuardAdded, + Description = $"Guard condition added in {symbol}", + ChangedSymbol = symbol + }; + + public static DriftCause SymbolRemoved(string symbol) => + new() + { + Kind = DriftCauseKind.SymbolRemoved, + Description = $"Symbol removed: {symbol}", + ChangedSymbol = symbol + }; + + public static DriftCause Unknown() => + new() + { + Kind = DriftCauseKind.Unknown, + Description = "Cause could not be determined" + }; +} + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum DriftCauseKind +{ + [JsonStringEnumMemberName("guard_removed")] + GuardRemoved, + + [JsonStringEnumMemberName("guard_added")] + GuardAdded, + + [JsonStringEnumMemberName("new_public_route")] + NewPublicRoute, + + [JsonStringEnumMemberName("visibility_escalated")] + VisibilityEscalated, + + [JsonStringEnumMemberName("dependency_upgraded")] + DependencyUpgraded, + + [JsonStringEnumMemberName("symbol_removed")] + SymbolRemoved, + + [JsonStringEnumMemberName("unknown")] + Unknown +} + +/// +/// Vulnerability associated with a sink. +/// +public sealed record AssociatedVuln +{ + [JsonPropertyName("cveId")] + public required string CveId { get; init; } + + [JsonPropertyName("epss")] + public double? Epss { get; init; } + + [JsonPropertyName("cvss")] + public double? Cvss { get; init; } + + [JsonPropertyName("vexStatus")] + public string? VexStatus { get; init; } + + [JsonPropertyName("packagePurl")] + public string? PackagePurl { get; init; } +} +``` + +### 2.3 Compressed Path Model + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Models/CompressedPath.cs + +namespace StellaOps.Scanner.ReachabilityDrift; + +using System.Collections.Immutable; +using System.Text.Json.Serialization; + +/// +/// Compressed representation of a call path for storage and UI. +/// +public sealed record CompressedPath +{ + [JsonPropertyName("entrypoint")] + public required PathNode Entrypoint { get; init; } + + [JsonPropertyName("sink")] + public required PathNode Sink { get; init; } + + [JsonPropertyName("intermediateCount")] + public required int IntermediateCount { get; init; } + + [JsonPropertyName("keyNodes")] + public required ImmutableArray KeyNodes { get; init; } + + [JsonPropertyName("fullPath")] + public ImmutableArray? FullPath { get; init; } // Node IDs for expansion +} + +/// +/// Node in a compressed path. +/// +public sealed record PathNode +{ + [JsonPropertyName("nodeId")] + public required string NodeId { get; init; } + + [JsonPropertyName("symbol")] + public required string Symbol { get; init; } + + [JsonPropertyName("file")] + public string? File { get; init; } + + [JsonPropertyName("line")] + public int? Line { get; init; } + + [JsonPropertyName("package")] + public string? Package { get; init; } + + [JsonPropertyName("isChanged")] + public bool IsChanged { get; init; } + + [JsonPropertyName("changeKind")] + public CodeChangeKind? ChangeKind { get; init; } +} +``` + +### 2.4 Drift Detector Service + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Services/ReachabilityDriftDetector.cs + +namespace StellaOps.Scanner.ReachabilityDrift.Services; + +using StellaOps.Scanner.CallGraph; +using StellaOps.Scanner.CallGraph.Analysis; + +/// +/// Detects reachability drift between two scan snapshots. +/// +public sealed class ReachabilityDriftDetector +{ + private readonly ReachabilityAnalyzer _reachabilityAnalyzer = new(); + private readonly DriftCauseExplainer _causeExplainer = new(); + private readonly PathCompressor _pathCompressor = new(); + + /// + /// Compares two call graph snapshots and returns drift results. + /// + public ReachabilityDriftResult Detect( + CallGraphSnapshot baseGraph, + CallGraphSnapshot headGraph, + IReadOnlyList codeChanges) + { + // Compute reachability for both graphs + var baseReachability = _reachabilityAnalyzer.Analyze(baseGraph); + var headReachability = _reachabilityAnalyzer.Analyze(headGraph); + + var newlyReachable = new List(); + var newlyUnreachable = new List(); + + // Find sinks that became reachable + foreach (var sinkId in headGraph.SinkIds) + { + var wasReachable = baseReachability.ReachableSinks.Contains(sinkId); + var isReachable = headReachability.ReachableSinks.Contains(sinkId); + + if (!wasReachable && isReachable) + { + var sink = headGraph.Nodes.First(n => n.NodeId == sinkId); + var path = headReachability.ShortestPaths.TryGetValue(sinkId, out var p) ? p : []; + var cause = _causeExplainer.Explain(baseGraph, headGraph, sinkId, path, codeChanges); + + newlyReachable.Add(new DriftedSink + { + SinkNodeId = sinkId, + Symbol = sink.Symbol, + SinkCategory = sink.SinkCategory ?? SinkCategory.CmdExec, + Direction = DriftDirection.BecameReachable, + Cause = cause, + Path = _pathCompressor.Compress(path, headGraph, codeChanges) + }); + } + } + + // Find sinks that became unreachable + foreach (var sinkId in baseGraph.SinkIds) + { + var wasReachable = baseReachability.ReachableSinks.Contains(sinkId); + var isReachable = headReachability.ReachableSinks.Contains(sinkId); + + if (wasReachable && !isReachable) + { + var sink = baseGraph.Nodes.First(n => n.NodeId == sinkId); + var path = baseReachability.ShortestPaths.TryGetValue(sinkId, out var p) ? p : []; + var cause = _causeExplainer.ExplainUnreachable(baseGraph, headGraph, sinkId, path, codeChanges); + + newlyUnreachable.Add(new DriftedSink + { + SinkNodeId = sinkId, + Symbol = sink.Symbol, + SinkCategory = sink.SinkCategory ?? SinkCategory.CmdExec, + Direction = DriftDirection.BecameUnreachable, + Cause = cause, + Path = _pathCompressor.Compress(path, baseGraph, codeChanges) + }); + } + } + + return new ReachabilityDriftResult + { + BaseScanId = baseGraph.ScanId, + HeadScanId = headGraph.ScanId, + DetectedAt = DateTimeOffset.UtcNow, + NewlyReachable = newlyReachable.ToImmutableArray(), + NewlyUnreachable = newlyUnreachable.ToImmutableArray() + }; + } +} +``` + +### 2.5 Drift Cause Explainer + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Services/DriftCauseExplainer.cs + +namespace StellaOps.Scanner.ReachabilityDrift.Services; + +using StellaOps.Scanner.CallGraph; + +/// +/// Explains why a reachability drift occurred. +/// +public sealed class DriftCauseExplainer +{ + /// + /// Explains why a sink became reachable. + /// + public DriftCause Explain( + CallGraphSnapshot baseGraph, + CallGraphSnapshot headGraph, + string sinkNodeId, + ImmutableArray path, + IReadOnlyList codeChanges) + { + if (path.IsDefaultOrEmpty) + return DriftCause.Unknown(); + + // Check each node on path for code changes + foreach (var nodeId in path) + { + var headNode = headGraph.Nodes.FirstOrDefault(n => n.NodeId == nodeId); + if (headNode is null) continue; + + var change = codeChanges.FirstOrDefault(c => + c.Symbol == headNode.Symbol || + c.Symbol == ExtractTypeName(headNode.Symbol)); + + if (change is not null) + { + return change.Kind switch + { + CodeChangeKind.GuardChanged => DriftCause.GuardRemoved( + headNode.Symbol, headNode.File, headNode.Line), + CodeChangeKind.Added => DriftCause.NewPublicRoute(headNode.Symbol), + CodeChangeKind.VisibilityChanged => DriftCause.VisibilityEscalated(headNode.Symbol), + CodeChangeKind.DependencyChanged => ExplainDependencyChange(change), + _ => DriftCause.Unknown() + }; + } + } + + // Check if entrypoint is new + var entrypoint = path.FirstOrDefault(); + if (entrypoint is not null) + { + var baseHasEntrypoint = baseGraph.EntrypointIds.Contains(entrypoint); + var headHasEntrypoint = headGraph.EntrypointIds.Contains(entrypoint); + + if (!baseHasEntrypoint && headHasEntrypoint) + { + var epNode = headGraph.Nodes.First(n => n.NodeId == entrypoint); + return DriftCause.NewPublicRoute(epNode.Symbol); + } + } + + return DriftCause.Unknown(); + } + + /// + /// Explains why a sink became unreachable. + /// + public DriftCause ExplainUnreachable( + CallGraphSnapshot baseGraph, + CallGraphSnapshot headGraph, + string sinkNodeId, + ImmutableArray basePath, + IReadOnlyList codeChanges) + { + // Check if any node on path was removed + foreach (var nodeId in basePath) + { + var existsInHead = headGraph.Nodes.Any(n => n.NodeId == nodeId); + if (!existsInHead) + { + var baseNode = baseGraph.Nodes.First(n => n.NodeId == nodeId); + return DriftCause.SymbolRemoved(baseNode.Symbol); + } + } + + // Check for guard additions + foreach (var nodeId in basePath) + { + var change = codeChanges.FirstOrDefault(c => + c.Kind == CodeChangeKind.GuardChanged); + + if (change is not null) + { + return DriftCause.GuardAdded(change.Symbol); + } + } + + return DriftCause.Unknown(); + } + + private static string ExtractTypeName(string symbol) + { + var lastDot = symbol.LastIndexOf('.'); + if (lastDot > 0) + { + var beforeMethod = symbol[..lastDot]; + var typeEnd = beforeMethod.LastIndexOf('.'); + return typeEnd > 0 ? beforeMethod[(typeEnd + 1)..] : beforeMethod; + } + return symbol; + } + + private static DriftCause ExplainDependencyChange(CodeChangeFact change) + { + if (change.Details is not null) + { + var details = change.Details.RootElement; + var package = details.TryGetProperty("package", out var p) ? p.GetString() : "unknown"; + var from = details.TryGetProperty("fromVersion", out var f) ? f.GetString() : "?"; + var to = details.TryGetProperty("toVersion", out var t) ? t.GetString() : "?"; + return DriftCause.DependencyUpgraded(package ?? "unknown", from ?? "?", to ?? "?"); + } + return DriftCause.Unknown(); + } +} +``` + +### 2.6 Path Compressor + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Services/PathCompressor.cs + +namespace StellaOps.Scanner.ReachabilityDrift.Services; + +using StellaOps.Scanner.CallGraph; + +/// +/// Compresses call paths for efficient storage and UI display. +/// +public sealed class PathCompressor +{ + private const int MaxKeyNodes = 5; + + /// + /// Compresses a full path to key nodes only. + /// + public CompressedPath Compress( + ImmutableArray fullPath, + CallGraphSnapshot graph, + IReadOnlyList codeChanges) + { + if (fullPath.IsDefaultOrEmpty) + { + return new CompressedPath + { + Entrypoint = new PathNode { NodeId = "unknown", Symbol = "unknown" }, + Sink = new PathNode { NodeId = "unknown", Symbol = "unknown" }, + IntermediateCount = 0, + KeyNodes = [] + }; + } + + var entrypointNode = graph.Nodes.FirstOrDefault(n => n.NodeId == fullPath[0]); + var sinkNode = graph.Nodes.FirstOrDefault(n => n.NodeId == fullPath[^1]); + + // Identify key nodes (changed, entry, sink, or interesting) + var keyNodes = new List(); + var changedSymbols = codeChanges.Select(c => c.Symbol).ToHashSet(); + + for (var i = 1; i < fullPath.Length - 1 && keyNodes.Count < MaxKeyNodes; i++) + { + var nodeId = fullPath[i]; + var node = graph.Nodes.FirstOrDefault(n => n.NodeId == nodeId); + if (node is null) continue; + + var isChanged = changedSymbols.Contains(node.Symbol); + var change = codeChanges.FirstOrDefault(c => c.Symbol == node.Symbol); + + if (isChanged || node.IsEntrypoint || node.IsSink) + { + keyNodes.Add(new PathNode + { + NodeId = node.NodeId, + Symbol = node.Symbol, + File = node.File, + Line = node.Line, + Package = node.Package, + IsChanged = isChanged, + ChangeKind = change?.Kind + }); + } + } + + return new CompressedPath + { + Entrypoint = CreatePathNode(entrypointNode, changedSymbols, codeChanges), + Sink = CreatePathNode(sinkNode, changedSymbols, codeChanges), + IntermediateCount = fullPath.Length - 2, + KeyNodes = keyNodes.ToImmutableArray(), + FullPath = fullPath // Optionally include for expansion + }; + } + + private static PathNode CreatePathNode( + CallGraphNode? node, + HashSet changedSymbols, + IReadOnlyList codeChanges) + { + if (node is null) + { + return new PathNode { NodeId = "unknown", Symbol = "unknown" }; + } + + var isChanged = changedSymbols.Contains(node.Symbol); + var change = codeChanges.FirstOrDefault(c => c.Symbol == node.Symbol); + + return new PathNode + { + NodeId = node.NodeId, + Symbol = node.Symbol, + File = node.File, + Line = node.Line, + Package = node.Package, + IsChanged = isChanged, + ChangeKind = change?.Kind + }; + } +} +``` + +### 2.7 Database Schema Extensions + +```sql +-- File: src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/007_drift_detection_tables.sql +-- Sprint: SPRINT_3600_0003_0001 +-- Description: Drift detection engine tables + +-- Code change facts from AST-level analysis +CREATE TABLE IF NOT EXISTS scanner.code_changes ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + scan_id TEXT NOT NULL, + base_scan_id TEXT NOT NULL, + file TEXT NOT NULL, + symbol TEXT NOT NULL, + change_kind TEXT NOT NULL, + details JSONB, + detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + + CONSTRAINT code_changes_unique UNIQUE (tenant_id, scan_id, base_scan_id, file, symbol) +); + +CREATE INDEX IF NOT EXISTS idx_code_changes_scan ON scanner.code_changes(scan_id); +CREATE INDEX IF NOT EXISTS idx_code_changes_symbol ON scanner.code_changes(symbol); +CREATE INDEX IF NOT EXISTS idx_code_changes_kind ON scanner.code_changes(change_kind); + +-- Extend material_risk_changes with drift-specific columns +ALTER TABLE scanner.material_risk_changes +ADD COLUMN IF NOT EXISTS cause TEXT, +ADD COLUMN IF NOT EXISTS cause_kind TEXT, +ADD COLUMN IF NOT EXISTS path_nodes JSONB, +ADD COLUMN IF NOT EXISTS base_scan_id TEXT, +ADD COLUMN IF NOT EXISTS associated_vulns JSONB; + +CREATE INDEX IF NOT EXISTS idx_material_risk_changes_cause + ON scanner.material_risk_changes(cause_kind) + WHERE cause_kind IS NOT NULL; + +CREATE INDEX IF NOT EXISTS idx_material_risk_changes_base_scan + ON scanner.material_risk_changes(base_scan_id) + WHERE base_scan_id IS NOT NULL; + +-- Reachability drift results (aggregate per scan pair) +CREATE TABLE IF NOT EXISTS scanner.reachability_drift_results ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + base_scan_id TEXT NOT NULL, + head_scan_id TEXT NOT NULL, + newly_reachable_count INT NOT NULL DEFAULT 0, + newly_unreachable_count INT NOT NULL DEFAULT 0, + detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + result_digest TEXT NOT NULL, -- Hash for dedup + + CONSTRAINT reachability_drift_unique UNIQUE (tenant_id, base_scan_id, head_scan_id) +); + +CREATE INDEX IF NOT EXISTS idx_drift_results_head_scan + ON scanner.reachability_drift_results(head_scan_id); + +-- Drifted sinks (individual sink drift records) +CREATE TABLE IF NOT EXISTS scanner.drifted_sinks ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + drift_result_id UUID NOT NULL REFERENCES scanner.reachability_drift_results(id), + sink_node_id TEXT NOT NULL, + symbol TEXT NOT NULL, + sink_category TEXT NOT NULL, + direction TEXT NOT NULL, -- became_reachable|became_unreachable + cause_kind TEXT NOT NULL, + cause_description TEXT NOT NULL, + cause_symbol TEXT, + cause_file TEXT, + cause_line INT, + code_change_id UUID REFERENCES scanner.code_changes(id), + compressed_path JSONB NOT NULL, + associated_vulns JSONB, + + CONSTRAINT drifted_sinks_unique UNIQUE (drift_result_id, sink_node_id) +); + +CREATE INDEX IF NOT EXISTS idx_drifted_sinks_drift_result + ON scanner.drifted_sinks(drift_result_id); +CREATE INDEX IF NOT EXISTS idx_drifted_sinks_direction + ON scanner.drifted_sinks(direction); +CREATE INDEX IF NOT EXISTS idx_drifted_sinks_category + ON scanner.drifted_sinks(sink_category); + +-- Enable RLS +ALTER TABLE scanner.code_changes ENABLE ROW LEVEL SECURITY; +ALTER TABLE scanner.reachability_drift_results ENABLE ROW LEVEL SECURITY; +ALTER TABLE scanner.drifted_sinks ENABLE ROW LEVEL SECURITY; + +DROP POLICY IF EXISTS code_changes_tenant_isolation ON scanner.code_changes; +CREATE POLICY code_changes_tenant_isolation ON scanner.code_changes + USING (tenant_id = scanner.current_tenant_id()); + +DROP POLICY IF EXISTS drift_results_tenant_isolation ON scanner.reachability_drift_results; +CREATE POLICY drift_results_tenant_isolation ON scanner.reachability_drift_results + USING (tenant_id = scanner.current_tenant_id()); + +DROP POLICY IF EXISTS drifted_sinks_tenant_isolation ON scanner.drifted_sinks; +CREATE POLICY drifted_sinks_tenant_isolation ON scanner.drifted_sinks + USING (tenant_id = ( + SELECT tenant_id FROM scanner.reachability_drift_results + WHERE id = drift_result_id + )); + +COMMENT ON TABLE scanner.code_changes IS 'AST-level code change facts for drift analysis'; +COMMENT ON TABLE scanner.reachability_drift_results IS 'Aggregate drift results per scan pair'; +COMMENT ON TABLE scanner.drifted_sinks IS 'Individual drifted sink records with causes and paths'; +``` + +--- + +## Delivery Tracker + +| # | Task ID | Status | Description | Notes | +|---|---------|--------|-------------|-------| +| 1 | DRIFT-001 | TODO | Create CodeChangeFact model | With all change kinds | +| 2 | DRIFT-002 | TODO | Create CodeChangeKind enum | 6 types | +| 3 | DRIFT-003 | TODO | Create ReachabilityDriftResult model | Aggregate result | +| 4 | DRIFT-004 | TODO | Create DriftedSink model | With cause and path | +| 5 | DRIFT-005 | TODO | Create DriftDirection enum | 2 directions | +| 6 | DRIFT-006 | TODO | Create DriftCause model | With factory methods | +| 7 | DRIFT-007 | TODO | Create DriftCauseKind enum | 7 kinds | +| 8 | DRIFT-008 | TODO | Create CompressedPath model | For UI display | +| 9 | DRIFT-009 | TODO | Create PathNode model | With change flags | +| 10 | DRIFT-010 | TODO | Implement ReachabilityDriftDetector | Core detection | +| 11 | DRIFT-011 | TODO | Implement DriftCauseExplainer | Cause attribution | +| 12 | DRIFT-012 | TODO | Implement ExplainUnreachable method | Reverse direction | +| 13 | DRIFT-013 | TODO | Implement PathCompressor | Key node selection | +| 14 | DRIFT-014 | TODO | Create Postgres migration 007 | code_changes, drift tables | +| 15 | DRIFT-015 | TODO | Implement ICodeChangeRepository | Storage contract | +| 16 | DRIFT-016 | TODO | Implement PostgresCodeChangeRepository | With Dapper | +| 17 | DRIFT-017 | TODO | Implement IDriftResultRepository | Storage contract | +| 18 | DRIFT-018 | TODO | Implement PostgresDriftResultRepository | With Dapper | +| 19 | DRIFT-019 | TODO | Unit tests for ReachabilityDriftDetector | Various scenarios | +| 20 | DRIFT-020 | TODO | Unit tests for DriftCauseExplainer | All cause kinds | +| 21 | DRIFT-021 | TODO | Unit tests for PathCompressor | Compression logic | +| 22 | DRIFT-022 | TODO | Integration tests with benchmark cases | End-to-end | +| 23 | DRIFT-023 | TODO | Golden fixtures for drift detection | Determinism | +| 24 | DRIFT-024 | TODO | API endpoint GET /scans/{id}/drift | Drift results | +| 25 | DRIFT-025 | TODO | API endpoint GET /drift/{id}/sinks | Individual sinks | +| 26 | DRIFT-026 | TODO | Integrate with MaterialRiskChangeDetector | Extend R1 rule | + +--- + +## 3. ACCEPTANCE CRITERIA + +### 3.1 Code Change Detection + +- [ ] Detects added symbols +- [ ] Detects removed symbols +- [ ] Detects signature changes +- [ ] Detects guard changes +- [ ] Detects dependency changes +- [ ] Detects visibility changes + +### 3.2 Drift Detection + +- [ ] Correctly identifies newly reachable sinks +- [ ] Correctly identifies newly unreachable sinks +- [ ] Handles graphs with different node sets +- [ ] Handles cyclic graphs + +### 3.3 Cause Attribution + +- [ ] Attributes guard removal causes +- [ ] Attributes new route causes +- [ ] Attributes visibility escalation causes +- [ ] Attributes dependency upgrade causes +- [ ] Provides unknown cause for undetectable cases + +### 3.4 Path Compression + +- [ ] Selects appropriate key nodes +- [ ] Marks changed nodes correctly +- [ ] Preserves entrypoint and sink +- [ ] Limits key nodes to max count + +### 3.5 Integration + +- [ ] Integrates with MaterialRiskChangeDetector +- [ ] Extends material_risk_changes table correctly +- [ ] API endpoints return correct data + +--- + +## Decisions & Risks + +| ID | Decision | Rationale | +|----|----------|-----------| +| DRIFT-DEC-001 | Extend existing tables, don't duplicate | Leverage scanner.material_risk_changes | +| DRIFT-DEC-002 | Store full path optionally | Enable UI expansion without re-computation | +| DRIFT-DEC-003 | Limit key nodes to 5 | Balance detail vs. storage | + +| ID | Risk | Mitigation | +|----|------|------------| +| DRIFT-RISK-001 | Cause attribution false positives | Conservative matching, show "unknown" | +| DRIFT-RISK-002 | Large path storage | Compression, CAS for full paths | +| DRIFT-RISK-003 | Performance on large graphs | Caching, pre-computed reachability | + +--- + +## Execution Log + +| Date (UTC) | Update | Owner | +|---|---|---| +| 2025-12-17 | Created sprint from master plan | Agent | + +--- + +## References + +- **Master Sprint**: `SPRINT_3600_0001_0001_reachability_drift_master.md` +- **Call Graph Sprint**: `SPRINT_3600_0002_0001_call_graph_infrastructure.md` +- **Advisory**: `17-Dec-2025 - Reachability Drift Detection.md` diff --git a/docs/implplan/SPRINT_3600_0004_0001_ui_evidence_chain.md b/docs/implplan/SPRINT_3600_0004_0001_ui_evidence_chain.md new file mode 100644 index 00000000..75b3abc9 --- /dev/null +++ b/docs/implplan/SPRINT_3600_0004_0001_ui_evidence_chain.md @@ -0,0 +1,886 @@ +# SPRINT_3600_0004_0001 - UI and Evidence Chain + +**Status:** TODO +**Priority:** P1 - HIGH +**Module:** Web, Attestor +**Working Directory:** `src/Web/StellaOps.Web/`, `src/Attestor/` +**Estimated Effort:** Medium +**Dependencies:** SPRINT_3600_0003_0001 (Drift Detection Engine) + +--- + +## Topic & Scope + +Implement the UI components and evidence chain integration for reachability drift. This sprint covers: +- Angular Path Viewer component +- Risk Drift Card component +- DSSE attestation for drift results +- CLI output enhancements +- SARIF integration + +--- + +## Documentation Prerequisites + +- `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md` +- `docs/implplan/SPRINT_3600_0003_0001_drift_detection_engine.md` +- `docs/modules/attestor/architecture.md` +- `src/Web/StellaOps.Web/README.md` + +--- + +## Wave Coordination + +Parallel tracks: +- Track A: Angular UI components +- Track B: DSSE attestation +- Track C: CLI enhancements + +--- + +## Interlocks + +- Depends on drift detection API from Sprint 3600.3 +- Must align with existing Console design patterns +- Must use existing Attestor infrastructure + +--- + +## Action Tracker + +| Date (UTC) | Action | Owner | Notes | +|---|---|---|---| +| 2025-12-17 | Created sprint from master plan | Agent | Initial | + +--- + +## 1. OBJECTIVE + +Build the user-facing components: +1. **Path Viewer** - Interactive call path visualization +2. **Risk Drift Card** - Summary view for PRs/scans +3. **Evidence Chain** - DSSE attestation linking +4. **CLI Output** - Enhanced drift reporting + +--- + +## 2. TECHNICAL DESIGN + +### 2.1 Angular Path Viewer Component + +```typescript +// File: src/Web/StellaOps.Web/src/app/components/path-viewer/path-viewer.component.ts + +import { Component, Input, Output, EventEmitter } from '@angular/core'; +import { CommonModule } from '@angular/common'; + +export interface PathNode { + nodeId: string; + symbol: string; + file?: string; + line?: number; + package?: string; + isChanged: boolean; + changeKind?: string; +} + +export interface CompressedPath { + entrypoint: PathNode; + sink: PathNode; + intermediateCount: number; + keyNodes: PathNode[]; + fullPath?: string[]; +} + +@Component({ + selector: 'app-path-viewer', + standalone: true, + imports: [CommonModule], + template: ` +
+
+ {{ title }} + +
+ +
+ +
+ +
+ {{ path.entrypoint.symbol }} + + {{ path.entrypoint.file }}:{{ path.entrypoint.line }} + + ENTRYPOINT +
+
+ + +
+ + + +
+ + {{ node.isChanged ? '●' : '○' }} + +
+ {{ node.symbol }} + + {{ node.file }}:{{ node.line }} + + + {{ formatChangeKind(node.changeKind) }} + +
+
+
+
+ + +
+ ... {{ path.intermediateCount - path.keyNodes.length }} more nodes ... + +
+ + +
+ +
+ {{ path.sink.symbol }} + + {{ path.sink.file }}:{{ path.sink.line }} + + VULNERABLE SINK + + {{ path.sink.package }} + +
+
+
+ +
+ Node + Changed + Sink + Call +
+
+ `, + styleUrls: ['./path-viewer.component.scss'] +}) +export class PathViewerComponent { + @Input() path!: CompressedPath; + @Input() title = 'Call Path'; + @Input() collapsible = true; + @Input() showLegend = true; + @Input() collapsed = false; + + @Output() expandPath = new EventEmitter(); + + toggleCollapse(): void { + this.collapsed = !this.collapsed; + } + + requestFullPath(): void { + if (this.path.fullPath) { + this.expandPath.emit(this.path.fullPath); + } + } + + formatChangeKind(kind?: string): string { + if (!kind) return 'Changed'; + return kind + .replace(/_/g, ' ') + .replace(/\b\w/g, c => c.toUpperCase()); + } +} +``` + +### 2.2 Risk Drift Card Component + +```typescript +// File: src/Web/StellaOps.Web/src/app/components/risk-drift-card/risk-drift-card.component.ts + +import { Component, Input, Output, EventEmitter } from '@angular/core'; +import { CommonModule } from '@angular/common'; +import { PathViewerComponent, CompressedPath } from '../path-viewer/path-viewer.component'; + +export interface DriftedSink { + sinkNodeId: string; + symbol: string; + sinkCategory: string; + direction: 'became_reachable' | 'became_unreachable'; + cause: DriftCause; + path: CompressedPath; + associatedVulns: AssociatedVuln[]; +} + +export interface DriftCause { + kind: string; + description: string; + changedSymbol?: string; + changedFile?: string; + changedLine?: number; +} + +export interface AssociatedVuln { + cveId: string; + epss?: number; + cvss?: number; + vexStatus?: string; + packagePurl?: string; +} + +export interface DriftResult { + baseScanId: string; + headScanId: string; + newlyReachable: DriftedSink[]; + newlyUnreachable: DriftedSink[]; +} + +@Component({ + selector: 'app-risk-drift-card', + standalone: true, + imports: [CommonModule, PathViewerComponent], + template: ` +
+
+

Risk Drift

+ +
+ +
+ + +{{ result.newlyReachable.length }} new reachable paths + + + -{{ result.newlyUnreachable.length }} mitigated paths + + + No material drift detected + +
+ +
+ +
+

New Reachable Paths

+
+
+ + {{ formatRoute(sink) }} + +
+ + {{ vuln.cveId }} + (EPSS {{ vuln.epss | number:'1.2-2' }}) + + + VEX: {{ sink.associatedVulns[0].vexStatus }} + +
+
+ +
+ Cause: {{ sink.cause.description }} +
+ + + + +
+ + + + +
+
+
+ + +
+

Mitigated Paths

+
+
+ + {{ formatRoute(sink) }} + +
+ + {{ vuln.cveId }} + +
+
+ +
+ Reason: {{ sink.cause.description }} +
+
+
+
+
+ `, + styleUrls: ['./risk-drift-card.component.scss'] +}) +export class RiskDriftCardComponent { + @Input() result!: DriftResult; + @Input() expanded = true; + + @Output() viewPath = new EventEmitter(); + @Output() quarantine = new EventEmitter(); + @Output() pinVersion = new EventEmitter(); + @Output() addException = new EventEmitter(); + + get hasDrift(): boolean { + return this.result.newlyReachable.length > 0 || + this.result.newlyUnreachable.length > 0; + } + + toggleExpand(): void { + this.expanded = !this.expanded; + } + + formatRoute(sink: DriftedSink): string { + const entrypoint = sink.path.entrypoint.symbol; + const sinkSymbol = sink.path.sink.symbol; + const intermediateCount = sink.path.intermediateCount; + + if (intermediateCount <= 2) { + return `${entrypoint} → ${sinkSymbol}`; + } + return `${entrypoint} → ... → ${sinkSymbol}`; + } +} +``` + +### 2.3 DSSE Predicate for Drift + +```csharp +// File: src/Attestor/StellaOps.Attestor.Types/Predicates/ReachabilityDriftPredicate.cs + +namespace StellaOps.Attestor.Types.Predicates; + +using System.Collections.Immutable; +using System.Text.Json.Serialization; + +/// +/// DSSE predicate for reachability drift attestation. +/// predicateType: stellaops.dev/predicates/reachability-drift@v1 +/// +public sealed record ReachabilityDriftPredicate +{ + public const string PredicateType = "stellaops.dev/predicates/reachability-drift@v1"; + + [JsonPropertyName("baseImage")] + public required ImageReference BaseImage { get; init; } + + [JsonPropertyName("targetImage")] + public required ImageReference TargetImage { get; init; } + + [JsonPropertyName("baseScanId")] + public required string BaseScanId { get; init; } + + [JsonPropertyName("headScanId")] + public required string HeadScanId { get; init; } + + [JsonPropertyName("drift")] + public required DriftSummary Drift { get; init; } + + [JsonPropertyName("analysis")] + public required AnalysisMetadata Analysis { get; init; } +} + +public sealed record ImageReference +{ + [JsonPropertyName("name")] + public required string Name { get; init; } + + [JsonPropertyName("digest")] + public required string Digest { get; init; } +} + +public sealed record DriftSummary +{ + [JsonPropertyName("newlyReachableCount")] + public required int NewlyReachableCount { get; init; } + + [JsonPropertyName("newlyUnreachableCount")] + public required int NewlyUnreachableCount { get; init; } + + [JsonPropertyName("newlyReachable")] + public required ImmutableArray NewlyReachable { get; init; } + + [JsonPropertyName("newlyUnreachable")] + public required ImmutableArray NewlyUnreachable { get; init; } +} + +public sealed record DriftedSinkSummary +{ + [JsonPropertyName("sinkNodeId")] + public required string SinkNodeId { get; init; } + + [JsonPropertyName("symbol")] + public required string Symbol { get; init; } + + [JsonPropertyName("sinkCategory")] + public required string SinkCategory { get; init; } + + [JsonPropertyName("causeKind")] + public required string CauseKind { get; init; } + + [JsonPropertyName("causeDescription")] + public required string CauseDescription { get; init; } + + [JsonPropertyName("associatedCves")] + public ImmutableArray AssociatedCves { get; init; } = []; +} + +public sealed record AnalysisMetadata +{ + [JsonPropertyName("analyzedAt")] + public required DateTimeOffset AnalyzedAt { get; init; } + + [JsonPropertyName("scanner")] + public required ScannerInfo Scanner { get; init; } + + [JsonPropertyName("baseGraphDigest")] + public required string BaseGraphDigest { get; init; } + + [JsonPropertyName("headGraphDigest")] + public required string HeadGraphDigest { get; init; } +} + +public sealed record ScannerInfo +{ + [JsonPropertyName("name")] + public required string Name { get; init; } + + [JsonPropertyName("version")] + public required string Version { get; init; } + + [JsonPropertyName("ruleset")] + public string? Ruleset { get; init; } +} +``` + +### 2.4 CLI Output Enhancement + +```csharp +// File: src/Cli/StellaOps.Cli/Commands/DriftCommand.cs + +namespace StellaOps.Cli.Commands; + +using System.CommandLine; +using System.Text.Json; +using Spectre.Console; + +public class DriftCommand : Command +{ + public DriftCommand() : base("drift", "Detect reachability drift between image versions") + { + var baseOption = new Option("--base", "Base image reference") { IsRequired = true }; + var targetOption = new Option("--target", "Target image reference") { IsRequired = true }; + var formatOption = new Option("--format", () => "table", "Output format (table|json|sarif)"); + var verboseOption = new Option("--verbose", () => false, "Show detailed path information"); + + AddOption(baseOption); + AddOption(targetOption); + AddOption(formatOption); + AddOption(verboseOption); + + this.SetHandler(ExecuteAsync, baseOption, targetOption, formatOption, verboseOption); + } + + private async Task ExecuteAsync(string baseImage, string targetImage, string format, bool verbose) + { + AnsiConsole.MarkupLine($"[bold]Analyzing drift:[/] {baseImage} → {targetImage}"); + + // TODO: Call drift detection service + var result = await DetectDriftAsync(baseImage, targetImage); + + switch (format.ToLowerInvariant()) + { + case "json": + OutputJson(result); + break; + case "sarif": + OutputSarif(result); + break; + default: + OutputTable(result, verbose); + break; + } + + // Exit code based on drift + Environment.ExitCode = result.TotalDriftCount switch + { + 0 => 0, // No drift + > 0 when result.NewlyReachable.Length > 0 => 1, // New reachable (info) + _ => 0 // Only mitigated + }; + } + + private void OutputTable(ReachabilityDriftResult result, bool verbose) + { + if (result.NewlyReachable.Length > 0) + { + AnsiConsole.MarkupLine("\n[red bold]NEW REACHABLE PATHS[/]"); + + var table = new Table(); + table.AddColumn("Sink"); + table.AddColumn("Category"); + table.AddColumn("Cause"); + if (verbose) + { + table.AddColumn("Path"); + } + table.AddColumn("CVEs"); + + foreach (var sink in result.NewlyReachable) + { + var row = new List + { + sink.Symbol, + sink.SinkCategory.ToString(), + sink.Cause.Description + }; + + if (verbose) + { + row.Add($"{sink.Path.Entrypoint.Symbol} → ... → {sink.Path.Sink.Symbol}"); + } + + row.Add(string.Join(", ", sink.AssociatedVulns.Select(v => v.CveId))); + + table.AddRow(row.ToArray()); + } + + AnsiConsole.Write(table); + } + + if (result.NewlyUnreachable.Length > 0) + { + AnsiConsole.MarkupLine("\n[green bold]MITIGATED PATHS[/]"); + + var table = new Table(); + table.AddColumn("Sink"); + table.AddColumn("Category"); + table.AddColumn("Reason"); + + foreach (var sink in result.NewlyUnreachable) + { + table.AddRow( + sink.Symbol, + sink.SinkCategory.ToString(), + sink.Cause.Description); + } + + AnsiConsole.Write(table); + } + + if (result.TotalDriftCount == 0) + { + AnsiConsole.MarkupLine("\n[green]No material reachability drift detected.[/]"); + } + + AnsiConsole.MarkupLine($"\n[bold]Summary:[/] +{result.NewlyReachable.Length} reachable, -{result.NewlyUnreachable.Length} mitigated"); + } + + private void OutputJson(ReachabilityDriftResult result) + { + var json = JsonSerializer.Serialize(result, new JsonSerializerOptions + { + WriteIndented = true, + PropertyNamingPolicy = JsonNamingPolicy.CamelCase + }); + Console.WriteLine(json); + } + + private void OutputSarif(ReachabilityDriftResult result) + { + // Generate SARIF 2.1.0 output + // TODO: Implement SARIF generation + throw new NotImplementedException("SARIF output to be implemented"); + } + + private Task DetectDriftAsync(string baseImage, string targetImage) + { + // TODO: Implement actual drift detection + throw new NotImplementedException(); + } +} +``` + +### 2.5 SARIF Integration + +```csharp +// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Output/DriftSarifGenerator.cs + +namespace StellaOps.Scanner.ReachabilityDrift.Output; + +using System.Text.Json; + +/// +/// Generates SARIF 2.1.0 output for drift results. +/// +public sealed class DriftSarifGenerator +{ + private const string ToolName = "StellaOps.ReachabilityDrift"; + private const string ToolVersion = "1.0.0"; + + public JsonDocument Generate(ReachabilityDriftResult result) + { + var rules = new List(); + var results = new List(); + + // Add rules for each drift type + rules.Add(new + { + id = "RDRIFT001", + name = "NewlyReachableSink", + shortDescription = new { text = "Vulnerable sink became reachable" }, + fullDescription = new { text = "A vulnerable code sink became reachable from application entrypoints due to code changes." }, + defaultConfiguration = new { level = "error" } + }); + + rules.Add(new + { + id = "RDRIFT002", + name = "MitigatedSink", + shortDescription = new { text = "Vulnerable sink became unreachable" }, + fullDescription = new { text = "A vulnerable code sink is no longer reachable from application entrypoints." }, + defaultConfiguration = new { level = "note" } + }); + + // Add results for newly reachable sinks + foreach (var sink in result.NewlyReachable) + { + results.Add(new + { + ruleId = "RDRIFT001", + level = "error", + message = new + { + text = $"Sink {sink.Symbol} became reachable. Cause: {sink.Cause.Description}" + }, + locations = sink.Cause.ChangedFile is not null ? new[] + { + new + { + physicalLocation = new + { + artifactLocation = new { uri = sink.Cause.ChangedFile }, + region = new { startLine = sink.Cause.ChangedLine ?? 1 } + } + } + } : Array.Empty(), + properties = new + { + sinkCategory = sink.SinkCategory.ToString(), + causeKind = sink.Cause.Kind.ToString(), + associatedVulns = sink.AssociatedVulns.Select(v => v.CveId).ToArray() + } + }); + } + + // Add results for mitigated sinks + foreach (var sink in result.NewlyUnreachable) + { + results.Add(new + { + ruleId = "RDRIFT002", + level = "note", + message = new + { + text = $"Sink {sink.Symbol} is no longer reachable. Reason: {sink.Cause.Description}" + }, + properties = new + { + sinkCategory = sink.SinkCategory.ToString(), + causeKind = sink.Cause.Kind.ToString() + } + }); + } + + var sarif = new + { + version = "2.1.0", + schema = "https://json.schemastore.org/sarif-2.1.0.json", + runs = new[] + { + new + { + tool = new + { + driver = new + { + name = ToolName, + version = ToolVersion, + informationUri = "https://stellaops.dev/docs/reachability-drift", + rules = rules.ToArray() + } + }, + results = results.ToArray(), + invocations = new[] + { + new + { + executionSuccessful = true, + endTimeUtc = result.DetectedAt.UtcDateTime.ToString("o") + } + } + } + } + }; + + var json = JsonSerializer.Serialize(sarif, new JsonSerializerOptions + { + WriteIndented = true, + PropertyNamingPolicy = JsonNamingPolicy.CamelCase + }); + + return JsonDocument.Parse(json); + } +} +``` + +--- + +## Delivery Tracker + +| # | Task ID | Status | Description | Notes | +|---|---------|--------|-------------|-------| +| 1 | UI-001 | TODO | Create PathNode TypeScript interface | Angular model | +| 2 | UI-002 | TODO | Create CompressedPath TypeScript interface | Angular model | +| 3 | UI-003 | TODO | Create PathViewerComponent | Core visualization | +| 4 | UI-004 | TODO | Style PathViewerComponent | SCSS styling | +| 5 | UI-005 | TODO | Create DriftedSink TypeScript interface | Angular model | +| 6 | UI-006 | TODO | Create DriftResult TypeScript interface | Angular model | +| 7 | UI-007 | TODO | Create RiskDriftCardComponent | Summary card | +| 8 | UI-008 | TODO | Style RiskDriftCardComponent | SCSS styling | +| 9 | UI-009 | TODO | Create drift API service | Angular HTTP service | +| 10 | UI-010 | TODO | Integrate PathViewer into scan details | Page integration | +| 11 | UI-011 | TODO | Integrate RiskDriftCard into PR view | Page integration | +| 12 | UI-012 | TODO | Unit tests for PathViewerComponent | Jest tests | +| 13 | UI-013 | TODO | Unit tests for RiskDriftCardComponent | Jest tests | +| 14 | UI-014 | TODO | Create ReachabilityDriftPredicate model | DSSE predicate | +| 15 | UI-015 | TODO | Register predicate in Attestor | Type registration | +| 16 | UI-016 | TODO | Implement drift attestation service | DSSE signing | +| 17 | UI-017 | TODO | Add attestation to drift API | API integration | +| 18 | UI-018 | TODO | Unit tests for attestation | Predicate validation | +| 19 | UI-019 | TODO | Create DriftCommand for CLI | CLI command | +| 20 | UI-020 | TODO | Implement table output | Spectre.Console | +| 21 | UI-021 | TODO | Implement JSON output | JSON serialization | +| 22 | UI-022 | TODO | Create DriftSarifGenerator | SARIF 2.1.0 | +| 23 | UI-023 | TODO | Implement SARIF output for CLI | CLI integration | +| 24 | UI-024 | TODO | Update CLI documentation | docs/cli/ | +| 25 | UI-025 | TODO | Integration tests for CLI | End-to-end | + +--- + +## 3. ACCEPTANCE CRITERIA + +### 3.1 Path Viewer Component + +- [ ] Displays entrypoint and sink nodes +- [ ] Shows key intermediate nodes +- [ ] Highlights changed nodes +- [ ] Supports collapse/expand +- [ ] Shows legend +- [ ] Handles paths of various lengths + +### 3.2 Risk Drift Card Component + +- [ ] Shows summary badges +- [ ] Lists newly reachable paths +- [ ] Lists mitigated paths +- [ ] Shows associated vulnerabilities +- [ ] Provides action buttons +- [ ] Supports expand/collapse + +### 3.3 DSSE Attestation + +- [ ] Generates valid predicate +- [ ] Signs with DSSE envelope +- [ ] Includes graph digests +- [ ] Includes all drift details +- [ ] Passes schema validation + +### 3.4 CLI Output + +- [ ] Table output is readable +- [ ] JSON output is valid +- [ ] SARIF output passes schema validation +- [ ] Exit codes are correct +- [ ] Verbose mode shows paths + +--- + +## Decisions & Risks + +| ID | Decision | Rationale | +|----|----------|-----------| +| UI-DEC-001 | Standalone Angular components | Reusability across pages | +| UI-DEC-002 | SARIF rule IDs prefixed with RDRIFT | Distinguish from other SARIF sources | +| UI-DEC-003 | CLI uses Spectre.Console | Consistent with existing CLI style | + +| ID | Risk | Mitigation | +|----|------|------------| +| UI-RISK-001 | Large paths slow UI | Lazy loading, pagination | +| UI-RISK-002 | SARIF compatibility issues | Test against multiple consumers | +| UI-RISK-003 | Attestation size limits | Summary only, link to full data | + +--- + +## Execution Log + +| Date (UTC) | Update | Owner | +|---|---|---| +| 2025-12-17 | Created sprint from master plan | Agent | + +--- + +## References + +- **Master Sprint**: `SPRINT_3600_0001_0001_reachability_drift_master.md` +- **Drift Detection Sprint**: `SPRINT_3600_0003_0001_drift_detection_engine.md` +- **Advisory**: `17-Dec-2025 - Reachability Drift Detection.md` +- **Angular Style Guide**: https://angular.io/guide/styleguide +- **SARIF 2.1.0 Spec**: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html diff --git a/docs/implplan/SPRINT_3700_0001_0001_triage_db_schema.md b/docs/implplan/SPRINT_3700_0001_0001_triage_db_schema.md new file mode 100644 index 00000000..0c811eb2 --- /dev/null +++ b/docs/implplan/SPRINT_3700_0001_0001_triage_db_schema.md @@ -0,0 +1,241 @@ +# SPRINT_3700_0001_0001_triage_db_schema + +**Epic:** Triage Infrastructure +**Module:** Scanner +**Working Directory:** `src/Scanner/__Libraries/StellaOps.Scanner.Triage/` +**Status:** TODO +**Created:** 2025-12-17 +**Target Completion:** TBD +**Depends On:** None + +--- + +## 1. Overview + +Implement the PostgreSQL database schema for the Narrative-First Triage UX system, including all tables, enums, indexes, and views required to support the triage workflow. + +### 1.1 Deliverables + +1. PostgreSQL migration script (`triage_schema.sql`) +2. EF Core entities for all triage tables +3. `TriageDbContext` with proper configuration +4. Integration tests using Testcontainers +5. Performance validation for indexed queries + +### 1.2 Dependencies + +- PostgreSQL >= 16 +- EF Core 9.0 +- `StellaOps.Infrastructure.Postgres` for base patterns + +--- + +## 2. Delivery Tracker + +| ID | Task | Owner | Status | Notes | +|----|------|-------|--------|-------| +| T1 | Create migration script from `docs/db/triage_schema.sql` | — | TODO | | +| T2 | Create PostgreSQL enums (7 types) | — | TODO | See schema | +| T3 | Create `TriageFinding` entity | — | TODO | | +| T4 | Create `TriageEffectiveVex` entity | — | TODO | | +| T5 | Create `TriageReachabilityResult` entity | — | TODO | | +| T6 | Create `TriageRiskResult` entity | — | TODO | | +| T7 | Create `TriageDecision` entity | — | TODO | | +| T8 | Create `TriageEvidenceArtifact` entity | — | TODO | | +| T9 | Create `TriageSnapshot` entity | — | TODO | | +| T10 | Create `TriageDbContext` with Fluent API | — | TODO | | +| T11 | Implement `v_triage_case_current` view mapping | — | TODO | | +| T12 | Add performance indexes | — | TODO | | +| T13 | Write integration tests with Testcontainers | — | TODO | | +| T14 | Validate query performance (explain analyze) | — | TODO | | + +--- + +## 3. Task Details + +### T1: Create migration script + +**Location:** `src/Scanner/__Libraries/StellaOps.Scanner.Triage/Migrations/` + +Use the schema from `docs/db/triage_schema.sql` as the authoritative source. Create an EF Core migration that matches. + +### T2-T9: Entity Classes + +Create entities in `src/Scanner/__Libraries/StellaOps.Scanner.Triage/Entities/` + +```csharp +// Example structure +namespace StellaOps.Scanner.Triage.Entities; + +public enum TriageLane +{ + Active, + Blocked, + NeedsException, + MutedReach, + MutedVex, + Compensated +} + +public enum TriageVerdict +{ + Ship, + Block, + Exception +} + +public sealed record TriageFinding +{ + public Guid Id { get; init; } + public Guid AssetId { get; init; } + public Guid? EnvironmentId { get; init; } + public required string AssetLabel { get; init; } + public required string Purl { get; init; } + public string? CveId { get; init; } + public string? RuleId { get; init; } + public DateTimeOffset FirstSeenAt { get; init; } + public DateTimeOffset LastSeenAt { get; init; } +} +``` + +### T10: DbContext Configuration + +```csharp +public sealed class TriageDbContext : DbContext +{ + public DbSet Findings => Set(); + public DbSet EffectiveVex => Set(); + public DbSet ReachabilityResults => Set(); + public DbSet RiskResults => Set(); + public DbSet Decisions => Set(); + public DbSet EvidenceArtifacts => Set(); + public DbSet Snapshots => Set(); + + protected override void OnModelCreating(ModelBuilder modelBuilder) + { + // Configure PostgreSQL enums + modelBuilder.HasPostgresEnum("triage_lane"); + modelBuilder.HasPostgresEnum("triage_verdict"); + // ... more enums + + // Configure entities + modelBuilder.Entity(entity => + { + entity.ToTable("triage_finding"); + entity.HasKey(e => e.Id); + entity.HasIndex(e => e.LastSeenAt).IsDescending(); + // ... more configuration + }); + } +} +``` + +### T11: View Mapping + +Map the `v_triage_case_current` view as a keyless entity: + +```csharp +[Keyless] +public sealed record TriageCaseCurrent +{ + public Guid CaseId { get; init; } + public Guid AssetId { get; init; } + // ... all view columns +} + +// In DbContext +modelBuilder.Entity() + .ToView("v_triage_case_current") + .HasNoKey(); +``` + +### T13: Integration Tests + +```csharp +public class TriageSchemaTests : IAsyncLifetime +{ + private readonly PostgreSqlContainer _postgres = new PostgreSqlBuilder() + .WithImage("postgres:16-alpine") + .Build(); + + [Fact] + public async Task Schema_Creates_Successfully() + { + await using var context = CreateContext(); + await context.Database.EnsureCreatedAsync(); + + // Verify tables exist + var tables = await context.Database.SqlQuery( + $"SELECT tablename FROM pg_tables WHERE schemaname = 'public'") + .ToListAsync(); + + Assert.Contains("triage_finding", tables); + Assert.Contains("triage_decision", tables); + // ... more assertions + } + + [Fact] + public async Task View_Returns_Correct_Columns() + { + await using var context = CreateContext(); + await context.Database.EnsureCreatedAsync(); + + // Insert test data + var finding = new TriageFinding { /* ... */ }; + context.Findings.Add(finding); + await context.SaveChangesAsync(); + + // Query view + var cases = await context.Set().ToListAsync(); + Assert.Single(cases); + } +} +``` + +--- + +## 4. Decisions & Risks + +### 4.1 Decisions + +| Decision | Rationale | Date | +|----------|-----------|------| +| Use PostgreSQL enums | Type safety, smaller storage | 2025-12-17 | +| Use `DISTINCT ON` in view | Efficient "latest" queries | 2025-12-17 | +| Store explanation as JSONB | Flexible schema for lattice output | 2025-12-17 | + +### 4.2 Risks + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Enum changes require migration | Medium | Use versioned enums, add-only pattern | +| View performance on large datasets | High | Monitor, add materialized view if needed | + +--- + +## 5. Acceptance Criteria (Sprint) + +- [ ] All 8 tables created with correct constraints +- [ ] All 7 enums registered in PostgreSQL +- [ ] View `v_triage_case_current` returns correct data +- [ ] Indexes created and verified with EXPLAIN ANALYZE +- [ ] Integration tests pass with Testcontainers +- [ ] No circular dependencies in foreign keys +- [ ] Migration is idempotent (can run multiple times) + +--- + +## 6. Execution Log + +| Date | Update | Owner | +|------|--------|-------| +| 2025-12-17 | Sprint file created | Claude | + +--- + +## 7. Reference Files + +- Schema definition: `docs/db/triage_schema.sql` +- UX Guide: `docs/ux/TRIAGE_UX_GUIDE.md` +- API Contract: `docs/api/triage.contract.v1.md` +- Advisory: `docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md` diff --git a/docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md b/docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md new file mode 100644 index 00000000..8d17b3a7 --- /dev/null +++ b/docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md @@ -0,0 +1,395 @@ +# Reachability Drift Detection + +**Date**: 2025-12-17 +**Status**: ANALYZED - Ready for Implementation Planning +**Related Advisories**: +- 14-Dec-2025 - Smart-Diff Technical Reference +- 14-Dec-2025 - Reachability Analysis Technical Reference + +--- + +## 1. EXECUTIVE SUMMARY + +This advisory proposes extending StellaOps' Smart-Diff capabilities to detect **reachability drift** - changes in whether vulnerable code paths are reachable from application entry points between container image versions. + +**Core Insight**: Raw diffs don't equal risk. Most changed lines don't matter for exploitability. Reachability drift detection fuses **call-stack reachability graphs** with **Smart-Diff metadata** to flag only paths that went from **unreachable to reachable** (or vice-versa), tied to **SBOM components** and **VEX statements**. + +--- + +## 2. GAP ANALYSIS vs EXISTING INFRASTRUCTURE + +### 2.1 What Already Exists (Leverage Points) + +| Component | Location | Status | +|-----------|----------|--------| +| `MaterialRiskChangeDetector` | `Scanner.SmartDiff.Detection` | DONE - R1-R4 rules | +| `VexCandidateEmitter` | `Scanner.SmartDiff.Detection` | DONE - Absent API detection | +| `ReachabilityGateBridge` | `Scanner.SmartDiff.Detection` | DONE - Lattice to 3-bit | +| `ReachabilitySignal` | `Signals.Contracts` | DONE - Call path model | +| `ReachabilityLatticeState` | `Signals.Contracts.Evidence` | DONE - 5-state enum | +| `CallPath`, `CallPathNode` | `Signals.Contracts.Evidence` | DONE - Path representation | +| `ReachabilityEvidenceChain` | `Signals.Contracts.Evidence` | DONE - Proof chain | +| `vex.graph_nodes/edges` | DB Schema | DONE - Graph storage | +| `scanner.risk_state_snapshots` | DB Schema | DONE - State storage | +| `scanner.material_risk_changes` | DB Schema | DONE - Change storage | +| `FnDriftCalculator` | `Scanner.Core.Drift` | DONE - Classification drift | +| `SarifOutputGenerator` | `Scanner.SmartDiff.Output` | DONE - CI output | +| Reachability Benchmark | `bench/reachability-benchmark/` | DONE - Ground truth cases | +| Language Analyzers | `Scanner.Analyzers.Lang.*` | PARTIAL - Package detection, limited call graph | + +### 2.2 What's Missing (New Implementation Required) + +| Component | Advisory Ref | Gap Description | +|-----------|-------------|-----------------| +| **Call Graph Extractor (.NET)** | §7 C# Roslyn | No MSBuildWorkspace/Roslyn analysis exists | +| **Call Graph Extractor (Go)** | §7 Go SSA | No golang.org/x/tools/go/ssa integration | +| **Call Graph Extractor (Java)** | §7 | No Soot/WALA integration | +| **Call Graph Extractor (Node)** | §7 | No @babel/traverse integration | +| **`scanner.code_changes` table** | §4 Smart-Diff | AST-level diff facts not persisted | +| **Drift Cause Explainer** | §6 Timeline | No causal attribution on path nodes | +| **Path Viewer UI** | §UX | No Angular component for call path visualization | +| **Cross-scan Function-level Drift** | §6 | State drift exists, function-level doesn't | +| **Entrypoint Discovery (per-framework)** | §3 | Limited beyond package.json/manifest parsing | + +### 2.3 Terminology Mapping + +| Advisory Term | StellaOps Equivalent | Notes | +|--------------|---------------------|-------| +| `commit_sha` | `scan_id` | StellaOps is image-centric, not commit-centric | +| `call_node` | `vex.graph_nodes` | Existing schema, extend don't duplicate | +| `call_edge` | `vex.graph_edges` | Existing schema | +| `reachability_drift` | `scanner.material_risk_changes` | Add `cause`, `path_nodes` columns | +| Risk Drift | Material Risk Change | Existing term is more precise | +| Router, Signals | Signals module only | Router module is not implemented | + +--- + +## 3. RECOMMENDED IMPLEMENTATION PATH + +### 3.1 What to Ship (Delta from Current State) + +``` +NEW TABLES: +├── scanner.code_changes # AST-level diff facts +└── scanner.call_graph_snapshots # Per-scan call graph cache + +NEW COLUMNS: +├── scanner.material_risk_changes.cause # TEXT - "guard_removed", "new_route", etc. +├── scanner.material_risk_changes.path_nodes # JSONB - Compressed path representation +└── scanner.material_risk_changes.base_scan_id # UUID - For cross-scan comparison + +NEW SERVICES: +├── CallGraphExtractor.DotNet # Roslyn-based for .NET projects +├── CallGraphExtractor.Node # AST-based for Node.js +├── DriftCauseExplainer # Attribute causes to code changes +└── PathCompressor # Compress paths for storage/UI + +NEW UI: +└── PathViewerComponent # Angular component for call path visualization +``` + +### 3.2 What NOT to Ship (Avoid Duplication) + +- **Don't create `call_node`/`call_edge` tables** - Use existing `vex.graph_nodes`/`vex.graph_edges` +- **Don't add `commit_sha` columns** - Use `scan_id` consistently +- **Don't build React components** - Angular v17 is the stack + +### 3.3 Use Valkey for Graph Caching + +Valkey is already integrated in `Router.Gateway.RateLimit`. Use it for: +- **Call graph snapshot caching** - Fast cross-instance lookups +- **Reachability result caching** - Avoid recomputation +- **Key pattern**: `stella:callgraph:{scan_id}:{lang}:{digest}` + +```yaml +# Configuration pattern (align with existing Router rate limiting) +reachability: + valkey_connection: "localhost:6379" + valkey_bucket: "stella-reachability" + cache_ttl_hours: 24 + circuit_breaker: + failure_threshold: 5 + timeout_seconds: 30 +``` + +--- + +## 4. TECHNICAL DESIGN + +### 4.1 Call Graph Extraction Model + +```csharp +/// +/// Per-scan call graph snapshot for drift comparison. +/// +public sealed record CallGraphSnapshot +{ + public required string ScanId { get; init; } + public required string GraphDigest { get; init; } // Content hash + public required string Language { get; init; } + public required DateTimeOffset ExtractedAt { get; init; } + public required ImmutableArray Nodes { get; init; } + public required ImmutableArray Edges { get; init; } + public required ImmutableArray EntrypointIds { get; init; } +} + +public sealed record CallGraphNode +{ + public required string NodeId { get; init; } // Stable identifier + public required string Symbol { get; init; } // Fully qualified name + public required string File { get; init; } + public required int Line { get; init; } + public required string Package { get; init; } + public required string Visibility { get; init; } // public/internal/private + public required bool IsEntrypoint { get; init; } + public required bool IsSink { get; init; } + public string? SinkCategory { get; init; } // CMD_EXEC, SQL_RAW, etc. +} + +public sealed record CallGraphEdge +{ + public required string SourceId { get; init; } + public required string TargetId { get; init; } + public required string CallKind { get; init; } // direct/virtual/delegate +} +``` + +### 4.2 Code Change Facts Model + +```csharp +/// +/// AST-level code change facts from Smart-Diff. +/// +public sealed record CodeChangeFact +{ + public required string ScanId { get; init; } + public required string File { get; init; } + public required string Symbol { get; init; } + public required CodeChangeKind Kind { get; init; } + public required JsonDocument Details { get; init; } +} + +public enum CodeChangeKind +{ + Added, + Removed, + SignatureChanged, + GuardChanged, // Boolean condition around call modified + DependencyChanged, // Callee package/version changed + VisibilityChanged // public<->internal<->private +} +``` + +### 4.3 Drift Cause Attribution + +```csharp +/// +/// Explains why a reachability flip occurred. +/// +public sealed class DriftCauseExplainer +{ + public DriftCause Explain( + CallGraphSnapshot baseGraph, + CallGraphSnapshot headGraph, + string sinkSymbol, + IReadOnlyList codeChanges) + { + // Find shortest path to sink in head graph + var path = ShortestPath(headGraph.EntrypointIds, sinkSymbol, headGraph); + if (path is null) + return DriftCause.Unknown; + + // Check each node on path for code changes + foreach (var nodeId in path.NodeIds) + { + var node = headGraph.Nodes.First(n => n.NodeId == nodeId); + var change = codeChanges.FirstOrDefault(c => c.Symbol == node.Symbol); + + if (change is not null) + { + return change.Kind switch + { + CodeChangeKind.GuardChanged => DriftCause.GuardRemoved(node.Symbol, node.File, node.Line), + CodeChangeKind.Added => DriftCause.NewPublicRoute(node.Symbol), + CodeChangeKind.VisibilityChanged => DriftCause.VisibilityEscalated(node.Symbol), + CodeChangeKind.DependencyChanged => DriftCause.DepUpgraded(change.Details), + _ => DriftCause.CodeModified(node.Symbol) + }; + } + } + + return DriftCause.Unknown; + } +} +``` + +### 4.4 Database Schema Extensions + +```sql +-- New table: Code change facts from AST-level Smart-Diff +CREATE TABLE scanner.code_changes ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + scan_id TEXT NOT NULL, + file TEXT NOT NULL, + symbol TEXT NOT NULL, + change_kind TEXT NOT NULL, -- added|removed|signature|guard|dep|visibility + details JSONB, + detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + + CONSTRAINT code_changes_unique UNIQUE (tenant_id, scan_id, file, symbol) +); + +CREATE INDEX idx_code_changes_scan ON scanner.code_changes(scan_id); +CREATE INDEX idx_code_changes_symbol ON scanner.code_changes(symbol); + +-- New table: Per-scan call graph snapshots (compressed) +CREATE TABLE scanner.call_graph_snapshots ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id UUID NOT NULL, + scan_id TEXT NOT NULL, + language TEXT NOT NULL, + graph_digest TEXT NOT NULL, -- Content hash for dedup + node_count INT NOT NULL, + edge_count INT NOT NULL, + entrypoint_count INT NOT NULL, + extracted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + cas_uri TEXT NOT NULL, -- Reference to CAS for full graph + + CONSTRAINT call_graph_snapshots_unique UNIQUE (tenant_id, scan_id, language) +); + +CREATE INDEX idx_call_graph_snapshots_digest ON scanner.call_graph_snapshots(graph_digest); + +-- Extend existing material_risk_changes table +ALTER TABLE scanner.material_risk_changes +ADD COLUMN IF NOT EXISTS cause TEXT, +ADD COLUMN IF NOT EXISTS path_nodes JSONB, +ADD COLUMN IF NOT EXISTS base_scan_id TEXT; + +CREATE INDEX IF NOT EXISTS idx_material_risk_changes_cause +ON scanner.material_risk_changes(cause) WHERE cause IS NOT NULL; +``` + +--- + +## 5. UI DESIGN + +### 5.1 Risk Drift Card (PR/Commit View) + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ RISK DRIFT ▼ │ +├─────────────────────────────────────────────────────────────────────┤ +│ +3 new reachable paths -2 mitigated paths │ +│ │ +│ ┌─ NEW REACHABLE ──────────────────────────────────────────────┐ │ +│ │ POST /payments → PaymentsController.Capture → ... → │ │ +│ │ crypto.Verify(legacy) │ │ +│ │ │ │ +│ │ [pkg:payments@1.8.2] [CVE-2024-1234] [EPSS 0.72] [VEX:affected]│ │ +│ │ │ │ +│ │ Cause: guard removed in AuthFilter.cs:42 │ │ +│ │ │ │ +│ │ [View Path] [Quarantine Route] [Pin Version] [Add Exception] │ │ +│ └───────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─ MITIGATED ──────────────────────────────────────────────────┐ │ +│ │ GET /admin → AdminController.Execute → ... → cmd.Run │ │ +│ │ │ │ +│ │ [pkg:admin@2.0.0] [CVE-2024-5678] [VEX:not_affected] │ │ +│ │ │ │ +│ │ Reason: Vulnerable API removed in upgrade │ │ +│ └───────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +### 5.2 Path Viewer Component + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ CALL PATH: POST /payments → crypto.Verify(legacy) [Collapse] │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ ○ POST /payments [ENTRYPOINT] │ +│ │ PaymentsController.cs:45 │ +│ │ │ +│ ├──○ PaymentsController.Capture() │ +│ │ │ PaymentsController.cs:89 │ +│ │ │ │ +│ │ ├──○ PaymentService.ProcessPayment() │ +│ │ │ │ PaymentService.cs:156 │ +│ │ │ │ │ +│ │ │ ├──● CryptoHelper.Verify() ← GUARD REMOVED │ +│ │ │ │ │ CryptoHelper.cs:42 [Changed: AuthFilter removed] │ +│ │ │ │ │ │ +│ │ │ │ └──◆ crypto.Verify(legacy) [VULNERABLE SINK] │ +│ │ │ │ pkg:crypto@1.2.3 │ +│ │ │ │ CVE-2024-1234 (CVSS 9.8) │ +│ │ +│ Legend: ○ Node ● Changed ◆ Sink ─ Call │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 6. POLICY INTEGRATION + +### 6.1 CI Gate Behavior + +```yaml +# Policy wiring for drift detection +smart_diff: + gates: + # Fail PR when new reachable paths to affected sinks + - condition: "delta_reachable > 0 AND vex_status IN ['affected', 'under_investigation']" + action: block + message: "New reachable paths to vulnerable sinks detected" + + # Warn when new paths to any sink + - condition: "delta_reachable > 0" + action: warn + message: "New reachable paths detected - review recommended" + + # Auto-mitigate when VEX confirms not_affected + - condition: "vex_status == 'not_affected' AND vex_justification IN ['component_not_present', 'fix_applied']" + action: allow + auto_mitigate: true +``` + +### 6.2 Exit Codes + +| Code | Meaning | +|------|---------| +| 0 | Success, no material drift | +| 1 | Success, material drift found (info) | +| 2 | Success, hardening regression detected | +| 3 | Success, new KEV reachable | +| 10+ | Errors | + +--- + +## 7. SPRINT STRUCTURE + +### 7.1 Master Sprint: SPRINT_3600_0001_0001 + +**Topic**: Reachability Drift Detection +**Dependencies**: SPRINT_3500 (Smart-Diff) - COMPLETE + +### 7.2 Sub-Sprints + +| ID | Topic | Priority | Effort | Dependencies | +|----|-------|----------|--------|--------------| +| SPRINT_3600_0002_0001 | Call Graph Infrastructure | P0 | Large | Master | +| SPRINT_3600_0003_0001 | Drift Detection Engine | P0 | Medium | 3600.2 | +| SPRINT_3600_0004_0001 | UI and Evidence Chain | P1 | Medium | 3600.3 | + +--- + +## 8. REFERENCES + +- `docs/product-advisories/14-Dec-2025 - Smart-Diff Technical Reference.md` +- `docs/product-advisories/14-Dec-2025 - Reachability Analysis Technical Reference.md` +- `docs/implplan/SPRINT_3500_0001_0001_smart_diff_master.md` +- `docs/reachability/lattice.md` +- `bench/reachability-benchmark/README.md` diff --git a/docs/product-advisories/unprocessed/15-Dec-2025 - Modeling StellaRouter Performance Curves.md b/docs/product-advisories/archived/15-Dec-2025 - Modeling StellaRouter Performance Curves.md similarity index 100% rename from docs/product-advisories/unprocessed/15-Dec-2025 - Modeling StellaRouter Performance Curves.md rename to docs/product-advisories/archived/15-Dec-2025 - Modeling StellaRouter Performance Curves.md diff --git a/docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof‑Linked UX in Security Workflows.md b/docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof‑Linked UX in Security Workflows.md index a97820be..a8e97808 100644 --- a/docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof‑Linked UX in Security Workflows.md +++ b/docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof‑Linked UX in Security Workflows.md @@ -1,1683 +1,140 @@ -Here’s a compact, ready‑to‑use playbook to **measure and plot performance envelopes for an HTTP → Valkey → Worker hop under variable concurrency**, so you can tune autoscaling and predict user‑visible spikes. +# Reimagining Proof-Linked UX in Security Workflows + +**Date**: 2025-12-16 +**Status**: PROCESSED +**Last Updated**: 2025-12-17 --- -## What we’re measuring (plain English) +## Overview -* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job. -* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round‑trip. -* **Worker service time:** time to pick up, process, and ack. -* **Queueing delay:** time spent waiting in the queue (arrival → start of worker). +This advisory introduces a **"Narrative-First" Triage UX** paradigm for Stella Ops, designed to dramatically reduce time-to-evidence and provide cryptographically verifiable proof chains for every security decision. -These four add up to the “hop latency” users feel when the system is under load. +### Core Innovation + +**Competitor pattern (Lists-First):** +- Big table of CVEs → filters → click into details → hunt for reachability/SBOM/VEX links scattered across tabs. +- Weak proof chain; noisy; slow "time-to-evidence". + +**Stella pattern (Narrative-First):** +- Case Header answers "Can I ship?" above the fold +- Single Evidence tab with proof-linked artifacts +- Quiet-by-default noise controls with reversible, signed decisions +- Smart-Diff history explaining meaningful risk changes + +### Key Deliverables + +This advisory has been split into formal documentation: + +| Document | Location | Purpose | +|----------|----------|---------| +| **Triage UX Guide** | `docs/ux/TRIAGE_UX_GUIDE.md` | Complete UX specification | +| **UI Reducer Spec** | `docs/ux/TRIAGE_UI_REDUCER_SPEC.md` | Angular 17 state machine | +| **API Contract v1** | `docs/api/triage.contract.v1.md` | REST endpoint specifications | +| **Database Schema** | `docs/db/triage_schema.sql` | PostgreSQL tables and views | --- -## Minimal tracing you can add today +## Key Concepts -Emit these IDs/headers end‑to‑end: +### 1. Lanes (Visibility Buckets) -* `x-stella-corr-id` (uuid) -* `x-stella-enq-ts` (gateway enqueue ts, ns) -* `x-stella-claim-ts` (worker claim ts, ns) -* `x-stella-done-ts` (worker done ts, ns) +| Lane | Description | +|------|-------------| +| ACTIVE | Actionable findings requiring attention | +| BLOCKED | Findings blocking deployment | +| NEEDS_EXCEPTION | Findings requiring exception approval | +| MUTED_REACH | Auto-muted due to non-reachability | +| MUTED_VEX | Auto-muted due to VEX not_affected status | +| COMPENSATED | Muted due to compensating controls | -From these, compute: +### 2. Verdicts -* `queue_delay = claim_ts - enq_ts` -* `service_time = done_ts - claim_ts` -* `http_ttfs = gateway_first_byte_ts - http_request_start_ts` -* `hop_latency = done_ts - enq_ts` (or return‑path if synchronous) +- **SHIP**: Safe to deploy +- **BLOCK**: Deployment blocked by policy +- **EXCEPTION**: Needs exception approval -Clock‑sync tip: use monotonic clocks in code and convert to ns; don’t mix wall‑clock. +### 3. Evidence Types + +- SBOM_SLICE +- VEX_DOC +- PROVENANCE +- CALLSTACK_SLICE +- REACHABILITY_PROOF +- REPLAY_MANIFEST +- POLICY +- SCAN_LOG + +### 4. Decision Types + +All decisions are DSSE-signed and reversible: +- MUTE_REACH +- MUTE_VEX +- ACK +- EXCEPTION --- -## Valkey commands (safe, BSD Valkey) +## Performance Targets -Use **Valkey Streams + Consumer Groups** for fairness and metrics: +| Metric | Target | +|--------|--------| +| Time-to-Evidence (TTFS) | P95 ≤ 30s | +| Mute Correctness | < 3% reversal rate | +| Audit Coverage | > 98% complete bundles | -* Enqueue: `XADD jobs * corr-id enq-ts payload <...>` -* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >` -* Ack: `XACK jobs workers ` +--- -Add a small Lua for timestamping at enqueue (atomic): +## Sprint Plan -```lua --- KEYS[1]=stream --- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload -return redis.call('XADD', KEYS[1], '*', - 'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3]) +See `docs/implplan/` for detailed sprint files: + +1. **SPRINT_3700_0001_0001**: Database Schema & Migrations +2. **SPRINT_3701_0001_0001**: Triage API Endpoints +3. **SPRINT_3702_0001_0001**: Decision Service with DSSE Signing +4. **SPRINT_0360_0001_0001**: Angular State Management +5. **SPRINT_0361_0001_0001**: UI Components +6. **SPRINT_0362_0001_0001**: Evidence Bundle Export + +--- + +## Integration Points + +### Services Involved + +- `scanner.webservice`: Risk evaluation, evidence storage, API +- `concelier`: Vuln feed aggregation (preserve prune source) +- `excititor`: VEX merge (preserve prune source) +- `notify.webservice`: Event emission (first_signal, risk_changed, gate_blocked) +- `scheduler.webservice`: Re-evaluation triggers + +### Data Flow + +``` +concelier (feeds) ─┬─► scanner.webservice ─► triage API ─► UI + │ ▲ +excititor (VEX) ───┘ │ + │ +scheduler (re-eval) ───────┘ ``` --- -## Load shapes to test (find the envelope) +## Related Documentation -1. **Open‑loop (arrival‑rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset. -2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time. -3. **Step‑up/down:** double every 2 min until SLO breach; then halve down. -4. **Long tail soak:** run at 70–80% of max for 1h; watch p95‑p99.9 drift. - -Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**. +- `docs/product-advisories/14-Dec-2025 - Proof and Evidence Chain Technical Reference.md` +- `docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md` +- `docs/15_UI_GUIDE.md` +- `docs/modules/ui/architecture.md` --- -## k6 script (HTTP client pressure) +## Archive Note -```javascript -// save as hop-test.js -import http from 'k6/http'; -import { check, sleep } from 'k6'; - -export let options = { - scenarios: { - step_load: { - executor: 'ramping-arrival-rate', - startRate: 20, timeUnit: '1s', - preAllocatedVUs: 200, maxVUs: 5000, - stages: [ - { target: 50, duration: '1m' }, - { target: 100, duration: '1m' }, - { target: 200, duration: '1m' }, - { target: 400, duration: '1m' }, - { target: 800, duration: '1m' }, - ], - }, - }, - thresholds: { - 'http_req_failed': ['rate<0.01'], - 'http_req_duration{phase:hop}': ['p(95)<500'], - }, -}; - -export default function () { - const corr = crypto.randomUUID(); - const res = http.post( - __ENV.GW_URL, - JSON.stringify({ data: 'ping', corr }), - { - headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr }, - tags: { phase: 'hop' }, - } - ); - check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 }); - sleep(0.01); -} -``` - -Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js` +The original content of this advisory was incorrectly populated with performance testing infrastructure content. That content has been preserved at: +`docs/dev/performance-testing-playbook.md` --- -## Worker hooks (.NET 10 sketch) - -```csharp -// At claim -var now = Stopwatch.GetTimestamp(); // monotonic -var claimNs = now.ToNanoseconds(); -log.AddTag("x-stella-claim-ts", claimNs); - -// After processing -var doneNs = Stopwatch.GetTimestamp().ToNanoseconds(); -log.AddTag("x-stella-done-ts", doneNs); -// Include corr-id and stream entry id in logs/metrics -``` - -Helper: - -```csharp -public static class MonoTime { - static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency; - public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick); -} -``` - ---- - -## Prometheus metrics to expose - -* `valkey_enqueue_ns` (histogram) -* `valkey_claim_block_ms` (gauge) -* `worker_service_ns` (histogram, labels: worker_type, route) -* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`) -* `enqueue_rate`, `dequeue_rate` (counters) - -Example recording rules: - -```yaml -- record: hop:queue_delay_p95 - expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le)) -- record: hop:service_time_p95 - expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le)) -- record: hop:latency_budget_p95 - expr: hop:queue_delay_p95 + hop:service_time_p95 -``` - ---- - -## Autoscaling signals (HPA/KEDA friendly) - -* **Primary:** queue depth & its derivative (d/dt). -* **Secondary:** p95 `queue_delay` and worker CPU. -* **Safety:** max in‑flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`. - ---- - -## Plot the “envelope” (what you’ll look at) - -* X‑axis: **offered load** (req/s). -* Y‑axis: **p95 hop latency** (ms). -* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises). -* Add secondary panel: **queue depth** vs load. - -If you want, I can generate a ready‑made notebook that ingests your logs/metrics CSV and outputs these plots. -Below is a **set of implementation guidelines** your agents can follow to build a repeatable performance test system for the **HTTP → Valkey → Worker** pipeline. It’s written as a “spec + runbook” with clear MUST/SHOULD requirements and concrete scenario definitions. - ---- - -# Performance Test Guidelines - -## HTTP → Valkey → Worker pipeline - -## 1) Objectives and scope - -### Primary objectives - -Your performance tests MUST answer these questions with evidence: - -1. **Capacity knee**: At what offered load does **queue delay** start growing sharply? -2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load? -3. **Decomposition**: How much of hop latency is: - - * gateway enqueue time - * Valkey enqueue/claim RTT - * queue wait time - * worker service time -4. **Scaling behavior**: How do these change with worker replica counts (N workers)? -5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)? - -### Non-goals (explicitly out of scope unless you add them later) - -* Micro-optimizing single function runtime -* Synthetic “max QPS” records without a representative payload -* Tests that don’t collect segment metrics (end-to-end only) for anything beyond basic smoke - ---- - -## 2) Definitions and required metrics - -### Required latency definitions (standardize these names) - -Agents MUST compute and report these per request/job: - -* **`t_http_accept`**: time from client send → gateway accepts request -* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side) -* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s) -* **`t_queue_delay`**: `claim_ts - enq_ts` -* **`t_service`**: `done_ts - claim_ts` -* **`t_hop`**: `done_ts - enq_ts` (this is the “true pipeline hop” latency) -* Optional but recommended: - - * **`t_ack`**: time to ack completion (Valkey ack RTT) - * **`t_http_response`**: request start → gateway response sent (TTFB/TTFS) - -### Required percentiles and aggregations - -Per scenario step (e.g., each offered load plateau), agents MUST output: - -* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue` -* Throughput: offered rps and achieved rps -* Error rate: HTTP failures, enqueue failures, worker failures -* Queue depth and backlog drain time - -### Required system-level telemetry (minimum) - -Agents MUST collect these time series during tests: - -* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators -* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count -* **Gateway**: CPU/mem, request rate, response codes, request duration histogram - ---- - -## 3) Environment and test hygiene requirements - -### Environment requirements - -Agents SHOULD run tests in an environment that matches production in: - -* container CPU/memory limits -* number of nodes, network topology -* Valkey topology (single, cluster, sentinel, etc.) -* worker replica autoscaling rules (or deliberately disabled) - -If exact parity isn’t possible, agents MUST record all known differences in the report. - -### Test hygiene (non-negotiable) - -Agents MUST: - -1. **Start from empty queues** (no backlog). -2. **Disable client retries** (or explicitly run two variants: retries off / retries on). -3. **Warm up** before measuring (e.g., 60s warm-up minimum). -4. **Hold steady plateaus** long enough to stabilize (usually 2–5 minutes per step). -5. **Cool down** and verify backlog drains (queue depth returns to baseline). -6. Record exact versions/SHAs of gateway/worker and Valkey config. - -### Load generator hygiene - -Agents MUST ensure the load generator is not the bottleneck: - -* CPU < ~70% during test -* no local socket exhaustion -* enough VUs/connections -* if needed, distributed load generation - ---- - -## 4) Instrumentation spec (agents implement this first) - -### Correlation and timestamps - -Agents MUST propagate an end-to-end correlation ID and timestamps. - -**Required fields** - -* `corr_id` (UUID) -* `enq_ts_ns` (set at enqueue, monotonic or consistent clock) -* `claim_ts_ns` (set by worker when job is claimed) -* `done_ts_ns` (set by worker when job processing ends) - -**Where these live** - -* HTTP request header: `x-corr-id: ` -* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type -* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns` - -### Clock requirements - -Agents MUST use a consistent timing source: - -* Prefer monotonic timers for durations (Stopwatch / monotonic clock) -* If timestamps cross machines, ensure they’re comparable: - - * either rely on synchronized clocks (NTP) **and** monitor drift - * or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay) - -**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity. - -### Valkey queue semantics (recommended) - -Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability: - -* Enqueue: `XADD jobs * corr enq payload <...>` -* Claim: `XREADGROUP GROUP workers COUNT 1 BLOCK 1000 STREAMS jobs >` -* Ack: `XACK jobs workers ` - -Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag. - -### Metrics exposure - -Agents MUST publish Prometheus (or equivalent) histograms: - -* `gateway_enqueue_seconds` (or ns) histogram -* `valkey_enqueue_rtt_seconds` histogram -* `worker_service_seconds` histogram -* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline) -* `hop_latency_seconds` histogram - ---- - -## 5) Workload modeling and test data - -Agents MUST define a workload model before running capacity tests: - -1. **Endpoint(s)**: list exact gateway routes under test -2. **Payload types**: small/typical/large -3. **Mix**: e.g., 70/25/5 by payload size -4. **Idempotency rules**: ensure repeated jobs don’t corrupt state -5. **Data reset strategy**: how test data is cleaned or isolated per run - -Agents SHOULD test at least: - -* Typical payload (p50) -* Large payload (p95) -* Worst-case allowed payload (bounded by your API limits) - ---- - -## 6) Scenario suite your agents MUST implement - -Each scenario MUST be defined as code/config (not manual). - -### Scenario A — Smoke (fast sanity) - -**Goal**: verify instrumentation + basic correctness -**Load**: low (e.g., 1–5 rps), 2 minutes -**Pass**: - -* 0 backlog after run -* error rate < 0.1% -* metrics present for all segments - -### Scenario B — Baseline (repeatable reference point) - -**Goal**: establish a stable baseline for regression tracking -**Load**: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes -**Pass**: - -* p95 `t_hop` within baseline ± tolerance (set after first runs) -* no upward drift in p95 across time (trend line ~flat) - -### Scenario C — Capacity ramp (open-loop) - -**Goal**: find the knee where queueing begins -**Method**: open-loop arrival-rate ramp with plateaus -Example stages (edit to fit your system): - -* 50 rps for 2m -* 100 rps for 2m -* 200 rps for 2m -* 400 rps for 2m -* … until SLO breach or errors spike - -**MUST**: - -* warm-up stage before first plateau -* record per-plateau summary - -**Stop conditions** (any triggers stop): - -* error rate > 1% -* queue depth grows without bound over an entire plateau -* p95 `t_hop` exceeds SLO for 2 consecutive plateaus - -### Scenario D — Stress (push past capacity) - -**Goal**: characterize failure mode and recovery -**Load**: 120–200% of knee load, 5–10 minutes -**Pass** (for resilience): - -* system does not crash permanently -* once load stops, backlog drains within target time (define it) - -### Scenario E — Burst / spike - -**Goal**: see how quickly queue grows and drains -**Load shape**: - -* baseline low load -* sudden burst (e.g., 10× for 10–30s) -* return to baseline - -**Report**: - -* peak queue depth -* time to drain to baseline -* p99 `t_hop` during burst - -### Scenario F — Soak (long-running) - -**Goal**: detect drift (leaks, fragmentation, GC patterns) -**Load**: 70–85% of knee, 60–180 minutes -**Pass**: - -* p95 does not trend upward beyond threshold -* memory remains bounded -* no rising error rate - -### Scenario G — Scaling curve (worker replica sweep) - -**Goal**: turn results into scaling rules -**Method**: - -* Repeat Scenario C with worker replicas = 1, 2, 4, 8… - **Deliverable**: -* plot of knee load vs worker count -* p95 `t_service` vs worker count (should remain similar; queue delay should drop) - ---- - -## 7) Execution protocol (runbook) - -Agents MUST run every scenario using the same disciplined flow: - -### Pre-run checklist - -* confirm system versions/SHAs -* confirm autoscaling mode: - - * **Off** for baseline capacity characterization - * **On** for validating autoscaling policies -* clear queues and consumer group pending entries -* restart or at least record “time since deploy” for services (cold vs warm) - -### During run - -* ensure load is truly open-loop when required (arrival-rate based) -* continuously record: - - * offered vs achieved rate - * queue depth - * CPU/mem for gateway/worker/Valkey - -### Post-run - -* stop load -* wait until backlog drains (or record that it doesn’t) -* export: - - * k6/runner raw output - * Prometheus time series snapshot - * sampled logs with corr_id fields -* generate a summary report automatically (no hand calculations) - ---- - -## 8) Analysis rules (how agents compute “the envelope”) - -Agents MUST generate at minimum two plots per run: - -1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis) - - * overlay p99 (and SLO line) -2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time - -### How to identify the “knee” - -Agents SHOULD mark the knee as the first plateau where: - -* queue depth grows monotonically within the plateau, **or** -* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50–100%) - -### Convert results into scaling guidance - -Agents SHOULD compute: - -* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker) -* recommended replicas for offered load λ at target utilization U: - - * `workers_needed = ceil(λ * mean(t_service) / U)` - * choose U ~ 0.6–0.75 for headroom - -This should be reported alongside the measured envelope. - ---- - -## 9) Pass/fail criteria and regression gates - -Agents MUST define gates in configuration, not in someone’s head. - -Suggested gating structure: - -* **Smoke gate**: error rate < 0.1%, backlog drains -* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history) -* **Capacity gate**: knee load regression < 10% (optional but very valuable) -* **Soak gate**: p95 drift over time < 15% and no memory runaway - ---- - -## 10) Common pitfalls (agents must avoid) - -1. **Closed-loop tests used for capacity** - Closed-loop (“N concurrent users”) self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity. - -2. **Ignoring queue depth** - A system can look “healthy” in request latency while silently building backlog. - -3. **Measuring only gateway latency** - You must measure enqueue → claim → done to see the real hop. - -4. **Load generator bottleneck** - If the generator saturates, you’ll under-estimate capacity. - -5. **Retries enabled by default** - Retries can inflate load and hide root causes; run with retries off first. - -6. **Not controlling warm vs cold** - Cold caches vs warmed services produce different envelopes; record the condition. - ---- - -# Agent implementation checklist (deliverables) - -Assign these as concrete tasks to your agents. - -## Agent 1 — Observability & tracing - -MUST deliver: - -* correlation id propagation gateway → Valkey → worker -* timestamps `enq/claim/done` -* Prometheus histograms for enqueue, service, hop -* queue depth metric (`XLEN` / `XINFO` lag) - -## Agent 2 — Load test harness - -MUST deliver: - -* test runner scripts (k6 or equivalent) for scenarios A–G -* test config file (YAML/JSON) controlling: - - * stages (rates/durations) - * payload mix - * headers (corr-id) -* reproducible seeds and version stamping - -## Agent 3 — Result collector and analyzer - -MUST deliver: - -* a pipeline that merges: - - * load generator output - * hop timing data (from logs or a completion stream) - * Prometheus snapshots -* automatic summary + plots: - - * latency envelope - * queue depth/drain -* CSV/JSON exports for long-term tracking - -## Agent 4 — Reporting and dashboards - -MUST deliver: - -* a standard report template that includes: - - * environment details - * scenario details - * key charts - * knee estimate - * scaling recommendation -* Grafana dashboard with the required panels - -## Agent 5 — CI / release integration - -SHOULD deliver: - -* PR-level smoke test (Scenario A) -* nightly baseline (Scenario B) -* weekly capacity sweep (Scenario C + scaling curve) - ---- - -## Template: scenario spec (agents can copy/paste) - -```yaml -test_run: - system_under_test: - gateway_sha: "" - worker_sha: "" - valkey_version: "" - environment: - cluster: "" - workers: 4 - autoscaling: "off" # off|on - workload: - endpoint: "/hop" - payload_profile: "p50" - mix: - p50: 0.7 - p95: 0.25 - max: 0.05 - scenario: - name: "capacity_ramp" - mode: "open_loop" - warmup_seconds: 60 - stages: - - rps: 50 - duration_seconds: 120 - - rps: 100 - duration_seconds: 120 - - rps: 200 - duration_seconds: 120 - - rps: 400 - duration_seconds: 120 - gates: - max_error_rate: 0.01 - slo_ms_p95_hop: 500 - backlog_must_drain_seconds: 300 - outputs: - artifacts_dir: "./artifacts//" -``` - ---- - -If you want, I can also provide a **single “golden” folder structure** (tests/ scripts/ dashboards/ analysis/) and a “definition of done” checklist that matches how your repo is organized—but the above is already sufficient for agents to start implementing immediately. -Below is a **sample / partial implementation** that gives **full functional coverage** of your performance-test requirements (instrumentation, correlation, timestamps, queue semantics, scenarios A–G, artifact export, and analysis). It is intentionally minimal and “swap-in-real-code” friendly. - -You can copy these files into a `perf/` folder in your repo, build, and run locally with Docker Compose. - ---- - -## 1) Suggested folder layout - -``` -perf/ - docker-compose.yml - prometheus/ - prometheus.yml - k6/ - lib.js - smoke.js - capacity_ramp.js - burst.js - soak.js - stress.js - scaling_curve.sh - tools/ - analyze.py - src/ - Perf.Gateway/ - Perf.Gateway.csproj - Program.cs - Metrics.cs - ValkeyStreams.cs - TimeNs.cs - Perf.Worker/ - Perf.Worker.csproj - Program.cs - WorkerService.cs - Metrics.cs - ValkeyStreams.cs - TimeNs.cs -``` - ---- - -## 2) Gateway sample (.NET 10, Minimal API) - -### `perf/src/Perf.Gateway/Perf.Gateway.csproj` - -```xml - - - net10.0 - enable - enable - - - - - - - -``` - -### `perf/src/Perf.Gateway/TimeNs.cs` - -```csharp -namespace Perf.Gateway; - -public static class TimeNs -{ - private static readonly long UnixEpochTicks = DateTime.UnixEpoch.Ticks; // 100ns units - - public static long UnixNowNs() - { - var ticks = DateTime.UtcNow.Ticks - UnixEpochTicks; // 100ns - return ticks * 100L; // ns - } -} -``` - -### `perf/src/Perf.Gateway/Metrics.cs` - -```csharp -using System.Collections.Concurrent; -using System.Globalization; -using System.Text; - -namespace Perf.Gateway; - -public sealed class Metrics -{ - private readonly ConcurrentDictionary _counters = new(); - - // Simple fixed-bucket histograms in seconds (Prometheus histogram format) - private readonly ConcurrentDictionary _h = new(); - - public void Inc(string name, long by = 1) => _counters.AddOrUpdate(name, by, (_, v) => v + by); - - public Histogram Hist(string name, double[] bucketsSeconds) => - _h.GetOrAdd(name, _ => new Histogram(name, bucketsSeconds)); - - public string ExportPrometheus() - { - var sb = new StringBuilder(16 * 1024); - - foreach (var (k, v) in _counters.OrderBy(kv => kv.Key)) - { - sb.Append("# TYPE ").Append(k).Append(" counter\n"); - sb.Append(k).Append(' ').Append(v.ToString(CultureInfo.InvariantCulture)).Append('\n'); - } - - foreach (var hist in _h.Values.OrderBy(h => h.Name)) - { - sb.Append(hist.Export()); - } - - return sb.ToString(); - } - - public sealed class Histogram - { - public string Name { get; } - private readonly double[] _buckets; // sorted - private readonly long[] _bucketCounts; // cumulative exposed later - private long _count; - private double _sum; - - private readonly object _lock = new(); - - public Histogram(string name, double[] bucketsSeconds) - { - Name = name; - _buckets = bucketsSeconds.OrderBy(x => x).ToArray(); - _bucketCounts = new long[_buckets.Length]; - } - - public void Observe(double seconds) - { - lock (_lock) - { - _count++; - _sum += seconds; - - for (int i = 0; i < _buckets.Length; i++) - { - if (seconds <= _buckets[i]) _bucketCounts[i]++; - } - } - } - - public string Export() - { - // Prometheus hist buckets are cumulative; we already maintain that. - var sb = new StringBuilder(2048); - sb.Append("# TYPE ").Append(Name).Append(" histogram\n"); - - lock (_lock) - { - for (int i = 0; i < _buckets.Length; i++) - { - sb.Append(Name).Append("_bucket{le=\"") - .Append(_buckets[i].ToString("0.################", CultureInfo.InvariantCulture)) - .Append("\"} ") - .Append(_bucketCounts[i].ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - } - - sb.Append(Name).Append("_bucket{le=\"+Inf\"} ") - .Append(_count.ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - - sb.Append(Name).Append("_sum ") - .Append(_sum.ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - - sb.Append(Name).Append("_count ") - .Append(_count.ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - } - - return sb.ToString(); - } - } -} -``` - -### `perf/src/Perf.Gateway/ValkeyStreams.cs` - -```csharp -using StackExchange.Redis; - -namespace Perf.Gateway; - -public sealed class ValkeyStreams -{ - private readonly IDatabase _db; - public ValkeyStreams(IConnectionMultiplexer mux) => _db = mux.GetDatabase(); - - public async Task EnsureConsumerGroupAsync(string stream, string group) - { - try - { - // XGROUP CREATE $ MKSTREAM - await _db.ExecuteAsync("XGROUP", "CREATE", stream, group, "$", "MKSTREAM"); - } - catch (RedisServerException ex) when (ex.Message.Contains("BUSYGROUP", StringComparison.OrdinalIgnoreCase)) - { - // ok - } - } - - public async Task XAddAsync(string stream, NameValueEntry[] fields) - { - // XADD stream * field value field value ... - var args = new List(2 + fields.Length * 2) { stream, "*" }; - foreach (var f in fields) { args.Add(f.Name); args.Add(f.Value); } - return await _db.ExecuteAsync("XADD", args.ToArray()); - } -} -``` - -### `perf/src/Perf.Gateway/Program.cs` - -```csharp -using Perf.Gateway; -using StackExchange.Redis; -using System.Diagnostics; - -var builder = WebApplication.CreateBuilder(args); - -var valkey = builder.Configuration["VALKEY_ENDPOINT"] ?? "valkey:6379"; -builder.Services.AddSingleton(_ => ConnectionMultiplexer.Connect(valkey)); -builder.Services.AddSingleton(); -builder.Services.AddSingleton(); - -var app = builder.Build(); - -var metrics = app.Services.GetRequiredService(); -var streams = app.Services.GetRequiredService(); - -const string JobsStream = "stella:perf:jobs"; -const string DoneStream = "stella:perf:done"; -const string Group = "workers"; - -await streams.EnsureConsumerGroupAsync(JobsStream, Group); - -var allowTestControl = (app.Configuration["ALLOW_TEST_CONTROL"] ?? "1") == "1"; -var runs = new Dictionary(StringComparer.Ordinal); // run_id -> start_ns - -if (allowTestControl) -{ - app.MapPost("/test/start", () => - { - var runId = Guid.NewGuid().ToString("N"); - var startNs = TimeNs.UnixNowNs(); - lock (runs) runs[runId] = startNs; - - metrics.Inc("perf_test_start_total"); - return Results.Ok(new { run_id = runId, start_ns = startNs, jobs_stream = JobsStream, done_stream = DoneStream }); - }); - - app.MapPost("/test/end/{runId}", (string runId) => - { - lock (runs) runs.Remove(runId); - metrics.Inc("perf_test_end_total"); - return Results.Ok(new { run_id = runId }); - }); -} - -app.MapGet("/metrics", () => Results.Text(metrics.ExportPrometheus(), "text/plain; version=0.0.4")); - -app.MapPost("/hop", async (HttpRequest req) => -{ - // Correlation / run id - var corr = req.Headers["x-stella-corr-id"].FirstOrDefault() ?? Guid.NewGuid().ToString(); - var runId = req.Headers["x-stella-run-id"].FirstOrDefault() ?? "no-run"; - - // Enqueue timestamp (UTC-derived ns) - var enqNs = TimeNs.UnixNowNs(); - - // Read raw body (payload) - keep it simple for perf harness - string payload; - using (var sr = new StreamReader(req.Body)) - payload = await sr.ReadToEndAsync(); - - var sw = Stopwatch.GetTimestamp(); - - // Valkey enqueue - var valkeySw = Stopwatch.GetTimestamp(); - var entryId = await streams.XAddAsync(JobsStream, new[] - { - new NameValueEntry("corr", corr), - new NameValueEntry("run", runId), - new NameValueEntry("enq_ns", enqNs), - new NameValueEntry("payload", payload), - }); - var valkeyRttSec = (Stopwatch.GetTimestamp() - valkeySw) / (double)Stopwatch.Frequency; - - var enqueueSec = (Stopwatch.GetTimestamp() - sw) / (double)Stopwatch.Frequency; - - metrics.Inc("hop_requests_total"); - metrics.Hist("gateway_enqueue_seconds", new[] { .001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2 }).Observe(enqueueSec); - metrics.Hist("valkey_enqueue_rtt_seconds", new[] { .0005, .001, .002, .005, .01, .02, .05, .1, .2 }).Observe(valkeyRttSec); - - return Results.Accepted(value: new { corr, run_id = runId, enq_ns = enqNs, entry_id = entryId.ToString() }); -}); - -app.Run("http://0.0.0.0:8080"); -``` - ---- - -## 3) Worker sample (.NET 10 hosted service + metrics) - -### `perf/src/Perf.Worker/Perf.Worker.csproj` - -```xml - - - net10.0 - enable - enable - - - - - - -``` - -### `perf/src/Perf.Worker/TimeNs.cs` - -```csharp -namespace Perf.Worker; - -public static class TimeNs -{ - private static readonly long UnixEpochTicks = DateTime.UnixEpoch.Ticks; - public static long UnixNowNs() => (DateTime.UtcNow.Ticks - UnixEpochTicks) * 100L; -} -``` - -### `perf/src/Perf.Worker/Metrics.cs` - -```csharp -// Same as gateway Metrics.cs (copy/paste). Keep identical for consistency. -using System.Collections.Concurrent; -using System.Globalization; -using System.Text; - -namespace Perf.Worker; - -public sealed class Metrics -{ - private readonly ConcurrentDictionary _counters = new(); - private readonly ConcurrentDictionary _h = new(); - - public void Inc(string name, long by = 1) => _counters.AddOrUpdate(name, by, (_, v) => v + by); - public Histogram Hist(string name, double[] bucketsSeconds) => - _h.GetOrAdd(name, _ => new Histogram(name, bucketsSeconds)); - - public string ExportPrometheus() - { - var sb = new StringBuilder(16 * 1024); - - foreach (var (k, v) in _counters.OrderBy(kv => kv.Key)) - { - sb.Append("# TYPE ").Append(k).Append(" counter\n"); - sb.Append(k).Append(' ').Append(v.ToString(CultureInfo.InvariantCulture)).Append('\n'); - } - - foreach (var hist in _h.Values.OrderBy(h => h.Name)) - sb.Append(hist.Export()); - - return sb.ToString(); - } - - public sealed class Histogram - { - public string Name { get; } - private readonly double[] _buckets; - private readonly long[] _bucketCounts; - private long _count; - private double _sum; - private readonly object _lock = new(); - - public Histogram(string name, double[] bucketsSeconds) - { - Name = name; - _buckets = bucketsSeconds.OrderBy(x => x).ToArray(); - _bucketCounts = new long[_buckets.Length]; - } - - public void Observe(double seconds) - { - lock (_lock) - { - _count++; - _sum += seconds; - for (int i = 0; i < _buckets.Length; i++) - if (seconds <= _buckets[i]) _bucketCounts[i]++; - } - } - - public string Export() - { - var sb = new StringBuilder(2048); - sb.Append("# TYPE ").Append(Name).Append(" histogram\n"); - lock (_lock) - { - for (int i = 0; i < _buckets.Length; i++) - { - sb.Append(Name).Append("_bucket{le=\"") - .Append(_buckets[i].ToString("0.################", CultureInfo.InvariantCulture)) - .Append("\"} ") - .Append(_bucketCounts[i].ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - } - - sb.Append(Name).Append("_bucket{le=\"+Inf\"} ") - .Append(_count.ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - - sb.Append(Name).Append("_sum ") - .Append(_sum.ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - - sb.Append(Name).Append("_count ") - .Append(_count.ToString(CultureInfo.InvariantCulture)) - .Append('\n'); - } - return sb.ToString(); - } - } -} -``` - -### `perf/src/Perf.Worker/ValkeyStreams.cs` - -```csharp -using StackExchange.Redis; - -namespace Perf.Worker; - -public sealed class ValkeyStreams -{ - private readonly IDatabase _db; - public ValkeyStreams(IConnectionMultiplexer mux) => _db = mux.GetDatabase(); - - public async Task EnsureConsumerGroupAsync(string stream, string group) - { - try - { - await _db.ExecuteAsync("XGROUP", "CREATE", stream, group, "$", "MKSTREAM"); - } - catch (RedisServerException ex) when (ex.Message.Contains("BUSYGROUP", StringComparison.OrdinalIgnoreCase)) { } - } - - public async Task XReadGroupAsync(string group, string consumer, string stream, string id, int count, int blockMs) - => await _db.ExecuteAsync("XREADGROUP", "GROUP", group, consumer, "COUNT", count, "BLOCK", blockMs, "STREAMS", stream, id); - - public async Task XAckAsync(string stream, string group, RedisValue id) - => await _db.ExecuteAsync("XACK", stream, group, id); - - public async Task XAddAsync(string stream, NameValueEntry[] fields) - { - var args = new List(2 + fields.Length * 2) { stream, "*" }; - foreach (var f in fields) { args.Add(f.Name); args.Add(f.Value); } - return await _db.ExecuteAsync("XADD", args.ToArray()); - } -} -``` - -### `perf/src/Perf.Worker/WorkerService.cs` - -```csharp -using StackExchange.Redis; -using System.Diagnostics; - -namespace Perf.Worker; - -public sealed class WorkerService : BackgroundService -{ - private readonly ValkeyStreams _streams; - private readonly Metrics _metrics; - private readonly ILogger _log; - - private const string JobsStream = "stella:perf:jobs"; - private const string DoneStream = "stella:perf:done"; - private const string Group = "workers"; - - private readonly string _consumer; - - public WorkerService(ValkeyStreams streams, Metrics metrics, ILogger log) - { - _streams = streams; - _metrics = metrics; - _log = log; - _consumer = Environment.GetEnvironmentVariable("WORKER_CONSUMER") ?? $"w-{Environment.MachineName}-{Guid.NewGuid():N}"; - } - - protected override async Task ExecuteAsync(CancellationToken stoppingToken) - { - await _streams.EnsureConsumerGroupAsync(JobsStream, Group); - - var serviceBuckets = new[] { .001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5 }; - var queueBuckets = new[] { .001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10, 30 }; - var hopBuckets = new[] { .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10, 30 }; - - while (!stoppingToken.IsCancellationRequested) - { - RedisResult res; - try - { - res = await _streams.XReadGroupAsync(Group, _consumer, JobsStream, ">", count: 1, blockMs: 1000); - } - catch (Exception ex) - { - _metrics.Inc("worker_xread_errors_total"); - _log.LogWarning(ex, "XREADGROUP failed"); - await Task.Delay(250, stoppingToken); - continue; - } - - if (res.IsNull) continue; - - // Parse XREADGROUP result (array -> stream -> entries) - // Expected shape: [[stream, [[id, [field, value, field, value...]], ...]]] - var outer = (RedisResult[])res!; - foreach (var streamBlock in outer) - { - var sb = (RedisResult[])streamBlock!; - var entries = (RedisResult[])sb[1]!; - - foreach (var entry in entries) - { - var e = (RedisResult[])entry!; - var entryId = (RedisValue)e[0]!; - var fields = (RedisResult[])e[1]!; - - string corr = "", run = "no-run"; - long enqNs = 0; - - for (int i = 0; i < fields.Length; i += 2) - { - var key = (string)fields[i]!; - var val = fields[i + 1].ToString(); - if (key == "corr") corr = val; - else if (key == "run") run = val; - else if (key == "enq_ns") _ = long.TryParse(val, out enqNs); - } - - var claimNs = TimeNs.UnixNowNs(); - - var sw = Stopwatch.GetTimestamp(); - - // Placeholder "service work" – replace with real processing - // Keep it deterministic-ish; use env var to model different service times. - var workMs = int.TryParse(Environment.GetEnvironmentVariable("WORK_MS"), out var ms) ? ms : 5; - await Task.Delay(workMs, stoppingToken); - - var doneNs = TimeNs.UnixNowNs(); - var serviceSec = (Stopwatch.GetTimestamp() - sw) / (double)Stopwatch.Frequency; - - var queueDelaySec = enqNs > 0 ? (claimNs - enqNs) / 1_000_000_000d : double.NaN; - var hopSec = enqNs > 0 ? (doneNs - enqNs) / 1_000_000_000d : double.NaN; - - // Ack then publish "done" record for offline analysis - await _streams.XAckAsync(JobsStream, Group, entryId); - - await _streams.XAddAsync(DoneStream, new[] - { - new NameValueEntry("run", run), - new NameValueEntry("corr", corr), - new NameValueEntry("entry", entryId), - new NameValueEntry("enq_ns", enqNs), - new NameValueEntry("claim_ns", claimNs), - new NameValueEntry("done_ns", doneNs), - new NameValueEntry("work_ms", workMs), - }); - - _metrics.Inc("worker_jobs_total"); - _metrics.Hist("worker_service_seconds", serviceBuckets).Observe(serviceSec); - - if (!double.IsNaN(queueDelaySec)) - _metrics.Hist("queue_delay_seconds", queueBuckets).Observe(queueDelaySec); - - if (!double.IsNaN(hopSec)) - _metrics.Hist("hop_latency_seconds", hopBuckets).Observe(hopSec); - } - } - } - } -} -``` - -### `perf/src/Perf.Worker/Program.cs` - -```csharp -using Perf.Worker; -using StackExchange.Redis; - -var builder = Host.CreateApplicationBuilder(args); - -var valkey = builder.Configuration["VALKEY_ENDPOINT"] ?? "valkey:6379"; -builder.Services.AddSingleton(_ => ConnectionMultiplexer.Connect(valkey)); -builder.Services.AddSingleton(); -builder.Services.AddSingleton(); -builder.Services.AddHostedService(); - -// Minimal metrics endpoint -builder.Services.AddSingleton(sp => -{ - return new SimpleMetricsServer( - sp.GetRequiredService(), - url: "http://0.0.0.0:8081/metrics" - ); -}); - -var host = builder.Build(); -await host.RunAsync(); - -// ---- minimal metrics server ---- -file sealed class SimpleMetricsServer : BackgroundService -{ - private readonly Metrics _metrics; - private readonly string _url; - - public SimpleMetricsServer(Metrics metrics, string url) { _metrics = metrics; _url = url; } - - protected override async Task ExecuteAsync(CancellationToken stoppingToken) - { - var builder = WebApplication.CreateBuilder(); - var app = builder.Build(); - app.MapGet("/metrics", () => Results.Text(_metrics.ExportPrometheus(), "text/plain; version=0.0.4")); - await app.RunAsync(_url, stoppingToken); - } -} -``` - ---- - -## 4) Docker Compose (Valkey + gateway + worker + Prometheus) - -### `perf/docker-compose.yml` - -```yaml -services: - valkey: - image: valkey/valkey:7.2 - ports: ["6379:6379"] - - gateway: - build: - context: ./src/Perf.Gateway - environment: - - VALKEY_ENDPOINT=valkey:6379 - - ALLOW_TEST_CONTROL=1 - ports: ["8080:8080"] - depends_on: [valkey] - - worker: - build: - context: ./src/Perf.Worker - environment: - - VALKEY_ENDPOINT=valkey:6379 - - WORK_MS=5 - ports: ["8081:8081"] - depends_on: [valkey] - - prometheus: - image: prom/prometheus:v2.55.0 - volumes: - - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - ports: ["9090:9090"] - depends_on: [gateway, worker] -``` - -### `perf/prometheus/prometheus.yml` - -```yaml -global: - scrape_interval: 5s - -scrape_configs: - - job_name: gateway - static_configs: - - targets: ["gateway:8080"] - - - job_name: worker - static_configs: - - targets: ["worker:8081"] -``` - -Run: - -```bash -cd perf -docker compose up -d --build -``` - ---- - -## 5) k6 scenarios A–G (open-loop where required) - -### `perf/k6/lib.js` - -```javascript -import http from "k6/http"; - -export function startRun(baseUrl) { - const res = http.post(`${baseUrl}/test/start`, null, { tags: { phase: "control" } }); - if (res.status !== 200) throw new Error(`startRun failed: ${res.status} ${res.body}`); - return res.json(); -} - -export function hop(baseUrl, runId) { - const corr = crypto.randomUUID(); - const payload = JSON.stringify({ corr, data: "ping" }); - - return http.post( - `${baseUrl}/hop`, - payload, - { - headers: { - "content-type": "application/json", - "x-stella-run-id": runId, - "x-stella-corr-id": corr - }, - tags: { phase: "hop" } - } - ); -} -``` - -### Scenario A: Smoke — `perf/k6/smoke.js` - -```javascript -import { check, sleep } from "k6"; -import { startRun, hop } from "./lib.js"; - -export const options = { - scenarios: { - smoke: { - executor: "constant-arrival-rate", - rate: 2, - timeUnit: "1s", - duration: "2m", - preAllocatedVUs: 20, - maxVUs: 200 - } - }, - thresholds: { - http_req_failed: ["rate<0.001"] - } -}; - -export function setup() { - return startRun(__ENV.GW_URL); -} - -export default function (data) { - const res = hop(__ENV.GW_URL, data.run_id); - check(res, { "202 accepted": r => r.status === 202 }); - sleep(0.01); -} -``` - -### Scenario C: Capacity ramp (open-loop) — `perf/k6/capacity_ramp.js` - -```javascript -import { check } from "k6"; -import { startRun, hop } from "./lib.js"; - -export const options = { - scenarios: { - ramp: { - executor: "ramping-arrival-rate", - startRate: 50, - timeUnit: "1s", - preAllocatedVUs: 200, - maxVUs: 5000, - stages: [ - { target: 50, duration: "2m" }, - { target: 100, duration: "2m" }, - { target: 200, duration: "2m" }, - { target: 400, duration: "2m" }, - { target: 800, duration: "2m" } - ] - } - }, - thresholds: { - http_req_failed: ["rate<0.01"] - } -}; - -export function setup() { - return startRun(__ENV.GW_URL); -} - -export default function (data) { - const res = hop(__ENV.GW_URL, data.run_id); - check(res, { "202 accepted": r => r.status === 202 }); -} -``` - -### Scenario E: Burst — `perf/k6/burst.js` - -```javascript -import { check } from "k6"; -import { startRun, hop } from "./lib.js"; - -export const options = { - scenarios: { - burst: { - executor: "ramping-arrival-rate", - startRate: 20, - timeUnit: "1s", - preAllocatedVUs: 200, - maxVUs: 5000, - stages: [ - { target: 20, duration: "60s" }, - { target: 400, duration: "20s" }, - { target: 20, duration: "120s" } - ] - } - } -}; - -export function setup() { return startRun(__ENV.GW_URL); } - -export default function (data) { - const res = hop(__ENV.GW_URL, data.run_id); - check(res, { "202": r => r.status === 202 }); -} -``` - -### Scenario F: Soak — `perf/k6/soak.js` - -```javascript -import { check } from "k6"; -import { startRun, hop } from "./lib.js"; - -export const options = { - scenarios: { - soak: { - executor: "constant-arrival-rate", - rate: 200, - timeUnit: "1s", - duration: "60m", - preAllocatedVUs: 500, - maxVUs: 5000 - } - } -}; - -export function setup() { return startRun(__ENV.GW_URL); } - -export default function (data) { - const res = hop(__ENV.GW_URL, data.run_id); - check(res, { "202": r => r.status === 202 }); -} -``` - -### Scenario D: Stress — `perf/k6/stress.js` - -```javascript -import { check } from "k6"; -import { startRun, hop } from "./lib.js"; - -export const options = { - scenarios: { - stress: { - executor: "constant-arrival-rate", - rate: 1500, - timeUnit: "1s", - duration: "10m", - preAllocatedVUs: 2000, - maxVUs: 15000 - } - }, - thresholds: { - http_req_failed: ["rate<0.05"] - } -}; - -export function setup() { return startRun(__ENV.GW_URL); } - -export default function (data) { - const res = hop(__ENV.GW_URL, data.run_id); - check(res, { "202": r => r.status === 202 }); -} -``` - -### Scenario G: Scaling curve orchestration — `perf/k6/scaling_curve.sh` - -```bash -#!/usr/bin/env bash -set -euo pipefail - -GW_URL="${GW_URL:-http://localhost:8080}" - -for n in 1 2 4 8; do - echo "== Scaling workers to $n ==" - docker compose -f ../docker-compose.yml up -d --scale worker="$n" - - mkdir -p "../artifacts/scale-$n" - k6 run \ - -e GW_URL="$GW_URL" \ - --summary-export "../artifacts/scale-$n/k6-summary.json" \ - ./capacity_ramp.js -done -``` - -Run (examples): - -```bash -cd perf/k6 -GW_URL=http://localhost:8080 k6 run --summary-export ../artifacts/smoke-summary.json smoke.js -GW_URL=http://localhost:8080 k6 run --summary-export ../artifacts/ramp-summary.json capacity_ramp.js -``` - ---- - -## 6) Offline analysis tool (reads “done” stream by run_id) - -### `perf/tools/analyze.py` - -```python -import os, sys, json, math -from datetime import datetime, timezone - -import redis - -def pct(values, p): - if not values: - return None - values = sorted(values) - k = (len(values) - 1) * (p / 100.0) - f = math.floor(k); c = math.ceil(k) - if f == c: - return values[int(k)] - return values[f] * (c - k) + values[c] * (k - f) - -def main(): - valkey = os.getenv("VALKEY_ENDPOINT", "localhost:6379") - host, port = valkey.split(":") - r = redis.Redis(host=host, port=int(port), decode_responses=True) - - run_id = os.getenv("RUN_ID") - if not run_id: - print("Set RUN_ID env var (from /test/start response).", file=sys.stderr) - sys.exit(2) - - done_stream = os.getenv("DONE_STREAM", "stella:perf:done") - - # Read all entries (sample scale). For big runs use XREAD with cursor. - entries = r.xrange(done_stream, min='-', max='+', count=200000) - - hop_ms = [] - queue_ms = [] - service_ms = [] - - matched = 0 - for entry_id, fields in entries: - if fields.get("run") != run_id: - continue - matched += 1 - - enq_ns = int(fields.get("enq_ns", "0")) - claim_ns = int(fields.get("claim_ns", "0")) - done_ns = int(fields.get("done_ns", "0")) - - if enq_ns > 0 and claim_ns > 0: - queue_ms.append((claim_ns - enq_ns) / 1_000_000.0) - if claim_ns > 0 and done_ns > 0: - service_ms.append((done_ns - claim_ns) / 1_000_000.0) - if enq_ns > 0 and done_ns > 0: - hop_ms.append((done_ns - enq_ns) / 1_000_000.0) - - summary = { - "run_id": run_id, - "done_stream": done_stream, - "matched_jobs": matched, - "hop_ms": { - "p50": pct(hop_ms, 50), "p95": pct(hop_ms, 95), "p99": pct(hop_ms, 99) - }, - "queue_ms": { - "p50": pct(queue_ms, 50), "p95": pct(queue_ms, 95), "p99": pct(queue_ms, 99) - }, - "service_ms": { - "p50": pct(service_ms, 50), "p95": pct(service_ms, 95), "p99": pct(service_ms, 99) - }, - "generated_at": datetime.now(timezone.utc).isoformat() - } - - print(json.dumps(summary, indent=2)) - -if __name__ == "__main__": - main() -``` - -Run: - -```bash -pip install redis -RUN_ID= python perf/tools/analyze.py -``` - -This yields the **key percentiles** for `hop`, `queue_delay`, and `service` from the authoritative worker-side timestamps. - ---- - -## 7) What this sample already covers - -* **Correlation**: `x-stella-corr-id` end-to-end. -* **Run isolation**: `x-stella-run-id` created via `/test/start`, used to filter results. -* **Valkey Streams + consumer group**: fair claim semantics. -* **Required timestamps**: `enq_ns`, `claim_ns`, `done_ns`. -* **Metrics**: - - * `gateway_enqueue_seconds` histogram - * `valkey_enqueue_rtt_seconds` histogram - * `worker_service_seconds`, `queue_delay_seconds`, `hop_latency_seconds` histograms -* **Scenarios**: - - * A Smoke, C Capacity ramp, D Stress, E Burst, F Soak - * G Scaling curve via script (repeat ramp across worker counts) - ---- - -## 8) Immediate next hardening steps (still “small”) - -1. **Add queue depth / lag gauges**: in worker or gateway poll `XLEN stella:perf:jobs` and export as a gauge metric in Prometheus format. -2. **Drain-time measurement**: implement `/test/end/{runId}` that waits until “matched jobs stop increasing” + queue depth returns baseline, and records a final metric. -3. **Stage slicing** (per plateau stats): extend `analyze.py` to accept your k6 stage plan and compute p95 per stage window (based on start_ns). - -If you want, I can extend the sample with (1) queue-depth export and (2) per-plateau slicing in `analyze.py` without adding any new dependencies. +**Document Version**: 1.0 +**Target Platform**: .NET 10, PostgreSQL >= 16, Angular v17 diff --git a/docs/ux/TRIAGE_UI_REDUCER_SPEC.md b/docs/ux/TRIAGE_UI_REDUCER_SPEC.md new file mode 100644 index 00000000..e3166a2b --- /dev/null +++ b/docs/ux/TRIAGE_UI_REDUCER_SPEC.md @@ -0,0 +1,400 @@ +# Stella Ops Triage UI Reducer Spec (Pure State + Explicit Commands) + +## 0. Purpose + +Define a deterministic, testable UI state machine for the triage UI. +- State transitions are pure functions. +- Side effects are emitted as explicit Commands. +- Enables UI "replay" for debugging (aligns with Stella's deterministic ethos). + +Target stack: Angular 17 + TypeScript. + +## 1. Core Concepts + +- Action: user/system event (route change, button click, HTTP success). +- State: all data required to render triage surfaces. +- Command: side-effect request (HTTP, download, navigation). + +Reducer signature: + +```ts +type ReduceResult = { state: TriageState; cmd: Command }; +function reduce(state: TriageState, action: Action): ReduceResult; +``` + +## 2. State Model + +```ts +export type Lane = + | "ACTIVE" + | "BLOCKED" + | "NEEDS_EXCEPTION" + | "MUTED_REACH" + | "MUTED_VEX" + | "COMPENSATED"; + +export type Verdict = "SHIP" | "BLOCK" | "EXCEPTION"; + +export interface MutedCounts { + reach: number; + vex: number; + compensated: number; +} + +export interface FindingRow { + id: string; // caseId == findingId + lane: Lane; + verdict: Verdict; + score: number; + reachable: "YES" | "NO" | "UNKNOWN"; + vex: "affected" | "not_affected" | "under_investigation" | "unknown"; + exploit: "YES" | "NO" | "UNKNOWN"; + asset: string; + updatedAt: string; // ISO +} + +export interface CaseHeader { + id: string; + verdict: Verdict; + lane: Lane; + score: number; + policyId: string; + policyVersion: string; + inputsHash: string; + why: string; // short narrative + chips: Array<{ key: string; label: string; value: string; evidenceIds?: string[] }>; +} + +export type EvidenceType = + | "SBOM_SLICE" + | "VEX_DOC" + | "PROVENANCE" + | "CALLSTACK_SLICE" + | "REACHABILITY_PROOF" + | "REPLAY_MANIFEST" + | "POLICY" + | "SCAN_LOG" + | "OTHER"; + +export interface EvidenceItem { + id: string; + type: EvidenceType; + title: string; + issuer?: string; + signed: boolean; + signedBy?: string; + contentHash: string; + createdAt: string; + previewUrl?: string; + rawUrl: string; +} + +export type DecisionKind = "MUTE_REACH" | "MUTE_VEX" | "ACK" | "EXCEPTION"; + +export interface DecisionItem { + id: string; + kind: DecisionKind; + reasonCode: string; + note?: string; + ttl?: string; + actor: { subject: string; display?: string }; + createdAt: string; + revokedAt?: string; + signatureRef?: string; +} + +export type SnapshotTrigger = + | "FEED_UPDATE" + | "VEX_UPDATE" + | "SBOM_UPDATE" + | "RUNTIME_TRACE" + | "POLICY_UPDATE" + | "DECISION" + | "RESCAN"; + +export interface SnapshotItem { + id: string; + trigger: SnapshotTrigger; + changedAt: string; + fromInputsHash: string; + toInputsHash: string; + summary: string; +} + +export interface SmartDiff { + fromInputsHash: string; + toInputsHash: string; + inputsChanged: Array<{ key: string; before?: string; after?: string; evidenceIds?: string[] }>; + outputsChanged: Array<{ key: string; before?: string; after?: string; evidenceIds?: string[] }>; +} + +export interface TriageState { + route: { page: "TABLE" | "CASE"; caseId?: string }; + filters: { + showMuted: boolean; + lane?: Lane; + search?: string; + page: number; + pageSize: number; + }; + + table: { + loading: boolean; + rows: FindingRow[]; + mutedCounts?: MutedCounts; + error?: string; + etag?: string; + }; + + caseView: { + loading: boolean; + header?: CaseHeader; + evidenceLoading: boolean; + evidence?: EvidenceItem[]; + decisionsLoading: boolean; + decisions?: DecisionItem[]; + snapshotsLoading: boolean; + snapshots?: SnapshotItem[]; + diffLoading: boolean; + activeDiff?: SmartDiff; + error?: string; + etag?: string; + }; + + ui: { + decisionDrawerOpen: boolean; + diffPanelOpen: boolean; + toast?: { kind: "success" | "error" | "info"; message: string }; + }; +} +``` + +## 3. Commands + +```ts +export type Command = + | { type: "NONE" } + | { type: "HTTP_GET"; url: string; headers?: Record; onSuccess: Action; onError: Action } + | { type: "HTTP_POST"; url: string; body: unknown; headers?: Record; onSuccess: Action; onError: Action } + | { type: "HTTP_DELETE"; url: string; headers?: Record; onSuccess: Action; onError: Action } + | { type: "DOWNLOAD"; url: string } + | { type: "NAVIGATE"; route: TriageState["route"] }; +``` + +## 4. Actions + +```ts +export type Action = + // routing + | { type: "ROUTE_TABLE" } + | { type: "ROUTE_CASE"; caseId: string } + + // table + | { type: "TABLE_LOAD" } + | { type: "TABLE_LOAD_OK"; rows: FindingRow[]; mutedCounts: MutedCounts; etag?: string } + | { type: "TABLE_LOAD_ERR"; error: string } + + | { type: "FILTER_SET_SEARCH"; search?: string } + | { type: "FILTER_SET_LANE"; lane?: Lane } + | { type: "FILTER_TOGGLE_SHOW_MUTED" } + | { type: "FILTER_SET_PAGE"; page: number } + | { type: "FILTER_SET_PAGE_SIZE"; pageSize: number } + + // case header + | { type: "CASE_LOAD"; caseId: string } + | { type: "CASE_LOAD_OK"; header: CaseHeader; etag?: string } + | { type: "CASE_LOAD_ERR"; error: string } + + // evidence + | { type: "EVIDENCE_LOAD"; caseId: string } + | { type: "EVIDENCE_LOAD_OK"; evidence: EvidenceItem[] } + | { type: "EVIDENCE_LOAD_ERR"; error: string } + + // decisions + | { type: "DECISIONS_LOAD"; caseId: string } + | { type: "DECISIONS_LOAD_OK"; decisions: DecisionItem[] } + | { type: "DECISIONS_LOAD_ERR"; error: string } + + | { type: "DECISION_DRAWER_OPEN"; open: boolean } + | { type: "DECISION_CREATE"; caseId: string; kind: DecisionKind; reasonCode: string; note?: string; ttl?: string } + | { type: "DECISION_CREATE_OK"; decision: DecisionItem } + | { type: "DECISION_CREATE_ERR"; error: string } + + | { type: "DECISION_REVOKE"; caseId: string; decisionId: string } + | { type: "DECISION_REVOKE_OK"; decisionId: string } + | { type: "DECISION_REVOKE_ERR"; error: string } + + // snapshots + smart diff + | { type: "SNAPSHOTS_LOAD"; caseId: string } + | { type: "SNAPSHOTS_LOAD_OK"; snapshots: SnapshotItem[] } + | { type: "SNAPSHOTS_LOAD_ERR"; error: string } + + | { type: "DIFF_OPEN"; open: boolean } + | { type: "DIFF_LOAD"; caseId: string; fromInputsHash: string; toInputsHash: string } + | { type: "DIFF_LOAD_OK"; diff: SmartDiff } + | { type: "DIFF_LOAD_ERR"; error: string } + + // export bundle + | { type: "BUNDLE_EXPORT"; caseId: string } + | { type: "BUNDLE_EXPORT_OK"; downloadUrl: string } + | { type: "BUNDLE_EXPORT_ERR"; error: string }; +``` + +## 5. Reducer Invariants + +* Pure: no I/O in reducer. +* Any mutation of gating/visibility must originate from: + * `CASE_LOAD_OK` (new computed risk) + * `DECISION_CREATE_OK` / `DECISION_REVOKE_OK` +* Evidence is loaded lazily; header is loaded first. +* "Show muted" affects only table filtering, never deletes data. + +## 6. Reducer Implementation (Reference) + +```ts +export function reduce(state: TriageState, action: Action): { state: TriageState; cmd: Command } { + switch (action.type) { + case "ROUTE_TABLE": + return { + state: { ...state, route: { page: "TABLE" } }, + cmd: { type: "NAVIGATE", route: { page: "TABLE" } } + }; + + case "ROUTE_CASE": + return { + state: { + ...state, + route: { page: "CASE", caseId: action.caseId }, + caseView: { ...state.caseView, loading: true, error: undefined } + }, + cmd: { + type: "HTTP_GET", + url: `/api/triage/v1/cases/${encodeURIComponent(action.caseId)}`, + headers: state.caseView.etag ? { "If-None-Match": state.caseView.etag } : undefined, + onSuccess: { type: "CASE_LOAD_OK", header: undefined as any }, + onError: { type: "CASE_LOAD_ERR", error: "" } + } + }; + + case "TABLE_LOAD": + return { + state: { ...state, table: { ...state.table, loading: true, error: undefined } }, + cmd: { + type: "HTTP_GET", + url: `/api/triage/v1/findings?showMuted=${state.filters.showMuted}&page=${state.filters.page}&pageSize=${state.filters.pageSize}` + + (state.filters.lane ? `&lane=${state.filters.lane}` : "") + + (state.filters.search ? `&search=${encodeURIComponent(state.filters.search)}` : ""), + headers: state.table.etag ? { "If-None-Match": state.table.etag } : undefined, + onSuccess: { type: "TABLE_LOAD_OK", rows: [], mutedCounts: { reach: 0, vex: 0, compensated: 0 } }, + onError: { type: "TABLE_LOAD_ERR", error: "" } + } + }; + + case "TABLE_LOAD_OK": + return { + state: { ...state, table: { ...state.table, loading: false, rows: action.rows, mutedCounts: action.mutedCounts, etag: action.etag } }, + cmd: { type: "NONE" } + }; + + case "TABLE_LOAD_ERR": + return { + state: { ...state, table: { ...state.table, loading: false, error: action.error } }, + cmd: { type: "NONE" } + }; + + case "CASE_LOAD_OK": { + const header = action.header; + return { + state: { + ...state, + caseView: { + ...state.caseView, + loading: false, + header, + etag: action.etag, + evidenceLoading: true, + decisionsLoading: true, + snapshotsLoading: true + } + }, + cmd: { + type: "HTTP_GET", + url: `/api/triage/v1/cases/${encodeURIComponent(header.id)}/evidence`, + onSuccess: { type: "EVIDENCE_LOAD_OK", evidence: [] }, + onError: { type: "EVIDENCE_LOAD_ERR", error: "" } + } + }; + } + + case "EVIDENCE_LOAD_OK": + return { + state: { ...state, caseView: { ...state.caseView, evidenceLoading: false, evidence: action.evidence } }, + cmd: { type: "NONE" } + }; + + case "DECISION_DRAWER_OPEN": + return { state: { ...state, ui: { ...state.ui, decisionDrawerOpen: action.open } }, cmd: { type: "NONE" } }; + + case "DECISION_CREATE": + return { + state: state, + cmd: { + type: "HTTP_POST", + url: `/api/triage/v1/decisions`, + body: { caseId: action.caseId, kind: action.kind, reasonCode: action.reasonCode, note: action.note, ttl: action.ttl }, + onSuccess: { type: "DECISION_CREATE_OK", decision: undefined as any }, + onError: { type: "DECISION_CREATE_ERR", error: "" } + } + }; + + case "DECISION_CREATE_OK": + return { + state: { + ...state, + ui: { ...state.ui, decisionDrawerOpen: false, toast: { kind: "success", message: "Decision applied. Undo available in History." } } + }, + // after decision, refresh header + snapshots (re-compute may occur server-side) + cmd: { type: "HTTP_GET", url: `/api/triage/v1/cases/${encodeURIComponent(state.route.caseId!)}`, onSuccess: { type: "CASE_LOAD_OK", header: undefined as any }, onError: { type: "CASE_LOAD_ERR", error: "" } } + }; + + case "BUNDLE_EXPORT": + return { + state, + cmd: { + type: "HTTP_POST", + url: `/api/triage/v1/cases/${encodeURIComponent(action.caseId)}/export`, + body: {}, + onSuccess: { type: "BUNDLE_EXPORT_OK", downloadUrl: "" }, + onError: { type: "BUNDLE_EXPORT_ERR", error: "" } + } + }; + + case "BUNDLE_EXPORT_OK": + return { + state: { ...state, ui: { ...state.ui, toast: { kind: "success", message: "Evidence bundle ready." } } }, + cmd: { type: "DOWNLOAD", url: action.downloadUrl } + }; + + default: + return { state, cmd: { type: "NONE" } }; + } +} +``` + +## 7. Unit Testing Requirements + +Minimum tests: + +* Reducer purity: no global mutation. +* TABLE_LOAD produces correct URL for filters. +* ROUTE_CASE triggers case header load. +* CASE_LOAD_OK triggers EVIDENCE load (and separately decisions/snapshots in your integration layer). +* DECISION_CREATE_OK closes drawer and refreshes case header. +* BUNDLE_EXPORT_OK emits DOWNLOAD. + +Recommended: golden-state snapshots to ensure backwards compatibility when the state model evolves. + +--- + +**Document Version**: 1.0 +**Target Platform**: Angular v17 + TypeScript diff --git a/docs/ux/TRIAGE_UX_GUIDE.md b/docs/ux/TRIAGE_UX_GUIDE.md new file mode 100644 index 00000000..30a2eaaf --- /dev/null +++ b/docs/ux/TRIAGE_UX_GUIDE.md @@ -0,0 +1,236 @@ +# Stella Ops Triage UX Guide (Narrative-First + Proof-Linked) + +## 0. Scope + +This guide specifies the user experience for Stella Ops triage and evidence workflows: +- Narrative-first case view that answers DevOps' three questions quickly. +- Proof-linked evidence surfaces (SBOM/VEX/provenance/reachability/replay). +- Quiet-by-default noise controls with reversible, signed decisions. +- Smart-Diff history that explains meaningful risk changes. + +Architecture constraints: +- Lattice/risk evaluation executes in `scanner.webservice`. +- `concelier` and `excititor` must **preserve prune source** (every merged/pruned datum remains traceable to origin). + +## 1. UX Contract + +Every triage surface must answer, in order: + +1) Can I ship this? +2) If not, what exactly blocks me? +3) What's the minimum safe change to unblock? + +Everything else is secondary and should be progressively disclosed. + +## 2. Primary Objects in the UX + +- Finding/Case: a specific vuln/rule tied to an asset (image/artifact/environment). +- Risk Result: deterministic lattice output (score/verdict/lane), computed by `scanner.webservice`. +- Evidence Artifact: signed, hash-addressed proof objects (SBOM slice, VEX doc, provenance, reachability slice, replay manifest). +- Decision: reversible user/system action that changes visibility/gating (mute/ack/exception) and is always signed/auditable. +- Snapshot: immutable record of inputs/outputs hashes enabling Smart-Diff. + +## 3. Global UX Principles + +### 3.1 Narrative-first, list-second +Default view is a "Case" narrative header + evidence rail. Lists exist for scanning and sorting, but not as the primary cognitive surface. + +### 3.2 Time-to-evidence (TTFS) target +From pipeline alert click → human-readable verdict + first evidence link: +- p95 ≤ 30 seconds (including auth and initial fetch). +- "Evidence" is always one click away (no deep tab chains). + +### 3.3 Proof-linking is mandatory +Any chip/badge that asserts a fact must link to the exact evidence object(s) that justify it. + +Examples: +- "Reachable: Yes" → call-stack slice (and/or runtime hit record) +- "VEX: not_affected" → effective VEX assertion + signature details +- "Blocked by Policy Gate X" → policy artifact + lattice explanation + +### 3.4 Quiet by default, never silent +Muted lanes are hidden by default but surfaced with counts and a toggle. +Muting never deletes; it creates a signed Decision with TTL/reason and is reversible. + +### 3.5 Deterministic and replayable +Users must be able to export an evidence bundle containing: +- scan replay manifest (feeds/rules/policies/hashes) +- signed artifacts +- outputs (risk result, snapshots) +so auditors can replay identically. + +## 4. Information Architecture + +### 4.1 Screens + +1) Findings Table (global) +- Purpose: scan, sort, filter, jump into cases +- Default: muted lanes hidden +- Banner: shows count of auto-muted by policy with "Show" toggle + +2) Case View (single-page narrative) +- Purpose: decision making + proof review +- Above fold: verdict + chips + deterministic score +- Right rail: evidence list +- Tabs (max 3): + - Evidence (default) + - Reachability & Impact + - History (Smart-Diff) + +3) Export / Verify Bundle +- Purpose: offline/audit verification +- Async export job, then download DSSE-signed zip +- Verification UI: signature status, hash tree, issuer chain + +### 4.2 Lanes (visibility buckets) + +Lanes are a UX categorization derived from deterministic risk + decisions: + +- ACTIVE +- BLOCKED +- NEEDS_EXCEPTION +- MUTED_REACH (non-reachable) +- MUTED_VEX (effective VEX says not_affected) +- COMPENSATED (controls satisfy policy) + +Default: show ACTIVE/BLOCKED/NEEDS_EXCEPTION. +Muted lanes appear behind a toggle and via the banner counts. + +## 5. Case View Layout (Required) + +### 5.1 Top Bar +- Asset name / Image tag / Environment +- Last evaluated time +- Policy profile name (e.g., "Strict CI Gate") + +### 5.2 Verdict Banner (Above fold) +Large, unambiguous verdict: +- SHIP +- BLOCKED +- NEEDS EXCEPTION + +Below verdict: +- One-line "why" summary (max 140 chars), e.g.: + - "Reachable path observed; exploit signal present; Policy 'prod-strict' blocks." + +### 5.3 Chips (Each chip is clickable) +Minimum set: +- Reachability: Reachable / Not reachable / Unknown (with confidence) +- Effective VEX: affected / not_affected / under_investigation +- Exploit signal: yes/no + source indicator +- Exposure: internet-exposed yes/no (if available) +- Asset tier: tier label +- Gate: allow/block/exception-needed (policy gate name) + +Chip click behavior: +- Opens evidence panel anchored to the proof objects +- Shows source chain (concelier/excititor preserved sources) + +### 5.4 Evidence Rail (Always visible right side) +List of evidence artifacts with: +- Type icon +- Title +- Issuer +- Signed/verified indicator +- Content hash (short) +- Created timestamp +Actions per item: +- Preview +- Copy hash +- Open raw +- "Show in bundle" marker + +### 5.5 Actions Footer (Only primary actions) +- Create work item +- Acknowledge / Mute (opens Decision drawer) +- Propose exception (Decision with TTL + approver chain) +- Export evidence bundle + +No more than 4 primary buttons. Secondary actions go into kebab menu. + +## 6. Decision Flows (Mute/Ack/Exception) + +### 6.1 Decision Drawer (common UI) +Fields: +- Decision kind: Mute reach / Mute VEX / Acknowledge / Exception +- Reason code (dropdown) + free-text note +- TTL (required for exceptions; optional for mutes) +- Policy ref (auto-filled; editable only by admins) +- "Sign and apply" (server-side DSSE signing; user identity included) + +On submit: +- Create Decision (signed) +- Re-evaluate lane/verdict if applicable +- Create Snapshot ("DECISION" trigger) +- Show toast with undo link + +### 6.2 Undo +Undo is implemented as "revoke decision" (signed revoke record or revocation fields). +Never delete. + +## 7. Smart-Diff UX + +### 7.1 Timeline +Chronological snapshots: +- when (timestamp) +- trigger (feed/vex/sbom/policy/runtime/decision/rescan) +- summary (short) + +### 7.2 Diff panel +Two-column diff: +- Inputs changed (with proof links): VEX assertion changed, policy version changed, runtime trace arrived, etc. +- Outputs changed: lane, verdict, score, gates + +### 7.3 Meaningful change definition +The UI only highlights "meaningful" changes: +- verdict change +- lane change +- score crosses a policy threshold +- reachability state changes +- effective VEX status changes +Other changes remain in "details" expandable. + +## 8. Performance & UI Engineering Requirements + +- Findings table uses virtual scroll and server-side pagination. +- Case view loads in 2 steps: + 1) Header narrative (small payload) + 2) Evidence list + snapshots (lazy) +- Evidence previews are lazy-loaded and cancellable. +- Use ETag/If-None-Match for case and evidence list endpoints. +- UI must remain usable under high latency (air-gapped / offline kits): + - show cached last-known verdict with clear "stale" marker + - allow exporting bundles from cached artifacts when permissible + +## 9. Accessibility & Operator Usability + +- Keyboard navigation: table rows, chips, evidence list +- High contrast mode supported +- All status is conveyed by text + shape (not color only) +- Copy-to-clipboard for hashes, purls, CVE IDs + +## 10. Telemetry (Must instrument) + +- TTFS: notification click → verdict banner rendered +- Time-to-proof: click chip → proof preview shown +- Mute reversal rate (auto-muted later becomes actionable) +- Bundle export success/latency + +## 11. Responsibilities by Service + +- `scanner.webservice`: + - produces reachability results, risk results, snapshots + - stores/serves case narrative header, evidence indexes, Smart-Diff +- `concelier`: + - aggregates vuln feeds and preserves per-source provenance ("preserve prune source") +- `excititor`: + - merges VEX and preserves original assertion sources ("preserve prune source") +- `notify.webservice`: + - emits first_signal / risk_changed / gate_blocked +- `scheduler.webservice`: + - re-evaluates existing images on feed/policy updates, triggers snapshots + +--- + +**Document Version**: 1.0 +**Target Platform**: .NET 10, PostgreSQL >= 16, Angular v17