house keeping work

2025-12-19 22:19:08 +02:00
parent 91f3610b9d
commit 5b57b04484
64 changed files with 4702 additions and 4 deletions
--- a/docs/product-advisories/unprocessed/19-Dec-2025
+++ b/docs/product-advisories/unprocessed/19-Dec-2025
@@ -0,0 +1,338 @@
+## Executive directive
+
+Build **Reachability as Evidence**, not as a UI feature.
+
+Every reachability conclusion must produce a **portable, signed, replayable evidence bundle** that answers:
+
+1. **What vulnerable code unit is being discussed?** (symbol/method/function + version)
+2. **What entrypoint is assumed?** (HTTP handler, RPC method, CLI, scheduled job, etc.)
+3. **What is the witness?** (a call-path subgraph, not a screenshot)
+4. **What assumptions/gates apply?** (config flags, feature toggles, runtime wiring)
+5. **Can a third party reproduce it?** (same inputs → same evidence hash)
+
+This must work for **source** and **post-build artifacts**.
+
+---
+
+# Directions for Product Managers
+
+## 1) Define the product contract in one page
+
+### Capability name
+**Proof‑carrying reachability**.
+
+### Contract
+Given an artifact (source or built) and a vulnerability mapping, Stella Ops outputs:
+
+- **Reachability verdict:** `REACHABLE | NOT_PROVEN_REACHABLE | INCONCLUSIVE`
+- **Witness evidence:** a minimal **reachability subgraph** + one or more witness paths
+- **Reproducibility bundle:** all inputs and toolchain metadata needed to replay
+- **Attestation:** signed statement tied to the artifact digest
+
+### Important language choice
+Avoid claiming “unreachable” unless you can prove non-reachability under a formally sound model.
+
+- Use **NOT_PROVEN_REACHABLE** for “no path found under current analysis + assumptions.”
+- Use **INCONCLUSIVE** when analysis cannot be performed reliably (missing symbols, obfuscation, unsupported language, dynamic dispatch uncertainty, etc.).
+
+This is essential for credibility and audit use.
+
+---
+
+## 2) Anchor personas and top workflows
+
+### Primary personas
+- Security governance / AppSec: wants fewer false positives and defensible prioritization.
+- Compliance/audit: wants evidence and replayability.
+- Engineering teams: wants specific call paths and what to change.
+
+### Top workflows (must support in MVP)
+1. **CI gate with signed verdict**
+   - “Block release if any `REACHABLE` high severity is present OR if `INCONCLUSIVE` exceeds threshold.”
+2. **Audit replay**
+   - “Reproduce the reachability proof for artifact digest X using snapshot Y.”
+3. **Release delta**
+   - “Show what reachability changed between release A and B.”
+
+---
+
+## 3) Minimum viable scope: pick targets that make “post-build” real early
+
+To satisfy “source and post-build artifacts” without biting off ELF-level complexity first:
+
+### MVP artifact types (recommended)
+- **Source repository** for 1–2 languages with mature static IR
+- **Post-build intermediate artifacts** that retain symbol structure:
+  - Java `.jar/.class`
+  - .NET assemblies
+  - Python wheels (bytecode)
+  - Node bundles with sourcemaps (optional)
+
+These give you “post-build” support where call graphs are tractable.
+
+### Defer for later phases
+- Native ELF/Mach-O deep reachability (harder due to stripping, inlining, indirect calls, dynamic loading)
+- Highly dynamic languages without strong type info, unless you accept “witness-only” semantics
+
+Your differentiator is proof portability and determinism, not “supports every binary on day one.”
+
+---
+
+## 4) Product requirements: what “proof-carrying” means in requirements language
+
+### Functional requirements
+- Output must include a **reachability subgraph**:
+  - Nodes = code units (function/method) with stable IDs
+  - Edges = call or dispatch edges with type annotations
+  - Must include at least one **witness path** from entrypoint to vulnerable node when `REACHABLE`
+- Output must be **artifact-tied**:
+  - Evidence must reference artifact digest(s) (source commit, build artifact digest, container image digest)
+- Output must be **attestable**:
+  - Produce a signed attestation (DSSE/in-toto style) attached to the artifact digest
+- Output must be **replayable**:
+  - Provide a “replay recipe” (analyzer versions, configs, vulnerability mapping version, and input digests)
+
+### Non-functional requirements
+- Deterministic: repeated runs on same inputs produce identical evidence hash
+- Size-bounded: subgraph evidence must be bounded (e.g., path-based extraction + limited context)
+- Privacy-controllable:
+  - Support a mode that avoids embedding raw source content (store pointers/hashes instead)
+- Verifiable offline:
+  - Verification and replay must work air-gapped given the snapshot bundle
+
+---
+
+## 5) Acceptance criteria (use as Definition of Done)
+
+A feature is “done” only when:
+
+1. **Verifier can validate** the attestation signature and confirm the evidence hash matches content.
+2. A second machine can **reproduce the same evidence hash** given the replay bundle.
+3. Evidence includes at least one witness path for `REACHABLE`.
+4. Evidence includes explicit assumptions/gates; absence of gating is recorded as an assumption (e.g., “config unknown”).
+5. Evidence is **linked to the precise artifact digest** being deployed/scanned.
+
+---
+
+## 6) Product packaging decisions that create switching cost
+
+These are product decisions that turn engineering into moat:
+
+- **Make “reachability proof” an exportable object**, not just a UI view.
+- Provide an API: `GET /findings/{id}/proof` returning canonical evidence.
+- Support policy gates on:
+  - `verdict`
+  - `confidence`
+  - `assumption_count`
+  - `inconclusive_reasons`
+- Make “proof replay” a one-command workflow in CLI.
+
+---
+
+# Directions for Development Managers
+
+## 1) Architecture: build a “proof pipeline” with strict boundaries
+
+Implement as composable modules with stable interfaces:
+
+1. **Artifact Resolver**
+   - Inputs: repo URL/commit, build artifact path, container image digest
+   - Output: normalized “artifact record” with digests and metadata
+
+2. **Graph Builder (language-specific adapters)**
+   - Inputs: artifact record
+   - Output: canonical **Program Graph**
+     - Nodes: code units
+     - Edges: calls/dispatch
+     - Optional: config gates, dependency edges
+
+3. **Vulnerability-to-Code Mapper**
+   - Inputs: vulnerability record (CVE), package coordinates, symbol metadata (if available)
+   - Output: vulnerable node set + mapping confidence
+
+4. **Entrypoint Modeler**
+   - Inputs: artifact + runtime context (framework detection, routing tables, main methods)
+   - Output: entrypoint node set with types (HTTP, RPC, CLI, cron)
+
+5. **Reachability Engine**
+   - Inputs: graph + entrypoints + vulnerable nodes + constraints
+   - Output: witness paths + minimal subgraph extraction
+
+6. **Evidence Canonicalizer**
+   - Inputs: witness paths + subgraph + metadata
+   - Output: canonical JSON (stable ordering, stable IDs), plus content hash
+
+7. **Attestor**
+   - Inputs: evidence hash + artifact digest
+   - Output: signed attestation object (OCI attachable)
+
+8. **Verifier (separate component)**
+   - Must validate signatures + evidence integrity independently of generator
+
+Critical: generator and verifier must be decoupled to preserve trust.
+
+---
+
+## 2) Evidence model: what to store (and how to keep it stable)
+
+### Node identity must be stable across runs
+Define a canonical NodeID scheme:
+
+- Source node ID:
+  - `{language}:{repo_digest}:{symbol_signature}:{optional_source_location_hash}`
+- Post-build node ID:
+  - `{language}:{artifact_digest}:{symbol_signature}:{optional_offset_or_token}`
+
+Avoid raw file paths or non-deterministic compiler offsets as primary IDs unless normalized.
+
+### Edge identity
+`{caller_node_id} -> {callee_node_id} : {edge_type}`  
+Edge types matter (direct call, virtual dispatch, reflection, dynamic import, etc.)
+
+### Subgraph extraction rule
+Store:
+- All nodes/edges on at least one witness path (or k witness paths)
+- Plus bounded context:
+  - 1–2 hop neighborhood around the vulnerable node and entrypoint
+  - routing edges (HTTP route → handler) where applicable
+
+This makes the proof compact and audit-friendly.
+
+### Canonicalization requirements
+- Stable sorting of nodes and edges
+- Canonical JSON serialization (no map-order nondeterminism)
+- Explicit analyzer version + config included in evidence
+- Hash everything that influences results
+
+---
+
+## 3) Determinism and reproducibility: engineering guardrails
+
+### Deterministic computation
+- Avoid parallel graph traversal that yields nondeterministic order without canonical sorting
+- If using concurrency, collect results and sort deterministically before emitting
+
+### Repro bundle (“time travel”)
+Persist, as digests:
+- Analyzer container/image digest
+- Analyzer config hash
+- Vulnerability mapping dataset version hash
+- Artifact digest(s)
+- Graph builder version hash
+
+A replay must be possible without “calling home.”
+
+### Golden tests
+Create fixtures where:
+- Same input graph + mapping → exact evidence hash
+- Regression test for canonicalization changes (version the schema intentionally)
+
+---
+
+## 4) Attestation format and verification
+
+### Attestation contents (minimum)
+- Subject: artifact digest (image digest / build artifact digest)
+- Predicate: reachability evidence hash + metadata
+- Predicate type: `reachability` (custom) with versioning
+
+### Verification requirements
+- Verification must run offline
+- It must validate:
+  1) signature
+  2) subject digest binding
+  3) evidence hash matches serialized evidence
+
+### Storage model
+Use content-addressable storage keyed by evidence hash.  
+Attestation references the hash; evidence stored separately or embedded (size tradeoff).
+
+---
+
+## 5) Source + post-build support: engineering plan
+
+### Unifying principle
+Both sources produce the same canonical Program Graph abstraction.
+
+#### Source analyzers produce:
+- Function/method nodes using language signatures
+- Edges from static analysis IR
+
+#### Post-build analyzers produce:
+- Nodes from bytecode/assembly symbol tables (where available)
+- Edges from bytecode call instructions / metadata
+
+### Practical sequencing (recommended)
+1. Implement one source language adapter (fastest to prove model)
+2. Implement one post-build adapter where symbols are rich (e.g., Java bytecode)
+3. Ensure evidence schema and attestation workflow works identically for both
+4. Expand to more ecosystems once the proof pipeline is stable
+
+---
+
+## 6) Operational constraints (performance, size, security)
+
+### Performance
+- Cache program graphs per artifact digest
+- Cache vulnerability-to-code mapping per package/version
+- Compute reachability on-demand per vulnerability, but reuse graphs
+
+### Evidence size
+- Limit witness paths (e.g., up to N shortest paths)
+- Prefer “witness + bounded neighborhood” over exporting full call graph
+
+### Security and privacy
+- Provide a “redacted proof mode”
+  - include symbol hashes instead of raw names if needed
+  - store source locations as hashes/pointers
+- Never embed raw source code unless explicitly enabled
+
+---
+
+## 7) Definition of Done for the engineering team
+
+A milestone is complete when you can demonstrate:
+
+1. Generate a reachability proof for a known vulnerable code unit with a witness path.
+2. Serialize a canonical evidence subgraph and compute a stable hash.
+3. Sign the attestation bound to the artifact digest.
+4. Verify the attestation on a clean machine (offline).
+5. Replay the analysis from the replay bundle and reproduce the same evidence hash.
+
+---
+
+# Concrete artifact example (for alignment)
+
+A reachability evidence object should look structurally like:
+
+- `subject`: artifact digest(s)
+- `claim`:
+  - `verdict`: REACHABLE / NOT_PROVEN_REACHABLE / INCONCLUSIVE
+  - `entrypoints`: list of NodeIDs
+  - `vulnerable_nodes`: list of NodeIDs
+  - `witness_paths`: list of paths (each path = ordered NodeIDs)
+- `subgraph`:
+  - `nodes`: list with stable IDs + metadata
+  - `edges`: list with stable ordering + edge types
+- `assumptions`:
+  - gating conditions, unresolved dynamic dispatch notes, etc.
+- `tooling`:
+  - analyzer name/version/digest
+  - config hash
+  - mapping dataset hash
+- `hashes`:
+  - evidence content hash
+  - schema version
+
+Then wrap and sign it as an attestation tied to the artifact digest.
+
+---
+
+## The one decision you should force early
+
+Decide (and document) whether your semantics are:
+
+- **Witness-based** (“REACHABLE only if we can produce a witness path”), and
+- **Conservative on negative claims** (“NOT_PROVEN_REACHABLE” is not “unreachable”).
+
+This single decision will keep the system honest, reduce legal/audit risk, and prevent the product from drifting into hand-wavy “trust us” scoring.