diff --git a/docs/product-advisories/23-Nov-2025 - Where Stella Ops Can Truly Lead.md b/docs/product-advisories/23-Nov-2025 - Where Stella Ops Can Truly Lead.md new file mode 100644 index 000000000..0c935031c --- /dev/null +++ b/docs/product-advisories/23-Nov-2025 - Where Stella Ops Can Truly Lead.md @@ -0,0 +1,2030 @@ +Here’s a quick, beginner‑friendly rundown of **reachability analysis** and how Stella Ops can leapfrog today’s tools with **deterministic, signed call‑graphs** and **in‑toto attestations**—plus a concrete action plan. + +--- + +## Why reachability matters (in plain terms) + +* Most scanners list every CVE in your code or containers. That’s noisy. +* **Reachability analysis** asks: *“Is the vulnerable function actually callable in this artifact, on this path, in this runtime?”* +* Modern vendors enrich this with context (code → cloud links, runtime traces, attack paths), but they usually **infer** paths and **don’t cryptographically prove** them. + +## Where today’s market sits + +* Commercial/OSS tools emphasize **prioritization signals** and **partial reachability** (select languages, heuristics, runtime traces). +* What’s typically missing: **deterministic graphs** per artifact, **stable IDs** for nodes/edges, and **signed, auditable proofs** for “why” a vuln is reachable (or not). + +--- + +## Stella Ops: how to lead (moats → outcomes) + +1. **Deterministic call‑graphs per artifact** + + * Build language/binary call‑graphs with **stable node IDs** (e.g., `purl` + symbol signature + build‑ID) and **edge determinism** (repeatable from the same inputs). +2. **Signed edges with DSSE / in‑toto** + + * Each edge `(caller → callee)` becomes a tiny attestation: who computed it, inputs (hashes), tool version, and policy context. + * Publish summaries to **Rekor** (or a mirror) to get a **tamper‑evident audit log**. +3. **Explainability by default** + + * Every finding carries its **“why path”** (edge chain), **VEX gate decisions**, and (for containers) **layer‑diff cause** (which layer introduced the symbol/file). +4. **VEX with proof** + + * When you say “not affected,” attach the **exact proof trail** (non‑reachable edge chain, eliminated by policy, or sandboxed at runtime), not just a boolean. +5. **Outcome targets (internal, program goals)** + + * **≥40% fewer noisy vulns** presented to engineers. + * **≥25% faster triage** thanks to inspectable “why” paths and signed evidence. + +--- + +## Minimal architecture (clean and pragmatic) + +* **Sbomer/Scanner**: produce SBOM + symbol maps + per‑layer file indexes; embed **build‑IDs** (ELF/PE/Mach‑O) and **purls**. +* **Cartographer**: construct **deterministic call‑graphs** (per language/binary) → emit **EdgeList.jsonl** with stable IDs. +* **Attestor**: wrap edges into **DSSE in‑toto** attestations; batch‑sign; push digests to **Rekor**. +* **Vexer**: evaluate policies (trust lattice) to yield **VEX** with linked edge‑proofs. +* **Ledger**: retain proofs; sync to Rekor mirror for sovereignty/offline. + +--- + +## Practical specs to hand to engineers (short version) + +* **Stable IDs** + + * Node: `purl@version!build-id!symbol-signature` (lang: fqdn; binary: `objfile:offset+size`, demangled name if available). + * Edge: `SHA256(nodeA||nodeB||tool-version||inputs-hash)`. +* **Determinism** + + * Pin parsers, symbolizers, compilers, and analysis flags. + * Emit **manifest.lock** (feeds, rules, tool hashes) to replay any scan identically. +* **Signing** + + * DSSE envelope with: tool version, inputs (file hashes), graph slice, time, signer. + * Store full attestations in object store; **Rekor** receives the envelope digest + inclusion proof. +* **Explainability payload** (attach to each finding) + + * `call_chain[]`, per‑edge evidence (file, line, symbol, layer), VEX gate decisions, and counter‑evidence for “not affected”. +* **Container layer attribution** + + * Track file provenance to the exact layer; show “introduced in layer X from base Y”. +* **APIs** + + * `POST /graph/edges: attest` (idempotent; same inputs → same edge IDs). + * `GET /findings/:id/proof` returns the call‑chain + Rekor inclusion proofs. + * `GET /vex/:artifact` streams VEX with embedded proofs. + +--- + +## Quick wins (do these first) + +* Add **Build‑ID capture** to your scanners for ELF/PE/Mach‑O; normalize to nodes with stable IDs. +* Ship a **Graph Determinism Manifest** (hashes of inputs + toolchain) per build. +* Start **edge attestations** for 1–2 ecosystems (e.g., PHP/JS) and **container layer provenance**. +* Integrate **Rekor** logging (digest‑only is fine to start); keep full DSSE envelopes in your bucket. +* Turn on **explainable VEX**: every verdict must have a **machine‑readable why‑path**. + +--- + +## How this helps day‑to‑day + +* Security and developers see **only reachable, explainable vulns** with a **clickable proof path**. +* Auditors get **cryptographic evidence** (DSSE + Rekor inclusion) that triage decisions weren’t hand‑waved. +* Ops can trace “which layer introduced this risk” and fix the **real** source quickly. + +If you want, I can turn this into: + +* a one‑page **product brief** for customers, +* an **engineer‑ready spec** (JSON schemas + example DSSE envelopes), +* or a **roadmap** broken into 30‑60‑90‑day milestones. +Below is a **very detailed engineering spec** for Stella Ops’ reachability and explainability system, plus **explicit “why this matters” notes** under each major group so your documentation writer can easily translate into customer‑facing language. + +You can copy this into an internal design doc and keep the “Why” paragraphs verbatim or adapt them into more marketing‑friendly copy. + +--- + +# 0. Scope & Goals + +This spec covers: + +1. **Deterministic call‑graph construction** per artifact +2. **Stable node/edge identities** +3. **DSSE / in‑toto edge attestations** and **Rekor logging** +4. **VEX engine** that uses those graphs as proof +5. **Proof & explainability APIs** (call‑chains, layer attribution, etc.) +6. **Determinism manifest & replay** for audits +7. **Container layer provenance** integration + +Non‑goals (for this document): + +* UI/UX design +* NVD/OSV ingestion pipeline details +* AuthN/AuthZ and tenancy model (assume existing platform patterns) + +We’ll use RFC‑style language: + +* **MUST** / **MUST NOT** +* **SHOULD** / **SHOULD NOT** +* **MAY** + +--- + +# 1. Core Concepts (Global) + +These terms are referenced throughout the spec. + +### 1.1 Artifact + +* **Definition**: A concrete software unit that can be scanned and deployed. + + * Examples: container image, VM image, binary, library, zip, lambda bundle. +* **Identity**: + + * `artifact_id` (string, globally unique) + * `artifact_hash` (sha256 of canonical bytes) + * `artifact_kind` (`"container" | "binary" | "library" | "archive" | "source_bundle"`) + +**Why this matters (docs hook):** +Clients need to see evidence tied to *exact* software units (e.g., “this image:tag”). Stable artifact identities make proofs auditable and prevent confusion between versions. + +--- + +### 1.2 Node, Edge, Call Graph + +* **Node**: A callable unit (function, method, exported symbol, entrypoint) in an artifact. +* **Edge**: A possible call from one node to another in that artifact. +* **Call Graph**: The directed graph **G = (V, E)** of nodes V and edges E. + +We distinguish: + +* **Static call‑graph**: Derived from code/binaries (no runtime traces). +* **Augmented call‑graph**: Static graph plus optional runtime evidence (future extension; not required by this spec but must be accommodated by the schema). + +**Why this matters (docs hook):** +Call‑graphs are the backbone of reachability. Without them, tools can only guess whether a vulnerable function is used. Graphs make that decision explainable and repeatable. + +--- + +### 1.3 Determinism + +**Deterministic analysis**: For the same inputs (artifact bytes + config + tool versions), the system MUST produce: + +* The same set of nodes and edges, +* The same IDs for nodes and edges, +* The same graph hash / revision ID, +* The same attestations (payload-wise). + +Determinism is enforced via: + +* Strict canonicalization rules (sorting, formatting), +* Fixed analysis options locked per version, +* Explicit recording of all inputs in a **determinism manifest**. + +**Why this matters (docs hook):** +Determinism turns vulnerability triage from a “black box guess” into a reproducible math problem. An auditor can rerun analysis and verify the same results and proofs. + +--- + +### 1.4 Attestations & Ledger + +* **Attestation**: A signed statement about artifacts, edges, or findings. +* **DSSE envelope**: The signing wrapper. +* **in‑toto Statement**: Typed payload inside the envelope. +* **Rekor**: Transparency log where attestation digests are recorded. + +**Why this matters (docs hook):** +Signed attestations create a **chain of custody** for analysis results. Rekor logging means tampering is detectable and clients can verify that Stella Ops didn’t silently change history. + +--- + +### 1.5 VEX & Proofs + +* **VEX**: Vulnerability Exploitability eXchange statement per vulnerability–artifact pair. + + * Key fields: `vulnerability_id`, `status`, `justification`, `proof_ref`. +* **Proof**: A structured explanation referencing call‑graph edges and attestations that show *why* a vulnerability is or is not exploitable. + +**Why this matters (docs hook):** +A VEX without proof is a “trust me” statement. A VEX with proof is a **verifiable, cryptographically linked explanation** that customers and auditors can inspect. + +--- + +# 2. Artifact Identity & Metadata Service + +This service normalizes artifact identity and provides a foundation for graph and attestation linkage. + +## 2.1 Functional Requirements + +**FR‑A1: Artifact Registration** + +* The system MUST expose `POST /v1/artifacts` to register an artifact. +* Request (simplified): + +```json +{ + "kind": "container", + "coordinates": { + "image_ref": "gcr.io/acme/api@sha256:abc123...", + "tag": "v1.2.3" + }, + "hash": { + "algo": "sha256", + "value": "..." + }, + "build_metadata": { + "build_id": "build-2025-01-01-1234", + "ci_run_id": "ci-abc-123", + "timestamp": "2025-01-01T12:34:56Z" + }, + "sbom_ref": { + "format": "spdx-json", + "location": "s3://.../api-v1.2.3.spdx.json" + } +} +``` + +* Response: + +```json +{ + "artifact_id": "art-01HXXXXX...", + "artifact_hash": "sha256:...", + "kind": "container" +} +``` + +**FR‑A2: Artifact Uniqueness** + +* `artifact_id` MUST be stable for an `(artifact_kind, artifact_hash)` pair. +* Registering the same `(kind, hash)` again MUST return the existing `artifact_id`. + +**FR‑A3: Build‑ID & Binary Identity Capture** + +For binaries within artifacts: + +* The scanner MUST extract: + + * `build_id` or equivalent (ELF `.note.gnu.build-id`, PE debug GUID, Mach‑O UUID), + * File path in the artifact, + * File hash. +* For each binary, the service MUST assign a `binary_id`: + +```json +{ + "binary_id": "bin-01HYYY...", + "artifact_id": "art-01HXXXXX...", + "path": "/usr/local/bin/server", + "file_hash": "sha256:...", + "build_id": "0x1234abcd..." +} +``` + +**FR‑A4: SBOM Integration** + +* The service MUST support linking SBOM components (e.g., purl) to binaries and files: + + * `component_id` ⇔ `binary_id` and/or file. + +**Why this matters (docs hook):** +Having a clean `artifact_id` and `binary_id` abstraction allows Stella Ops to say “this edge comes from *this* binary in *this* image” with zero ambiguity. It’s the basis for trustable, cross‑tool correlations (SBOM ↔ graph ↔ VEX). + +--- + +# 3. Node & Edge Identity (Call‑Graph Model) + +This is the core of determinism and explainability. + +## 3.1 Node Identity + +**FR‑N1: Node Canonical Tuple** + +Each node MUST be defined by a canonical tuple: + +* `artifact_id` +* `binary_id` (nullable for source‑level nodes) +* `language` (`"php" | "js" | "java" | "go" | "c" | "cpp" | "dotnet" | ...`) +* `symbol_kind` (`"function" | "method" | "constructor" | "lambda" | "entrypoint" | "indirect_stub"`) +* `symbol_signature`: + + * Language‑specific, fully qualified form + * Examples: + + * PHP: `\Acme\Service\UserService::findByEmail(string): User` + * Java: `com.acme.UserService.findByEmail(Ljava/lang/String;)Lcom/acme/User;` + * C: `acme_user_find_by_email(const char *email)` +* `source_location` OR `binary_offset`: + + * `source_location`: + + * `file_path` (normalized), + * `line` (int), + * `column` (optional). + * `binary_offset`: + + * `section` (e.g., `.text`), + * `offset` (hex), + * `size` (bytes). + +**FR‑N2: NodeID Derivation** + +* A `node_id` MUST be derived as: + +```text +canonical_string = JSONCanonicalize({ + "artifact_id": "...", + "binary_id": "...", + "language": "...", + "symbol_kind": "...", + "symbol_signature": "...", + "source_location" | "binary_offset": ... +}) + +node_id = "node-" + hex(sha256(canonical_string)) +``` + +* `JSONCanonicalize`: + + * MUST be deterministic (sorted keys, no extraneous whitespace, fixed number formats). + * Exact algorithm MUST be documented and versioned. + +**FR‑N3: Stability** + +* For a given canonical tuple, the same `node_id` MUST be produced across runs, machines, and time. + +**Why this matters (docs hook):** +Stable node IDs allow Stella Ops to “bookmark” a function across scans, policies, and attestations. It means a client or auditor can see the exact same identifier referenced in SBOMs, VEX, and evidence—even years later. + +--- + +## 3.2 Edge Identity + +**FR‑E1: Edge Canonical Tuple** + +Each edge MUST be defined by: + +* `caller_node_id` +* `callee_node_id` +* `edge_kind` (`"static" | "virtual" | "interface" | "dynamic" | "indirect"`) +* `evidence`: + + * `origin` (`"static_analysis" | "runtime_trace"`), + * `tool_version`, + * `analysis_profile` (e.g., optimization level, language mode), + * `confidence` (0–1 decimal), + * optional `metadata` (language‑specific details). + +**FR‑E2: EdgeID Derivation** + +* `edge_id` MUST be derived as: + +```text +canonical_edge_string = JSONCanonicalize({ + "caller": "node-...", + "callee": "node-...", + "edge_kind": "...", + "analysis_tool_version": "...", + "analysis_profile": "..." +}) + +edge_id = "edge-" + hex(sha256(canonical_edge_string)) +``` + +**FR‑E3: Edge Determinism** + +* For identical inputs (artifact, analysis config, tool version) the same set of `(edge_id, fields)` MUST be produced. +* Sorting rules: + + * All edges MUST be emitted in lexicographic order of `edge_id`. + * This ordering MUST be used when computing graph hashes and signing batches. + +**Why this matters (docs hook):** +Stable edges make “reachable” vs “not reachable” mathematically testable. If a client re‑runs analysis, they can verify they get the same edges and therefore the same reachability verdicts—building confidence in the tool’s fairness and repeatability. + +--- + +## 3.3 Graph Revision & Manifest + +**FR‑G1: Graph Revision ID** + +For an artifact’s graph: + +```text +graph_canonical = JSONCanonicalize({ + "nodes": [sorted by node_id], + "edges": [sorted by edge_id] +}) + +graph_revision_id = "graph-" + hex(sha256(graph_canonical)) +``` + +**FR‑G2: Determinism Manifest** + +The system MUST generate a **determinism manifest** per graph: + +```json +{ + "graph_revision_id": "graph-...", + "artifact_id": "art-...", + "inputs": { + "artifact_hash": "sha256:...", + "sbom_hash": "sha256:...", + "scanner_version": "1.4.0", + "cartographer_version": "2.1.3", + "config": { + "languages": ["php", "js"], + "analysis_profile": "default" + } + }, + "toolchain_hashes": { + "php_parser": "sha256:...", + "js_parser": "sha256:..." + }, + "timestamp": "2025-01-01T12:34:56Z" +} +``` + +**Why this matters (docs hook):** +The manifest is the “recipe card” for the call‑graph. If a regulator or customer wants to re‑bake the same result, they use this recipe and confirm they get the same graph hash. + +--- + +# 4. Cartographer (Call‑Graph Builder) Service + +## 4.1 Responsibilities + +* Parse artifacts and binaries. +* Extract functions/symbols as nodes. +* Generate edges with deterministic algorithms. +* Produce `EdgeList` and `NodeList` structures plus graph revision and determinism manifest. + +## 4.2 Inputs & Outputs + +**FR‑C1: API to Trigger Analysis** + +`POST /v1/graphs:analyze` + +Request: + +```json +{ + "artifact_id": "art-01HXXXXX...", + "analysis_profile": "default", + "languages": ["php", "js", "c"], + "options": { + "include_runtime_traces": false + } +} +``` + +Response: + +```json +{ + "graph_id": "graph-job-01ABC...", + "artifact_id": "art-...", + "status": "queued" +} +``` + +**FR‑C2: Graph Retrieval** + +`GET /v1/graphs/{graph_id}` + +Response: + +```json +{ + "graph_revision_id": "graph-...", + "artifact_id": "art-...", + "nodes": [ + { + "node_id": "node-...", + "artifact_id": "art-...", + "binary_id": "bin-...", + "language": "php", + "symbol_kind": "function", + "symbol_signature": "\\Acme\\UserService::findByEmail(string): User", + "source_location": { + "file_path": "/app/src/UserService.php", + "line": 42, + "column": 3 + } + } + // ... + ], + "edges": [ + { + "edge_id": "edge-...", + "caller_node_id": "node-...", + "callee_node_id": "node-...", + "edge_kind": "static", + "evidence": { + "origin": "static_analysis", + "tool_version": "2.1.3", + "analysis_profile": "default", + "confidence": 1.0 + } + } + // ... + ], + "determinism_manifest": { ... } +} +``` + +**FR‑C3: Language Coverage** + +* Cartographer MUST support at least: + + * PHP, JavaScript, TypeScript, Java, Go, C/C++, .NET, Python. +* Each language implementation MUST: + + * Use pinned parser versions, + * Use deterministic traversal order (e.g., AST preorder + sorted children), + * Emit nodes/edges consistent with the canonical schema. + +**FR‑C4: Error Handling** + +* If analysis fails, `status` MUST be one of: `"failed_parse"`, `"failed_analysis"`, `"unsupported_language"`, with a machine‑readable error code and human‑readable message. + +**Why this matters (docs hook):** +Cartographer is where Stella Ops’ “magic” becomes structured data. The deterministic API and schemas mean customers can export, inspect, and even combine these graphs with their own tooling. + +--- + +# 5. Edge Attestations (Attestor Service) + +The Attestor turns graph edges into cryptographically verifiable DSSE/in‑toto attestations and logs them to Rekor. + +## 5.1 Attestation Content + +**FR‑T1: in‑toto Statement Format** + +Each attestation payload MUST conform to in‑toto Statement: + +```json +{ + "_type": "https://in-toto.io/Statement/v0.1", + "subject": [ + { + "name": "artifact:art-01HXXXXX...", + "digest": { + "sha256": "..." + } + } + ], + "predicateType": "https://stella.dev/attestations/callgraph-edges/v1", + "predicate": { + "graph_revision_id": "graph-...", + "edges": [ + { + "edge_id": "edge-...", + "caller_node_id": "node-...", + "callee_node_id": "node-...", + "edge_kind": "static", + "evidence": { + "origin": "static_analysis", + "tool_version": "2.1.3", + "analysis_profile": "default", + "confidence": 1.0 + } + } + ], + "analysis_metadata": { + "cartographer_version": "2.1.3", + "options": { + "languages": ["php", "js"], + "analysis_profile": "default" + }, + "timestamp": "2025-01-01T12:34:56Z" + } + } +} +``` + +**FR‑T2: DSSE Envelope** + +The Statement MUST be wrapped in a DSSE envelope: + +```json +{ + "payloadType": "application/vnd.stella.callgraph-edges.v1+json", + "payload": "base64url()", + "signatures": [ + { + "keyid": "k-01KEY...", + "sig": "base64url(signature-bytes)" + } + ] +} +``` + +**FR‑T3: Signing Requirements** + +* Signatures MUST use Ed25519 (or another modern scheme, but the algorithm MUST be recorded). +* Private keys MUST be stored in KMS/HSM (not in application memory except ephemeral signing use). +* Key metadata: + + * `keyid`, + * `algorithm`, + * `valid_from`, `valid_until`. +* Keys MUST be rotated regularly; new keys MUST NOT invalidate old attestations (verification MUST support multiple keys). + +**FR‑T4: Batching Strategy** + +* Attestor MAY batch up to N edges per attestation (configurable, default 10,000) for performance. +* Each attestation MUST: + + * Only contain edges from a single `graph_revision_id`. + * Preserve deterministic ordering of `edges` by `edge_id`. + +**FR‑T5: Rekor Logging** + +* For each DSSE envelope: + + * Compute digest of the envelope. + * Submit to Rekor (kind: `dsse` or `in-toto`). + * Persist in internal Ledger: + + * `rekor_log_index`, + * `rekor_uuid`, + * `rekor_inclusion_proof`. + +**Why this matters (docs hook):** +Attestations let Stella Ops prove that the call‑graph wasn’t tampered with after analysis. Clients can independently verify signatures and Rekor proofs to ensure that evidence is authentic and unmodified. + +--- + +## 5.2 Attestor API + +**FR‑T6: Trigger Attestation** + +`POST /v1/attestations/graphs/{graph_revision_id}` + +Request: + +```json +{ + "artifact_id": "art-01HXXXXX...", + "batch_size": 10000 +} +``` + +Response: + +```json +{ + "attestation_batch_id": "att-batch-01...", + "graph_revision_id": "graph-...", + "status": "in_progress" +} +``` + +**FR‑T7: Attestation Status & Retrieval** + +`GET /v1/attestations/batches/{attestation_batch_id}` + +Response: + +```json +{ + "attestation_batch_id": "att-batch-01...", + "graph_revision_id": "graph-...", + "status": "completed", + "attestations": [ + { + "attestation_id": "att-01...", + "dsse_envelope_ref": "s3://.../att-01.json", + "rekor_log_index": 12345, + "rekor_uuid": "..." + } + ] +} +``` + +**Why this matters (docs hook):** +A clean attestation API allows clients and integrators to subscribe to evidence as a first‑class output (not just raw JSON). This is crucial for advanced audit and compliance workflows. + +--- + +# 6. Container Layer Provenance + +Reachability proofs should show *which container layer* introduced a function or vulnerability. + +## 6.1 Layer Indexing + +**FR‑L1: Layer Metadata** + +For container artifacts: + +* The scanner MUST extract the image’s layer list and config. +* For each layer: + +```json +{ + "layer_digest": "sha256:layer1...", + "index": 0, + "created_by": "FROM ubuntu:22.04", + "size_bytes": 123456 +} +``` + +**FR‑L2: File Provenance** + +* Build a file tree mapping `file_path` → `origin_layer_digest`. +* For each file, also track: + + * `file_hash`, + * `mode`, `uid`, `gid`. + +**FR‑L3: Linking Nodes to Layers** + +* When creating nodes, if `source_location.file_path` is inside a container layer: + + * Resolve the path to its `origin_layer_digest`. +* Each node MUST have: + +```json +"container_origin": { + "layer_digest": "sha256:layerX...", + "path_in_layer": "/app/src/UserService.php" +} +``` + +**Why this matters (docs hook):** +Layer provenance lets Stella Ops say “this issue comes from the base image” or “this function was added by your team’s Dockerfile step.” This makes it clear where to fix the problem and who owns it. + +--- + +# 7. VEX & Policy Engine + +The VEX engine converts vulnerabilities and call‑graphs into **status + proof** decisions. + +## 7.1 Internal Vulnerability Model + +**FR‑V1: Vulnerability Record** + +Internally, for each vulnerability: + +```json +{ + "vulnerability_id": "CVE-2025-12345", + "source": "OSV", + "affected_symbols": [ + { + "language": "php", + "symbol_signature": "\\Vendor\\Lib\\Foo::dangerous()", + "matching_rule": "exact" + } + ], + "severity": "HIGH", + "feeds": ["osv", "nvd"], + "metadata": { ... } +} +``` + +**FR‑V2: Component Mapping** + +* Use SBOM + artifact metadata to map `vulnerability_id` to: + + * `artifact_id`, + * `component_id` (library/package), + * potential source/binary files. + +--- + +## 7.2 Reachability Evaluation + +**FR‑V3: Node Matching** + +For each `(artifact_id, vulnerability_id)`: + +1. Map `affected_symbols` to `node_id`s in the artifact’s graph using: + + * `language`, + * `symbol_signature`, + * optional fuzzy/heuristic matching rules. + +2. Record the set `Sinks = { node_id_1, node_id_2, ... }`. + +**FR‑V4: Entry Points** + +Define entry points set `Entrypoints`: + +* Defaults: + + * For containers: common `main`/server functions, binaries invoked by entrypoint/cmd. + * For libraries: exported funcs/classes. +* MUST be configurable via policy. + +**FR‑V5: Path Search** + +* For each `sink` in `Sinks`: + + * Run a bounded graph traversal (e.g., BFS or DFS) backwards from `sink` to see if any `entrypoint` ∈ `Entrypoints` is reachable. + * Limit depth and time by configuration, but record those limits in the proof. + +**FR‑V6: Reachability Result** + +For each `(artifact_id, vulnerability_id)`: + +* `reachable`: if any path `entrypoint → ... → sink` exists. +* `not_reachable`: if no path is found within bounds and analysis coverage is sufficient. +* `unknown`: if analysis bounds or gaps prevent a confident answer. + +**Why this matters (docs hook):** +This is the step that turns thousands of “you have a CVE” alerts into a small list of actually exploitable issues with a clear path, massively reducing noise for engineers. + +--- + +## 7.3 VEX Statement & Proof Linking + +**FR‑V7: Internal VEX Record** + +For each `(artifact_id, vulnerability_id)`: + +```json +{ + "vex_id": "vex-01...", + "artifact_id": "art-...", + "vulnerability_id": "CVE-2025-12345", + "status": "affected" | "not_affected" | "fixed" | "under_investigation", + "justification": "reachability_not_detected" | "component_not_present" | "vulnerable_code_not_in_execute_path" | ..., + "proof_ref": "proof-01...", + "generated_at": "2025-01-01T12:34:56Z", + "graph_revision_id": "graph-...", + "policy_version": "pol-2025.01" +} +``` + +**FR‑V8: Proof Bundles** + +`proof-01...` MUST refer to a **Proof Bundle**: + +```json +{ + "proof_id": "proof-01...", + "artifact_id": "art-...", + "vulnerability_id": "CVE-2025-12345", + "graph_revision_id": "graph-...", + "kind": "reachable" | "unreachable" | "unknown", + "paths": [ + { + "path_id": "path-01...", + "entrypoint_node_id": "node-...", + "sink_node_id": "node-...", + "edge_ids": [ + "edge-...", + "edge-..." + ], + "attestation_refs": [ + { + "edge_id": "edge-...", + "attestation_id": "att-01...", + "rekor_log_index": 12345 + } + ] + } + ], + "analysis_limits": { + "max_depth": 50, + "max_time_ms": 5000, + "coverage_notes": "no runtime traces included" + } +} +``` + +**Why this matters (docs hook):** +The VEX record gives the “headline” (affected or not), while the Proof Bundle is the “story behind the headline.” Clients can inspect exact call‑chains and even verify the underlying attestations if they choose. + +--- + +# 8. Proof & Explainability API + +This is what customers and UIs will query to display **“why is this vulnerability shown (or suppressed)?”** + +## 8.1 Findings API + +**FR‑P1: List Findings** + +`GET /v1/artifacts/{artifact_id}/findings` + +Response: + +```json +{ + "artifact_id": "art-...", + "findings": [ + { + "finding_id": "find-01...", + "vulnerability_id": "CVE-2025-12345", + "status": "affected", + "severity": "HIGH", + "vex_id": "vex-01...", + "proof_id": "proof-01...", + "component_id": "pkg:composer/acme/foo@1.2.3", + "summary": "Reachable from /app/index.php via Acme\\Controller::handle()." + } + ] +} +``` + +**FR‑P2: Proof Retrieval** + +`GET /v1/findings/{finding_id}/proof` + +Response: + +```json +{ + "finding_id": "find-01...", + "proof": { + "proof_id": "proof-01...", + "kind": "reachable", + "paths": [ + { + "path_id": "path-01...", + "entrypoint": { + "node_id": "node-...", + "symbol_signature": "\\App\\Controller::handle()", + "source_location": { + "file_path": "/app/src/Controller.php", + "line": 10 + }, + "container_origin": { + "layer_digest": "sha256:layer0...", + "index": 0 + } + }, + "sink": { + "node_id": "node-...", + "symbol_signature": "\\Vendor\\Lib\\Foo::dangerous()", + "source_location": { + "file_path": "/vendor/vendor/lib/Foo.php", + "line": 99 + }, + "container_origin": { + "layer_digest": "sha256:layer1...", + "index": 1 + } + }, + "edge_chain": [ + { + "edge_id": "edge-...", + "caller_node_id": "node-...", + "callee_node_id": "node-...", + "evidence_summary": { + "origin": "static_analysis", + "confidence": 1.0 + }, + "attestation_ref": { + "attestation_id": "att-01...", + "rekor_log_index": 12345 + } + } + ] + } + ], + "analysis_limits": { ... } + } +} +``` + +**FR‑P3: Negative Proofs** + +For `status = "not_affected"`, `kind` MUST be `"unreachable"` and `paths` MAY be omitted. Instead, include: + +```json +"unreachable_reason": { + "no_paths_found": true, + "graph_coverage": "full", + "notes": "All entrypoints explored up to depth 50; no path to sink." +} +``` + +**Why this matters (docs hook):** +This API lets Stella Ops show *exactly why* something is prioritized or suppressed. It’s the difference between “we think this is safe” and “here’s the graph that proves why it’s safe.” + +--- + +# 9. Determinism & Replay Guarantees + +## 9.1 Re‑analysis + +**FR‑D1: Replay Endpoint** + +`POST /v1/graphs/{graph_revision_id}:replay` + +* MUST re‑run Cartographer with the same determinism manifest. +* MUST produce: + + * Same `graph_revision_id`, + * Same `nodes` / `edges` (including IDs), + * Same attestations payloads (signatures will differ if keys rotated, but payload digests MUST match). + +**FR‑D2: Audit Report** + +`GET /v1/graphs/{graph_revision_id}/audit-report` + +* MUST include: + + * Original determinism manifest, + * Replay comparison (equal / not equal), + * Differences if any (for internal troubleshooting). + +**Why this matters (docs hook):** +Replay is key for regulated customers: they can prove that Stella Ops would make the same decision again given the same inputs, supporting audit trails and incident investigations. + +--- + +# 10. Non‑Functional Requirements + +## 10.1 Performance & Scale + +**FR‑NF1: Edge Volume** + +* System MUST handle at least: + + * 10^6 edges per artifact, + * 10^8 edges per day (multi‑tenant). + +**FR‑NF2: Latency Targets** + +* For medium artifacts (≤100k edges): + + * Call‑graph build: p95 ≤ 60s. + * Attestation batching + Rekor: p95 ≤ 120s (background). + +**FR‑NF3: Storage** + +* Graphs, nodes, edges, and attestations MUST be stored in a way that: + + * Allows efficient lookup by: + + * `artifact_id`, + * `vulnerability_id`, + * `node_id` / `edge_id`, + * `graph_revision_id`. + * Supports retention policies (e.g., keep graphs for N years). + +## 10.2 Security + +**FR‑NF4: Data Integrity** + +* All evidence blobs (graphs, attestations, manifests) MUST be hashed and checksummed. +* Any corruption MUST be detectable via hash mismatch. + +**FR‑NF5: Tenant Isolation** + +* Multi‑tenant stores MUST enforce separation such that: + + * Evidence from tenant A cannot be read by tenant B. + * Rekor remains public, but tenant‑specific metadata stays private. + +**Why this matters (docs hook):** +Non‑functional guarantees are what make the system usable in real life: fast enough to run in CI, safe enough to satisfy security teams, and scalable enough to handle entire fleets. + +--- + +# 11. Documentation Hooks Summary (for your writer) + +Here’s a quick mapping from technical feature → story you can tell clients: + +1. **Artifact & Binary IDs** + + > “Every finding is anchored to a precise artifact and binary, so you always know *exactly* what code we’re talking about.” + +2. **Deterministic node/edge IDs & graph hash** + + > “If you or your auditor re‑run the analysis, you’ll get the same graph and the same decisions, byte for byte.” + +3. **DSSE + in‑toto attestations, Rekor logging** + + > “We sign and transparently log our analysis so nobody—including us—can silently change the evidence later.” + +4. **Container layer provenance** + + > “We can tell you whether a risk comes from your base image, your Dockerfile, or a specific library update.” + +5. **VEX with proof bundles** + + > “For every ‘not affected’ verdict, we include a proof that shows how we know the vulnerable code is unreachable.” + +6. **Proof & explainability API** + + > “You can always drill down from a red vulnerability indicator to the exact call‑chain that justifies it.” + +7. **Replay & determinism manifest** + + > “Your compliance team can rerun our analysis later and confirm we’d make the same call, which is critical for audits and incident reviews.” + +--- + +If you’d like next, I can: + +* Turn this into a **formal RFC‑style doc** with numbered sections and changelog, or +* Draft **customer‑facing docs** (1–2 pages) that explain these same features in non‑technical language. +Below is what you asked for in one shot: + +1. An internal, numbered **RFC-style engineering spec** (you can drop this into `docs/rfcs/RFC-XXXX-deterministic-reachability.md`). +2. A concise **customer-facing explainer** your documentation/marketing can turn into website copy, whitepapers, or slideware. + +You can adapt names (IDs, paths, versions) to your internal conventions. + +--- + +## Part 1 — Internal RFC: Deterministic Reachability & Proof System + +**RFC ID:** RFC-2025-DRP +**Title:** Deterministic Reachability & Proof System for Stella Ops +**Status:** Draft +**Owners:** Platform / Scanner / Graph Guilds +**Target Release:** Q2–Q3 2026 +**Affected Components:** Scanner, Sbomer, Cartographer, Attestor, Vexer, Ledger, APIs + +--- + +### 0. Abstract + +This RFC defines the deterministic reachability and proof system for Stella Ops. It standardizes: + +* Deterministic call-graph construction per artifact +* Stable identities for nodes, edges, and graphs +* DSSE / in-toto attestations for call-graph edges logged to Rekor +* A VEX and policy engine that uses graphs as cryptographic proof +* Explainability APIs that expose call-chains, layer provenance, and analysis limits +* Determinism manifests and replay capabilities for audits + +The objective is to move from heuristic, non-repeatable vulnerability prioritization to a cryptographically verifiable, replayable reachability model. + +--- + +### 1. Motivation + +Current vulnerability scanners and “reachability” features in the market: + +* Rely on heuristics or partial coverage for call-graphs +* Are typically non-deterministic (different runs = different results) +* Provide limited or opaque explainability (“we think this is reachable”) +* Do not provide cryptographic guarantees of integrity (no DSSE/in-toto chain) +* Do not systematically link reachability to container layer provenance + +Stella Ops aims to: + +1. **Reduce noise** (unreachable CVEs) with strong, repeatable evidence. +2. **Provide proof** (signed attestations, Rekor logs) that decisions have not been tampered with. +3. **Support audits** via determinism manifests and replay. +4. **Clarify responsibility** (base image vs application image layers) via layer provenance. + +This RFC describes the engineering design to achieve these goals. + +--- + +### 2. Goals and Non-Goals + +#### 2.1 Goals + +* G-1: Deterministic call-graphs per artifact with stable node, edge, and graph IDs. +* G-2: DSSE / in-toto attestations over call-graph edges, with digests logged to Rekor. +* G-3: VEX decisions driven by graph reachability, not only presence of vulnerable components. +* G-4: Explainable findings: each vulnerability has an attached proof bundle (paths, edges, attestations, layer provenance). +* G-5: Determinism manifest and replay endpoints for audit and forensic workflows. +* G-6: Integration with existing Stella Ops components (Scanner, Sbomer, Ledger, etc.) with minimal disruption. + +#### 2.2 Non-Goals + +* N-1: UI/UX design details (dashboards, widgets, styling) are outside scope. +* N-2: Feed ingestion for vulnerability data (NVD, OSV, vendor advisories) is treated as given. +* N-3: AuthN/AuthZ and multi-tenant boundaries are assumed to follow the existing platform. +* N-4: Runtime trace ingestion is considered a later enhancement; this RFC prioritizes static call-graphs. + +--- + +### 3. Terminology + +* **Artifact** – Scannable unit: container image, VM image, binary, library, archive, or source bundle. +* **Binary** – Executable or library file within an artifact (ELF, PE, Mach-O, etc.). +* **Node** – Callable unit (function, method, entrypoint, exported symbol) within an artifact. +* **Edge** – Possible call from one node to another. +* **Call-graph** – Directed graph G = (V, E) of nodes V and edges E. +* **Graph Revision** – Canonical hash of the node and edge sets for an artifact. +* **Determinism Manifest** – Structured record capturing all inputs and versions needed to reproduce a graph. +* **Attestation** – DSSE-wrapped in-toto statement describing edges for a given graph revision. +* **Rekor** – Transparency log where DSSE envelope digests are recorded. +* **VEX** – Vulnerability Exploitability eXchange statement for a given vulnerability–artifact pair. +* **Proof Bundle** – Structured, machine-readable explanation of how a VEX verdict was obtained (paths, limits, attestations, etc.). + +--- + +### 4. Architecture Overview + +Components and responsibilities: + +1. **Artifact Service** + + * Normalizes artifacts, assigns `artifact_id`, manages binary identities and SBOM links. + +2. **Cartographer Service** + + * Parses artifacts and binaries. + * Builds deterministic call-graphs (nodes + edges). + * Produces `graph_revision_id` and determinism manifest. + +3. **Attestor Service** + + * Consumes graphs and emits DSSE / in-toto attestations over edges. + * Logs attestation digests to Rekor. + * Stores attestation metadata in Ledger. + +4. **VEX Engine** + + * Maps vulnerabilities to nodes via SBOM + symbol mapping. + * Performs reachability analysis from entrypoints to vulnerable sinks. + * Emits VEX records and associated proof bundles. + +5. **Proof & Explainability API** + + * Exposes findings, proofs, call-chains, and layer provenance. + +6. **Container Layer Provenance** + + * Maps files and nodes to container layers and build steps. + +7. **Ledger** + + * Persists graphs, determinism manifests, attestations, and proofs. + * Supports replay and audit queries. + +--- + +### 5. Data Model + +#### 5.1 Artifact and Binary Identity + +* `artifact_id` + + * Globally unique ID for `(kind, artifact_hash)`. + * Idempotent: same `(kind, hash)` → same `artifact_id`. + +* `artifact` record (logical view): + +```json +{ + "artifact_id": "art-01HXXXXX...", + "kind": "container", + "artifact_hash": "sha256:...", + "coordinates": { + "image_ref": "gcr.io/acme/api@sha256:...", + "tag": "v1.2.3" + }, + "build_metadata": { + "build_id": "build-2025-01-01-1234", + "ci_run_id": "ci-abc-123", + "timestamp": "2025-01-01T12:34:56Z" + }, + "sbom_ref": { + "format": "spdx-json", + "location": "s3://.../api-v1.2.3.spdx.json", + "hash": "sha256:..." + } +} +``` + +* `binary_id` + + * Assigned for each binary discovered within an artifact. + +```json +{ + "binary_id": "bin-01HYYY...", + "artifact_id": "art-01HXXXXX...", + "path": "/usr/local/bin/server", + "file_hash": "sha256:...", + "build_id": "0x1234abcd..." +} +``` + +**Rationale:** Stable IDs for artifacts and binaries enable precise binding between SBOM components, graph nodes, and VEX decisions, and are essential for cross-run comparability and auditability. + +--- + +#### 5.2 Node Model + +Each node is defined by a canonical tuple: + +* `artifact_id` +* `binary_id` (nullable for source-level nodes) +* `language` +* `symbol_kind` (function, method, constructor, lambda, entrypoint, etc.) +* `symbol_signature` (language-specific, fully qualified) +* `source_location` OR `binary_offset` +* Optional `container_origin` (for containerized artifacts; see §7) + +Example: + +```json +{ + "node_id": "node-...", + "artifact_id": "art-...", + "binary_id": "bin-...", + "language": "php", + "symbol_kind": "method", + "symbol_signature": "\\Acme\\Service\\UserService::findByEmail(string): User", + "source_location": { + "file_path": "/app/src/Service/UserService.php", + "line": 42, + "column": 3 + }, + "container_origin": { + "layer_digest": "sha256:layer1...", + "path_in_layer": "/app/src/Service/UserService.php" + } +} +``` + +**Node ID Derivation** + +```text +canonical_node_string = JSONCanonicalize({ + "artifact_id": "...", + "binary_id": "...", + "language": "...", + "symbol_kind": "...", + "symbol_signature": "...", + "location": { ... } // source_location or binary_offset +}) + +node_id = "node-" + hex(sha256(canonical_node_string)) +``` + +* `JSONCanonicalize` MUST be deterministic: + + * Sorted keys + * Fixed number formats + * Normalized file paths and line/column representation + +**Rationale:** Deterministic node IDs allow persistent references to functions across scans, policies, and proofs. This is critical for long-term VEX stability and cross-tool integration. + +--- + +#### 5.3 Edge Model + +Canonical fields: + +* `edge_id` +* `caller_node_id` +* `callee_node_id` +* `edge_kind` – `"static" | "virtual" | "interface" | "dynamic" | "indirect"` +* `evidence`: + + * `origin` – `"static_analysis" | "runtime_trace"` (runtime is future extension) + * `tool_version` – Cartographer version + * `analysis_profile` – Profile name (e.g. `"default"`, `"aggressive"`) + * `confidence` – Float in [0,1] + * `metadata` – Language-specific info (dispatch tables, type information, etc.) + +**Edge ID Derivation** + +```text +canonical_edge_string = JSONCanonicalize({ + "caller": "node-...", + "callee": "node-...", + "edge_kind": "...", + "analysis_tool_version": "...", + "analysis_profile": "..." +}) + +edge_id = "edge-" + hex(sha256(canonical_edge_string)) +``` + +* Edges MUST be emitted sorted by `edge_id`. + +**Rationale:** Stable edges, combined with deterministic sorting, are necessary to ensure that graph hashes and attestations remain identical across re-runs with the same inputs. + +--- + +#### 5.4 Graph Revision & Determinism Manifest + +Graph canonical representation: + +```text +graph_canonical = JSONCanonicalize({ + "nodes": [ ... ] // sorted by node_id + "edges": [ ... ] // sorted by edge_id +}) + +graph_revision_id = "graph-" + hex(sha256(graph_canonical)) +``` + +Determinism manifest: + +```json +{ + "graph_revision_id": "graph-...", + "artifact_id": "art-...", + "inputs": { + "artifact_hash": "sha256:...", + "sbom_hash": "sha256:...", + "scanner_version": "1.4.0", + "cartographer_version": "2.1.3", + "config": { + "languages": ["php", "js"], + "analysis_profile": "default" + } + }, + "toolchain_hashes": { + "php_parser": "sha256:...", + "js_parser": "sha256:..." + }, + "timestamp": "2025-01-01T12:34:56Z" +} +``` + +**Rationale:** The manifest is the “recipe” to regenerate the same graph. It is required for replay, forensic analysis, and third-party audits. + +--- + +### 6. Service APIs + +#### 6.1 Artifact Service + +**POST `/v1/artifacts`** + +* Registers or retrieves an artifact. +* Idempotent on `(kind, artifact_hash)`. + +Request (simplified): + +```json +{ + "kind": "container", + "coordinates": { + "image_ref": "gcr.io/acme/api@sha256:...", + "tag": "v1.2.3" + }, + "hash": { + "algo": "sha256", + "value": "..." + }, + "build_metadata": { ... }, + "sbom_ref": { ... } +} +``` + +Response: + +```json +{ + "artifact_id": "art-01HXXXXX...", + "artifact_hash": "sha256:...", + "kind": "container" +} +``` + +**Binary registration** is normally performed automatically by Scanner/Cartographer, but logically lives under Artifact Service. + +--- + +#### 6.2 Cartographer (Call-Graph Builder) + +**POST `/v1/graphs:analyze`** + +Request: + +```json +{ + "artifact_id": "art-01HXXXXX...", + "analysis_profile": "default", + "languages": ["php", "js", "c"], + "options": { + "include_runtime_traces": false + } +} +``` + +Response (job): + +```json +{ + "graph_id": "graph-job-01ABC...", + "artifact_id": "art-...", + "status": "queued" +} +``` + +**GET `/v1/graphs/{graph_id}`** + +Response (normalized): + +```json +{ + "graph_revision_id": "graph-...", + "artifact_id": "art-...", + "nodes": [ ... ], + "edges": [ ... ], + "determinism_manifest": { ... } +} +``` + +Requirements: + +* Language coverage: PHP, JS/TS, Java, Go, C/C++, .NET, Python. +* Deterministic parsing and traversal (AST order, sorted children, pinned parser versions). +* Error modes: `failed_parse`, `failed_analysis`, `unsupported_language` with machine-readable codes. + +--- + +#### 6.3 Attestor Service + +**Payload format** + +in-toto Statement: + +```json +{ + "_type": "https://in-toto.io/Statement/v0.1", + "subject": [ + { + "name": "artifact:art-01HXXXXX...", + "digest": { "sha256": "..." } + } + ], + "predicateType": "https://stella.dev/attestations/callgraph-edges/v1", + "predicate": { + "graph_revision_id": "graph-...", + "edges": [ + { + "edge_id": "edge-...", + "caller_node_id": "node-...", + "callee_node_id": "node-...", + "edge_kind": "static", + "evidence": { + "origin": "static_analysis", + "tool_version": "2.1.3", + "analysis_profile": "default", + "confidence": 1.0 + } + } + ], + "analysis_metadata": { + "cartographer_version": "2.1.3", + "options": { + "languages": ["php", "js"], + "analysis_profile": "default" + }, + "timestamp": "2025-01-01T12:34:56Z" + } + } +} +``` + +Wrapped in DSSE envelope: + +```json +{ + "payloadType": "application/vnd.stella.callgraph-edges.v1+json", + "payload": "base64url()", + "signatures": [ + { + "keyid": "k-01KEY...", + "sig": "base64url(signature-bytes)" + } + ] +} +``` + +Signing: + +* Default: Ed25519 (others may be added; must be recorded). +* Private keys in KMS/HSM. +* Key rotation supported without invalidating old attestations (verification must handle multiple keys). + +Batching: + +* Up to N edges per attestation (default: 10,000). +* Single `graph_revision_id` per attestation. +* Edges within attestation sorted by `edge_id`. + +Rekor: + +* Submit DSSE envelope digest to Rekor (kind `dsse` / `in-toto`). +* Persist: + + * `attestation_id` + * `rekor_log_index` + * `rekor_uuid` + * `rekor_inclusion_proof` + +**POST `/v1/attestations/graphs/{graph_revision_id}`** + +Response: + +```json +{ + "attestation_batch_id": "att-batch-01...", + "graph_revision_id": "graph-...", + "status": "in_progress" +} +``` + +**GET `/v1/attestations/batches/{attestation_batch_id}`** + +Response: + +```json +{ + "attestation_batch_id": "att-batch-01...", + "graph_revision_id": "graph-...", + "status": "completed", + "attestations": [ + { + "attestation_id": "att-01...", + "dsse_envelope_ref": "s3://.../att-01.json", + "rekor_log_index": 12345, + "rekor_uuid": "..." + } + ] +} +``` + +--- + +#### 6.4 Container Layer Provenance + +Scanner MUST: + +* Parse image configuration and layer chain. +* Emit per-layer metadata: + +```json +{ + "layer_digest": "sha256:layer0...", + "index": 0, + "created_by": "FROM ubuntu:22.04", + "size_bytes": 123456 +} +``` + +File provenance: + +* Map `file_path` to `origin_layer_digest`, `file_hash`, and mode/ownership. +* When building nodes, resolve `source_location.file_path` to `origin_layer_digest` whenever possible. + +This yields: + +```json +"container_origin": { + "layer_digest": "sha256:layerX...", + "index": 1, + "path_in_layer": "/app/src/UserService.php" +} +``` + +--- + +#### 6.5 VEX Engine & Policy + +Vulnerability representation: + +```json +{ + "vulnerability_id": "CVE-2025-12345", + "source": "OSV", + "affected_symbols": [ + { + "language": "php", + "symbol_signature": "\\Vendor\\Lib\\Foo::dangerous()", + "matching_rule": "exact" + } + ], + "severity": "HIGH", + "feeds": ["osv", "nvd"], + "metadata": { } +} +``` + +Process for `(artifact_id, vulnerability_id)`: + +1. **Node Matching** + + * Map `affected_symbols` to `node_id`s via language + signature + heuristic matching as needed. + * Define `Sinks = { node_id_1, node_id_2, ... }`. + +2. **Entrypoint Resolution** + + * Default `Entrypoints` from: + + * Container entrypoint/cmd and their transitive calls. + * Exported/public API functions where appropriate. + * Allow policy override (e.g. custom entrypoint list per app). + +3. **Reachability Search** + + * Reverse traversal from each sink towards entrypoints. + * Bounded BFS/DFS (configurable `max_depth`, `max_time_ms`). + * Record explored nodes and edges for proof building. + +4. **Result Classification** + + * `reachable`: at least one `entrypoint → ... → sink` path found. + * `not_reachable`: no path within bounds, with sufficient graph coverage. + * `unknown`: bounds or missing data prevent a confident classification. + +VEX record: + +```json +{ + "vex_id": "vex-01...", + "artifact_id": "art-...", + "vulnerability_id": "CVE-2025-12345", + "status": "affected" | "not_affected" | "fixed" | "under_investigation", + "justification": "reachability_not_detected" | "component_not_present" | "vulnerable_code_not_in_execute_path" | "...", + "proof_ref": "proof-01...", + "generated_at": "2025-01-01T12:34:56Z", + "graph_revision_id": "graph-...", + "policy_version": "pol-2025.01" +} +``` + +--- + +#### 6.6 Proof Bundles & Explainability API + +Proof bundle: + +```json +{ + "proof_id": "proof-01...", + "artifact_id": "art-...", + "vulnerability_id": "CVE-2025-12345", + "graph_revision_id": "graph-...", + "kind": "reachable" | "unreachable" | "unknown", + "paths": [ + { + "path_id": "path-01...", + "entrypoint_node_id": "node-...", + "sink_node_id": "node-...", + "edge_ids": ["edge-1...", "edge-2..."], + "attestation_refs": [ + { + "edge_id": "edge-1...", + "attestation_id": "att-01...", + "rekor_log_index": 12345 + } + ] + } + ], + "analysis_limits": { + "max_depth": 50, + "max_time_ms": 5000, + "coverage_notes": "no runtime traces included" + }, + "unreachable_reason": { + "no_paths_found": true, + "graph_coverage": "full", + "notes": "All entrypoints explored up to depth 50; no path to sink." + } +} +``` + +API: + +* `GET /v1/artifacts/{artifact_id}/findings` – Lists findings and high-level summaries. +* `GET /v1/findings/{finding_id}/proof` – Returns the full proof bundle as above, including path details, node metadata (symbol signatures, source location, container layers), and attestation pointers. + +--- + +### 7. Determinism & Replay + +**Replay Endpoint** + +`POST /v1/graphs/{graph_revision_id}:replay` + +* Uses determinism manifest to re-run Cartographer. +* Expected outputs: + + * Same `graph_revision_id`. + * Identical sets of nodes and edges (IDs and canonical content). + * For attestations, payload digests MUST match (signatures may differ due to key rotation). + +**Audit Report** + +`GET /v1/graphs/{graph_revision_id}/audit-report` + +* Includes: + + * Original determinism manifest. + * Replay manifest. + * Comparison result: `match` / `mismatch`. + * If mismatch: enumerated differences (nodes, edges, toolchain or config drift). + +--- + +### 8. Security and Compliance Considerations + +* All graphs, manifests, and attestations must be integrity-checked (hashes) before use. +* Key management for signing must follow internal KMS/HSM best practices. +* Attestations and proofs may contain code locations but must not leak customer secrets beyond configured scopes. +* Multi-tenant isolation rules apply to all stored artifacts, graphs, and proofs; Rekor is public by design, but tenant IDs and internal references must not leak into Rekor payloads. + +--- + +### 9. Performance & Scalability + +Baseline targets (initial): + +* Per artifact up to 10^6 edges. +* Platform capacity 10^8 edges/day across tenants. +* p95 for medium artifacts (≤100k edges): + + * Graph build ≤ 60s. + * Attestation batching + Rekor ≤ 120s (async). + +Storage: + +* Graph data indexed by `artifact_id`, `graph_revision_id`. +* Edges indexable by `edge_id` and `caller_node_id`/`callee_node_id`. +* Proof bundles indexable by `artifact_id`, `vulnerability_id`, and `vex_id`. + +--- + +### 10. Operational Concerns + +* Feature flags for: + + * Graph-based VEX decisions (on/off). + * Attestations + Rekor logging (on/off; some customers may not allow external logs). + * Determinism replay endpoints (may be limited to Enterprise tiers). + +* Monitoring: + + * Graph build failure rate, latency histograms. + * Attestation creation and Rekor submission errors. + * Replay mismatch rate. + +--- + +### 11. Alternatives Considered + +* Non-deterministic traversal with approximate edges – rejected due to poor auditability. +* Runtime-only reachability – rejected as primary mechanism due to incomplete coverage (not all paths exercised in tests). +* Storing unsigned graph data only – rejected because it does not meet integrity or non-repudiation goals. + +--- + +### 12. Rollout Plan (High Level) + +1. Phase 1 – Internal graphs only + + * Implement Cartographer determinism + manifest. + * No attestations; internal verification. + +2. Phase 2 – Attestations & Rekor + + * Integrate Attestor, DSSE, and Rekor logging. + * Build internal tools to inspect and verify attestations. + +3. Phase 3 – VEX with proof bundles + + * Switch VEX decisions for selected tenants to graph-based reachability. + * Expose Proof & Explainability API. + +4. Phase 4 – Replay & audit workflows + + * Make replay endpoints GA for Enterprise customers. + * Add reporting and export features. + +--- + +### 13. Changelog + +* **v0.1 (Draft)** – Initial RFC, definitions of node/edge identities, graph determinism, VEX and proof model. +* **v0.2 (Planned)** – Add runtime traces integration and trust lattice hooks. +* **v1.0 (Planned)** – Marked stable once at least two major tenants successfully use reachability-based VEX in production. + +--- + +## Part 2 — Customer-Facing Explainer (Draft Copy) + +You can give this to your documentation / marketing team to adapt into website, PDF, or slide decks. + +--- + +### Title: Deterministic Reachability & Cryptographic Proofs in Stella Ops + +Modern vulnerability scanners all suffer from the same problem: they report every known CVE in your code and containers, but they cannot reliably tell you which ones are actually exploitable in your environment. Security teams are left drowning in noise, and auditors have to trust whatever the tool claims without real evidence. + +Stella Ops takes a different approach. We build **deterministic call-graphs** for every artifact you scan and then attach **cryptographic proofs** to our reachability analysis and VEX decisions. + +--- + +### 1. From “You Have a CVE” to “Here Is the Reachable Path” + +A call-graph is a map of which functions can call which other functions inside your containers, binaries, and services. Using language-specific analysis pipelines, Stella Ops: + +1. Identifies every callable function or method (nodes). +2. Establishes possible calls between them (edges). +3. Computes whether a vulnerable function can actually be reached from any real entry point in your application. + +The result is a clear distinction between: + +* Vulnerabilities that are **reachable** and worth urgent remediation. +* Vulnerabilities that are **present but not reachable** under any execution path we can construct. +* Vulnerabilities that are **unknown** (e.g. where analysis limits or missing information prevent a confident verdict). + +This dramatically reduces noise for your developers and AppSec teams. + +--- + +### 2. Deterministic, Replayable Results + +Most tools behave like a black box: re-running analysis on the same artifact may yield slightly different results over time. That is not acceptable in regulated or high-assurance environments. + +Stella Ops ensures that: + +* For the same inputs (artifact, configuration, tool versions), we always produce the **same call-graph**. +* Every function and edge gets a **stable identifier**. +* The entire graph is summarized into a **graph hash** and a **determinism manifest** that records exactly which tools and configurations were used. + +If you or your auditor re-run the analysis in the future, you can confirm that: + +* You get the same graph hash. +* The same nodes and edges are present. +* Our decisions are **reproducible**, not heuristic. + +--- + +### 3. Cryptographic Evidence: DSSE, in-toto, and Rekor + +We do not expect you to just trust our analysis; we sign it. + +For each call-graph, Stella Ops creates **DSSE-wrapped in-toto attestations** that describe the edges (who calls whom) for that artifact. These attestations are: + +* Digitally signed using modern cryptography. +* Logged to a **transparency ledger (Rekor)** so that any tampering is publicly detectable. +* Stored in your Stella Ops Ledger so you can retrieve and verify them later. + +This gives you a verifiable chain of custody: + +1. From artifact and SBOM. +2. To call-graph construction. +3. To reachability analysis and VEX decisions. + +--- + +### 4. VEX With Real Proof, Not Just a Checkbox + +Stella Ops emits VEX statements for each vulnerability–artifact pair. Unlike most tools, every VEX decision is backed by a **proof bundle**: + +* For “affected” findings, we show the **exact call-chain** from a real entrypoint (e.g. `/app/index.php` or a REST controller) to the vulnerable function, along with references to the signed attestations that prove each edge. +* For “not affected” findings, we document that we explored all configured entrypoints to a defined depth and found **no path** to the vulnerable function. We also include explicit notes about analysis limits and coverage. + +In other words, every VEX verdict comes with a “why,” not just a “yes/no.” + +--- + +### 5. Understanding Where Vulnerabilities Come From + +Many issues originate in: + +* Base images +* Shared layers used by multiple teams +* Vendor-provided components + +Stella Ops tracks **container layer provenance** for each file and function. For any finding, we can show: + +* Which layer introduced the vulnerable function. +* Whether it came from a base image or from your own Dockerfile steps. +* The exact file and path in that layer. + +This makes it much easier to assign ownership (“platform team vs application team”) and to fix the problem at the right source (updating a base image vs patching local code). + +--- + +### 6. Audit-Ready: Replay and Forensics + +For regulated industries, it is not enough to know that a decision was made; you must be able to prove it was made correctly and could be made again in the same way. + +Stella Ops provides: + +* A **determinism manifest** per analysis run, listing all inputs and tool versions. +* A **replay capability**: re-run the analysis and verify that you obtain the same graph and the same vulnerability verdicts. +* **Exportable reports** that link vulnerabilities, VEX decisions, call-chains, and attestations. + +This supports internal audits, external regulators, and incident response activities where accurate historical reconstruction is essential. + +--- + +### 7. Practical Benefits for Your Teams + +By combining deterministic graphs, cryptographic attestations, and explainable VEX, Stella Ops: + +* Reduces the number of vulnerabilities engineers must look at by focusing on actually **reachable** issues. +* Gives security teams **hard evidence** they can bring to auditors and management. +* Helps operations and platform teams quickly see whether an issue belongs to a base image, a shared library, or a specific application. +* Makes vulnerability management more like an engineering discipline and less like guesswork. + +--- + +If you want, I can next reshape these into: + +* A shorter, 1-page executive summary, or +* A developer-facing “How Stella Ops Reachability Works” guide with diagrams and example JSON snippets for your docs site. diff --git a/docs/product-advisories/23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring.md b/docs/product-advisories/23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring.md new file mode 100644 index 000000000..37cc97876 --- /dev/null +++ b/docs/product-advisories/23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring.md @@ -0,0 +1,972 @@ +Here’s a compact, ready‑to‑run plan to benchmark how consistently different vulnerability scanners score the *same* SBOM/VEX—so we can quantify Stella Ops’ determinism advantage. + +# Why this matters (quickly) + +Scanners often disagree on CVSS and “severity after VEX.” Measuring variance under identical inputs lets us prove scoring stability (a Stella Ops moat: deterministic, replayable scans). + +# What we’ll measure + +* **Determinism rate**: % of runs yielding identical (hash‑equal) results per scanner. +* **CVSS delta σ**: standard deviation of (scanner_score − reference_score) across vulns. +* **Order‑invariance**: re-feed inputs in randomized orders; expect identical outputs. +* **VEX application stability**: variance before vs. after applying VEX justifications. +* **Drift vs. feeds**: pin feeds to content hashes; any change must be attributable. + +# Inputs (frozen & hashed) + +* 3–5 **SBOMs** (CycloneDX 1.6 + SPDX 3.0.1) from well‑known images (e.g., nginx, keycloak, alpine‑glibc, a Java app, a Node app). +* Matching **VEX** docs (CycloneDX VEX) covering “not affected,” “affected,” and “fixed.” +* **Feeds bundle**: vendor DBs (NVD, GHSA, distro OVAL), all vendored and hashed. +* **Policy**: identical normalization rules (CVSS v3.1 only, prefer vendor over NVD, etc.). + +# Scanners (example set) + +* Anchore/Grype, Trivy, Snyk CLI, osv‑scanner, Dependency‑Track API (server mode), plus **Stella Ops Scanner**. + +# Protocol (10 runs × 2 orders) + +1. **Pin environment** (Docker images + air‑gapped tarballs). Record: + + * tool version, container digest, feed bundle SHA‑256, SBOM/VEX SHA‑256. +2. **Run matrix**: for each SBOM/VEX, per scanner: + + * 10 runs with canonical file order. + * 10 runs with randomized SBOM component order + shuffled VEX statements. +3. **Capture** normalized JSON: `{purl, vuln_id, base_cvss, effective_severity, vex_applied, notes}`. +4. **Hash** each run’s full result (SHA‑256 over canonical JSON). +5. **Compute** metrics per scanner: + + * Determinism rate = identical_hash_runs / total_runs. + * σ(CVSS delta) vs. reference (choose NVD base as reference, or Stella policy). + * Order‑invariance failures (# of distinct hashes between canonical vs. shuffled). + * VEX stability: σ before vs. after VEX; Δσ should shrink, not grow. + +# Minimal harness (Python outline) + +```python +# run_bench.py +# prerequisites: docker CLI, Python 3.10+, numpy, pandas +from pathlib import Path +import json, hashlib, random, subprocess +import numpy as np +SBOMS = ["sboms/nginx.cdx.json", "sboms/keycloak.spdx.json", ...] +VEXES = ["vex/nginx.vex.json", "vex/keycloak.vex.json", ...] +SCANNERS = { + "grype": ["docker","run","--rm","-v","$PWD:/w","grype:TAG","--input","/w/{sbom}","--vex","/w/{vex}","--output","json"], + "trivy": ["docker","run","--rm","-v","$PWD:/w","aquasec/trivy:TAG","sbom","/w/{sbom}","--vex","/w/{vex}","--format","json"], + # add more… + "stella": ["docker","run","--rm","-v","$PWD:/w","stellaops/scanner:TAG","scan","--sbom","/w/{sbom}","--vex","/w/{vex}","--normalize","json"] +} +def canon(obj): return json.dumps(obj, sort_keys=True, separators=(",",":")).encode() +def shas(b): return hashlib.sha256(b).hexdigest() +def shuffle_file(src, dst): # implement component/VEX statement shuffle preserving semantics + data = json.load(open(src)) + for k in ("components","vulnerabilities","vex","statements"): + if isinstance(data, dict) and k in data and isinstance(data[k], list): + random.shuffle(data[k]) + json.dump(data, open(dst,"w"), indent=0, separators=(",",":")) +def run(cmd): return subprocess.check_output(cmd, text=True) + +results=[] +for sbom, vex in zip(SBOMS, VEXES): + for scanner, tmpl in SCANNERS.items(): + for mode in ("canonical","shuffled"): + for i in range(10): + sb, vx = sbom, vex + if mode=="shuffled": + sb, vx = f"tmp/{Path(sbom).stem}.json", f"tmp/{Path(vex).stem}.json" + shuffle_file(sbom, sb); shuffle_file(vex, vx) + out = run([c.format(sbom=sb, vex=vx) for c in tmpl]) + j = json.loads(out) + # normalize to minimal tuple per finding (purl,id,base_cvss,effective) + norm = [{"purl":x["purl"],"id":x["id"],"base":x.get("cvss","NA"), + "eff":x.get("effectiveSeverity","NA")} for x in j.get("findings",[])] + blob = canon({"scanner":scanner,"sbom":sbom,"vex":vex,"findings":norm}) + results.append({ + "scanner":scanner,"sbom":sbom,"mode":mode,"run":i, + "hash":shas(blob),"norm":norm + }) +# compute stats (pandas groupby): determinism %, std dev of (eff - ref) per (scanner,sbom) +``` + +# Pass/Fail gates (suggested) + +* **Determinism ≥ 99%** across 20 runs per (scanner, SBOM). +* **Order‑invariance = 100%** identical hashes. +* **VEX stability**: σ_after ≤ σ_before (VEX reduces variance). +* **Provenance**: any change must correlate to a different feed bundle hash. + +# Deliverables + +* `bench/` with SBOMs, VEX, feeds bundle manifest (hashes). +* `run_bench.py` + `analyze.ipynb` (charts: determinism%, σ by scanner). +* One‑page **Stella Ops Differentiator**: “Provable Scoring Stability” with the above metrics and reproducibility recipe. + +# Next step + +If you want, I’ll generate the folder skeleton, example SBOM/VEX, and the analysis notebook stub so you can drop in your scanners and hit run. +Here’s a concrete, .NET‑friendly implementation plan you can actually build, not just admire in a doc. + +I’ll assume: + +* .NET 8 (or 6) SDK +* Windows or Linux dev machine with Docker installed +* You’re comfortable with basic C#, CLI, and JSON + +--- + +## 1. Project structure + +Create a simple solution with two projects: + +```bash +dotnet new sln -n ScannerBench +cd ScannerBench + +dotnet new console -n ScannerBench.Runner +dotnet new xunit -n ScannerBench.Tests + +dotnet sln add ScannerBench.Runner/ScannerBench.Runner.csproj +dotnet sln add ScannerBench.Tests/ScannerBench.Tests.csproj +dotnet add ScannerBench.Tests reference ScannerBench.Runner +``` + +Inside `ScannerBench.Runner` create folders: + +* `Inputs/` – SBOM & VEX files + + * `Inputs/Sboms/nginx.cdx.json` + * `Inputs/Vex/nginx.vex.json` + * (and a few more pairs) +* `Config/` – scanner config JSON or YAML later if you want +* `Results/` – captured run outputs (for debugging / manual inspection) + +--- + +## 2. Define core domain models (C#) + +In `ScannerBench.Runner` add a file `Models.cs`: + +```csharp +using System.Collections.Generic; + +namespace ScannerBench.Runner; + +public sealed record ScannerConfig( + string Name, + string DockerImage, + string[] CommandTemplate // tokens; use {sbom} and {vex} placeholders +); + +public sealed record BenchInput( + string Id, // e.g. "nginx-cdx" + string SbomPath, + string VexPath +); + +public sealed record NormalizedFinding( + string Purl, + string VulnerabilityId, // CVE-2021‑1234, GHSA‑xxx, etc. + string BaseCvss, // normalized to string for simplicity + string EffectiveSeverity // e.g. "LOW", "MEDIUM", "HIGH" +); + +public sealed record ScanRun( + string ScannerName, + string InputId, + int RunIndex, + string Mode, // "canonical" | "shuffled" + string ResultHash, + IReadOnlyList Findings +); + +public sealed record DeterminismStats( + string ScannerName, + string InputId, + string Mode, + int TotalRuns, + int DistinctHashes +); + +public sealed record CvssDeltaStats( + string ScannerName, + string InputId, + double MeanDelta, + double StdDevDelta +); +``` + +You can grow this later, but this is enough to get the first version working. + +--- + +## 3. Hard‑code scanner configs (first pass) + +In `ScannerConfigs.cs`: + +```csharp +namespace ScannerBench.Runner; + +public static class ScannerConfigs +{ + public static readonly ScannerConfig[] All = + { + new( + Name: "grype", + DockerImage: "anchore/grype:v0.79.0", + CommandTemplate: new[] + { + "grype", + "--input", "/work/{sbom}", + "--output", "json" + // add flags like --vex when supported + } + ), + new( + Name: "trivy", + DockerImage: "aquasec/trivy:0.55.0", + CommandTemplate: new[] + { + "trivy", "sbom", "/work/{sbom}", + "--format", "json" + } + ), + new( + Name: "stella", + DockerImage: "stellaops/scanner:latest", + CommandTemplate: new[] + { + "scanner", "scan", + "--sbom", "/work/{sbom}", + "--vex", "/work/{vex}", + "--output-format", "json" + } + ) + }; +} +``` + +You can tweak command templates once you wire up actual tools. + +--- + +## 4. Input set (SBOM + VEX pairs) + +In `BenchInputs.cs`: + +```csharp +namespace ScannerBench.Runner; + +public static class BenchInputs +{ + public static readonly BenchInput[] All = + { + new("nginx-cdx", "Inputs/Sboms/nginx.cdx.json", "Inputs/Vex/nginx.vex.json"), + new("keycloak-spdx", "Inputs/Sboms/keycloak.spdx.json", "Inputs/Vex/keycloak.vex.json") + // add more as needed + }; +} +``` + +Populate `Inputs/Sboms` and `Inputs/Vex` manually or with a script (doesn’t need to be .NET). + +--- + +## 5. Utility: JSON shuffle to test order‑invariance + +You want to randomize component/vulnerability/VEX statement order to confirm that scanners don’t change results based on input ordering. + +Create `JsonShuffler.cs`: + +```csharp +using System; +using System.Collections.Generic; +using System.IO; +using System.Linq; +using System.Text.Json; +using System.Text.Json.Nodes; + +namespace ScannerBench.Runner; + +public static class JsonShuffler +{ + private static readonly string[] ListKeysToShuffle = + { + "components", + "vulnerabilities", + "statements", + "vex" + }; + + public static string CreateShuffledCopy(string sourcePath, string tmpDir) + { + Directory.CreateDirectory(tmpDir); + + var jsonText = File.ReadAllText(sourcePath); + var node = JsonNode.Parse(jsonText); + if (node is null) + throw new InvalidOperationException($"Could not parse JSON: {sourcePath}"); + + ShuffleLists(node); + + var fileName = Path.GetFileName(sourcePath); + var destPath = Path.Combine(tmpDir, fileName); + File.WriteAllText(destPath, node.ToJsonString(new JsonSerializerOptions + { + WriteIndented = false + })); + return destPath; + } + + private static void ShuffleLists(JsonNode node) + { + if (node is JsonObject obj) + { + foreach (var kvp in obj.ToList()) + { + if (kvp.Value is JsonArray arr && ListKeysToShuffle.Contains(kvp.Key)) + { + ShuffleInPlace(arr); + } + else if (kvp.Value is not null) + { + ShuffleLists(kvp.Value); + } + } + } + else if (node is JsonArray arr) + { + foreach (var child in arr) + { + if (child is not null) + ShuffleLists(child); + } + } + } + + private static void ShuffleInPlace(JsonArray arr) + { + var rnd = new Random(); + var list = arr.ToList(); + arr.Clear(); + foreach (var item in list.OrderBy(_ => rnd.Next())) + { + arr.Add(item); + } + } +} +``` + +--- + +## 6. Utility: run Dockerized scanner from C# + +Create `DockerRunner.cs`: + +```csharp +using System; +using System.Diagnostics; +using System.IO; +using System.Linq; +using System.Text; + +namespace ScannerBench.Runner; + +public static class DockerRunner +{ + public static string RunScanner( + ScannerConfig scanner, + string sbomPath, + string? vexPath, + string workDir) + { + // Build the container command (inside container) + var innerCmdTokens = scanner.CommandTemplate + .Select(t => t.Replace("{sbom}", sbomPath.Replace("\\", "/")) + .Replace("{vex}", vexPath?.Replace("\\", "/") ?? "")) + .ToArray(); + + // We run: docker run --rm -v :/work + var dockerArgs = new StringBuilder(); + dockerArgs.Append("run --rm "); + dockerArgs.Append($"-v \"{workDir}:/work\" "); + dockerArgs.Append(scanner.DockerImage); + dockerArgs.Append(' '); + dockerArgs.Append(string.Join(' ', innerCmdTokens.Select(Escape))); + + return RunProcess("docker", dockerArgs.ToString()); + } + + private static string RunProcess(string fileName, string arguments) + { + var psi = new ProcessStartInfo + { + FileName = fileName, + Arguments = arguments, + RedirectStandardOutput = true, + RedirectStandardError = true, + UseShellExecute = false, + CreateNoWindow = true + }; + + using var process = Process.Start(psi) + ?? throw new InvalidOperationException("Failed to start process"); + var stdout = process.StandardOutput.ReadToEnd(); + var stderr = process.StandardError.ReadToEnd(); + process.WaitForExit(); + + if (process.ExitCode != 0) + { + throw new InvalidOperationException( + $"Process failed ({fileName} {arguments}): {stderr}"); + } + + return stdout; + } + + private static string Escape(string arg) + { + if (string.IsNullOrEmpty(arg)) return "\"\""; + if (arg.Contains(' ') || arg.Contains('"')) + { + return "\"" + arg.Replace("\"", "\\\"") + "\""; + } + return arg; + } +} +``` + +Notes: + +* `workDir` will be your project directory (so `/work/Inputs/...` inside the container). +* For simplicity, I’m not handling Windows vs Linux nuances heavily; adjust path escaping if needed on Windows. + +--- + +## 7. Utility: Normalize scanner JSON output + +Different scanners have different JSON; you just need a **mapping** from each scanner to the `NormalizedFinding` shape. + +Create `Normalizer.cs`: + +```csharp +using System; +using System.Collections.Generic; +using System.Text.Json; +using System.Text.Json.Nodes; + +namespace ScannerBench.Runner; + +public static class Normalizer +{ + public static IReadOnlyList Normalize( + string scannerName, + string rawJson) + { + var node = JsonNode.Parse(rawJson) + ?? throw new InvalidOperationException("Cannot parse scanner JSON"); + + return scannerName switch + { + "grype" => NormalizeGrype(node), + "trivy" => NormalizeTrivy(node), + "stella" => NormalizeStella(node), + _ => throw new NotSupportedException($"Unknown scanner: {scannerName}") + }; + } + + private static IReadOnlyList NormalizeGrype(JsonNode root) + { + // Adjust based on actual Grype JSON + var findings = new List(); + var matches = root["matches"] as JsonArray; + if (matches is null) return findings; + + foreach (var m in matches) + { + if (m is null) continue; + var artifact = m["artifact"]; + var vuln = m["vulnerability"]; + + var purl = artifact?["purl"]?.ToString() ?? ""; + var id = vuln?["id"]?.ToString() ?? ""; + var cvss = vuln?["cvss"]?[0]?["metrics"]?["baseScore"]?.ToString() ?? "NA"; + var severity = vuln?["severity"]?.ToString() ?? "UNKNOWN"; + + findings.Add(new NormalizedFinding( + Purl: purl, + VulnerabilityId: id, + BaseCvss: cvss, + EffectiveSeverity: severity.ToUpperInvariant() + )); + } + + return findings; + } + + private static IReadOnlyList NormalizeTrivy(JsonNode root) + { + var list = new List(); + + var results = root["Results"] as JsonArray; + if (results is null) return list; + + foreach (var r in results) + { + var vulnerabilities = r?["Vulnerabilities"] as JsonArray; + if (vulnerabilities is null) continue; + + foreach (var v in vulnerabilities) + { + if (v is null) continue; + var pkgName = v["PkgName"]?.ToString() ?? ""; + var purl = v["Purl"]?.ToString() ?? pkgName; + var id = v["VulnerabilityID"]?.ToString() ?? ""; + var cvss = v["CVSS"]?["nvd"]?["V3Score"]?.ToString() + ?? v["CVSS"]?["nvd"]?["V2Score"]?.ToString() + ?? "NA"; + var severity = v["Severity"]?.ToString() ?? "UNKNOWN"; + + list.Add(new NormalizedFinding( + Purl: purl, + VulnerabilityId: id, + BaseCvss: cvss, + EffectiveSeverity: severity.ToUpperInvariant() + )); + } + } + + return list; + } + + private static IReadOnlyList NormalizeStella(JsonNode root) + { + // Adjust to match Stella Ops output schema + var list = new List(); + var findings = root["findings"] as JsonArray; + if (findings is null) return list; + + foreach (var f in findings) + { + if (f is null) continue; + var purl = f["purl"]?.ToString() ?? ""; + var id = f["id"]?.ToString() ?? ""; + var cvss = f["baseCvss"]?.ToString() + ?? f["cvss"]?.ToString() + ?? "NA"; + var severity = f["effectiveSeverity"]?.ToString() + ?? f["severity"]?.ToString() + ?? "UNKNOWN"; + + list.Add(new NormalizedFinding( + Purl: purl, + VulnerabilityId: id, + BaseCvss: cvss, + EffectiveSeverity: severity.ToUpperInvariant() + )); + } + + return list; + } +} +``` + +You’ll need to tweak the JSON paths once you inspect real outputs, but the pattern is clear. + +--- + +## 8. Utility: Hashing & canonicalization + +Create `Hashing.cs`: + +```csharp +using System.Collections.Generic; +using System.Linq; +using System.Security.Cryptography; +using System.Text; +using System.Text.Json; + +namespace ScannerBench.Runner; + +public static class Hashing +{ + public static string ComputeResultHash( + string scannerName, + string inputId, + IReadOnlyList findings) + { + // Ensure deterministic ordering before hashing + var ordered = findings + .OrderBy(f => f.Purl) + .ThenBy(f => f.VulnerabilityId) + .ToList(); + + var payload = new + { + scanner = scannerName, + input = inputId, + findings = ordered + }; + + var json = JsonSerializer.Serialize(payload, + new JsonSerializerOptions + { + PropertyNamingPolicy = JsonNamingPolicy.CamelCase + }); + + using var sha = SHA256.Create(); + var bytes = Encoding.UTF8.GetBytes(json); + var hashBytes = sha.ComputeHash(bytes); + return ConvertToHex(hashBytes); + } + + private static string ConvertToHex(byte[] bytes) + { + var sb = new StringBuilder(bytes.Length * 2); + foreach (var b in bytes) + sb.Append(b.ToString("x2")); + return sb.ToString(); + } +} +``` + +--- + +## 9. Metric computation (determinism & CVSS deltas) + +Create `StatsCalculator.cs`: + +```csharp +using System; +using System.Collections.Generic; +using System.Globalization; +using System.Linq; + +namespace ScannerBench.Runner; + +public static class StatsCalculator +{ + public static DeterminismStats ComputeDeterminism( + string scannerName, + string inputId, + string mode, + IReadOnlyList runs) + { + var hashes = runs.Select(r => r.ResultHash).Distinct().Count(); + return new DeterminismStats( + ScannerName: scannerName, + InputId: inputId, + Mode: mode, + TotalRuns: runs.Count, + DistinctHashes: hashes + ); + } + + public static CvssDeltaStats ComputeCvssDeltas( + string scannerName, + string inputId, + IReadOnlyList scannerRuns, + IReadOnlyList referenceRuns) + { + // Use the *first* run of each as baseline (assuming deterministic) + var scannerFindings = scannerRuns.First().Findings; + var refFindings = referenceRuns.First().Findings; + + // Map by (purl,id) + var refMap = refFindings.ToDictionary( + f => (f.Purl, f.VulnerabilityId), + f => ParseCvss(f.BaseCvss) + ); + + var deltas = new List(); + + foreach (var f in scannerFindings) + { + if (!refMap.TryGetValue((f.Purl, f.VulnerabilityId), out var refScore)) + continue; + + var score = ParseCvss(f.BaseCvss); + if (double.IsNaN(score) || double.IsNaN(refScore)) + continue; + + deltas.Add(score - refScore); + } + + if (deltas.Count == 0) + { + return new CvssDeltaStats(scannerName, inputId, double.NaN, double.NaN); + } + + var mean = deltas.Average(); + var variance = deltas.Sum(d => Math.Pow(d - mean, 2)) / deltas.Count; + var stdDev = Math.Sqrt(variance); + + return new CvssDeltaStats(scannerName, inputId, mean, stdDev); + } + + private static double ParseCvss(string value) + { + if (double.TryParse(value, NumberStyles.Float, CultureInfo.InvariantCulture, out var v)) + return v; + return double.NaN; + } +} +``` + +Pick your “reference” scanner (e.g., NVD‑aligned policy or Stella) when you call this method. + +--- + +## 10. Main runner: orchestrate everything + +In `Program.cs`: + +```csharp +using System; +using System.Collections.Generic; +using System.IO; +using System.Linq; +using ScannerBench.Runner; + +class Program +{ + static void Main(string[] args) + { + var projectRoot = GetProjectRoot(); + var tmpDir = Path.Combine(projectRoot, "Tmp"); + Directory.CreateDirectory(tmpDir); + + const int runsPerMode = 10; + + var allRuns = new List(); + + foreach (var input in BenchInputs.All) + { + Console.WriteLine($"=== Input: {input.Id} ==="); + + foreach (var scanner in ScannerConfigs.All) + { + Console.WriteLine($" Scanner: {scanner.Name}"); + + // Canonical runs + var canonicalRuns = RunMultiple( + scanner, input, projectRoot, tmpDir, + mode: "canonical", runsPerMode); + + // Shuffled runs + var shuffledRuns = RunMultiple( + scanner, input, projectRoot, tmpDir, + mode: "shuffled", runsPerMode); + + allRuns.AddRange(canonicalRuns); + allRuns.AddRange(shuffledRuns); + + // Determinism stats + var canonStats = StatsCalculator.ComputeDeterminism( + scanner.Name, input.Id, "canonical", canonicalRuns); + var shuffleStats = StatsCalculator.ComputeDeterminism( + scanner.Name, input.Id, "shuffled", shuffledRuns); + + Console.WriteLine($" Canonical: {canonStats.DistinctHashes}/{canonStats.TotalRuns} distinct hashes"); + Console.WriteLine($" Shuffled: {shuffleStats.DistinctHashes}/{shuffleStats.TotalRuns} distinct hashes"); + } + } + + // Example: compute CVSS deltas vs Stella + var stellaByInput = allRuns.Where(r => r.ScannerName == "stella") + .GroupBy(r => r.InputId) + .ToDictionary(g => g.Key, g => g.ToList()); + + foreach (var scanner in ScannerConfigs.All.Where(s => s.Name != "stella")) + { + foreach (var input in BenchInputs.All) + { + var scannerRuns = allRuns + .Where(r => r.ScannerName == scanner.Name && + r.InputId == input.Id && + r.Mode == "canonical") + .ToList(); + + if (scannerRuns.Count == 0 || !stellaByInput.TryGetValue(input.Id, out var stellaRuns)) + continue; + + var stats = StatsCalculator.ComputeCvssDeltas( + scanner.Name, + input.Id, + scannerRuns, + stellaRuns.Where(r => r.Mode == "canonical").ToList()); + + Console.WriteLine( + $"CVSS delta vs Stella [{scanner.Name}, {input.Id}]: mean={stats.MeanDelta:F2}, stddev={stats.StdDevDelta:F2}"); + } + } + + Console.WriteLine("Done."); + } + + private static List RunMultiple( + ScannerConfig scanner, + BenchInput input, + string projectRoot, + string tmpDir, + string mode, + int runsPerMode) + { + var list = new List(); + var inputSbomFull = Path.Combine(projectRoot, input.SbomPath); + var inputVexFull = Path.Combine(projectRoot, input.VexPath); + + for (int i = 0; i < runsPerMode; i++) + { + string sbomPath; + string vexPath; + + if (mode == "canonical") + { + sbomPath = input.SbomPath; // path relative to /work + vexPath = input.VexPath; + } + else + { + sbomPath = Path.GetRelativePath( + projectRoot, + JsonShuffler.CreateShuffledCopy(inputSbomFull, tmpDir)); + + vexPath = Path.GetRelativePath( + projectRoot, + JsonShuffler.CreateShuffledCopy(inputVexFull, tmpDir)); + } + + var rawJson = DockerRunner.RunScanner( + scanner, + sbomPath, + vexPath, + projectRoot); + + var findings = Normalizer.Normalize(scanner.Name, rawJson); + var hash = Hashing.ComputeResultHash(scanner.Name, input.Id, findings); + + list.Add(new ScanRun( + ScannerName: scanner.Name, + InputId: input.Id, + RunIndex: i, + Mode: mode, + ResultHash: hash, + Findings: findings)); + + Console.WriteLine($" {mode} run {i + 1}/{runsPerMode}: hash={hash[..8]}..."); + } + + return list; + } + + private static string GetProjectRoot() + { + var dir = Directory.GetCurrentDirectory(); + // If you run from bin/Debug, go up until we find .sln or .git, or just go two levels up + return dir; + } +} +``` + +This is intentionally straightforward: run all scanners × inputs × modes, gather runs, print determinism stats and CVSS deltas vs Stella. + +--- + +## 11. Add a couple of automated tests (xUnit) + +In `ScannerBench.Tests`, create `StatsTests.cs`: + +```csharp +using System.Collections.Generic; +using ScannerBench.Runner; +using Xunit; + +public class StatsTests +{ + [Fact] + public void Determinism_Is_One_When_All_Hashes_Equal() + { + var runs = new List + { + new("s", "i", 0, "canonical", "aaa", new List()), + new("s", "i", 1, "canonical", "aaa", new List()), + }; + + var stats = StatsCalculator.ComputeDeterminism("s", "i", "canonical", runs); + Assert.Equal(1, stats.DistinctHashes); + Assert.Equal(2, stats.TotalRuns); + } + + [Fact] + public void CvssDelta_Computes_Mean_And_StdDev() + { + var refRuns = new List + { + new("ref", "i", 0, "canonical", "h1", new List + { + new("pkg1","CVE-1","5.0","HIGH"), + new("pkg2","CVE-2","7.0","HIGH") + }) + }; + + var scannerRuns = new List + { + new("scan", "i", 0, "canonical", "h2", new List + { + new("pkg1","CVE-1","6.0","HIGH"), // +1 + new("pkg2","CVE-2","8.0","HIGH") // +1 + }) + }; + + var stats = StatsCalculator.ComputeCvssDeltas("scan", "i", scannerRuns, refRuns); + Assert.Equal(1.0, stats.MeanDelta, 3); + Assert.Equal(0.0, stats.StdDevDelta, 3); + } +} +``` + +Run tests: + +```bash +dotnet test +``` + +--- + +## 12. How you’ll use this in practice + +1. **Drop SBOM & VEX files** into `Inputs/Sboms` and `Inputs/Vex`. + +2. **Install Docker** and make sure CLI works. + +3. Pull scanner images (optional but nice): + + ```bash + docker pull anchore/grype:v0.79.0 + docker pull aquasec/trivy:0.55.0 + docker pull stellaops/scanner:latest + ``` + +4. `cd ScannerBench.Runner` and run: + + ```bash + dotnet run + ``` + +5. Inspect console output: + + * For each scanner & SBOM: + + * Determinism: `distinct hashes / total runs` (expect 1 / N). + * Order‑invariance: compare canonical vs shuffled determinism. + * CVSS deltas vs Stella: look at standard deviation (lower = more aligned). + +6. Optional: serialize `allRuns` and metrics to `Results/*.json` and plot them in whatever you like. + +--- + +If you’d like, next step I can help you: + +* tighten the JSON normalization against real scanner outputs, or +* add a small HTML/Blazor or minimal API endpoint that renders the stats as a web dashboard instead of console output. diff --git a/docs/product-advisories/23-Nov-2025 - Publishing a Reachability Benchmark Dataset.md b/docs/product-advisories/23-Nov-2025 - Publishing a Reachability Benchmark Dataset.md new file mode 100644 index 000000000..04f01783e --- /dev/null +++ b/docs/product-advisories/23-Nov-2025 - Publishing a Reachability Benchmark Dataset.md @@ -0,0 +1,1031 @@ +Here’s a crisp plan to **publish a small, public “vulnerable binaries” dataset** (PHP, JS, C#) and a way to **compare reachability results** across tools—so you can ship something useful fast, gather feedback, and iterate. + +--- + +# Scope (MVP) + +* **Languages:** PHP (composer), JavaScript (npm), C# (.NET). +* **Artifacts per sample:** + + 1. minimal app, 2) lockfile, 3) SBOM (CycloneDX JSON), 4) VEX (OSV/CycloneDX VEX), 5) ground‑truth reachability notes, 6) scriptable repro (Docker). +* **Size:** 3–5 samples per language (9–15 total). Keep each sample ≤200 LOC. + +--- + +# Repo layout + +``` +vuln-reach-dataset/ + LICENSE + README.md + schema/ + ground-truth.schema.json + run-matrix.schema.json + runners/ + run_all.sh + run_all.ps1 + results/ + ///run.json + samples/ + php/... + js/... + csharp/... +``` + +--- + +# “Ground truth” format (minimal) + +```json +{ + "sample_id": "php-001-phar-deserialize", + "lang": "php", + "package_manager": "composer", + "vuln_ids": ["CVE-2019-XXXX","OSV:GHSA-..."], + "entrypoints": ["public/index.php"], + "reachable_symbols": [ + {"purl":"pkg:composer/vendor/package@1.2.3","symbol":"Vendor\\Unsafe::unserialize"}, + {"purl":"pkg:composer/monolog/monolog@2.9.0","symbol":"Monolog\\Logger::pushHandler","note":"benign"} + ], + "evidence": [ + {"type":"path","file":"public/index.php","line":18,"desc":"tainted input -> unserialize"}, + {"type":"exec","cmd":"curl 'http://localhost/?p=O:...'", "result":"triggered sink"} + ] +} +``` + +--- + +# Samples to include (suggested) + +## PHP (composer) + +1. **php-001-phar-deserialize** + + * Risk: unsafe `unserialize()` on user input; optional PHAR gadget. + * Ground truth: reachable sink `unserialize`. +2. **php-002-xxe-simplexml** + + * Risk: XML external entity in `simplexml_load_string` with libxml options off. + * Ground truth: reachable XXE sink. +3. **php-003-ssrf-guzzle** + + * Risk: user‑controlled URL into Guzzle client. + * Ground truth: SSRF call chain to `Client::request`. + +## JavaScript (npm) + +1. **js-001-prototype-pollution** + + * Risk: `lodash.merge` (known vulns historically) with user object. + * Ground truth: polluted `{__proto__}` path reaches object creation site. +2. **js-002-yaml-unsafe-load** + + * Risk: `js-yaml` `load` on untrusted text. + * Ground truth: call to `load` reachable from HTTP route. +3. **js-003-ssrf-node-fetch** + + * Risk: user URL to `node-fetch`. + * Ground truth: request issued to attacker-controlled host. + +## C# (.NET) + +1. **cs-001-binaryformatter-deserialize** + + * Risk: `BinaryFormatter.Deserialize` on user input (legacy). + * Ground truth: reachable call to `Deserialize`. +2. **cs-002-processstartinfo-injection** + + * Risk: `Process.Start` with unsanitized arg (Windows/Linux). + * Ground truth: taint to `Process.Start`. +3. **cs-003-xmlreader-xxe** + + * Risk: insecure `XmlReader` settings (DtdProcessing = Parse). + * Ground truth: external entity resolved. + +Each sample should: + +* Pin a known vulnerable version in lockfile. +* Provide a **positive** (reachable) and **negative** (not reachable) path. +* Include a tiny HTTP entrypoint to exercise the path. + +--- + +# SBOM & VEX per sample + +* **CycloneDX 1.6 JSON** SBOM produced via native tool (composer, npm, dotnet) + converter. +* **VEX**: one document stating the vulnerability is **affected** and **exploitable** for the positive path; **not_affected** for the negative path with justification (e.g., “vulnerable code not invoked”). + +--- + +# Runner & result format (tool-agnostic) + +* Runners call each selected tool (e.g., “ToolA”, “ToolB”), then normalize outputs to: + +```json +{ + "tool": "ToolA", + "version": "x.y.z", + "sample_id": "js-002-yaml-unsafe-load", + "detected_vulns": ["OSV:GHSA-..."], + "reachable_symbols_reported": [ + {"purl":"pkg:npm/js-yaml@4.1.0","symbol":"load"} + ], + "verdict": { + "reachable": true, + "confidence": 0.92 + }, + "raw": "path/to/original/tool/output.json" +} +``` + +--- + +# Comparison metrics + +For each sample: + +* **TP** (tool says reachable & ground truth reachable) +* **FP** (tool says reachable but ground truth not reachable) +* **FN** (tool says not reachable but ground truth reachable) +* **TN** (tool says not reachable & ground truth not reachable) + +Aggregate per language & tool: + +* Precision, recall, F1, and **Reachability Accuracy** = (TP+TN)/All. +* Optional: **Path depth** agreement (did tool cite the expected symbol/edge?). +* Optional: **Time-to-result** (seconds) and **scan mode** (static, dynamic, hybrid). + +--- + +# Minimal example (JS) — `samples/js/js-002-yaml-unsafe-load` + +``` +package.json +package-lock.json +server.js # express route POST /parse -> js-yaml load(body.text) +README.md +sbom.cdx.json +vex.cdx.json +repro.sh # npm ci; node server.js; curl -XPOST ... +GROUND_TRUTH.json +``` + +* Positive path: POST `{"text":"a: &a 1\nb: *a"}` to exercise parser. +* Negative path: guarded route that rejects user input unless whitelisted. + +--- + +# Publishing checklist + +* **License:** CC BY 4.0 (dataset) + MIT (runners). +* **Data hygiene:** no real secrets; deterministic scripts; pinned versions. +* **Repro:** one‑command `docker compose up` per language. +* **Docs:** + + * What is “reachability”? (vulnerable code is actually callable from app inputs). + * How we built ground truth (static review + runnable PoC). + * How to add a new sample (template folder + PR checklist). + +--- + +# Fast path to first release (1–2 days of focused work) + +1. Ship **one sample per language** with full ground truth + SBOM/VEX. +2. Include **one tool runner** (even a no‑op placeholder) and the **result schema**. +3. Add a **results/README** with the confusion‑matrix table filled for these 3 samples. +4. Open **issues** inviting contributions: more samples, more tools, more sinks. + +--- + +# Why this helps + +* Creates a **neutral, reproducible** yardstick for reachability. +* Lets vendors & researchers **compare apples to apples**. +* Encourages **PRs** (small, self‑contained samples) and **early citations** for Stella Ops. + +If you want, I can generate the repo skeleton (folders, sample stubs, JSON schemas, and runner scripts) so you can push it directly to GitHub. +Here’s a “drop in” **developer guide + concrete samples** you can paste into your repo (or split into `README.md` / `docs/DEVELOPER_GUIDE.md`). I’ll show: + +1. How the project is structured +2. **Very detailed** example samples for PHP, JS, C# +3. How tool authors integrate their reachability tool +4. How contributors add new samples + +You can tweak names/IDs, but everything below is self‑consistent. + +--- + +## 1. Repository structure (recap) + +```text +vuln-reach-dataset/ + README.md + docs/ + DEVELOPER_GUIDE.md # this file (or paste sections into README) + schema/ + ground-truth.schema.json + run-matrix.schema.json + samples/ + php/ + php-001-phar-deserialize/ + php-002-xxe-simplexml/ + ... + js/ + js-002-yaml-unsafe-load/ + ... + csharp/ + cs-001-binaryformatter-deserialize/ + ... + runners/ + run_all.sh + run_all.ps1 + run_with_tool_mytool.py # example tool integration + results/ + mytool/ + php/php-001-phar-deserialize/run.json + js/js-002-yaml-unsafe-load/run.json + ... +``` + +**Core idea:** +Each `samples///` folder is: + +* A minimal runnable app containing a known vulnerability +* A **positive path** (vulnerable code reachable) and (ideally) a **negative path** (package present but not reachable) +* `GROUND_TRUTH.json` describing what is actually reachable +* SBOM + VEX files describing vulnerabilities at the component level +* A `repro.sh` script to run the app and trigger the bug + +Tool authors plug in by reading each sample folder, running their scanner, and writing normalized results to `results////run.json`. + +--- + +## 2. Ground truth schema (what tools are judged against) + +**Minimal JSON format** (you can store a full JSON Schema in `schema/ground-truth.schema.json`): + +```jsonc +{ + "sample_id": "php-001-phar-deserialize", + "lang": "php", + "package_manager": "composer", + "vuln_ids": [ + "OSV:PLACEHOLDER-2019-XXXX" + ], + "entrypoints": [ + "public/index.php" + ], + "reachable_symbols": [ + { + "purl": "pkg:composer/example/vendor@1.2.3", + "symbol": "Example\\Unsafe::unserialize", + "kind": "sink", + "note": "User-controlled input can reach this sink in /?mode=unsafe&data=..." + }, + { + "purl": "pkg:composer/example/vendor@1.2.3", + "symbol": "Example\\Unsafe::unserialize", + "kind": "sink", + "note": "NOT reached in /?mode=safe (negative path)." + } + ], + "evidence": [ + { + "type": "path", + "file": "public/index.php", + "line": 25, + "desc": "Tainted $_GET['data'] flows into Example\\Unsafe::unserialize" + }, + { + "type": "exec", + "cmd": "curl 'http://localhost:8000/?mode=unsafe&data=...payload...'", + "result": "Trigger behavior / exploit / exception" + } + ] +} +``` + +Fields are intentionally simple: + +* `reachable_symbols` describes **what is reachable** and from which package/version. +* `evidence` explains *why* we marked it reachable (code path + repro command). + +--- + +## 3. PHP sample (php-001-phar-deserialize) + +### 3.1 Folder layout + +`samples/php/php-001-phar-deserialize/`: + +```text +composer.json +composer.lock # pinned, checked-in +public/ + index.php +src/ + UnsafeDeser.php +sbom.cdx.json +vex.cdx.json +GROUND_TRUTH.json +repro.sh +Dockerfile # optional, but recommended +README.md # local sample README +``` + +### 3.2 `composer.json` + +Pin a vulnerable (or pretend-vulnerable) version: + +```json +{ + "name": "dataset/php-001-phar-deserialize", + "description": "Minimal PHP app demonstrating unsafe unserialize reachability.", + "require": { + "php": "^8.1", + "example/vendor": "1.2.3" // pretend vulnerable package + }, + "autoload": { + "psr-4": { + "Dataset\\Php001\\": "src/" + } + } +} +``` + +### 3.3 `src/UnsafeDeser.php` + +```php + vulnerable sink + $result = UnsafeDeser::unsafeUnserialize($data); + echo "UNSAFE RESULT:\n"; + var_dump($result); +} else { + // NEGATIVE PATH: package is present but sink not invoked + $isOk = UnsafeDeser::safeCompare($data); + echo "SAFE RESULT:\n"; + var_dump($isOk); +} +``` + +### 3.5 `GROUND_TRUTH.json` + +```json +{ + "sample_id": "php-001-phar-deserialize", + "lang": "php", + "package_manager": "composer", + "vuln_ids": [ + "OSV:PLACEHOLDER-2019-UNSERIALIZE" + ], + "entrypoints": [ + "public/index.php" + ], + "reachable_symbols": [ + { + "purl": "pkg:composer/example/vendor@1.2.3", + "symbol": "Dataset\\Php001\\UnsafeDeser::unsafeUnserialize", + "kind": "sink", + "note": "Reachable when mode=unsafe (positive path)." + } + ], + "evidence": [ + { + "type": "path", + "file": "public/index.php", + "line": 15, + "desc": "$_GET['data'] flows into UnsafeDeser::unsafeUnserialize without validation." + }, + { + "type": "exec", + "cmd": "php -S 0.0.0.0:8000 -t public", + "result": "Dev server started at http://0.0.0.0:8000" + }, + { + "type": "exec", + "cmd": "curl 'http://localhost:8000/?mode=unsafe&data=O:4:\"Test\":0:{}'", + "result": "Object of class Test created via unserialize()" + } + ] +} +``` + +### 3.6 Minimal SBOM (`sbom.cdx.json`) + +Very small CycloneDX 1.6 example (trim or enrich as needed): + +```json +{ + "bomFormat": "CycloneDX", + "specVersion": "1.6", + "version": 1, + "metadata": { + "component": { + "type": "application", + "name": "php-001-phar-deserialize" + } + }, + "components": [ + { + "type": "library", + "name": "example/vendor", + "version": "1.2.3", + "purl": "pkg:composer/example/vendor@1.2.3" + } + ] +} +``` + +### 3.7 Minimal VEX (`vex.cdx.json`) + +CycloneDX VEX example: + +```json +{ + "bomFormat": "CycloneDX", + "specVersion": "1.6", + "version": 1, + "metadata": { + "component": { + "type": "application", + "name": "php-001-phar-deserialize" + } + }, + "vulnerabilities": [ + { + "id": "OSV:PLACEHOLDER-2019-UNSERIALIZE", + "source": { + "name": "OSV", + "url": "https://osv.dev/" + }, + "affects": [ + { + "ref": "pkg:composer/example/vendor@1.2.3" + } + ], + "analysis": { + "state": "affected", + "justification": "exploitable", + "detail": "UnsafeDeser::unsafeUnserialize is reachable from HTTP query parameter 'data' when mode=unsafe." + } + } + ] +} +``` + +### 3.8 `repro.sh` + +```bash +#!/usr/bin/env bash +set -euxo pipefail + +# Install dependencies +composer install --no-interaction --no-progress + +# Start built-in PHP server in background +php -S 0.0.0.0:8000 -t public & +SERVER_PID=$! + +# Give server a moment to start +sleep 2 + +echo "[+] Safe path (should NOT reach vulnerable sink)" +curl -s 'http://localhost:8000/?mode=safe&data=s:2:"ok";' || true + +echo "[+] Unsafe path (should reach vulnerable sink)" +curl -s 'http://localhost:8000/?mode=unsafe&data=O:4:"Test":0:{}' || true + +kill "$SERVER_PID" +wait || true +``` + +--- + +## 4. JavaScript sample (js-002-yaml-unsafe-load) + +This example is intentionally simple: an Express server that calls `js-yaml`’s unsafe `load()` on user input. + +### 4.1 Layout + +`samples/js/js-002-yaml-unsafe-load/`: + +```text +package.json +package-lock.json +server.js +sbom.cdx.json +vex.cdx.json +GROUND_TRUTH.json +repro.sh +Dockerfile (optional) +README.md +``` + +### 4.2 `package.json` + +```json +{ + "name": "js-002-yaml-unsafe-load", + "version": "1.0.0", + "description": "Minimal Node.js sample demonstrating unsafe js-yaml load reachability.", + "main": "server.js", + "scripts": { + "start": "node server.js" + }, + "dependencies": { + "express": "^4.19.0", + "js-yaml": "4.1.0" + } +} +``` + +### 4.3 `server.js` + +```js +const express = require('express'); +const yaml = require('js-yaml'); + +const app = express(); +app.use(express.text({ type: '*/*' })); + +// POSITIVE PATH: unsafe load of attacker-controlled YAML +app.post('/parse-unsafe', (req, res) => { + try { + const doc = yaml.load(req.body); // vulnerable symbol + res.json({ parsed: doc }); + } catch (err) { + res.status(400).json({ error: String(err) }); + } +}); + +// NEGATIVE PATH: same dependency, but not reachable as a sink +app.post('/parse-safe', (req, res) => { + // Pretend we validated and reject anything non-whitelisted + if (req.body.length > 100) { + return res.status(400).json({ error: 'Too big' }); + } + // No call to yaml.load() here; dependency is present but sink not invoked + res.json({ length: req.body.length }); +}); + +const port = process.env.PORT || 3000; +app.listen(port, () => { + console.log(`js-002-yaml-unsafe-load listening on http://localhost:${port}`); +}); +``` + +### 4.4 `GROUND_TRUTH.json` + +```json +{ + "sample_id": "js-002-yaml-unsafe-load", + "lang": "javascript", + "package_manager": "npm", + "vuln_ids": [ + "OSV:PLACEHOLDER-js-yaml-unsafe-load" + ], + "entrypoints": [ + "server.js" + ], + "reachable_symbols": [ + { + "purl": "pkg:npm/js-yaml@4.1.0", + "symbol": "load", + "kind": "sink", + "note": "Reachable from POST /parse-unsafe body." + } + ], + "evidence": [ + { + "type": "path", + "file": "server.js", + "line": 9, + "desc": "req.body passes directly into yaml.load() without validation." + }, + { + "type": "exec", + "cmd": "node server.js", + "result": "Server listening on http://localhost:3000" + }, + { + "type": "exec", + "cmd": "curl -XPOST localhost:3000/parse-unsafe -d 'foo: bar'", + "result": "JSON response with {\"foo\":\"bar\"}" + } + ] +} +``` + +### 4.5 SBOM & VEX + +Very similar to the PHP example, but with npm purl: + +```json +{ + "bomFormat": "CycloneDX", + "specVersion": "1.6", + "version": 1, + "components": [ + { + "type": "library", + "name": "js-yaml", + "version": "4.1.0", + "purl": "pkg:npm/js-yaml@4.1.0" + } + ] +} +``` + +and: + +```json +{ + "bomFormat": "CycloneDX", + "specVersion": "1.6", + "version": 1, + "vulnerabilities": [ + { + "id": "OSV:PLACEHOLDER-js-yaml-unsafe-load", + "affects": [ + { "ref": "pkg:npm/js-yaml@4.1.0" } + ], + "analysis": { + "state": "affected", + "justification": "exploitable", + "detail": "yaml.load() reachable from POST /parse-unsafe." + } + } + ] +} +``` + +### 4.6 `repro.sh` + +```bash +#!/usr/bin/env bash +set -euxo pipefail + +npm ci +node server.js & +SERVER_PID=$! + +sleep 2 + +echo "[+] Positive (reachable) path" +curl -s -XPOST localhost:3000/parse-unsafe -d 'foo: bar' || true + +echo "[+] Negative (not reaching sink) path" +curl -s -XPOST localhost:3000/parse-safe -d 'foo: bar' || true + +kill "$SERVER_PID" +wait || true +``` + +--- + +## 5. C# sample (cs-001-binaryformatter-deserialize) + +Minimal ASP.NET Core-style sample that uses `BinaryFormatter.Deserialize` on request data. + +### 5.1 Layout + +`samples/csharp/cs-001-binaryformatter-deserialize/`: + +```text +Cs001BinaryFormatter.csproj +Program.cs +sbom.cdx.json +vex.cdx.json +GROUND_TRUTH.json +repro.sh +Dockerfile (optional) +README.md +``` + +### 5.2 `Cs001BinaryFormatter.csproj` + +```xml + + + net8.0 + enable + enable + + + + + + +``` + +### 5.3 `Program.cs` + +```csharp +using System.Runtime.Serialization.Formatters.Binary; +using System.Text; + +var builder = WebApplication.CreateBuilder(args); +var app = builder.Build(); + +app.MapPost("/deserialize-unsafe", async (HttpContext ctx) => +{ + // POSITIVE PATH: body -> BinaryFormatter.Deserialize + using var ms = new MemoryStream(await ToBytes(ctx.Request.Body)); +#pragma warning disable SYSLIB0011 + var formatter = new BinaryFormatter(); + var obj = formatter.Deserialize(ms); // vulnerable symbol +#pragma warning restore SYSLIB0011 + + await ctx.Response.WriteAsJsonAsync(new { success = true, type = obj?.GetType().FullName }); +}); + +app.MapPost("/deserialize-safe", async (HttpContext ctx) => +{ + // NEGATIVE PATH: we read input, but never deserialize + using var reader = new StreamReader(ctx.Request.Body, Encoding.UTF8); + var text = await reader.ReadToEndAsync(); + await ctx.Response.WriteAsJsonAsync(new { length = text.Length }); +}); + +app.Run(); + +static async Task ToBytes(Stream stream) +{ + using var ms = new MemoryStream(); + await stream.CopyToAsync(ms); + return ms.ToArray(); +} +``` + +### 5.4 `GROUND_TRUTH.json` + +```json +{ + "sample_id": "cs-001-binaryformatter-deserialize", + "lang": "csharp", + "package_manager": "nuget", + "vuln_ids": [ + "OSV:PLACEHOLDER-BinaryFormatter" + ], + "entrypoints": [ + "Program.cs" + ], + "reachable_symbols": [ + { + "purl": "pkg:nuget/Example.VulnerableLib@1.0.0", + "symbol": "System.Runtime.Serialization.Formatters.Binary.BinaryFormatter::Deserialize", + "kind": "sink", + "note": "Reachable from POST /deserialize-unsafe body." + } + ], + "evidence": [ + { + "type": "path", + "file": "Program.cs", + "line": 15, + "desc": "Request body copied verbatim into BinaryFormatter.Deserialize." + }, + { + "type": "exec", + "cmd": "dotnet run", + "result": "App listening on http://localhost:5000" + }, + { + "type": "exec", + "cmd": "curl -XPOST http://localhost:5000/deserialize-unsafe --data-binary @payload.bin", + "result": "Response includes type name of deserialized object." + } + ] +} +``` + +SBOM/VEX same pattern as previous examples, with `purl: "pkg:nuget/Example.VulnerableLib@1.0.0"`. + +--- + +## 6. Tool output schema and integration + +This is the **normalized output** your runners should produce for each `(tool, sample)` pair. + +### 6.1 `run.json` schema (results////run.json) + +```jsonc +{ + "tool": "mytool", + "version": "1.2.3", + "sample_id": "js-002-yaml-unsafe-load", + "lang": "javascript", + "detected_vulns": [ + "OSV:PLACEHOLDER-js-yaml-unsafe-load" + ], + "reachable_symbols_reported": [ + { + "purl": "pkg:npm/js-yaml@4.1.0", + "symbol": "load", + "kind": "sink", + "evidence": "Taint flow from POST /parse-unsafe body to js-yaml load()." + } + ], + "verdict": { + "reachable": true, + "confidence": 0.92 + }, + "timing": { + "scan_ms": 2300 + }, + "raw": "tool-output.json" // optional path to original tool output +} +``` + +Fields: + +* `reachable` is your top-level yes/no reachability verdict **for the specific vulnerability(ies) listed in the sample**. +* `reachable_symbols_reported` should map onto `GROUND_TRUTH.reachable_symbols` where possible. + +--- + +## 7. Example integration: running a tool against all samples + +### 7.1 Simple Bash runner (`runners/run_all.sh`) + +```bash +#!/usr/bin/env bash +set -euo pipefail + +TOOL_NAME="${1:-mytool}" + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" + +for lang_dir in "$ROOT_DIR/samples"/*; do + lang="$(basename "$lang_dir")" + for sample_dir in "$lang_dir"/*; do + sample_id="$(basename "$sample_dir")" + echo "[*] Running $TOOL_NAME on $lang/$sample_id" + + sbom="$sample_dir/sbom.cdx.json" + vex="$sample_dir/vex.cdx.json" + + mkdir -p "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id" + + # Example: assume your tool supports a CLI like: + # mytool scan --sbom sbom.cdx.json --vex vex.cdx.json --project-root . + mytool scan \ + --sbom "$sbom" \ + --vex "$vex" \ + --project-root "$sample_dir" \ + > "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id/tool-output.json" + + # Normalize output to run.json via helper script + python "$ROOT_DIR/runners/normalize_${TOOL_NAME}.py" \ + "$sample_dir" \ + "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id/tool-output.json" \ + > "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id/run.json" + done +done +``` + +### 7.2 Python normalizer example (`runners/normalize_mytool.py`) + +This script turns your proprietary tool output into our `run.json` schema. + +```python +#!/usr/bin/env python +import json +import sys +from pathlib import Path + +sample_dir = Path(sys.argv[1]) +tool_output_path = Path(sys.argv[2]) + +ground_truth = json.loads((sample_dir / "GROUND_TRUTH.json").read_text()) +tool_output = json.loads(tool_output_path.read_text()) + +# Example: adapt based on your tool's own schema +run = { + "tool": "mytool", + "version": tool_output.get("tool_version", "unknown"), + "sample_id": ground_truth["sample_id"], + "lang": ground_truth["lang"], + "detected_vulns": tool_output.get("vuln_ids", []), + "reachable_symbols_reported": [], + "verdict": { + "reachable": bool(tool_output.get("reachable", False)), + "confidence": float(tool_output.get("confidence", 0.0)) + }, + "timing": { + "scan_ms": tool_output.get("scan_ms", None) + }, + "raw": str(tool_output_path.name) +} + +for r in tool_output.get("reachable_sinks", []): + run["reachable_symbols_reported"].append({ + "purl": r.get("purl"), + "symbol": r.get("symbol"), + "kind": r.get("kind", "sink"), + "evidence": r.get("evidence", "") + }) + +print(json.dumps(run, indent=2)) +``` + +So **tool authors** only need to: + +1. Implement a CLI to scan a project (given SBOM/VEX). +2. Implement a small normalizer to produce `run.json`. + +--- + +## 8. Adding a new sample (for contributors) + +This is what you’d document so others can extend the dataset. + +1. **Pick a language & ID** + + * Folder: `samples//-NNN-/` + * Example: `samples/php/php-004-guzzle-ssrf/` + +2. **Create a minimal app** + + * It must install with one command (`composer install`, `npm ci`, `dotnet restore`, etc.). + * Include: + + * **Positive path**: user-controllable data reaches the vulnerable sink. + * **Negative path** (if possible): same dependency present but sink not reachable. + +3. **Pin dependencies** + + * Commit lockfiles (`composer.lock`, `package-lock.json`, etc.). + * Make sure the vulnerable version is used. + +4. **Write `GROUND_TRUTH.json`** + + * Fill all required fields from the schema above. + * Be explicit about which symbol(s) are reachable and how to reproduce. + +5. **Generate SBOM** + + * Use your preferred SBOM generator and convert to CycloneDX 1.6 JSON (`sbom.cdx.json`). + * Ensure PURLs match those you reference in `GROUND_TRUTH.json`. + +6. **Write VEX (`vex.cdx.json`)** + + * At minimum: one vulnerability with `analysis.state = affected` or `not_affected`. + * Link to the SBOM component via `affects.ref`. + +7. **Add `repro.sh`** + + * Script that: + + * Installs deps. + * Starts the app. + * Executes at least one positive and one negative HTTP/CLI call. + * Must exit non‑zero on obvious failure. + +8. **Document briefly in local README** + + * What vulnerability pattern this sample represents (e.g., SSRF, XXE, unsafe deserialization). + * Expected tool behavior (what should be marked reachable). + +--- + +If you want, you can literally copy-paste the code and JSON above as your initial three samples (`php-001`, `js-002`, `cs-001`) and then we can layer in more patterns (XXE, SSRF, prototype pollution, etc.) the same way. diff --git a/docs/product-advisories/23-Nov-2025 - Stella Ops vs Competitors.md b/docs/product-advisories/23-Nov-2025 - Stella Ops vs Competitors.md new file mode 100644 index 000000000..ee495cebd --- /dev/null +++ b/docs/product-advisories/23-Nov-2025 - Stella Ops vs Competitors.md @@ -0,0 +1,1044 @@ +Understood. Below is the **full vendor comparison sheet (15 vendors)** with a focus on SBOM generation, SBOM ingestion, attestations, Rekor/Sigstore integration, offline/air-gapped readiness, and gaps relative to your Stella Ops moats (deterministic replay, sovereign crypto, lattice policy engine, provenance graph, explainability layer, trust economics, quantum-resilient signatures). + +This is written in the format that you can directly give to agents or drop into your architecture docs. + +If you want, I can also deliver a **table-only export**, **JSON export**, or a **short 1-page executive sheet**. + +--- + +# Comprehensive SBOM / VEX / Attestation Vendor Comparison (2025) + +## Legend + +* SBOM Gen = Native SBOM generation +* SBOM Ingest = Can scan from SBOM (CycloneDX/SPDX) +* Attest = Can **produce** in-toto/DSSE/Cosign attestations +* Attest Verify = Can **verify** attestations +* Rekor = Can **write** or **query** Rekor transparency logs +* Offline = Has explicit offline/air-gap support +* Gap Score = Severity of gap relative to Stella Ops moats (High/Medium/Low) + +--- + +# 1. Trivy (Aqua Security) + +**SBOM Gen:** Yes (CycloneDX/SPDX) +**SBOM Ingest:** Yes +**Attest:** Yes (Cosign-compatible) +**Attest Verify:** Yes +**Rekor:** Query yes, write via Cosign +**Offline:** Strong +**Gap Score:** Medium + +**Gaps vs Stella Ops Moats:** + +* No deterministic replay bundles (no freezeable feed+rules snapshot). +* No lattice VEX engine. +* No crypto-sovereign mode (GOST/SM/EIDAS). +* No trust economics or provenance graph. + +--- + +# 2. Syft + Grype (Anchore) + +**SBOM Gen:** Yes (Syft) +**SBOM Ingest:** Yes (Grype) +**Attest:** Limited (requires Cosign) +**Attest Verify:** Limited +**Rekor:** Via Cosign only +**Offline:** Partial but not enterprise-grade +**Gap Score:** Medium + +**Gaps:** + +* No attestation-first workflow; SBOMs are unsigned unless user orchestrates signing. +* No deterministic replay. +* No lattice merge logic. +* No sovereign crypto. +* No provenance chain. + +--- + +# 3. Snyk + +**SBOM Gen:** Yes +**SBOM Ingest:** Limited +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Weak (SaaS-first) +**Gap Score:** High + +**Gaps:** + +* Attestations, signing, deterministic replay, and offline mode are missing entirely. +* No VEX, lattice rules, or provenance graphs. + +--- + +# 4. Prisma Cloud (Palo Alto) + +**SBOM Gen:** Yes (CycloneDX) +**SBOM Ingest:** Limited +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Yes, strong on Intel Stream updates +**Gap Score:** High + +**Gaps:** + +* Attestations and SBOM provenance are not part of system. +* No deterministic audit replay. +* No trust graph or crypto-sovereign mode. + +--- + +# 5. AWS Inspector + AWS Signer + ECR Notary v2 (AWS) + +**SBOM Gen:** Partial +**SBOM Ingest:** Partial (Inspector) +**Attest:** Yes (Notary v2) +**Attest Verify:** Yes +**Rekor:** No (private transparency solution) +**Offline:** Weak +**Gap Score:** Medium + +**Gaps:** + +* No SBOM/VEX formal unification. +* Closed ecosystem, no sovereign crypto. +* Deterministic replay impossible outside AWS. +* No lattice engine. + +--- + +# 6. Google Artifact Registry + Cloud Build Attestations + +**SBOM Gen:** Yes +**SBOM Ingest:** Yes +**Attest:** Yes (SLSA provenance) +**Attest Verify:** Yes +**Rekor:** Optional through Sigstore, not default +**Offline:** Weak +**Gap Score:** Medium + +**Gaps:** + +* No offline bundles. +* No trust economics. +* No lattice VEX or regional-crypto. + +--- + +# 7. GitHub Advanced Security + Dependabot + Actions Attestation + +**SBOM Gen:** Yes (GHAS) +**SBOM Ingest:** Partial +**Attest:** Yes (OIDC + Sigstore) +**Attest Verify:** Yes +**Rekor:** Yes (Sigstore integrated) +**Offline:** No +**Gap Score:** Medium–High + +**Gaps:** + +* No deterministic replay of scans. +* No custom cryptographic modes. +* No provenance graph beyond simple attestation. + +--- + +# 8. GitLab Ultimate + Dependency Scanning + +**SBOM Gen:** Yes +**SBOM Ingest:** Limited +**Attest:** Partial +**Attest Verify:** Partial +**Rekor:** No native +**Offline:** Medium +**Gap Score:** Medium + +**Gaps:** + +* Attestations are not first-class objects. +* No lattice engine or trust economics. +* No deterministic replay. + +--- + +# 9. Microsoft Defender for DevOps + +**SBOM Gen:** Partial +**SBOM Ingest:** Partial +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Weak +**Gap Score:** High + +**Gaps:** + +* Missing all advanced SBOM/VEX/Attestation constructs. + +--- + +# 10. Anchore Enterprise + +**SBOM Gen:** Yes +**SBOM Ingest:** Yes +**Attest:** Some (enterprise extensions) +**Attest Verify:** Partial +**Rekor:** No +**Offline:** Good +**Gap Score:** Medium + +**Gaps:** + +* No sovereign crypto. +* No deterministic replay. +* No visual lattice editor. + +--- + +# 11. JFrog Xray + +**SBOM Gen:** Yes +**SBOM Ingest:** Yes +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Medium +**Gap Score:** High + +**Gaps:** + +* SBOM processing only; attestation and provenance missing. +* No graph-based trust or VEX merge layers. + +--- + +# 12. Tenable + Tenable Cloud Security + +**SBOM Gen:** Partial +**SBOM Ingest:** Limited +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Weak +**Gap Score:** High + +**Gaps:** + +* Not an SBOM or attestation system. + +--- + +# 13. Qualys + +**SBOM Gen:** Limited +**SBOM Ingest:** Limited +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Medium +**Gap Score:** High + +--- + +# 14. Rezilion + +**SBOM Gen:** Yes +**SBOM Ingest:** Yes +**Attest:** No +**Attest Verify:** No +**Rekor:** No +**Offline:** Medium +**Gap Score:** Medium–High + +**Differentiator:** Reachability analysis based on runtime—but no attestation stack. + +--- + +# 15. Chainguard + Wolfi + +**SBOM Gen:** Yes (SLSA-native) +**SBOM Ingest:** Yes +**Attest:** Yes +**Attest Verify:** Yes +**Rekor:** Yes +**Offline:** Medium +**Gap Score:** Low–Medium + +**Gaps:** + +* Very strong attestation and provenance, but no deterministic replay bundles. +* No lattice/VEX merge engine. +* No crypto-sovereign modes. + +--- + +# Consolidated Comparison Table (Compact) + +| Vendor | SBOM Gen | SBOM Ingest | Attest | Attest Verify | Rekor | Offline | Gap Score | +| ----------- | -------- | ----------- | ------- | ------------- | -------- | ------- | ----------- | +| Trivy | Yes | Yes | Yes | Yes | Yes | Strong | Medium | +| Syft+Grype | Yes | Yes | Limited | Limited | Indirect | Medium | Medium | +| Snyk | Yes | Limited | No | No | No | Weak | High | +| Prisma | Yes | Limited | No | No | No | Strong | High | +| AWS | Partial | Partial | Yes | Yes | No | Weak | Medium | +| Google | Yes | Yes | Yes | Yes | Optional | Weak | Medium | +| GitHub | Yes | Partial | Yes | Yes | Yes | Weak | Medium-High | +| GitLab | Yes | Limited | Partial | Partial | No | Medium | Medium | +| Microsoft | Partial | Partial | No | No | No | Weak | High | +| Anchore Ent | Yes | Yes | Some | Partial | No | Good | Medium | +| JFrog Xray | Yes | Yes | No | No | No | Medium | High | +| Tenable | Partial | Limited | No | No | No | Weak | High | +| Qualys | Limited | Limited | No | No | No | Medium | High | +| Rezilion | Yes | Yes | No | No | No | Medium | Medium-High | +| Chainguard | Yes | Yes | Yes | Yes | Yes | Medium | Low–Medium | + +--- + +# High-Level Summary for Stella Ops Positioning + +## 1. You beat all 15 vendors on **deterministic replayability** + +None of them provide: + +* Feed+rules snapshot capture +* Replayable SBOM & VEX evaluation identical across time +* Hash-locked manifests for audits/regulators + +This is a primary moat. + +## 2. You are the only one with **crypto-sovereign readiness** + +No vendor supports: + +* GOST R 34.10-2012 +* SM2/SM3/SM4 +* EIDAS/QES native pipeline +* PQC (Dilithium/Falcon) as first-class switches + +Another moat. + +## 3. You are the only one with a **lattice-based VEX engine** + +No vendor even attempts: + +* Policy-merging lattice logic +* Visual trust algebra editor +* VEX reduction semantics + +Massive moat. + +## 4. **Proof-of-Integrity Graph** is unmatched + +Some vendors do provenance; only Chainguard does strong provenance. +None build a user-facing graph with: + +* DSSE chain +* Build ancestry +* Material locality +* Container runtime verification + +## 5. No vendor has **trust economics / market layer** + +Your “Proof-Market Ledger” is completely unique. + +## 6. No vendor has **AI explainability for SBOM/VEX** + +This is a differentiator that can become a moat if done with deterministic reasoning constraints. + +--- + +# Final Result + +Delivered: full vendor comparison sheet across 15 competitors, with SBOM/attestation/offline features and a gap analysis against Stella Ops’ moats. + +If you want, I can now: + +1. Produce the **PowerPoint deck** for investors. +2. Generate a **machine-readable JSON** for internal agent consumption. +3. Produce **competitive battlecards** per vendor for sales. +4. Integrate into your **AGENTS.md** or **TASKS.md** structure. +Perfect, let’s turn this into something you can literally hand to your documentation team as a working specification. + +Below I’ll structure it as: + +1. Global principles and structure. +2. Blueprint for client-facing docs (with outlines and text samples). +3. Blueprint for developer-facing docs (with outlines and text samples). +4. Standard per-module template (Scanner, Sbomer, Concelier, Vexer, etc.). +5. Two concrete, copy-pasteable full-page samples. + +You can paste this into `docs/authoring-guide.md` (or similar) in your repo. + +--- + +## 1. Global Principles + +### 1.1 Audiences + +Your documentators must always write with a clear audience in mind: + +* **Client / Business / Security leadership** + CIO, CISO, Head of Infrastructure, Architects, Risk/Compliance. + +* **Technical Integrators / DevOps / Platform Engineers** + People who deploy Stella Ops and integrate with CI/CD, registries, proxies. + +* **Developers / Product Engineers** + People who consume APIs/SDKs, embed agents, or extend the system. + +Every document must declare its audience in front-matter: + +```yaml +--- +title: "Deterministic Scan Replay" +audience: ["client", "security-lead", "architect"] +level: "introductory" +version: "1.0" +module: "Platform" +--- +``` + +### 1.2 Document Types + +We standardize on four types: + +1. **Concept / Overview** – explain *what* and *why* in plain language. +2. **How-to / Guide** – stepwise instructions to accomplish a task. +3. **Reference** – APIs, CLI, config, schemas. +4. **Architecture / Deep Dive** – internals, invariants, interaction diagrams. + +Each page must state its type in front-matter: + +```yaml +type: "concept" # or "guide" | "reference" | "architecture" +``` + +--- + +## 2. Client-Facing Documentation Blueprint + +The client side should answer: “What does Stella Ops do for my risk, compliance, and operations?” + +### 2.1 Top-Level Structure (Clients) + +Proposed tree: + +* `docs/clients/01-solution-overview.md` +* `docs/clients/02-platform-architecture-and-trust-model.md` +* `docs/clients/03-sbom-and-vex-capabilities.md` +* `docs/clients/04-attestations-and-transparency.md` +* `docs/clients/05-deterministic-replay-and-audit-readiness.md` +* `docs/clients/06-crypto-sovereign-and-regional-standards.md` +* `docs/clients/07-deployment-models-and-operations.md` +* `docs/clients/08-compliance-and-certifications.md` +* `docs/clients/09-faq-and-glossary.md` + +Below are concrete directions and samples. + +--- + +### 2.2 01 – Solution Overview + +**Goal:** In 2–4 pages, explain the value proposition with minimal jargon. + +**Outline for documentators:** + +1. Problem statement: SBOM, VEX, attestation chaos today. +2. What Stella Ops is (one paragraph). +3. Key capabilities: + + * Deterministic, replayable scans. + * Crypto-sovereign readiness. + * Lattice-based VEX / policy engine. + * Proof-of-Integrity Graph. +4. Key deployment scenarios: + + * On-prem, air-gapped. + * Hybrid. + * Cloud-only. +5. Outcomes: + + * Faster audits. + * Lower vulnerability noise. + * Vendor-proof cryptography and attestations. + +**Sample paragraph (they can adapt language):** + +> Stella Ops is a sovereign, deterministic SBOM and VEX platform that turns software supply chain metadata into auditable evidence. It ingests container images and build artifacts, produces cryptographically signed SBOMs and VEX statements, and evaluates risk using a configurable lattice engine. Every scan can be deterministically replayed against frozen feeds and policies, enabling regulators and auditors to independently verify historical decisions. + +--- + +### 2.3 02 – Platform Architecture & Trust Model + +**Goal:** Explain the core subsystems to architects, without drowning them in implementation details. + +**Outline:** + +1. High-level diagram (required): modules and major data flows. +2. Core components: + + * Scanner (binary/SBOM generation). + * Sbomer (SBOM normalization and storage). + * Concelier/Feedser (feed ingestion, normalization). + * Excitior/Vexer (VEX evaluation, lattice engine). + * Authority (signing, policy, key management). + * Ledger (proof market, transparency-log integration). +3. Trust boundaries: + + * What runs in customer infra vs external registries/logs. + * Where signatures are created and verified. +4. Threat model summary: + + * Supply chain tampering. + * Compromised registry. + * Compromised feed sources. + +**Sample section header and short text:** + +```md +## Components at a Glance + +Stella Ops is composed of six cooperating services: + +- **Scanner** – inspects container images and binaries, produces SBOMs. +- **Sbomer** – normalizes and stores SBOMs in CycloneDX/SPDX formats. +- **Concelier/Feedser** – ingests vulnerability feeds, advisories, and vendor VEX. +- **Excitior/Vexer** – applies lattice-based policies to SBOM+VEX to compute effective risk. +- **Authority** – manages keys, signatures, and attestation policies. +- **Ledger** – integrates with transparency logs (e.g. Rekor) and maintains a local proof store. +``` + +--- + +### 2.4 03 – SBOM & VEX Capabilities + +**Goal:** Show clients what standards you support and how SBOM/VEX are produced and consumed. + +**Outline:** + +1. Standards support: + + * CycloneDX (version list). + * SPDX (version list). + * VEX formats supported/normalized. +2. SBOM lifecycle: + + * Generation at build. + * Ingestion from third parties. + * Storage, indexing, retention. +3. VEX lifecycle: + + * Ingestion from vendors. + * Internal VEX issuance. + * Propagation to downstream systems. +4. Benefits: + + * Noise reduction. + * Consistent view across multiple vendors. + +**Short sample:** + +> Stella Ops consumes and produces SBOMs in CycloneDX and SPDX formats, normalizing them into an internal graph model. On top of that graph, the platform evaluates vulnerability information combined with VEX statements (both vendor-supplied and locally authored). This separation of “what is present” (SBOM) from “what is exploitable” (VEX + lattice rules) is central to reducing false positives while maintaining explainability. + +--- + +### 2.5 05 – Deterministic Replay & Audit Readiness + +This is one of your moats; it deserves its own client-facing concept doc. + +**Outline:** + +1. What is a “deterministic scan”? +2. What is captured in a replay bundle: + + * Feeds snapshot (hashes). + * Policies, lattice configuration. + * SBOM and VEX input. + * Scan result + proof object. +3. Use cases: + + * Regulatory audits. + * Internal forensics. + * Dispute resolution with vendors. +4. Operational model: + + * Retention strategy (how long we keep bundles). + * Export/import for external verification. + +**Sample explainer paragraph:** + +> A deterministic scan in Stella Ops is a vulnerability and risk evaluation that can be reproduced bit-for-bit at a later time. For each scan, the platform records a manifest of all inputs: SBOMs, VEX statements, vulnerability feeds, and policy rules, each identified by content hash. These inputs, together with the evaluation engine version, form a replay bundle. Auditors can re-run the bundle in an offline environment and confirm that the platform’s decision at that time was mathematically consistent with the evidence and policies in force. + +--- + +### 2.6 06 – Crypto-Sovereign & Regional Standards + +**Outline:** + +1. Supported cryptographic algorithms: + + * Standard (RSA/ECDSA, etc.). + * Regional (GOST, SM, eIDAS, etc.) – list per region. + * PQC options (Dilithium/Falcon) – if planned/available. +2. Deployment patterns: + + * How keys are hosted (HSM, external KMS, bring-your-own). + * How to ensure regional legal compliance (data residency, crypto regulations). +3. Interoperability: + + * How a GOST-signed attestation can still be verifiable by non-GOST consumers, and vice versa (if applicable). + +**Short sample:** + +> Stella Ops supports multiple cryptographic profiles, including EU eIDAS-qualified signatures and regional standards such as GOST and SM-series algorithms. Each deployment can be configured with a crypto profile that aligns with regulatory and organizational requirements. The same SBOM and attestation data can be signed under multiple profiles, enabling cross-border verification without forcing customers onto a single cryptographic regime. + +--- + +## 3. Developer-Facing Documentation Blueprint + +Developer documentation must be immediately actionable. + +### 3.1 Top-Level Structure (Developers) + +Proposed tree: + +* `docs/dev/01-getting-started.md` +* `docs/dev/02-core-concepts.md` +* `docs/dev/03-apis/` (REST/GraphQL/CLI) +* `docs/dev/04-sdks/` +* `docs/dev/05-examples-and-recipes/` +* `docs/dev/06-architecture-deep-dive/` +* `docs/dev/07-data-models-and-schemas/` +* `docs/dev/08-operations-and-runbooks/` +* `docs/dev/09-contributing-and-extension-points/` + +--- + +### 3.2 01 – Getting Started (Developers) + +**Outline:** + +1. Minimal environment prerequisites (Docker, Kubernetes, DB). +2. Quick install (one simple path: e.g., docker-compose, Helm). +3. First scan: + + * Run scanner against sample image. + * View SBOM and results via UI/API. +4. Links to deeper sections. + +**Sample skeleton:** + +````md +# Getting Started with Stella Ops + +This guide helps you run your first scan in under 30 minutes. + +## Prerequisites + +- Docker 24+ or Kubernetes 1.27+ +- Access to a Postgres-compatible database +- A container registry containing at least one test image + +## Step 1: Deploy the Core Services + +```bash +docker compose -f deploy/docker-compose.minimal.yml up -d +```` + +## Step 2: Run Your First Scan + +```bash +stella scanner image scan \ + --image ghcr.io/example/app:latest \ + --output sbom.json +``` + +## Step 3: View Results + +Use the API: + +```bash +curl http://localhost:8080/api/scans/{scanId} +``` + +Or open the web UI at [http://localhost:8080](http://localhost:8080). + +```` + +--- + +### 3.3 03 – API Reference (REST/GraphQL/CLI) + +For each API, documentators must provide: + +- Endpoint/path (or GraphQL query/mutation). +- Description. +- Request schema & example. +- Response schema & example. +- Error codes and typical causes. + +**Example structure for a REST endpoint:** + +```md +## POST /api/scans + +Create a new scan for a container image or SBOM. + +### Request + +```json +{ + "target": { + "type": "container-image", + "imageRef": "ghcr.io/example/app:1.2.3" + }, + "policies": ["default"], + "attest": true +} +```` + +### Response + +```json +{ + "scanId": "scn_01J8YF4ZK7...", + "status": "queued", + "links": { + "self": "/api/scans/scn_01J8YF4ZK7...", + "results": "/api/scans/scn_01J8YF4ZK7.../results" + } +} +``` + +### Error Codes + +* `400 INVALID_TARGET` – the target specification is missing or invalid. +* `403 POLICY_NOT_ALLOWED` – caller not permitted to use specified policies. +* `503 BACKEND_UNAVAILABLE` – core scanning service currently unavailable. + +```` + +--- + +### 3.4 05 – Examples & Recipes + +This section is extremely practical. You want short, targeted recipes such as: + +- “Integrate Stella Ops in GitLab CI pipeline.” +- “Use Stella Ops to sign SBOMs with GOST keys.” +- “Replay a historical scan bundle for audit purposes.” +- “Publish attestations to Rekor and verify them.” + +Each recipe follows a standard pattern: + +1. Problem / goal. +2. Preconditions. +3. Step-by-step instructions. +4. Example code/config. +5. Verification. + +--- + +### 3.5 06 – Architecture Deep Dive (Developers) + +This is where your internal devs and advanced customers go. + +Per deep dive: + +- Explain internal data flows in detail. +- Include sequence diagrams (build → scanner → sbomer → vexer → ledger). +- Document invariants and assumptions. +- Explain failure modes and how they surface (e.g. what happens if Rekor is down). + +--- + +## 4. Per-Module Template + +For every major module (Scanner, Sbomer, Concelier/Feedser, Excitior/Vexer, Authority, Ledger), documentators should create a module page using this template: + +```md +--- +title: "Scanner Module Overview" +audience: ["developer", "devops", "architect"] +type: "architecture" +module: "Scanner" +--- + +# Purpose + +One paragraph: what this module does in the system. + +# Responsibilities + +- Bullet list of responsibilities. +- What input it accepts. +- What output it produces. + +# Interfaces + +## Incoming + +- API endpoints, message queues, or CLI commands that hit this module. + +## Outgoing + +- Which services it calls. +- Which queues/topics it publishes to. + +# Data Model + +- Key entities and fields. +- Links to detailed schema documents. + +# Configuration + +- Config options (names, types, defaults). +- How configuration is loaded (env, file, config service). + +# Operational Considerations + +- Scaling characteristics. +- Resource usage patterns. +- Common failure modes and how they are reported. + +# Security & Trust + +- What security boundaries apply. +- Which keys or tokens are used. +- Logging and audit fields produced. + +# Examples + +- Example request/response. +- Example logs for a typical operation. +```` + +--- + +## 5. Two Concrete, Ready-to-Use Sample Pages + +You asked for “very detailed samples”. Here are two you can almost drop in. + +### 5.1 Sample Client Doc: “Deterministic Scan Replay – Overview” + +```md +--- +title: "Deterministic Scan Replay" +audience: ["client", "security-lead", "architect"] +type: "concept" +version: "1.0" +module: "Platform" +--- + +# Deterministic Scan Replay + +Stella Ops allows every vulnerability scan to be reproduced bit-for-bit at a later date. This property is called **deterministic scan replay** and is central to our audit and compliance story. + +## Why Determinism Matters + +Most security tools evaluate vulnerabilities against live feeds and mutable policies. Six months later, it is often impossible to reconstruct why a specific risk decision was made, because: + +- Vulnerability feeds have changed. +- Vendor advisories have been updated or withdrawn. +- Internal policies have evolved. + +Stella Ops addresses this by capturing all inputs into a scan in a **replay bundle**. + +## What is a Replay Bundle? + +For each scan, Stella Ops records: + +- **SBOM inputs** – the exact SBOM documents used, with content hashes. +- **VEX inputs** – vendor and local VEX statements applied, with content hashes. +- **Vulnerability feeds** – references to feed snapshots with content hashes and timestamps. +- **Policy and lattice configuration** – the full set of rules and lattice parameters in effect. +- **Engine version** – the precise version of the evaluation engine used. + +These elements are referenced in a signed manifest. The manifest itself can be stored locally or anchored to an external transparency log. + +## How Auditors Use Replay + +During an audit, an authorized party can: + +1. Export a replay bundle for the time period or system under review. +2. Load the bundle into a replay-capable environment (online or offline). +3. Re-run the evaluation and confirm that: + - The same inputs produce the same findings. + - The risk decisions claimed at the time are consistent with the policies that were in force. + +This removes guesswork and narrative from the conversation. Auditors work with verifiable evidence and deterministic behavior instead of reconstructed stories. + +## Impact on Compliance + +Deterministic scan replay supports: + +- **Regulatory compliance** – Demonstrate that vulnerability decisions were aligned with documented policies and evidence at any historical point. +- **Vendor accountability** – Show precisely which vendor advisories and VEX statements were applied to which software versions. +- **Internal governance** – Trace how changes in policy or feeds affect risk assessments over time. +``` + +--- + +### 5.2 Sample Developer Doc: “Replay Bundle Manifest – Technical Specification” + +````md +--- +title: "Replay Bundle Manifest" +audience: ["developer", "devops"] +type: "reference" +module: "Authority" +version: "1.0" +--- + +# Replay Bundle Manifest + +This document defines the JSON schema for the replay bundle manifest used by Stella Ops to support deterministic scan replay. + +The manifest describes all inputs required to reproduce a specific scan result. + +## Top-Level Structure + +```json +{ + "schemaVersion": "1.0", + "bundleId": "rbn_01J8YF4ZK7...", + "scanId": "scn_01J8XE3FQ8...", + "createdAt": "2025-10-21T14:32:45Z", + "engine": { + "name": "stella-vex-engine", + "version": "2.3.1" + }, + "inputs": { + "sboms": [], + "vexStatements": [], + "feeds": [], + "policies": [] + }, + "signatures": [] +} +```` + +## SBOM Inputs + +```json +"sboms": [ + { + "id": "sbom_01J8XD7A9K...", + "role": "primary", + "format": "CycloneDX", + "version": "1.6", + "hash": { + "alg": "sha256", + "value": "4f3c9c..." + } + } +] +``` + +* `role` – `primary` or `dependency`. +* `format` – `CycloneDX` or `SPDX`. +* `hash` – content hash of the SBOM document. + +## VEX Inputs + +```json +"vexStatements": [ + { + "id": "vex_01J8XD9VTR...", + "source": "vendor", + "format": "CycloneDX-VEX", + "hash": { + "alg": "sha256", + "value": "b3d2a1..." + } + } +] +``` + +* `source` – `vendor` or `local`. +* `format` – VEX format identifier. +* `hash` – content hash of the VEX document. + +## Feed Snapshots + +```json +"feeds": [ + { + "id": "feed_nvd_2025-10-21", + "provider": "NVD", + "snapshotDate": "2025-10-21", + "hash": { + "alg": "sha256", + "value": "89a7c5..." + } + } +] +``` + +Each entry represents a frozen snapshot of a vulnerability feed used during the scan. + +## Policies and Lattice Configuration + +```json +"policies": [ + { + "id": "policy_default_2025-09", + "type": "lattice-config", + "hash": { + "alg": "sha256", + "value": "cc44ee..." + } + } +] +``` + +The manifest does not inline the full policy, only its hash and ID. Policies are stored separately but are referentially immutable. + +## Signatures + +```json +"signatures": [ + { + "profile": "eidas-qes", + "alg": "ECDSA_P256", + "sig": "MEQCIG...", + "certChainRef": "cert_01J8Z0..." + } +] +``` + +* `profile` – cryptographic profile, e.g. `eidas-qes`, `gost-2012`, `pqc-dilithium`. +* `alg` – concrete algorithm used. +* `sig` – signature over the canonical JSON form. +* `certChainRef` – reference to the certificate chain used for verification. + +## Backwards Compatibility + +* New fields must be additive. +* Existing fields must not change semantics. +* The `schemaVersion` field is used to interpret optional sections. + +Replay tools MUST ignore unknown fields. + +``` + +--- + +If you give this to your documentators, they will have: + +- A clear structure for client vs developer docs. +- Concrete outlines per key area (SBOM/VEX, replay, crypto-sovereign). +- A per-module template. +- Example pages that show the level of depth and style you expect. + +If you want next, I can: +- Turn this into a full `docs/authoring-guide.md` plus a directory tree, or +- Add explicit naming conventions and linting rules (e.g., required headings, glossary tags). +::contentReference[oaicite:0]{index=0} +``` diff --git a/docs/product-advisories/23-Nov-2025 - Verifying Binary Reachability via DSSE Envelopes.md b/docs/product-advisories/23-Nov-2025 - Verifying Binary Reachability via DSSE Envelopes.md new file mode 100644 index 000000000..ae5904a1f --- /dev/null +++ b/docs/product-advisories/23-Nov-2025 - Verifying Binary Reachability via DSSE Envelopes.md @@ -0,0 +1,830 @@ +Here’s a crisp idea I think you’ll like: **attested, offline‑verifiable call graphs** for binaries. + +![abstract graph with signed edges concept](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?q=80\&w=1200\&auto=format\&fit=crop) + +### The gist + +* **Goal:** Make binary reachability (who calls whom) something an auditor can replay **deterministically**, even air‑gapped. +* **How:** + + 1. Build the call graph for ELF/PE/Mach‑O. + 2. **Seal each edge (caller → callee) as its own artifact** and sign it in a **DSSE** (in‑toto envelope). + 3. Bundle a **reachability graph manifest** listing all edge‑artifacts + hashes of the inputs (binary, debug info, decompiler version, lattice/policy config). + 4. Upload edge‑attestations to a **transparency log** (e.g., Rekor v2). + 5. Anyone can later fetch/verifiy the envelopes and **replay the analysis identically** (same inputs ⇒ same graph). + +### Why this matters + +* **Deterministic audits:** “Prove this edge existed at analysis time.” No hand‑wavy “our tool said so last week.” +* **Granular trust:** You can quarantine or dispute **just one edge** without invalidating the whole graph. +* **Supply‑chain fit:** Edge‑artifacts compose nicely with SBOM/VEX; you can say “CVE‑123 is reachable via these signed edges.” + +### Minimal vocabulary + +* **DSSE:** A standard envelope that signs the *statement* (here: an edge) and its *subject* (binary, build‑ID, PURLs). +* **Rekor (v2):** An append‑only public log for attestations. Inclusion proofs = tamper‑evidence. +* **Reachability graph:** Nodes are functions/symbols; edges are possible calls; roots are entrypoints (exports, handlers, ctors, etc.). + +### What “best‑in‑class” looks like in Stella Ops + +* **Edge schema (per envelope):** + + * `subject`: binary digest + **build‑id**, container image digest (if relevant) + * `caller`: {binary‑offset | symbol | demangled | PURL, version} + * `callee`: same structure + * `reason`: static pattern (PLT/JMP, thunk), **init_array/ctors**, EH frames, import table, or **dynamic witness** (trace sample ID) + * `provenance`: tool name + version, pipeline run ID, OS, container digest + * `policy-hash`: hash of lattice/policy/rules used + * `evidence`: (optional) byte slice, CFG snippet hash, or trace excerpt hash +* **Graph manifest (DSSE too):** + + * list of edge envelope digests, **roots set**, toolchain hashes, input feeds, **PURL map** (component/function ↔ PURL). +* **Verification flow:** + + * Verify envelopes → verify Rekor inclusion → recompute edges from inputs (or check cached proofs) → compare manifest hash. +* **Roots you must include:** exports, syscalls, signal handlers, **.init_array / .ctors**, TLS callbacks, exception trampolines, plugin entrypoints, registered callbacks. + +### Quick implementation plan (C#/.NET 10, fits your stack) + +1. **Parsers**: ELF/PE/Mach‑O loaders (SymbolTable, DynSym, Reloc/Relr, Import/Export, Sections, Build‑ID), plus DWARF/PDB stubs when present. +2. **Normalizer**: stable symbol IDs (image base + RVA) and **PURL resolver** (package → function namespace). +3. **Edge extractors** (pluggable): + + * Static: import thunks, PLT/JMP, reloc‑targets, vtable patterns, .init_array, EH tables, jump tables. + * Dynamic (optional): eBPF/ETW/Perf trace ingester → produce **witness edges**. +4. **Edge attestation**: one DSSE per edge + signer (FIPS/SM/GOST/EIDAS as needed). +5. **Manifest builder**: emit graph manifest + policy/lattice hash; store in your **Ledger**. +6. **Transparency client**: Rekor v2 submit/query; cache inclusion proofs for offline bundles. +7. **Verifier**: deterministic replay runner; diff engine (edge‑set, roots, policy changes). +8. **UI**: “Edge provenance” panel; click an edge → see DSSE, Rekor proof, extraction reason. + +### Practical guardrails + +* **Idempotence:** Edge IDs = `hash(callerID, calleeID, reason, tool-version)`. Re‑runs don’t duplicate. +* **Explainability:** Every edge must say *why it exists* (pattern or witness). +* **Stripped binaries:** fall back to pattern heuristics + patch oracles; mark edges **probabilistic** with separate attestation type. +* **Hybrid truth:** Keep static and dynamic edges distinct; policies can require both for “reachable”. + +### How this helps your day‑to‑day + +* **Compliance**: Ship an SBOM/VEX plus a **proof pack**; auditors can verify offline. +* **Triage**: For a CVE, show **the exact signed path** from entrypoint → vulnerable function; suppresses noisy “maybe‑reachable” claims. +* **Vendor claims**: Accept third‑party edges only if they come with DSSE + Rekor inclusion. + +If you want, I can draft the **DSSE edge schema (JSON)**, the **manifest format**, and the **.NET 10 interfaces** (`IEdgeExtractor`, `IAttestor`, `IReplayer`, `ITransparencyClient`) so your mid‑level dev can start coding today. +Here’s a concrete, “give this to a mid‑level .NET dev” implementation plan for the attested, offline‑verifiable call graph. + +I’ll assume: + +* Recent .NET (your “.NET 10”) +* C# +* You can add NuGet packages +* You already have (or will have) an “Authority Signer” for DSSE signatures (file key, KMS, etc.) + +--- + +## 0. Solution layout (what projects to create) + +Create a new solution, e.g. `StellaOps.CallGraph.sln` with: + +1. **`StellaOps.CallGraph.Core`** (Class Library) + + * Domain models (functions, edges, manifests) + * Interfaces (`IBinaryParser`, `IEdgeExtractor`, `IAttestor`, `IRekorClient`, etc.) + * DSSE envelope and helpers + +2. **`StellaOps.CallGraph.BinaryParsers`** (Class Library) + + * Implementations of `IBinaryParser` for: + + * **PE/.NET assemblies** using `System.Reflection.Metadata` / `PEReader`([NuGet][1]) + * Optionally native PE / ELF using `Microsoft.Binary.Parsers`([NuGet][2]) or `ELFSharp`([NuGet][3]) + +3. **`StellaOps.CallGraph.EdgeExtraction`** (Class Library) + + * Call‑graph builder / edge extractors (import table, IL call instructions, .ctors, etc.) + +4. **`StellaOps.CallGraph.Attestation`** (Class Library) + + * DSSE helpers + * Attestation logic for edges + graph manifest + * Transparency log (Rekor) client + +5. **`StellaOps.CallGraph.Cli`** (Console app) + + * Developer entrypoint: `callgraph analyze ` + * Outputs: + + * Edge DSSE envelopes (one per edge, or batched) + * Graph manifest DSSE + * Human‑readable summary + +6. **`StellaOps.CallGraph.Tests`** (xUnit / NUnit) + + * Unit tests per layer + +--- + +## 1. Define the core domain (Core project) + +### 1.1 Records and enums + +Create these in `StellaOps.CallGraph.Core`: + +```csharp +public sealed record BinaryIdentity( + string LogicalId, // e.g. build-id or image digest + string Path, // local path used during analysis + string? BuildId, + string? ImageDigest, // e.g. OCI digest + IReadOnlyDictionary Digests // sha256, sha512, etc. +); + +public sealed record FunctionRef( + string BinaryLogicalId, // link to BinaryIdentity.LogicalId + ulong Rva, // Relative virtual address (for native) or metadata token for managed + string? SymbolName, // raw symbol if available + string? DisplayName, // demangled, user-facing + string? Purl // optional: pkg/function mapping +); + +public enum EdgeReasonKind +{ + ImportTable, + StaticCall, // direct call instruction + VirtualDispatch, // via vtable / callvirt + InitArrayOrCtor, + ExceptionHandler, + DynamicWitness // from traces +} + +public sealed record EdgeReason( + EdgeReasonKind Kind, + string Detail // e.g. ".text: call 0x401234", "import: kernel32!CreateFileW" +); + +public sealed record ReachabilityEdge( + FunctionRef Caller, + FunctionRef Callee, + EdgeReason Reason, + string ToolVersion, + string PolicyHash, // hash of lattice/policy + string EvidenceHash // hash of raw evidence blob (CFG snippet, trace, etc.) +); +``` + +Graph manifest: + +```csharp +public sealed record CallGraphManifest( + string SchemaVersion, + BinaryIdentity Binary, + IReadOnlyList Roots, + IReadOnlyList EdgeEnvelopeDigests, // sha256 of DSSE envelopes + string PolicyHash, + IReadOnlyDictionary ToolMetadata +); +``` + +### 1.2 Core interfaces + +```csharp +public interface IBinaryParser +{ + BinaryIdentity Identify(string path); + IReadOnlyList GetFunctions(BinaryIdentity binary); + IReadOnlyList GetRoots(BinaryIdentity binary); // exports, entrypoint, handlers, etc. + BinaryCodeRegion GetCodeRegion(BinaryIdentity binary); // raw bytes + mappings, see below +} + +public sealed record BinaryCodeRegion( + byte[] Bytes, + ulong ImageBase, + IReadOnlyList Sections +); + +public sealed record SectionInfo( + string Name, + ulong Rva, + uint Size +); + +public interface IEdgeExtractor +{ + IReadOnlyList Extract( + BinaryIdentity binary, + IReadOnlyList functions, + BinaryCodeRegion code); +} + +public interface IAttestor +{ + Task SignEdgeAsync( + ReachabilityEdge edge, + BinaryIdentity binary, + CancellationToken ct = default); + + Task SignManifestAsync( + CallGraphManifest manifest, + CancellationToken ct = default); +} + +public interface IRekorClient +{ + Task UploadAsync(DsseEnvelope envelope, CancellationToken ct = default); +} + +public sealed record RekorEntryRef(string LogId, long Index, string Uuid); +``` + +(We’ll define `DsseEnvelope` in section 3.) + +--- + +## 2. Implement minimal PE parser (BinaryParsers project) + +Start with **PE/.NET** only; expand later. + +### 2.1 Add NuGet packages + +* `System.Reflection.Metadata` (if you’re not already on a shared framework that has it)([NuGet][1]) +* Optionally `Microsoft.Binary.Parsers` for native PE & ELF; it already knows how to parse PE headers and ELF.([NuGet][2]) + +### 2.2 Implement `PeBinaryParser` (managed assemblies) + +In `StellaOps.CallGraph.BinaryParsers`: + +* `BinaryIdentity Identify(string path)` + + * Open file, compute SHA‑256 (streaming). + * Use `PEReader` and `MetadataReader` to pull: + + * MVID (`ModuleDefinition`). + * Assembly name, version. + * Derive `LogicalId`, e.g. `"dotnet:/"`. + +* `IReadOnlyList GetFunctions(...)` + + * Use `PEReader` → `GetMetadataReader()` to enumerate methods: + + * `reader.TypeDefinitions` → methods in each type. + * For each `MethodDefinition`, compute: + + * `BinaryLogicalId = binary.LogicalId` + * `Rva = methodDef.RelativeVirtualAddress` + * `SymbolName = reader.GetString(methodDef.Name)` + * `DisplayName = typeFullName + "::" + methodName + signature` + * `Purl` optional mapping (you can fill later from SBOM). + +* `IReadOnlyList GetRoots(...)` + + * Roots for .NET: + + * `Main` methods in entry assembly. + * Public exported API if you want (public methods in public types). + * Static constructors (.cctor) for public types (init roots). + * Keep it simple for v1: treat `Main` as only root. + +* `BinaryCodeRegion GetCodeRegion(...)` + + * For managed assemblies, you only need IL for now: + + * Use `PEReader.GetMethodBody(rva)` to get `MethodBodyBlock`.([Microsoft Learn][4]) + * For v1, you can assemble per‑method IL as you go in the extractor instead of pre‑building a whole region. + +Implementation trick: have `PeBinaryParser` expose a helper: + +```csharp +public MethodBodyBlock? TryGetMethodBody(BinaryIdentity binary, uint rva); +``` + +You’ll pass this down to the edge extractor. + +### 2.3 (Optional) native PE/ELF + +Once managed assemblies work: + +* Add `Microsoft.Binary.Parsers` for PE + ELF.([NuGet][2]) +* Or `ELFSharp` if you prefer.([NuGet][3]) + +You can then: + +* Parse import table → edges from “import stub” → imported function. +* Parse export table → roots (exports). +* Parse `.pdata`, `.xdata` → exception handlers. +* Parse `.init_array` (ELF) / TLS callbacks, C runtime init functions. + +For an “average dev” first iteration, you can **skip native** and get a lot of value from .NET assemblies only. + +--- + +## 3. DSSE attestation primitives (Attestation project) + +You already use DSSE elsewhere, but here’s a self‑contained minimal version. + +### 3.1 Envelope models + +```csharp +public sealed record DsseSignature( + string KeyId, + string Sig // base64 signature +); + +public sealed record DsseEnvelope( + string PayloadType, // e.g. "application/vnd.stella.call-edge+json" + string Payload, // base64-encoded JSON statement + IReadOnlyList Signatures +); +``` + +Statement for a **single edge**: + +```csharp +public sealed record EdgeStatement( + string _type, // e.g. "https://stella.ops/Statement/CallEdge/v1" + object subject, // Binary info + maybe PURLs + ReachabilityEdge edge +); +``` + +You can loosely follow the DSSE / in‑toto style: Google’s Grafeas `Envelope` type also matches DSSE’s `envelope.proto`.([Google Cloud][5]) + +### 3.2 Pre‑authentication encoding (PAE) + +Implement DSSE PAE once: + +```csharp +public static class Dsse +{ + public static byte[] PreAuthEncode(string payloadType, byte[] payload) + { + static byte[] Cat(params byte[][] parts) + { + var total = parts.Sum(p => p.Length); + var buf = new byte[total]; + var offset = 0; + foreach (var part in parts) + { + Buffer.BlockCopy(part, 0, buf, offset, part.Length); + offset += part.Length; + } + return buf; + } + + static byte[] Utf8(string s) => Encoding.UTF8.GetBytes(s); + + var header = Utf8("DSSEv1"); + var pt = Utf8(payloadType); + var lenPt = Utf8(pt.Length.ToString(CultureInfo.InvariantCulture)); + var lenPayload = Utf8(payload.Length.ToString(CultureInfo.InvariantCulture)); + var space = Utf8(" "); + + return Cat(header, space, lenPt, space, pt, space, lenPayload, space, payload); + } +} +``` + +### 3.3 Implement `IAttestor` + +Assume you already have some `IAuthoritySigner` that can sign arbitrary byte arrays (Ed25519, RSA, etc.). + +```csharp +public sealed class DsseAttestor : IAttestor +{ + private readonly IAuthoritySigner _signer; + private readonly JsonSerializerOptions _jsonOptions = new() + { + DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull + }; + + public DsseAttestor(IAuthoritySigner signer) => _signer = signer; + + public async Task SignEdgeAsync( + ReachabilityEdge edge, + BinaryIdentity binary, + CancellationToken ct = default) + { + var stmt = new EdgeStatement( + _type: "https://stella.ops/Statement/CallEdge/v1", + subject: new + { + type = "file", + name = binary.Path, + digest = binary.Digests + }, + edge: edge + ); + + return await SignStatementAsync( + stmt, + payloadType: "application/vnd.stella.call-edge+json", + ct); + } + + public async Task SignManifestAsync( + CallGraphManifest manifest, + CancellationToken ct = default) + { + var stmt = new + { + _type = "https://stella.ops/Statement/CallGraphManifest/v1", + subject = new + { + type = "file", + name = manifest.Binary.Path, + digest = manifest.Binary.Digests + }, + manifest + }; + + return await SignStatementAsync( + stmt, + payloadType: "application/vnd.stella.call-manifest+json", + ct); + } + + private async Task SignStatementAsync( + object statement, + string payloadType, + CancellationToken ct) + { + var payloadBytes = JsonSerializer.SerializeToUtf8Bytes(statement, _jsonOptions); + var pae = Dsse.PreAuthEncode(payloadType, payloadBytes); + + var signatureBytes = await _signer.SignAsync(pae, ct).ConfigureAwait(false); + var keyId = await _signer.GetKeyIdAsync(ct).ConfigureAwait(false); + + return new DsseEnvelope( + PayloadType: payloadType, + Payload: Convert.ToBase64String(payloadBytes), + Signatures: new[] + { + new DsseSignature(keyId, Convert.ToBase64String(signatureBytes)) + }); + } +} +``` + +You can plug in: + +* `IAuthoritySigner` using `System.Security.Cryptography.Ed25519` on .NET (or BouncyCastle) for signatures.([Stack Overflow][6]) + +--- + +## 4. Edge extraction (EdgeExtraction project) + +### 4.1 Choose strategy per binary type + +For **managed .NET assemblies** the easiest route is to use `Mono.Cecil` to read IL opcodes.([NuGet][7]) + +Add package: `Mono.Cecil`. + +```csharp +public sealed class ManagedIlEdgeExtractor : IEdgeExtractor +{ + public IReadOnlyList Extract( + BinaryIdentity binary, + IReadOnlyList functions, + BinaryCodeRegion code) + { + // For managed we won't use BinaryCodeRegion; we’ll re-open file with Cecil. + var result = new List(); + var filePath = binary.Path; + + var module = ModuleDefinition.ReadModule(filePath, new ReaderParameters + { + ReadSymbols = false + }); + + foreach (var type in module.Types) + foreach (var method in type.Methods.Where(m => m.HasBody)) + { + var callerRef = ToFunctionRef(binary, method); + + foreach (var instr in method.Body.Instructions) + { + if (instr.OpCode.FlowControl != FlowControl.Call) + continue; + + if (instr.Operand is not MethodReference calleeMethod) + continue; + + var calleeRef = ToFunctionRef(binary, calleeMethod); + + var edge = new ReachabilityEdge( + Caller: callerRef, + Callee: calleeRef, + Reason: new EdgeReason( + EdgeReasonKind.StaticCall, + Detail: $"IL {instr.OpCode} {calleeMethod.FullName}" + ), + ToolVersion: "stella-callgraph/0.1.0", + PolicyHash: "TODO", + EvidenceHash: "TODO" // later: hash of snippet + ); + + result.Add(edge); + } + } + + return result; + } + + private static FunctionRef ToFunctionRef(BinaryIdentity binary, MethodReference method) + { + var displayName = $"{method.DeclaringType.FullName}::{method.Name}"; + return new FunctionRef( + BinaryLogicalId: binary.LogicalId, + Rva: (ulong)method.MetadataToken.ToInt32(), + SymbolName: method.FullName, + DisplayName: displayName, + Purl: null + ); + } +} +``` + +Later, you can add: + +* Import table edges (`EdgeReasonKind.ImportTable`). +* Virtual dispatch edges, heuristics, etc. +* Dynamic edges from trace logs (`EdgeReasonKind.DynamicWitness`). + +### 4.2 Call‑graph builder + +Add a thin orchestration service: + +```csharp +public sealed class CallGraphBuilder +{ + private readonly IBinaryParser _parser; + private readonly IReadOnlyList _extractors; + + public CallGraphBuilder( + IBinaryParser parser, + IEnumerable extractors) + { + _parser = parser; + _extractors = extractors.ToList(); + } + + public (BinaryIdentity binary, + IReadOnlyList functions, + IReadOnlyList roots, + IReadOnlyList edges) Build(string path) + { + var binary = _parser.Identify(path); + var functions = _parser.GetFunctions(binary); + var roots = _parser.GetRoots(binary); + + // Optionally, pack code region if needed + var code = new BinaryCodeRegion(Array.Empty(), 0, Array.Empty()); + + var edges = _extractors + .SelectMany(e => e.Extract(binary, functions, code)) + .ToList(); + + return (binary, functions, roots, edges); + } +} +``` + +--- + +## 5. Edge→DSSE and manifest→DSSE wiring + +In `StellaOps.CallGraph.Attestation`, create a coordinator: + +```csharp +public sealed class CallGraphAttestationService +{ + private readonly CallGraphBuilder _builder; + private readonly IAttestor _attestor; + private readonly IRekorClient _rekor; + + public CallGraphAttestationService( + CallGraphBuilder builder, + IAttestor attestor, + IRekorClient rekor) + { + _builder = builder; + _attestor = attestor; + _rekor = rekor; + } + + public async Task AnalyzeAndAttestAsync( + string path, + CancellationToken ct = default) + { + var (binary, functions, roots, edges) = _builder.Build(path); + + // 1) Sign each edge + var edgeEnvelopes = new List(); + foreach (var edge in edges) + { + var env = await _attestor.SignEdgeAsync(edge, binary, ct); + edgeEnvelopes.Add(env); + } + + // 2) Compute digests for manifest + var edgeEnvelopeDigests = edgeEnvelopes + .Select(e => Crypto.HashSha256(JsonSerializer.SerializeToUtf8Bytes(e))) + .ToList(); + + var manifest = new CallGraphManifest( + SchemaVersion: "1.0", + Binary: binary, + Roots: roots, + EdgeEnvelopeDigests: edgeEnvelopeDigests, + PolicyHash: edges.FirstOrDefault()?.PolicyHash ?? "", + ToolMetadata: new Dictionary + { + ["builder"] = "stella-callgraph/0.1.0", + ["created-at"] = DateTimeOffset.UtcNow.ToString("O") + }); + + var manifestEnvelope = await _attestor.SignManifestAsync(manifest, ct); + + // 3) Publish DSSE envelopes to Rekor (if configured) + var rekorRefs = new List(); + foreach (var env in edgeEnvelopes.Append(manifestEnvelope)) + { + var entry = await _rekor.UploadAsync(env, ct); + rekorRefs.Add(entry); + } + + return new CallGraphAttestationResult( + Manifest: manifest, + ManifestEnvelope: manifestEnvelope, + EdgeEnvelopes: edgeEnvelopes, + RekorEntries: rekorRefs); + } +} + +public sealed record CallGraphAttestationResult( + CallGraphManifest Manifest, + DsseEnvelope ManifestEnvelope, + IReadOnlyList EdgeEnvelopes, + IReadOnlyList RekorEntries); +``` + +--- + +## 6. Rekor v2 client (transparency log) + +Rekor is a REST‑based transparency log (part of Sigstore).([Sigstore][8]) + +For an average dev, keep it **simple**: + +1. Add `HttpClient`‑based `RekorClient`: + + * `UploadAsync(DsseEnvelope)`: + + * POST to your Rekor server’s `/api/v1/log/entries` (v1 today; v2 is under active development, but the pattern is similar). + * Store returned `logID`, `logIndex`, `uuid` in `RekorEntryRef`. + +2. For offline replay you’ll want to store: + + * The DSSE envelopes. + * Rekor entry references (and ideally inclusion proofs, but that can come later). + +You don’t need to fully implement Merkle tree verification in v1; you can add that when you harden the verifier. + +--- + +## 7. CLI for developers (Cli project) + +A simple console app gives you fast feedback: + +```bash +stella-callgraph analyze myapp.dll \ + --output-dir artifacts/callgraph +``` + +Implementation sketch: + +```csharp +static async Task Main(string[] args) +{ + var input = args[1]; // TODO: proper parser + + var services = Bootstrap(); // DI container + + var svc = services.GetRequiredService(); + var result = await svc.AnalyzeAndAttestAsync(input); + + // Write DSSE envelopes & manifest as JSON files + var outDir = Path.Combine("artifacts", "callgraph"); + Directory.CreateDirectory(outDir); + + await File.WriteAllTextAsync( + Path.Combine(outDir, "manifest.dsse.json"), + JsonSerializer.Serialize(result.ManifestEnvelope, new JsonSerializerOptions { WriteIndented = true })); + + for (var i = 0; i < result.EdgeEnvelopes.Count; i++) + { + var path = Path.Combine(outDir, $"edge-{i:D6}.dsse.json"); + await File.WriteAllTextAsync(path, + JsonSerializer.Serialize(result.EdgeEnvelopes[i], new JsonSerializerOptions { WriteIndented = true })); + } + + return 0; +} +``` + +--- + +## 8. Verifier (same libraries, different flow) + +Later (or in parallel), add a **verification** mode: + +1. Inputs: + + * Binary file. + * Manifest DSSE file. + * Edge DSSE files. + * (Optionally) Rekor log inclusion proof bundle. + +2. Steps (same dev can implement): + + * Verify DSSE signatures for manifest and edges (using `IAuthoritySigner.VerifyAsync`). + * Check: + + * Manifest’s binary digest matches the current file. + * Manifest’s edge‑envelope digests match hashes of the provided DSSE edge files. + * Rebuild call graph using the same tool & policy version and diff against attested edges: + + * For deterministic replay, their differences should be zero. + * Optionally: + + * Ask Rekor for current log info and verify inclusion proof (advanced). + +--- + +## 9. Order of work for a mid‑level .NET dev + +If you hand this as a sequence of tasks: + +1. **Core models & interfaces** + + * Add domain records (`BinaryIdentity`, `FunctionRef`, `ReachabilityEdge`, `CallGraphManifest`). + * Add `IBinaryParser`, `IEdgeExtractor`, `IAttestor`, `IRekorClient`. + +2. **Managed PE parser** + + * Implement `PeBinaryParser` using `System.Reflection.Metadata` (`PEReader`, `MetadataReader`).([NuGet][1]) + * Return `BinaryIdentity`, a list of methods as `FunctionRef`, and roots (`Main`). + +3. **IL edge extractor** + + * Add `Mono.Cecil`. + * Implement `ManagedIlEdgeExtractor` that: + + * Iterates methods and IL instructions. + * Emits edges for `call` and `callvirt`. + +4. **CallGraphBuilder** + + * Wire `.Build(path)` to use `PeBinaryParser` + `ManagedIlEdgeExtractor`. + +5. **DSSE library** + + * Add `DsseEnvelope`, `DsseSignature`, `Dsse.PreAuthEncode`. + * Implement `DsseAttestor` that wraps `ReachabilityEdge` and `CallGraphManifest` into DSSE envelopes using an `IAuthoritySigner`. + +6. **Rekor client (stub, then real)** + + * First: `DummyRekorClient` that just returns fake IDs. + * Then: `HttpRekorClient` that POSTs to your Rekor server. + +7. **CallGraphAttestationService + CLI** + + * Implement `CallGraphAttestationService`. + * CLI command to: + + * Run analysis. + * Write DSSE files + a human readable summary. + +8. **Verifier** + + * Implement basic “offline verify” command: + + * Verify DSSE signatures on manifest + edges. + * Verify manifest ↔ edge digest linkage. + * (Later) compare re‑analyzed graph with attested one. + +--- + +If you want, I can next: + +* Propose the **exact JSON schema** for `EdgeStatement` and `CallGraphManifest` (with sample instances). +* Or help turn this into a **Jira/Linear ticket breakdown** ready for your team. + +[1]: https://www.nuget.org/packages/system.reflection.metadata/?utm_source=chatgpt.com "System.Reflection.Metadata 10.0.0" +[2]: https://www.nuget.org/packages/Microsoft.Binary.Parsers?utm_source=chatgpt.com "Microsoft.Binary.Parsers 4.4.8" +[3]: https://www.nuget.org/packages/ELFSharp?utm_source=chatgpt.com "ELFSharp 2.17.3" +[4]: https://learn.microsoft.com/en-us/dotnet/api/system.reflection.portableexecutable.pereader?view=net-10.0&utm_source=chatgpt.com "PEReader Class (System.Reflection.PortableExecutable)" +[5]: https://cloud.google.com/dotnet/docs/reference/Grafeas.V1/latest/Grafeas.V1.Envelope?utm_source=chatgpt.com "Grafeas v1 API - Class Envelope (3.10.0) | .NET client library" +[6]: https://stackoverflow.com/questions/72152837/get-public-and-private-key-from-pem-ed25519-in-c-sharp?utm_source=chatgpt.com "Get public and private key from PEM ed25519 in C#" +[7]: https://www.nuget.org/packages/mono.cecil/?utm_source=chatgpt.com "Mono.Cecil 0.11.6" +[8]: https://docs.sigstore.dev/logging/overview/?utm_source=chatgpt.com "Rekor" diff --git a/docs/product-advisories/archived/15-Nov-2026 - embedded in-toto provenance events.md b/docs/product-advisories/archived/15-Nov-2025 - embedded in-toto provenance events.md similarity index 100% rename from docs/product-advisories/archived/15-Nov-2026 - embedded in-toto provenance events.md rename to docs/product-advisories/archived/15-Nov-2025 - embedded in-toto provenance events.md diff --git a/docs/product-advisories/archived/15-Nov-2026 - function-level vex explainability.md b/docs/product-advisories/archived/15-Nov-2025 - function-level vex explainability.md similarity index 100% rename from docs/product-advisories/archived/15-Nov-2026 - function-level vex explainability.md rename to docs/product-advisories/archived/15-Nov-2025 - function-level vex explainability.md diff --git a/docs/product-advisories/archived/15-Nov-2026 - ipal serdica census excel import blueprint.md b/docs/product-advisories/archived/15-Nov-2025 - ipal serdica census excel import blueprint.md similarity index 100% rename from docs/product-advisories/archived/15-Nov-2026 - ipal serdica census excel import blueprint.md rename to docs/product-advisories/archived/15-Nov-2025 - ipal serdica census excel import blueprint.md diff --git a/docs/product-advisories/archived/15-Nov-2026 - proof spine for explainable quiet alerts.md b/docs/product-advisories/archived/15-Nov-2025 - proof spine for explainable quiet alerts.md similarity index 100% rename from docs/product-advisories/archived/15-Nov-2026 - proof spine for explainable quiet alerts.md rename to docs/product-advisories/archived/15-Nov-2025 - proof spine for explainable quiet alerts.md diff --git a/docs/product-advisories/archived/15-Nov-2026 - scanner roadmap with deterministic diff-aware rescans.md b/docs/product-advisories/archived/15-Nov-2025 - scanner roadmap with deterministic diff-aware rescans.md similarity index 100% rename from docs/product-advisories/archived/15-Nov-2026 - scanner roadmap with deterministic diff-aware rescans.md rename to docs/product-advisories/archived/15-Nov-2025 - scanner roadmap with deterministic diff-aware rescans.md diff --git a/docs/product-advisories/archived/16-Nov-2026 - layer-sbom cache hash reuse.md b/docs/product-advisories/archived/16-Nov-2025 - layer-sbom cache hash reuse.md similarity index 100% rename from docs/product-advisories/archived/16-Nov-2026 - layer-sbom cache hash reuse.md rename to docs/product-advisories/archived/16-Nov-2025 - layer-sbom cache hash reuse.md diff --git a/docs/product-advisories/archived/16-Nov-2026 - multi-runtime reachability corpus.md b/docs/product-advisories/archived/16-Nov-2025 - multi-runtime reachability corpus.md similarity index 100% rename from docs/product-advisories/archived/16-Nov-2026 - multi-runtime reachability corpus.md rename to docs/product-advisories/archived/16-Nov-2025 - multi-runtime reachability corpus.md diff --git a/docs/product-advisories/archived/16-Nov-2026 - spdx canonical persistence cyclonedx interchange.md b/docs/product-advisories/archived/16-Nov-2025 - spdx canonical persistence cyclonedx interchange.md similarity index 100% rename from docs/product-advisories/archived/16-Nov-2026 - spdx canonical persistence cyclonedx interchange.md rename to docs/product-advisories/archived/16-Nov-2025 - spdx canonical persistence cyclonedx interchange.md diff --git a/docs/product-advisories/archived/16-Nov-2026 - validation plan for quiet scans provenance diff-ci.md b/docs/product-advisories/archived/16-Nov-2025 - validation plan for quiet scans provenance diff-ci.md similarity index 100% rename from docs/product-advisories/archived/16-Nov-2026 - validation plan for quiet scans provenance diff-ci.md rename to docs/product-advisories/archived/16-Nov-2025 - validation plan for quiet scans provenance diff-ci.md diff --git a/docs/product-advisories/archived/18-Nov-2026 - SBOM-Provenance-Spine.md b/docs/product-advisories/archived/17-Nov-2025 - SBOM-Provenance-Spine.md similarity index 96% rename from docs/product-advisories/archived/18-Nov-2026 - SBOM-Provenance-Spine.md rename to docs/product-advisories/archived/17-Nov-2025 - SBOM-Provenance-Spine.md index 303991e51..b6d9a0421 100644 --- a/docs/product-advisories/archived/18-Nov-2026 - SBOM-Provenance-Spine.md +++ b/docs/product-advisories/archived/17-Nov-2025 - SBOM-Provenance-Spine.md @@ -1,784 +1,785 @@ -Here’s a clean, air‑gap‑ready spine for turning container images into verifiable SBOMs and provenance—built to be idempotent and easy to slot into Stella Ops or any CI/CD. - -```mermaid -flowchart LR - A[OCI Image/Repo]-->B[Layer Extractor] - B-->C[Sbomer: CycloneDX/SPDX] - C-->D[DSSE Sign] - D-->E[in-toto Statement (SLSA Provenance)] - E-->F[Transparency Log Adapter] - C-->G[POST /sbom/ingest] - F-->H[POST /attest/verify] -``` - -### What this does (in plain words) - -* **Pull & crack the image** → extract layers, metadata (labels, env, history). -* **Build an SBOM** → emit **CycloneDX 1.6** and **SPDX 3.0.1** (pick one or both). -* **Sign artifacts** → wrap SBOM/provenance in **DSSE** envelopes. -* **Provenance** → generate **in‑toto Statement** with **SLSA Provenance v1** as the predicate. -* **Auditability** → optionally publish attestations to a transparency log (e.g., Rekor) so they’re tamper‑evident via Merkle proofs. -* **APIs are idempotent** → safe to re‑ingest the same image/SBOM/attestation without version churn. - -### Design notes you can hand to an agent - -* **Idempotency keys** - - * `contentAddress` = SHA256 of OCI manifest (or full image digest) - * `sbomHash` = SHA256 of normalized SBOM JSON - * `attHash` = SHA256 of DSSE payload (base64‑stable) - Store these; reject duplicates with HTTP 200 + `"status":"already_present"`. - -* **Default formats** - - * SBOM export: CycloneDX v1.6 (`application/vnd.cyclonedx+json`), SPDX 3.0.1 (`application/spdx+json`) - * DSSE envelope: `application/dsse+json` - * in‑toto Statement: `application/vnd.in-toto+json` with `predicateType` = SLSA Provenance v1 - -* **Air‑gap mode** - - * No external calls required; Rekor publish is optional. - * Keep a local Merkle log (pluggable) and allow later “sync‑to‑Rekor” when online. - -* **Transparency log adapter** - - * Interface: `Put(entry) -> {logIndex, logID, inclusionProof}` - * Backends: `rekor`, `local-merkle`, `null` (no‑op) - -### Minimal API sketch - -* `POST /sbom/ingest` - - * Body: `{ imageDigest, sbom, format, dsseSignature? }` - * Returns: `{ sbomId, status, sbomHash }` (status: `stored|already_present`) -* `POST /attest/verify` - - * Body: `{ dsseEnvelope, expectedSubjects:[{name, digest}] }` - * Verifies DSSE, checks in‑toto subject ↔ image digest, optionally records/logs. - * Returns: `{ verified:true, predicateType, logIndex?, inclusionProof? }` - -### CLI flow (pseudocode) - -```bash -# 1) Extract -stella-extract --image $IMG --out /work/extract - -# 2) SBOM (Cdx + SPDX) -stella-sbomer cdx --in /work/extract --out /work/sbom.cdx.json -stella-sbomer spdx --in /work/extract --out /work/sbom.spdx.json - -# 3) DSSE sign (offline keyring or HSM) -stella-sign dsse --in /work/sbom.cdx.json --out /work/sbom.cdx.dsse.json --key file:k.pem - -# 4) SLSA provenance (in‑toto Statement) -stella-provenance slsa-v1 --subject $IMG_DIGEST --materials /work/extract/manifest.json \ - --out /work/prov.dsse.json --key file:k.pem - -# 5) (optional) Publish to transparency log -stella-log publish --in /work/prov.dsse.json --backend rekor --rekor-url $REKOR -``` - -### Validation rules (quick) - -* **Subject binding**: in‑toto Statement `subject[].digest.sha256` must equal the OCI image digest you scanned. -* **Key policy**: enforce allowed issuers (Fulcio, internal CA, GOST/SM/EIDAS/FIPS as needed). -* **Normalization**: canonicalize JSON before hashing/signing to keep idempotency stable. - -### Why this matters - -* **Audit‑ready**: You can always prove *what* you scanned, *how* it was built, and *who* signed it. -* **Noise‑gated**: With deterministic SBOMs + provenance, downstream VEX/reachability gets much cleaner. -* **Drop‑in**: Works in harsh environments—offline, mirrors, sovereign crypto stacks—without changing your pipeline. - -If you want, I can generate: - -* a ready‑to‑use OpenAPI stub for `POST /sbom/ingest` and `POST /attest/verify`, -* C# (.NET 10) DSSE + in‑toto helpers (interfaces + test fixtures), -* or a Docker‑compose “air‑gap bundle” showing the full spine end‑to‑end. -Below is a full architecture plan you can hand to an agent as the “master spec” for implementing the SBOM & provenance spine (image → SBOM → DSSE → in-toto/SLSA → transparency log → REST APIs), with idempotent APIs and air-gap readiness. - ---- - -## 1. Scope and Objectives - -**Goal:** Implement a deterministic, air-gap-ready “SBOM spine” that: - -* Converts OCI images into SBOMs (CycloneDX 1.6 and SPDX 3.0.1). -* Generates SLSA v1 provenance wrapped in in-toto Statements. -* Signs all artifacts with DSSE envelopes using pluggable crypto providers. -* Optionally publishes attestations to transparency logs (Rekor/local-Merkle/none). -* Exposes stable, idempotent APIs: - - * `POST /sbom/ingest` - * `POST /attest/verify` -* Avoids versioning by design; APIs are extended, not versioned; all mutations are idempotent keyed by content digests. - -**Out of scope (for this iteration):** - -* Full vulnerability scanning (delegated to Scanner service). -* Policy evaluation / lattice logic (delegated to Scanner/Graph engine). -* Vendor-facing proof-market ledger and trust economics (future module). - ---- - -## 2. High-Level Architecture - -### 2.1 Logical Components - -1. **StellaOps.SupplyChain.Core (Library)** - - * Shared types and utilities: - - * Domain models: SBOM, DSSE, in-toto Statement, SLSA predicates. - * Canonicalization & hashing utilities. - * DSSE sign/verify abstractions. - * Transparency log entry model & Merkle proof verification. - -2. **StellaOps.Sbomer.Engine (Library)** - - * Image → SBOM functionality: - - * Layer & manifest analysis. - * SBOM generation: CycloneDX, SPDX. - * Extraction of metadata (labels, env, history). - * Deterministic ordering & normalization. - -3. **StellaOps.Provenance.Engine (Library)** - - * Build provenance & in-toto: - - * In-toto Statement generator. - * SLSA v1 provenance predicate builder. - * Subject and material resolution from image metadata & SBOM. - -4. **StellaOps.Authority (Service/Library)** - - * Crypto & keys: - - * Key management abstraction (file, HSM, KMS, sovereign crypto). - * DSSE signing & verification with multiple key types. - * Trust roots, certificate chains, key policies. - -5. **StellaOps.LogBridge (Service/Library)** - - * Transparency log adapter: - - * Rekor backend. - * Local Merkle log backend (for air-gap). - * Null backend (no-op). - * Merkle proof validation. - -6. **StellaOps.SupplyChain.Api (Service)** - - * The SBOM spine HTTP API: - - * `POST /sbom/ingest` - * `POST /attest/verify` - * Optionally: `GET /sbom/{id}`, `GET /attest/{id}`, `GET /image/{digest}/summary`. - * Performs orchestrations: - - * SBOM/attestation parsing, canonicalization, hashing. - * Idempotency and persistence. - * Delegation to Authority and LogBridge. - -7. **CLI Tools (optional but recommended)** - - * `stella-extract`, `stella-sbomer`, `stella-sign`, `stella-provenance`, `stella-log`. - * Thin wrappers over the above libraries; usable offline and in CI pipelines. - -8. **Persistence Layer** - - * Primary DB: PostgreSQL (or other RDBMS). - * Optional object storage: S3/MinIO for large SBOM/attestation blobs. - * Tables: `images`, `sboms`, `attestations`, `signatures`, `log_entries`, `keys`. - -### 2.2 Deployment View (Kubernetes / Docker) - -```mermaid -flowchart LR - subgraph Node1[Cluster Node] - A[StellaOps.SupplyChain.Api (ASP.NET Core)] - B[StellaOps.Authority Service] - C[StellaOps.LogBridge Service] - end - - subgraph Node2[Worker Node] - D[Runner / CI / Air-gap host] - E[CLI Tools\nstella-extract/sbomer/sign/provenance/log] - end - - F[(PostgreSQL)] - G[(Object Storage\nS3/MinIO)] - H[(Local Merkle Log\nor Rekor)] - - A --> F - A --> G - A --> C - A --> B - C --> H - E --> A -``` - -* **Air-gap mode:** - - * Rekor backend disabled; LogBridge uses local Merkle log (`H`) or `null`. - * All components run within the offline network. -* **Online mode:** - - * LogBridge talks to external Rekor instance using outbound HTTPS only. - ---- - -## 3. Domain Model and Storage Design - -Use EF Core 9 with PostgreSQL in .NET 10. - -### 3.1 Core Entities - -1. **ImageArtifact** - - * `Id` (GUID/ULID, internal). - * `ImageDigest` (string; OCI digest; UNIQUE). - * `Registry` (string). - * `Repository` (string). - * `Tag` (string, nullable, since digest is canonical). - * `FirstSeenAt` (timestamp). - * `MetadataJson` (JSONB; manifest, labels, env). - -2. **Sbom** - - * `Id` (string, primary key = `SbomHash` or derived ULID). - * `ImageArtifactId` (FK). - * `Format` (enum: `CycloneDX_1_6`, `SPDX_3_0_1`). - * `ContentHash` (string; normalized JSON SHA-256; UNIQUE with `TenantId`). - * `StorageLocation` (inline JSONB or external object storage key). - * `CreatedAt`. - * `Origin` (enum: `Generated`, `Uploaded`, `ExternalVendor`). - * Unique constraint: `(TenantId, ContentHash)`. - -3. **Attestation** - - * `Id` (string, primary key = `AttestationHash` or derived ULID). - * `ImageArtifactId` (FK). - * `Type` (enum: `InTotoStatement_SLSA_v1`, `Other`). - * `PayloadHash` (hash of DSSE payload, before envelope). - * `DsseEnvelopeHash` (hash of full DSSE JSON). - * `StorageLocation` (inline JSONB or object storage). - * `CreatedAt`. - * `Issuer` (string; signer identity / certificate subject). - * Unique constraint: `(TenantId, DsseEnvelopeHash)`. - -4. **SignatureInfo** - - * `Id` (GUID/ULID). - * `AttestationId` (FK). - * `KeyId` (logical key identifier). - * `Algorithm` (enum; includes PQ & sovereign algs). - * `VerifiedAt`. - * `VerificationStatus` (enum: `Valid`, `Invalid`, `Unknown`). - * `DetailsJson` (JSONB; trust-chain, error reasons, etc.). - -5. **TransparencyLogEntry** - - * `Id` (GUID/ULID). - * `AttestationId` (FK). - * `Backend` (enum: `Rekor`, `LocalMerkle`). - * `LogIndex` (string). - * `LogId` (string). - * `InclusionProofJson` (JSONB). - * `RecordedAt`. - * Unique constraint: `(Backend, LogId, LogIndex)`. - -6. **KeyRecord** (optional if not reusing Authority’s DB) - - * `KeyId` (string, PK). - * `KeyType` (enum). - * `Usage` (enum: `Signing`, `Verification`, `Both`). - * `Status` (enum: `Active`, `Retired`, `Revoked`). - * `MetadataJson` (JSONB; KMS ARN, HSM slot, etc.). - -### 3.2 Idempotency Keys - -* SBOM: - - * `sbomHash = SHA256(canonicalJson(sbom))`. - * Uniqueness enforced by `(TenantId, sbomHash)` in DB. -* Attestation: - - * `attHash = SHA256(canonicalJson(dsse.payload))` or full envelope. - * Uniqueness enforced by `(TenantId, attHash)` in DB. -* Image: - - * `imageDigest` is globally unique (per OCI spec). - ---- - -## 4. Service-Level Architecture - -### 4.1 StellaOps.SupplyChain.Api (.NET 10, ASP.NET Core) - -**Responsibilities:** - -* Expose HTTP API for ingest / verify. -* Handle idempotency logic & persistence. -* Delegate cryptographic operations to Authority. -* Delegate transparency logging to LogBridge. -* Perform basic validation against schemas (SBOM, DSSE, in-toto, SLSA). - -**Key Endpoints:** - -1. `POST /sbom/ingest` - - * Request: - - * `imageDigest` (string). - * `sbom` (raw JSON). - * `format` (enum/string). - * Optional: `dsseSignature` or `dsseEnvelope`. - * Behavior: - - * Parse & validate SBOM structure. - * Canonicalize JSON, compute `sbomHash`. - * If `sbomHash` exists for `imageDigest` and tenant: - - * Return `200` with `{ status: "already_present", sbomId, sbomHash }`. - * Else: - - * Persist `Sbom` entity. - * Optionally verify DSSE signature via Authority. - * Return `201` with `{ status: "stored", sbomId, sbomHash }`. - -2. `POST /attest/verify` - - * Request: - - * `dsseEnvelope` (JSON). - * `expectedSubjects` (list of `{ name, digest }`). - * Behavior: - - * Canonicalize payload, compute `attHash`. - * Verify DSSE signature via Authority. - * Parse in-toto Statement; ensure `subject[].digest.sha256` matches `expectedSubjects`. - * Persist `Attestation` & `SignatureInfo`. - * If configured, call LogBridge to publish and store `TransparencyLogEntry`. - * If `attHash` already exists: - - * Return `200` with `status: "already_present"` and existing references. - * Else, return `201` with `verified:true`, plus log info when available. - -3. Optional read APIs: - - * `GET /sbom/by-image/{digest}` - * `GET /attest/by-image/{digest}` - * `GET /image/{digest}/summary` (SBOM + attestations + log status). - -### 4.2 StellaOps.Sbomer.Engine - -**Responsibilities:** - -* Given: - - * OCI image manifest & layers (from local tarball or remote registry). -* Produce: - - * CycloneDX 1.6 JSON. - * SPDX 3.0.1 JSON. - -**Design:** - -* Use layered analyzers: - - * `ILayerAnalyzer` for generic filesystem traversal. - * Language-specific analyzers (optional for SBOM detail): - - * `DotNetAnalyzer`, `NodeJsAnalyzer`, `PythonAnalyzer`, `JavaAnalyzer`, `PhpAnalyzer`, etc. -* Determinism: - - * Sort all lists (components, dependencies) by stable keys. - * Remove unstable fields (timestamps, machine IDs, ephemeral paths). - * Provide `Normalize()` method per format that returns canonical JSON. - -### 4.3 StellaOps.Provenance.Engine - -**Responsibilities:** - -* Build in-toto Statement with SLSA v1 predicate: - - * `subject` derived from image digest(s). - * `materials` from: - - * Git commit, tag, builder image, SBOM components if available. -* Ensure determinism: - - * Sort materials by URI + digest. - * Normalize nested maps. - -**Key APIs (internal library):** - -* `InTotoStatement BuildSlsaProvenance(ImageArtifact image, Sbom sbom, ProvenanceContext ctx)` -* `string ToCanonicalJson(InTotoStatement stmt)` - -### 4.4 StellaOps.Authority - -**Responsibilities:** - -* DSSE signing & verification. -* Key management abstraction. -* Policy enforcement (which keys/trust roots are allowed). - -**Interfaces:** - -* `ISigningProvider` - - * `Task SignAsync(byte[] payload, string payloadType, string keyId)` -* `IVerificationProvider` - - * `Task VerifyAsync(DsseEnvelope envelope, VerificationPolicy policy)` - -**Backends:** - -* File-based keys (PEM). -* HSM/KMS (AWS KMS, Azure Key Vault, on-prem HSM). -* Sovereign crypto providers (GOST, SMx, etc.). -* Optional PQ providers (Dilithium, Falcon). - -### 4.5 StellaOps.LogBridge - -**Responsibilities:** - -* Abstract interaction with transparency logs. - -**Interface:** - -* `ILogBackend` - - * `Task PutAsync(byte[] canonicalPayloadHash, DsseEnvelope env)` - * `Task VerifyInclusionAsync(LogEntryResult entry)` - -**Backends:** - -* `RekorBackend`: - - * Calls Rekor REST API with hashed payload. -* `LocalMerkleBackend`: - - * Maintains Merkle tree in local DB. - * Returns `logIndex`, `logId`, and inclusion proof. -* `NullBackend`: - - * Returns empty/no-op results. - -### 4.6 CLI Tools (Optional) - -Use the same libraries as the services: - -* `stella-extract`: - - * Input: image reference. - * Output: local tarball + manifest JSON. -* `stella-sbomer`: - - * Input: manifest & layers. - * Output: SBOM JSON. -* `stella-sign`: - - * Input: JSON file. - * Output: DSSE envelope. -* `stella-provenance`: - - * Input: image digest, build metadata. - * Output: signed in-toto/SLSA DSSE. -* `stella-log`: - - * Input: DSSE envelope. - * Output: log entry details. - ---- - -## 5. End-to-End Flows - -### 5.1 SBOM Ingest (Upload Path) - -```mermaid -sequenceDiagram - participant Client - participant API as SupplyChain.Api - participant Core as SupplyChain.Core - participant DB as PostgreSQL - - Client->>API: POST /sbom/ingest (imageDigest, sbom, format) - API->>Core: Validate & canonicalize SBOM - Core-->>API: sbomHash - API->>DB: SELECT Sbom WHERE sbomHash & imageDigest - DB-->>API: Not found - API->>DB: INSERT Sbom (sbomHash, imageDigest, content) - DB-->>API: ok - API-->>Client: 201 { status:"stored", sbomId, sbomHash } -``` - -Re-ingest of the same SBOM repeats steps up to SELECT, then returns `status:"already_present"` with `200`. - -### 5.2 Attestation Verify & Record - -```mermaid -sequenceDiagram - participant Client - participant API as SupplyChain.Api - participant Auth as Authority - participant Log as LogBridge - participant DB as PostgreSQL - - Client->>API: POST /attest/verify (dsseEnvelope, expectedSubjects) - API->>Auth: Verify DSSE (keys, policy) - Auth-->>API: VerificationResult(Valid/Invalid) - API->>API: Parse in-toto, check subjects vs expected - API->>DB: SELECT Attestation WHERE attHash - DB-->>API: Not found - API->>DB: INSERT Attestation + SignatureInfo - alt Logging enabled - API->>Log: PutAsync(attHash, envelope) - Log-->>API: LogEntryResult(logIndex, logId, proof) - API->>DB: INSERT TransparencyLogEntry - end - API-->>Client: 201 { verified:true, attestationId, logIndex?, inclusionProof? } -``` - -If attestation already exists, API returns `200` with `status:"already_present"`. - ---- - -## 6. Idempotency and Determinism Strategy - -1. **Canonicalization rules:** - - * Remove insignificant whitespace. - * Sort all object keys lexicographically. - * Sort arrays where order is not semantically meaningful (components, materials). - * Strip non-deterministic fields (timestamps, random IDs) where allowed. - -2. **Hashing:** - - * Always hash canonical JSON as UTF-8. - * Use SHA-256 for core IDs; allow crypto provider to also compute other digests if needed. - -3. **Persistence:** - - * Enforce uniqueness in DB via indices on: - - * `(TenantId, ContentHash)` for SBOMs. - * `(TenantId, AttHash)` for attestations. - * `(Backend, LogId, LogIndex)` for log entries. - * API behavior: - - * Existing row → `200` with `"already_present"`. - * New row → `201` with `"stored"`. - -4. **API design:** - - * No version numbers in path. - * Add fields over time; never break or repurpose existing ones. - * Use explicit capability discovery via `GET /meta/capabilities` if needed. - ---- - -## 7. Air-Gap Mode and Synchronization - -### 7.1 Air-Gap Mode - -* Configuration flag `Mode = Offline` on SupplyChain.Api. -* LogBridge backend: - - * Default to `LocalMerkle` or `Null`. -* Rekor-specific configuration disabled or absent. -* DB & Merkle log stored locally inside the secure network. - -### 7.2 Later Synchronization to Rekor (Optional Future Step) - -Not mandatory for first iteration, but prepare for: - -* Background job (Scheduler module) that: - - * Enumerates local `TransparencyLogEntry` not yet exported. - * Publishes hashed payloads to Rekor when network is available. - * Stores mapping between local log entries and remote Rekor entries. - ---- - -## 8. Security, Access Control, and Observability - -### 8.1 Security - -* mTLS between internal services (SupplyChain.Api, Authority, LogBridge). -* Authentication: - - * API keys/OIDC for clients. - * Per-tenant scoping; `TenantId` must be present in context. -* Authorization: - - * RBAC: which tenants/users can write/verify/only read. - -### 8.2 Crypto Policies - -* Policy object defines: - - * Allowed key types and algorithms. - * Trust roots (Fulcio, internal CA, sovereign PKI). - * Revocation checking strategy (CRL/OCSP, offline lists). -* Authority enforces policies; SupplyChain.Api only consumes `VerificationResult`. - -### 8.3 Observability - -* Logs: - - * Structured logs with correlation IDs; log imageDigest, sbomHash, attHash. -* Metrics: - - * SBOM ingest count, dedup hit rate. - * Attestation verify latency. - * Transparency log publish success/failure counts. -* Traces: - - * OpenTelemetry tracing across API → Authority → LogBridge. - ---- - -## 9. Implementation Plan (Epics & Work Packages) - -You can give this section directly to agents to split. - -### Epic 1: Core Domain & Canonicalization - -1. Define .NET 10 solution structure: - - * Projects: - - * `StellaOps.SupplyChain.Core` - * `StellaOps.Sbomer.Engine` - * `StellaOps.Provenance.Engine` - * `StellaOps.SupplyChain.Api` - * `StellaOps.Authority` (if not already present) - * `StellaOps.LogBridge` -2. Implement core domain models: - - * SBOM, DSSE, in-toto, SLSA v1. -3. Implement canonicalization & hashing utilities. -4. Unit tests: - - * Given semantically equivalent JSON, hashes must match. - * Negative tests where order changes but meaning does not. - -### Epic 2: Persistence Layer - -1. Design EF Core models for: - - * ImageArtifact, Sbom, Attestation, SignatureInfo, TransparencyLogEntry, KeyRecord. -2. Write migrations for PostgreSQL. -3. Implement repository interfaces for read/write. -4. Tests: - - * Unique constraints and idempotency behavior. - * Query performance for common access paths (by imageDigest). - -### Epic 3: SBOM Engine - -1. Implement minimal layer analysis: - - * Accepts local tarball or path (for now). -2. Implement CycloneDX 1.6 generator. -3. Implement SPDX 3.0.1 generator. -4. Deterministic normalization across formats. -5. Tests: - - * Golden files for images → SBOM output. - * Stability under repeated runs. - -### Epic 4: Provenance Engine - -1. Implement in-toto Statement model with SLSA v1 predicate. -2. Implement builder to map: - - * ImageDigest → subject. - * Build metadata → materials. -3. Deterministic canonicalization. -4. Tests: - - * Golden in-toto/SLSA statements for sample inputs. - * Subject matching logic. - -### Epic 5: Authority Integration - -1. Implement `ISigningProvider`, `IVerificationProvider` contracts. -2. Implement file-based key backend as default. -3. Implement DSSE wrapper: - - * `SignAsync(payload, payloadType, keyId)`. - * `VerifyAsync(envelope, policy)`. -4. Tests: - - * DSSE round-trip; invalid signature scenarios. - * Policy enforcement tests. - -### Epic 6: Transparency Log Bridge - -1. Implement `ILogBackend` interface. -2. Implement `LocalMerkleBackend`: - - * Simple Merkle tree with DB storage. -3. Implement `NullBackend`. -4. Define configuration model to select backend. -5. (Optional later) Implement `RekorBackend`. -6. Tests: - - * Stable Merkle root; inclusion proof verification. - -### Epic 7: SupplyChain.Api - -1. Implement `POST /sbom/ingest`: - - * Request/response DTOs. - * Integration with canonicalization, persistence, idempotency logic. -2. Implement `POST /attest/verify`: - - * End-to-end verification and persistence. - * Integration with Authority and LogBridge. -3. Optional read APIs. -4. Add input validation (JSON schema, basic constraints). -5. Integration tests: - - * Full flows for new and duplicate inputs. - * Error cases (invalid DSSE, subject mismatch). - -### Epic 8: CLI Tools - -1. Implement `stella-sbomer` (wraps Sbomer.Engine). -2. Implement `stella-provenance` (wraps Provenance.Engine + Authority). -3. Implement `stella-sign` and `stella-log`. -4. Provide clear help/usage and sample scripts. - -### Epic 9: Hardening, Air-Gap Profile, and Docs - -1. Configuration profiles: - - * `Offline` vs `Online`. - * Log backend selection. -2. Security hardening: - - * mTLS, authentication, authorization. -3. Observability: - - * Metrics, logs, traces wiring. -4. Documentation: - - * API reference. - * Sequence diagrams. - * Deployment recipes for: - - * Single-node air-gap. - * Clustered online deployment. - ---- - -If you want, next step I can: - -* Turn this into an AGENTS/TASKS/PROMPT set for your codex workers, or -* Produce concrete .NET 10 project skeletons (csproj layout, folder structure, and initial interfaces) for the core libraries and API service. + +Here’s a clean, air‑gap‑ready spine for turning container images into verifiable SBOMs and provenance—built to be idempotent and easy to slot into Stella Ops or any CI/CD. + +```mermaid +flowchart LR + A[OCI Image/Repo]-->B[Layer Extractor] + B-->C[Sbomer: CycloneDX/SPDX] + C-->D[DSSE Sign] + D-->E[in-toto Statement (SLSA Provenance)] + E-->F[Transparency Log Adapter] + C-->G[POST /sbom/ingest] + F-->H[POST /attest/verify] +``` + +### What this does (in plain words) + +* **Pull & crack the image** → extract layers, metadata (labels, env, history). +* **Build an SBOM** → emit **CycloneDX 1.6** and **SPDX 3.0.1** (pick one or both). +* **Sign artifacts** → wrap SBOM/provenance in **DSSE** envelopes. +* **Provenance** → generate **in‑toto Statement** with **SLSA Provenance v1** as the predicate. +* **Auditability** → optionally publish attestations to a transparency log (e.g., Rekor) so they’re tamper‑evident via Merkle proofs. +* **APIs are idempotent** → safe to re‑ingest the same image/SBOM/attestation without version churn. + +### Design notes you can hand to an agent + +* **Idempotency keys** + + * `contentAddress` = SHA256 of OCI manifest (or full image digest) + * `sbomHash` = SHA256 of normalized SBOM JSON + * `attHash` = SHA256 of DSSE payload (base64‑stable) + Store these; reject duplicates with HTTP 200 + `"status":"already_present"`. + +* **Default formats** + + * SBOM export: CycloneDX v1.6 (`application/vnd.cyclonedx+json`), SPDX 3.0.1 (`application/spdx+json`) + * DSSE envelope: `application/dsse+json` + * in‑toto Statement: `application/vnd.in-toto+json` with `predicateType` = SLSA Provenance v1 + +* **Air‑gap mode** + + * No external calls required; Rekor publish is optional. + * Keep a local Merkle log (pluggable) and allow later “sync‑to‑Rekor” when online. + +* **Transparency log adapter** + + * Interface: `Put(entry) -> {logIndex, logID, inclusionProof}` + * Backends: `rekor`, `local-merkle`, `null` (no‑op) + +### Minimal API sketch + +* `POST /sbom/ingest` + + * Body: `{ imageDigest, sbom, format, dsseSignature? }` + * Returns: `{ sbomId, status, sbomHash }` (status: `stored|already_present`) +* `POST /attest/verify` + + * Body: `{ dsseEnvelope, expectedSubjects:[{name, digest}] }` + * Verifies DSSE, checks in‑toto subject ↔ image digest, optionally records/logs. + * Returns: `{ verified:true, predicateType, logIndex?, inclusionProof? }` + +### CLI flow (pseudocode) + +```bash +# 1) Extract +stella-extract --image $IMG --out /work/extract + +# 2) SBOM (Cdx + SPDX) +stella-sbomer cdx --in /work/extract --out /work/sbom.cdx.json +stella-sbomer spdx --in /work/extract --out /work/sbom.spdx.json + +# 3) DSSE sign (offline keyring or HSM) +stella-sign dsse --in /work/sbom.cdx.json --out /work/sbom.cdx.dsse.json --key file:k.pem + +# 4) SLSA provenance (in‑toto Statement) +stella-provenance slsa-v1 --subject $IMG_DIGEST --materials /work/extract/manifest.json \ + --out /work/prov.dsse.json --key file:k.pem + +# 5) (optional) Publish to transparency log +stella-log publish --in /work/prov.dsse.json --backend rekor --rekor-url $REKOR +``` + +### Validation rules (quick) + +* **Subject binding**: in‑toto Statement `subject[].digest.sha256` must equal the OCI image digest you scanned. +* **Key policy**: enforce allowed issuers (Fulcio, internal CA, GOST/SM/EIDAS/FIPS as needed). +* **Normalization**: canonicalize JSON before hashing/signing to keep idempotency stable. + +### Why this matters + +* **Audit‑ready**: You can always prove *what* you scanned, *how* it was built, and *who* signed it. +* **Noise‑gated**: With deterministic SBOMs + provenance, downstream VEX/reachability gets much cleaner. +* **Drop‑in**: Works in harsh environments—offline, mirrors, sovereign crypto stacks—without changing your pipeline. + +If you want, I can generate: + +* a ready‑to‑use OpenAPI stub for `POST /sbom/ingest` and `POST /attest/verify`, +* C# (.NET 10) DSSE + in‑toto helpers (interfaces + test fixtures), +* or a Docker‑compose “air‑gap bundle” showing the full spine end‑to‑end. +Below is a full architecture plan you can hand to an agent as the “master spec” for implementing the SBOM & provenance spine (image → SBOM → DSSE → in-toto/SLSA → transparency log → REST APIs), with idempotent APIs and air-gap readiness. + +--- + +## 1. Scope and Objectives + +**Goal:** Implement a deterministic, air-gap-ready “SBOM spine” that: + +* Converts OCI images into SBOMs (CycloneDX 1.6 and SPDX 3.0.1). +* Generates SLSA v1 provenance wrapped in in-toto Statements. +* Signs all artifacts with DSSE envelopes using pluggable crypto providers. +* Optionally publishes attestations to transparency logs (Rekor/local-Merkle/none). +* Exposes stable, idempotent APIs: + + * `POST /sbom/ingest` + * `POST /attest/verify` +* Avoids versioning by design; APIs are extended, not versioned; all mutations are idempotent keyed by content digests. + +**Out of scope (for this iteration):** + +* Full vulnerability scanning (delegated to Scanner service). +* Policy evaluation / lattice logic (delegated to Scanner/Graph engine). +* Vendor-facing proof-market ledger and trust economics (future module). + +--- + +## 2. High-Level Architecture + +### 2.1 Logical Components + +1. **StellaOps.SupplyChain.Core (Library)** + + * Shared types and utilities: + + * Domain models: SBOM, DSSE, in-toto Statement, SLSA predicates. + * Canonicalization & hashing utilities. + * DSSE sign/verify abstractions. + * Transparency log entry model & Merkle proof verification. + +2. **StellaOps.Sbomer.Engine (Library)** + + * Image → SBOM functionality: + + * Layer & manifest analysis. + * SBOM generation: CycloneDX, SPDX. + * Extraction of metadata (labels, env, history). + * Deterministic ordering & normalization. + +3. **StellaOps.Provenance.Engine (Library)** + + * Build provenance & in-toto: + + * In-toto Statement generator. + * SLSA v1 provenance predicate builder. + * Subject and material resolution from image metadata & SBOM. + +4. **StellaOps.Authority (Service/Library)** + + * Crypto & keys: + + * Key management abstraction (file, HSM, KMS, sovereign crypto). + * DSSE signing & verification with multiple key types. + * Trust roots, certificate chains, key policies. + +5. **StellaOps.LogBridge (Service/Library)** + + * Transparency log adapter: + + * Rekor backend. + * Local Merkle log backend (for air-gap). + * Null backend (no-op). + * Merkle proof validation. + +6. **StellaOps.SupplyChain.Api (Service)** + + * The SBOM spine HTTP API: + + * `POST /sbom/ingest` + * `POST /attest/verify` + * Optionally: `GET /sbom/{id}`, `GET /attest/{id}`, `GET /image/{digest}/summary`. + * Performs orchestrations: + + * SBOM/attestation parsing, canonicalization, hashing. + * Idempotency and persistence. + * Delegation to Authority and LogBridge. + +7. **CLI Tools (optional but recommended)** + + * `stella-extract`, `stella-sbomer`, `stella-sign`, `stella-provenance`, `stella-log`. + * Thin wrappers over the above libraries; usable offline and in CI pipelines. + +8. **Persistence Layer** + + * Primary DB: PostgreSQL (or other RDBMS). + * Optional object storage: S3/MinIO for large SBOM/attestation blobs. + * Tables: `images`, `sboms`, `attestations`, `signatures`, `log_entries`, `keys`. + +### 2.2 Deployment View (Kubernetes / Docker) + +```mermaid +flowchart LR + subgraph Node1[Cluster Node] + A[StellaOps.SupplyChain.Api (ASP.NET Core)] + B[StellaOps.Authority Service] + C[StellaOps.LogBridge Service] + end + + subgraph Node2[Worker Node] + D[Runner / CI / Air-gap host] + E[CLI Tools\nstella-extract/sbomer/sign/provenance/log] + end + + F[(PostgreSQL)] + G[(Object Storage\nS3/MinIO)] + H[(Local Merkle Log\nor Rekor)] + + A --> F + A --> G + A --> C + A --> B + C --> H + E --> A +``` + +* **Air-gap mode:** + + * Rekor backend disabled; LogBridge uses local Merkle log (`H`) or `null`. + * All components run within the offline network. +* **Online mode:** + + * LogBridge talks to external Rekor instance using outbound HTTPS only. + +--- + +## 3. Domain Model and Storage Design + +Use EF Core 9 with PostgreSQL in .NET 10. + +### 3.1 Core Entities + +1. **ImageArtifact** + + * `Id` (GUID/ULID, internal). + * `ImageDigest` (string; OCI digest; UNIQUE). + * `Registry` (string). + * `Repository` (string). + * `Tag` (string, nullable, since digest is canonical). + * `FirstSeenAt` (timestamp). + * `MetadataJson` (JSONB; manifest, labels, env). + +2. **Sbom** + + * `Id` (string, primary key = `SbomHash` or derived ULID). + * `ImageArtifactId` (FK). + * `Format` (enum: `CycloneDX_1_6`, `SPDX_3_0_1`). + * `ContentHash` (string; normalized JSON SHA-256; UNIQUE with `TenantId`). + * `StorageLocation` (inline JSONB or external object storage key). + * `CreatedAt`. + * `Origin` (enum: `Generated`, `Uploaded`, `ExternalVendor`). + * Unique constraint: `(TenantId, ContentHash)`. + +3. **Attestation** + + * `Id` (string, primary key = `AttestationHash` or derived ULID). + * `ImageArtifactId` (FK). + * `Type` (enum: `InTotoStatement_SLSA_v1`, `Other`). + * `PayloadHash` (hash of DSSE payload, before envelope). + * `DsseEnvelopeHash` (hash of full DSSE JSON). + * `StorageLocation` (inline JSONB or object storage). + * `CreatedAt`. + * `Issuer` (string; signer identity / certificate subject). + * Unique constraint: `(TenantId, DsseEnvelopeHash)`. + +4. **SignatureInfo** + + * `Id` (GUID/ULID). + * `AttestationId` (FK). + * `KeyId` (logical key identifier). + * `Algorithm` (enum; includes PQ & sovereign algs). + * `VerifiedAt`. + * `VerificationStatus` (enum: `Valid`, `Invalid`, `Unknown`). + * `DetailsJson` (JSONB; trust-chain, error reasons, etc.). + +5. **TransparencyLogEntry** + + * `Id` (GUID/ULID). + * `AttestationId` (FK). + * `Backend` (enum: `Rekor`, `LocalMerkle`). + * `LogIndex` (string). + * `LogId` (string). + * `InclusionProofJson` (JSONB). + * `RecordedAt`. + * Unique constraint: `(Backend, LogId, LogIndex)`. + +6. **KeyRecord** (optional if not reusing Authority’s DB) + + * `KeyId` (string, PK). + * `KeyType` (enum). + * `Usage` (enum: `Signing`, `Verification`, `Both`). + * `Status` (enum: `Active`, `Retired`, `Revoked`). + * `MetadataJson` (JSONB; KMS ARN, HSM slot, etc.). + +### 3.2 Idempotency Keys + +* SBOM: + + * `sbomHash = SHA256(canonicalJson(sbom))`. + * Uniqueness enforced by `(TenantId, sbomHash)` in DB. +* Attestation: + + * `attHash = SHA256(canonicalJson(dsse.payload))` or full envelope. + * Uniqueness enforced by `(TenantId, attHash)` in DB. +* Image: + + * `imageDigest` is globally unique (per OCI spec). + +--- + +## 4. Service-Level Architecture + +### 4.1 StellaOps.SupplyChain.Api (.NET 10, ASP.NET Core) + +**Responsibilities:** + +* Expose HTTP API for ingest / verify. +* Handle idempotency logic & persistence. +* Delegate cryptographic operations to Authority. +* Delegate transparency logging to LogBridge. +* Perform basic validation against schemas (SBOM, DSSE, in-toto, SLSA). + +**Key Endpoints:** + +1. `POST /sbom/ingest` + + * Request: + + * `imageDigest` (string). + * `sbom` (raw JSON). + * `format` (enum/string). + * Optional: `dsseSignature` or `dsseEnvelope`. + * Behavior: + + * Parse & validate SBOM structure. + * Canonicalize JSON, compute `sbomHash`. + * If `sbomHash` exists for `imageDigest` and tenant: + + * Return `200` with `{ status: "already_present", sbomId, sbomHash }`. + * Else: + + * Persist `Sbom` entity. + * Optionally verify DSSE signature via Authority. + * Return `201` with `{ status: "stored", sbomId, sbomHash }`. + +2. `POST /attest/verify` + + * Request: + + * `dsseEnvelope` (JSON). + * `expectedSubjects` (list of `{ name, digest }`). + * Behavior: + + * Canonicalize payload, compute `attHash`. + * Verify DSSE signature via Authority. + * Parse in-toto Statement; ensure `subject[].digest.sha256` matches `expectedSubjects`. + * Persist `Attestation` & `SignatureInfo`. + * If configured, call LogBridge to publish and store `TransparencyLogEntry`. + * If `attHash` already exists: + + * Return `200` with `status: "already_present"` and existing references. + * Else, return `201` with `verified:true`, plus log info when available. + +3. Optional read APIs: + + * `GET /sbom/by-image/{digest}` + * `GET /attest/by-image/{digest}` + * `GET /image/{digest}/summary` (SBOM + attestations + log status). + +### 4.2 StellaOps.Sbomer.Engine + +**Responsibilities:** + +* Given: + + * OCI image manifest & layers (from local tarball or remote registry). +* Produce: + + * CycloneDX 1.6 JSON. + * SPDX 3.0.1 JSON. + +**Design:** + +* Use layered analyzers: + + * `ILayerAnalyzer` for generic filesystem traversal. + * Language-specific analyzers (optional for SBOM detail): + + * `DotNetAnalyzer`, `NodeJsAnalyzer`, `PythonAnalyzer`, `JavaAnalyzer`, `PhpAnalyzer`, etc. +* Determinism: + + * Sort all lists (components, dependencies) by stable keys. + * Remove unstable fields (timestamps, machine IDs, ephemeral paths). + * Provide `Normalize()` method per format that returns canonical JSON. + +### 4.3 StellaOps.Provenance.Engine + +**Responsibilities:** + +* Build in-toto Statement with SLSA v1 predicate: + + * `subject` derived from image digest(s). + * `materials` from: + + * Git commit, tag, builder image, SBOM components if available. +* Ensure determinism: + + * Sort materials by URI + digest. + * Normalize nested maps. + +**Key APIs (internal library):** + +* `InTotoStatement BuildSlsaProvenance(ImageArtifact image, Sbom sbom, ProvenanceContext ctx)` +* `string ToCanonicalJson(InTotoStatement stmt)` + +### 4.4 StellaOps.Authority + +**Responsibilities:** + +* DSSE signing & verification. +* Key management abstraction. +* Policy enforcement (which keys/trust roots are allowed). + +**Interfaces:** + +* `ISigningProvider` + + * `Task SignAsync(byte[] payload, string payloadType, string keyId)` +* `IVerificationProvider` + + * `Task VerifyAsync(DsseEnvelope envelope, VerificationPolicy policy)` + +**Backends:** + +* File-based keys (PEM). +* HSM/KMS (AWS KMS, Azure Key Vault, on-prem HSM). +* Sovereign crypto providers (GOST, SMx, etc.). +* Optional PQ providers (Dilithium, Falcon). + +### 4.5 StellaOps.LogBridge + +**Responsibilities:** + +* Abstract interaction with transparency logs. + +**Interface:** + +* `ILogBackend` + + * `Task PutAsync(byte[] canonicalPayloadHash, DsseEnvelope env)` + * `Task VerifyInclusionAsync(LogEntryResult entry)` + +**Backends:** + +* `RekorBackend`: + + * Calls Rekor REST API with hashed payload. +* `LocalMerkleBackend`: + + * Maintains Merkle tree in local DB. + * Returns `logIndex`, `logId`, and inclusion proof. +* `NullBackend`: + + * Returns empty/no-op results. + +### 4.6 CLI Tools (Optional) + +Use the same libraries as the services: + +* `stella-extract`: + + * Input: image reference. + * Output: local tarball + manifest JSON. +* `stella-sbomer`: + + * Input: manifest & layers. + * Output: SBOM JSON. +* `stella-sign`: + + * Input: JSON file. + * Output: DSSE envelope. +* `stella-provenance`: + + * Input: image digest, build metadata. + * Output: signed in-toto/SLSA DSSE. +* `stella-log`: + + * Input: DSSE envelope. + * Output: log entry details. + +--- + +## 5. End-to-End Flows + +### 5.1 SBOM Ingest (Upload Path) + +```mermaid +sequenceDiagram + participant Client + participant API as SupplyChain.Api + participant Core as SupplyChain.Core + participant DB as PostgreSQL + + Client->>API: POST /sbom/ingest (imageDigest, sbom, format) + API->>Core: Validate & canonicalize SBOM + Core-->>API: sbomHash + API->>DB: SELECT Sbom WHERE sbomHash & imageDigest + DB-->>API: Not found + API->>DB: INSERT Sbom (sbomHash, imageDigest, content) + DB-->>API: ok + API-->>Client: 201 { status:"stored", sbomId, sbomHash } +``` + +Re-ingest of the same SBOM repeats steps up to SELECT, then returns `status:"already_present"` with `200`. + +### 5.2 Attestation Verify & Record + +```mermaid +sequenceDiagram + participant Client + participant API as SupplyChain.Api + participant Auth as Authority + participant Log as LogBridge + participant DB as PostgreSQL + + Client->>API: POST /attest/verify (dsseEnvelope, expectedSubjects) + API->>Auth: Verify DSSE (keys, policy) + Auth-->>API: VerificationResult(Valid/Invalid) + API->>API: Parse in-toto, check subjects vs expected + API->>DB: SELECT Attestation WHERE attHash + DB-->>API: Not found + API->>DB: INSERT Attestation + SignatureInfo + alt Logging enabled + API->>Log: PutAsync(attHash, envelope) + Log-->>API: LogEntryResult(logIndex, logId, proof) + API->>DB: INSERT TransparencyLogEntry + end + API-->>Client: 201 { verified:true, attestationId, logIndex?, inclusionProof? } +``` + +If attestation already exists, API returns `200` with `status:"already_present"`. + +--- + +## 6. Idempotency and Determinism Strategy + +1. **Canonicalization rules:** + + * Remove insignificant whitespace. + * Sort all object keys lexicographically. + * Sort arrays where order is not semantically meaningful (components, materials). + * Strip non-deterministic fields (timestamps, random IDs) where allowed. + +2. **Hashing:** + + * Always hash canonical JSON as UTF-8. + * Use SHA-256 for core IDs; allow crypto provider to also compute other digests if needed. + +3. **Persistence:** + + * Enforce uniqueness in DB via indices on: + + * `(TenantId, ContentHash)` for SBOMs. + * `(TenantId, AttHash)` for attestations. + * `(Backend, LogId, LogIndex)` for log entries. + * API behavior: + + * Existing row → `200` with `"already_present"`. + * New row → `201` with `"stored"`. + +4. **API design:** + + * No version numbers in path. + * Add fields over time; never break or repurpose existing ones. + * Use explicit capability discovery via `GET /meta/capabilities` if needed. + +--- + +## 7. Air-Gap Mode and Synchronization + +### 7.1 Air-Gap Mode + +* Configuration flag `Mode = Offline` on SupplyChain.Api. +* LogBridge backend: + + * Default to `LocalMerkle` or `Null`. +* Rekor-specific configuration disabled or absent. +* DB & Merkle log stored locally inside the secure network. + +### 7.2 Later Synchronization to Rekor (Optional Future Step) + +Not mandatory for first iteration, but prepare for: + +* Background job (Scheduler module) that: + + * Enumerates local `TransparencyLogEntry` not yet exported. + * Publishes hashed payloads to Rekor when network is available. + * Stores mapping between local log entries and remote Rekor entries. + +--- + +## 8. Security, Access Control, and Observability + +### 8.1 Security + +* mTLS between internal services (SupplyChain.Api, Authority, LogBridge). +* Authentication: + + * API keys/OIDC for clients. + * Per-tenant scoping; `TenantId` must be present in context. +* Authorization: + + * RBAC: which tenants/users can write/verify/only read. + +### 8.2 Crypto Policies + +* Policy object defines: + + * Allowed key types and algorithms. + * Trust roots (Fulcio, internal CA, sovereign PKI). + * Revocation checking strategy (CRL/OCSP, offline lists). +* Authority enforces policies; SupplyChain.Api only consumes `VerificationResult`. + +### 8.3 Observability + +* Logs: + + * Structured logs with correlation IDs; log imageDigest, sbomHash, attHash. +* Metrics: + + * SBOM ingest count, dedup hit rate. + * Attestation verify latency. + * Transparency log publish success/failure counts. +* Traces: + + * OpenTelemetry tracing across API → Authority → LogBridge. + +--- + +## 9. Implementation Plan (Epics & Work Packages) + +You can give this section directly to agents to split. + +### Epic 1: Core Domain & Canonicalization + +1. Define .NET 10 solution structure: + + * Projects: + + * `StellaOps.SupplyChain.Core` + * `StellaOps.Sbomer.Engine` + * `StellaOps.Provenance.Engine` + * `StellaOps.SupplyChain.Api` + * `StellaOps.Authority` (if not already present) + * `StellaOps.LogBridge` +2. Implement core domain models: + + * SBOM, DSSE, in-toto, SLSA v1. +3. Implement canonicalization & hashing utilities. +4. Unit tests: + + * Given semantically equivalent JSON, hashes must match. + * Negative tests where order changes but meaning does not. + +### Epic 2: Persistence Layer + +1. Design EF Core models for: + + * ImageArtifact, Sbom, Attestation, SignatureInfo, TransparencyLogEntry, KeyRecord. +2. Write migrations for PostgreSQL. +3. Implement repository interfaces for read/write. +4. Tests: + + * Unique constraints and idempotency behavior. + * Query performance for common access paths (by imageDigest). + +### Epic 3: SBOM Engine + +1. Implement minimal layer analysis: + + * Accepts local tarball or path (for now). +2. Implement CycloneDX 1.6 generator. +3. Implement SPDX 3.0.1 generator. +4. Deterministic normalization across formats. +5. Tests: + + * Golden files for images → SBOM output. + * Stability under repeated runs. + +### Epic 4: Provenance Engine + +1. Implement in-toto Statement model with SLSA v1 predicate. +2. Implement builder to map: + + * ImageDigest → subject. + * Build metadata → materials. +3. Deterministic canonicalization. +4. Tests: + + * Golden in-toto/SLSA statements for sample inputs. + * Subject matching logic. + +### Epic 5: Authority Integration + +1. Implement `ISigningProvider`, `IVerificationProvider` contracts. +2. Implement file-based key backend as default. +3. Implement DSSE wrapper: + + * `SignAsync(payload, payloadType, keyId)`. + * `VerifyAsync(envelope, policy)`. +4. Tests: + + * DSSE round-trip; invalid signature scenarios. + * Policy enforcement tests. + +### Epic 6: Transparency Log Bridge + +1. Implement `ILogBackend` interface. +2. Implement `LocalMerkleBackend`: + + * Simple Merkle tree with DB storage. +3. Implement `NullBackend`. +4. Define configuration model to select backend. +5. (Optional later) Implement `RekorBackend`. +6. Tests: + + * Stable Merkle root; inclusion proof verification. + +### Epic 7: SupplyChain.Api + +1. Implement `POST /sbom/ingest`: + + * Request/response DTOs. + * Integration with canonicalization, persistence, idempotency logic. +2. Implement `POST /attest/verify`: + + * End-to-end verification and persistence. + * Integration with Authority and LogBridge. +3. Optional read APIs. +4. Add input validation (JSON schema, basic constraints). +5. Integration tests: + + * Full flows for new and duplicate inputs. + * Error cases (invalid DSSE, subject mismatch). + +### Epic 8: CLI Tools + +1. Implement `stella-sbomer` (wraps Sbomer.Engine). +2. Implement `stella-provenance` (wraps Provenance.Engine + Authority). +3. Implement `stella-sign` and `stella-log`. +4. Provide clear help/usage and sample scripts. + +### Epic 9: Hardening, Air-Gap Profile, and Docs + +1. Configuration profiles: + + * `Offline` vs `Online`. + * Log backend selection. +2. Security hardening: + + * mTLS, authentication, authorization. +3. Observability: + + * Metrics, logs, traces wiring. +4. Documentation: + + * API reference. + * Sequence diagrams. + * Deployment recipes for: + + * Single-node air-gap. + * Clustered online deployment. + +--- + +If you want, next step I can: + +* Turn this into an AGENTS/TASKS/PROMPT set for your codex workers, or +* Produce concrete .NET 10 project skeletons (csproj layout, folder structure, and initial interfaces) for the core libraries and API service. diff --git a/docs/product-advisories/archived/17-Nov-2026 - Stripped-ELF-Reachability.md b/docs/product-advisories/archived/17-Nov-2025 - Stripped-ELF-Reachability.md similarity index 96% rename from docs/product-advisories/archived/17-Nov-2026 - Stripped-ELF-Reachability.md rename to docs/product-advisories/archived/17-Nov-2025 - Stripped-ELF-Reachability.md index f894edcd5..adf1157c2 100644 --- a/docs/product-advisories/archived/17-Nov-2026 - Stripped-ELF-Reachability.md +++ b/docs/product-advisories/archived/17-Nov-2025 - Stripped-ELF-Reachability.md @@ -1,846 +1,846 @@ - -Here’s a compact blueprint for bringing **stripped ELF binaries** into StellaOps’s **call‑graph + reachability scoring**—from raw bytes → neutral JSON → deterministic scoring. - ---- - -# Why this matters (quick) - -Even when symbols are missing, you can still (1) recover functions, (2) build a call graph, and (3) decide if a vulnerable function is *actually* reachable from the binary’s entrypoints. This feeds StellaOps’s deterministic scoring/lattice engine so VEX decisions are evidence‑backed, not guesswork. - ---- - -# High‑level pipeline - -1. **Ingest** - -* Accept: ELF (static/dynamic), PIE, musl/glibc, multiple arches (x86_64, aarch64, armhf, riscv64). -* Normalize: compute file hash set (SHA‑256, BLAKE3), note `PT_DYNAMIC`, `DT_NEEDED`, interpreter, RPATH/RUNPATH. - -2. **Symbolization (best‑effort)** - -* **If DWARF present**: read `.debug_*` (function names, inlines, CU boundaries, ranges). -* **If stripped**: - - * Use disassembler to **discover functions** (prolog patterns, xref‑to‑targets, thunk detection). - * Derive **synthetic names**: `sub_`, `plt_` (from dynamic symbol table if available), `extern@libc.so.6:memcpy`. - * Lift exported dynsyms and PLT stubs even when local symbols are removed. - * Recover **string‑referenced names** (e.g., Go/Python/C++ RTTI/Itanium mangling where present). - -3. **Disassembly & IR** - -* Disassemble to basic blocks; lift to a neutral IR (SSA‑like) sufficient for: - - * Call edges (direct `call`/`bl`). - * **Indirect calls** via GOT/IAT, vtables, function pointers (approximate with points‑to sets). - * Tailcalls, thunks, PLT interposition. - -4. **Call‑graph build** - -* Start from **entrypoints**: - - * ELF entry (`_start`), constructors (`.init_array`), exported API (public symbols), `main` (if recoverable). - * Optional: **entry‑trace** (cmd‑line + env + loader path) from container image to seed realistic roots. -* Build **CG** with: - - * Direct edges: precise. - * Indirect edges: conservative, with **evidence tags** (GOT target set, vtable class set, signature match). -* Record **inter‑module edges** to shared libs (soname + version) with relocation evidence. - -5. **Reachability scoring (deterministic)** - -* Input: list of vulnerable functions/paths (from CSAF/CVE KB) normalized to **function‑level identifiers** (soname!symbol or hash‑based if unnamed). -* Compute **reachability** from roots → target: - - * `REACHABLE_CONFIRMED` (path with only precise edges), - * `REACHABLE_POSSIBLE` (path contains conservative edges), - * `NOT_REACHABLE_FOUNDATION` (no path in current graph), - * Add **confidence** derived from edge evidence + relocation proof. -* Emit **proof trails** (the exact path: nodes, edges, evidence). - -6. **Neutral JSON intermediate (NJIF)** - -* Stored in cache; signed for deterministic replay. -* Consumed by StellaOps.Policy/Lattice to merge with VEX. - ---- - -# Neutral JSON Intermediate Format (NJIF) - -```json -{ - "artifact": { - "path": "/work/bin/app", - "hashes": {"sha256": "…", "blake3": "…"}, - "arch": "x86_64", - "elf": { - "type": "ET_DYN", - "interpreter": "/lib64/ld-linux-x86-64.so.2", - "needed": ["libc.so.6", "libssl.so.3"], - "rpath": [], - "runpath": [] - } - }, - "symbols": { - "exported": [ - {"id": "libc.so.6!memcpy", "kind": "dynsym", "addr": "0x0", "plt": true} - ], - "functions": [ - {"id": "sub_401000", "addr": "0x401000", "size": 112, "name_hint": null, "from": "disasm"}, - {"id": "main", "addr": "0x4023d0", "size": 348, "from": "dwarf|heuristic"} - ] - }, - "cfg": [ - {"func": "main", "blocks": [ - {"b": "0x4023d0", "succ": ["0x402415"], "calls": [{"type": "direct", "target": "sub_401000"}]}, - {"b": "0x402415", "succ": ["0x402440"], "calls": [{"type": "plt", "target": "libc.so.6!memcpy"}]} - ]} - ], - "cg": { - "nodes": [ - {"id": "main", "evidence": ["dwarf|heuristic"]}, - {"id": "sub_401000"}, - {"id": "libc.so.6!memcpy", "external": true, "lib": "libc.so.6"} - ], - "edges": [ - {"from": "main", "to": "sub_401000", "kind": "direct"}, - {"from": "main", "to": "libc.so.6!memcpy", "kind": "plt", "evidence": ["reloc@GOT"]} - ], - "roots": ["_start", "init_array[]", "main"] - }, - "reachability": [ - { - "target": "libssl.so.3!SSL_free", - "status": "NOT_REACHABLE_FOUNDATION", - "path": [] - }, - { - "target": "libc.so.6!memcpy", - "status": "REACHABLE_CONFIRMED", - "path": ["main", "libc.so.6!memcpy"], - "confidence": 0.98, - "evidence": ["plt", "dynsym", "reloc"] - } - ], - "provenance": { - "toolchain": { - "disasm": "ghidra_headless|radare2|llvm-mca", - "version": "…" - }, - "scan_manifest_hash": "…", - "timestamp_utc": "2025-11-16T00:00:00Z" - } -} -``` - ---- - -# Practical extractors (headless/CLI) - -* **DWARF**: `llvm-dwarfdump`/`eu-readelf` for quick CU/function ranges; fall back to the disassembler. -* **Disassembly/CFG/CG** (choose one or more; wrap with a stable adapter): - - * **Ghidra Headless API**: recover functions, basic blocks, references, PLT/GOT, vtables; export via a custom headless script to NJIF. - * **radare2 / rizin**: `aaa`, `agCd`, `aflj`, `agj` to export functions/graphs as JSON. - * **Binary Ninja headless** (if license permits) for cleaner IL and indirect‑call modeling. - * **angr** for path‑sensitive refinement on tricky indirect calls (optional, gated by budget). - -**Adapter principle:** All tools output a **small, consistent NJIF** so the scoring engine and lattice logic never depend on any single RE tool. - ---- - -# Indirect call modeling (concise rules) - -* **PLT/GOT**: edge from caller → `soname!symbol` with evidence: `plt`, `reloc@GOT`. -* **Function pointers**: if a store to a pointer is found and targets a known function set `{f1…fk}`, add edges with `kind: "indirect"`, `evidence: ["xref-store", "sig-compatible"]`. -* **Virtual calls / vtables**: class‑method set from RTTI/vtable scans; mark edges `evidence: ["vtable-match"]`. -* **Tailcalls**: treat as edges, not fallthrough. - -Each conservative step lowers **confidence**, but keeps determinism: the rules and their hashes are in the scan manifest. - ---- - -# Deterministic scoring (plug into Stella’s lattice) - -* **Inputs**: NJIF, CVE→function mapping (`soname!symbol` or function hash), policy knobs. -* **States**: `{NOT_OBSERVED < POSSIBLE < REACHABLE_CONFIRMED}` with **monotone** merge (never oscillates). -* **Confidence**: product of edge evidences (configurable weights): `direct=1.0, plt=0.98, vtable=0.85, funcptr=0.7`. -* **Output**: OpenVEX/CSAF annotations + human proof path; signed with DSSE to preserve replayability. - ---- - -# Minimal Ghidra headless skeleton (exporter idea) - -```bash -analyzeHeadless /work/gh_proj MyProj -import app -scriptPath scripts \ - -postScript ExportNjif.java /out/app.njif.json -``` - -```java -// ExportNjif.java (outline) -public class ExportNjif extends GhidraScript { - public void run() throws Exception { - var fns = getFunctions(true); - // collect functions, blocks, calls, externs/PLT - // map non‑named functions to sub_ - // detect PLT thunks → dynsym names - // write NJIF JSON deterministically (sorted keys, stable ordering) - } -} -``` - ---- - -# Integration points in StellaOps - -* **Scanner.Analyzers.Binary.Elf** - - * `ElfNormalizer` → hashes, dynamic deps. - * `Symbolizer` → DWARF reader + HeuristicDisasm (via tool adapter). - * `CgBuilder` → NJIF builder/merger (multi‑module). - * `ReachabilityEngine` → path search + confidence math. - * `Emitter` → NJIF cache + VEX/CSAF notes. - -* **Scheduler**: memoize by `(hashes, toolchain_version, ruleset_hash)` to ensure replayable results. - -* **Authority**: sign NJIF + scoring outputs; store manifests (feeds, rule weights, tool versions). - ---- - -# Test fixtures (suggested) - -* Tiny ELF zoo: statically linked, PIE, stripped/non‑stripped, C++ with vtables, musl vs glibc. -* Known CVE libs (e.g., `libssl`, `zlib`) with versioned symbols to validate soname!symbol mapping. -* Synthetic binaries with function‑pointer tables to validate conservative edges. - ---- - -If you want, I can generate: - -* A ready‑to‑run **Ghidra headless exporter** (Java) that writes NJIF exactly like above. -* A small **.NET parser** that ingests NJIF and emits StellaOps reachability + OpenVEX notes. -Below is a full architecture plan for implementing **stripped-ELF binary reachability** (call graph + NJIF + deterministic scoring, with a hook for patch-oracles) inside **StellaOps**. - -I will assume .NET 10, existing microservice split (Scanner.WebService, Scanner.Worker, Concelier, Excitior, Authority, Scheduler, Sbomer, Signals), and your standing rule: **all lattice logic runs in Scanner.WebService**. - ---- - -## 1. Scope, Objectives, Non-Goals - -### 1.1 Objectives - -1. **Recover function-level call graphs from ELF binaries**, including **stripped** ones: - -* Support ET_EXEC / ET_DYN / PIE, static & dynamic linking. -* Support at least **x86_64, aarch64** in v1, later armhf, riscv64. - -2. **Produce a neutral, deterministic JSON representation (NJIF)**: - -* Tool-agnostic: can be generated from Ghidra, radare2/rizin, Binary Ninja, angr, etc. -* Stable identifiers and schema so downstream services don’t depend on a specific RE engine. - -3. **Compute function-level reachability for vulnerabilities**: - -* Given CVE → `soname!symbol` (and later function-hash) mappings from Concelier, -* Decide `REACHABLE_CONFIRMED` / `REACHABLE_POSSIBLE` / `NOT_REACHABLE_FOUNDATION` with evidence and confidence. - -4. **Integrate with StellaOps lattice and VEX outputs**: - -* Lattice logic runs in **Scanner.WebService**. -* Results flow into Excitior (VEX) and Sbomer (SBOM annotations), preserving provenance. - -5. **Enable deterministic replay**: - -* Every analysis run is tied to a **Scan Manifest**: tool versions, ruleset hashes, policy hashes, container image digests. - -### 1.2 Non-Goals (v1) - -* No dynamic runtime probes (EventPipe/JFR) in this phase. -* No full decompilation; we only need enough IR for calls/edges. -* No aggressive path-sensitive analysis (symbolic execution) in v1; that can be a v2 enhancement. - ---- - -## 2. High-Level System Architecture - -### 2.1 Components - -* **Scanner.WebService (existing)** - - * REST/gRPC API for scans. - * Orchestrates analysis jobs via Scheduler. - * Hosts **Lattice & Reachability Engine** for all artifact types. - * Reads NJIF results, merges with Concelier function mappings and policies. - -* **Scanner.Worker (existing, extended)** - - * Executes **Binary Analyzer Pipelines**. - * Invokes RE tools (Ghidra, rizin, etc.) in dedicated containers. - * Produces NJIF and persists it. - -* **Binary Tools Containers (new)** - - * `stellaops-tools-ghidra:` - * `stellaops-tools-rizin:` - * Optionally `stellaops-tools-angr` for advanced passes. - * Pinned versions, no network access (for determinism & air-gap). - -* **Storage & Metadata** - - * **DB (PostgreSQL)**: scan records, NJIF metadata, reachability summaries. - * **Object store** (MinIO/S3/Filesystem): NJIF JSON blobs, tool logs. - * **Authority**: DSSE signatures for Scan Manifest, NJIF, and reachability outputs. - -* **Concelier** - - * Provides **CVE → component → function symbol/hashes** resolution. - * Exposes “Link-Not-Merge” graph of advisory, component, and function nodes. - -* **Excitior (VEX)** - - * Consumes Scanner.WebService reachability states. - * Emits OpenVEX/CSAF with properly justified statuses. - -* **UnknownsRegistry (future)** - - * Receives unresolvable call edges / ambiguous functions from the analyzer, - * Feeds them into “adaptive security” workflows. - -### 2.2 End-to-End Flow (Binary / Image Scan) - -1. Client requests scan (binary or container image) via **Scanner.WebService**. -2. WebService: - - * Extracts binaries from OCI layers (if scanning image), - * Registers **Scan Manifest**, - * Submits a job to Scheduler (queue: `binary-elfflow`). -3. Scanner.Worker dequeues the job: - - * Detects ELF binaries, - * Runs **Binary Analyzer Pipeline** for each unique binary hash. -4. Worker uses tools containers: - - * Ghidra/rizin → CFG, function discovery, call graph, - * Converts to **NJIF**. -5. Worker persists NJIF + metadata; marks analysis complete. -6. Scanner.WebService picks up NJIF: - - * Fetches advisory function mappings from Concelier, - * Runs **Reachability & Lattice scoring**, - * Updates scan results and triggers Excitior / Sbomer. - -All steps are deterministic given: - -* Input artifact, -* Tool container digests, -* Ruleset/policy versions. - ---- - -## 3. Binary Analyzer Subsystem (Scanner.Worker) - -Introduce a dedicated module: - -* `StellaOps.Scanner.Analyzers.Binary.Elf` - -### 3.1 Internal Layers - -1. **ElfDetector** - - * Inspects files in a scan: - - * Magic `0x7f 'E' 'L' 'F'`, - * Confirms architecture via ELF header. - * Produces `BinaryArtifact` records with: - - * `hashes` (SHA-256, BLAKE3), - * `path` in container, - * `arch`, `endianness`. - -2. **ElfNormalizer** - - * Uses a lightweight library (e.g., ElfSharp) to extract: - - * `ElfType` (ET_EXEC, ET_DYN), - * interpreter (`PT_INTERP`), - * `DT_NEEDED` list, - * RPATH/RUNPATH, - * presence/absence of DWARF sections. - * Emits a normalized `ElfMetadata` DTO. - -3. **Symbolization Layer** - - * Sub-components: - - * `DwarfSymbolReader`: if DWARF present, read CU, function ranges, names, inlines. - * `DynsymReader`: parse `.dynsym`, `.plt`, exported symbols. - * `HeuristicFunctionFinder`: - - * For stripped binaries: - - * Use disassembler xrefs, prolog patterns, return instructions, call-targets. - * Recognize PLT thunks → `soname!symbol`. - * Consolidates into `FunctionSymbol` entities: - - * `id` (e.g., `main`, `sub_401000`, `libc.so.6!memcpy`), - * `addr`, `size`, `is_external`, `from` (`dwarf`, `dynsym`, `heuristic`). - -4. **Disassembly & IR Layer** - - * Abstraction: `IDisassemblyAdapter`: - - * `Task AnalyzeAsync(BinaryArtifact, ElfMetadata, ScanManifest)` - * Implementations: - - * `GhidraDisassemblyAdapter`: - - * Invokes headless Ghidra in container, - * Receives machine-readable JSON (script-produced), - * Extracts functions, basic blocks, calls, GOT/PLT info, vtables. - * `RizinDisassemblyAdapter` (backup/fallback). - * Produces: - - * `BasicBlock` objects, - * `Instruction` metadata where needed for calls, - * `CallSite` records (direct, PLT, indirect). - -5. **Call-Graph Builder** - - * Consumes `FunctionSymbol` + `CallSite` sets. - * Identifies **roots**: - - * `_start`, `.init_array` entries, - * `main` (if present), - * Exported API functions for shared libs. - * Creates `CallGraph`: - - * Nodes: functions (`FunctionNode`), - * Edges: `CallEdge` with: - - * `kind`: `direct`, `plt`, `indirect-funcptr`, `indirect-vtable`, `tailcall`, - * `evidence`: tags like `["reloc@GOT", "sig-match", "vtable-class"]`. - -6. **Evidence & Confidence Annotator** - - * For each edge, computes a **local confidence**: - - * `direct`: 1.0 - * `plt`: 0.98 - * `indirect-funcptr`: 0.7 - * `indirect-vtable`: 0.85 - * For each path later, Scanner.WebService composes these. - -7. **NJIF Serializer** - - * Transforms domain objects into **NJIF JSON**: - - * Sorted keys, stable ordering for determinism. - * Writes: - - * `artifact`, `elf`, `symbols`, `cfg`, `cg`, and partial `reachability: []` (filled by WebService). - * Stores in object store, returns location + hash to DB. - -8. **Unknowns Reporting** - - * Any unresolved: - - * Indirect call with empty target set, - * Function region not mapped to symbol, - * Logged as `UnknownEvidence` records and optionally published to **UnknownsRegistry** stream. - ---- - -## 4. NJIF Data Model (Neutral JSON Intermediate Format) - -Define a stable schema with a top-level `njif_schema_version` field. - -### 4.1 Top-Level Shape - -```json -{ - "njif_schema_version": "1.0.0", - "artifact": { ... }, - "symbols": { ... }, - "cfg": [ ... ], - "cg": { ... }, - "reachability": [ ... ], - "provenance": { ... } -} -``` - -### 4.2 Key Sections - -1. `artifact` - - * `path`, `hashes`, `arch`, `elf.type`, `interpreter`, `needed`, `rpath`, `runpath`. - -2. `symbols` - - * `exported`: external/dynamic symbols, especially PLT: - - * `id`, `kind`, `plt`, `lib`. - * `functions`: - - * `id` (synthetic or real name), - * `addr`, `size`, `from` (source of naming info), - * `name_hint` (optional). - -3. `cfg` - - * Per-function basic block CFG plus call sites: - - * Blocks with `succ`, `calls` entries. - * Sufficient for future static checks, not full IR. - -4. `cg` - - * `nodes`: function nodes with evidence tags. - * `edges`: call edges with: - - * `from`, `to`, `kind`, `evidence`. - * `roots`: entrypoints for reachability algorithms. - -5. `reachability` - - * Initially empty from Worker. - * Populated in Scanner.WebService as: - -```json -{ - "target": "libssl.so.3!SSL_free", - "status": "REACHABLE_CONFIRMED", - "path": ["_start", "main", "libssl.so.3!SSL_free"], - "confidence": 0.93, - "evidence": ["plt", "dynsym", "reloc"] -} -``` - -6. `provenance` - - * `toolchain`: - - * `disasm`: `"ghidra_headless:10.4"`, etc. - * `scan_manifest_hash`, - * `timestamp_utc`. - -### 4.3 Persisting NJIF - -* Object store (versioned path): - - * `njif/{sha256}/njif-v1.json` -* DB table `binary_njif`: - - * `binary_hash`, `njif_hash`, `schema_version`, `toolchain_digest`, `scan_manifest_id`. - ---- - -## 5. Reachability & Lattice Integration (Scanner.WebService) - -### 5.1 Inputs - -* **NJIF** for each binary (possibly multiple binaries per container). -* Concelier’s **CVE → (component, function)** resolution: - - * `component_id` → `soname!symbol` sets, and where available, function hashes. -* Scanner’s existing **lattice policies**: - - * States: e.g. `NOT_OBSERVED < POSSIBLE < REACHABLE_CONFIRMED`. - * Merge rules are monotone. - -### 5.2 Reachability Engine - -New service module: - -* `StellaOps.Scanner.Domain.Reachability` - - * `INjifRepository` (reads NJIF JSON), - * `IFunctionMappingResolver` (Concelier adapter), - * `IReachabilityCalculator`. - -Algorithm per target function: - -1. Resolve vulnerable function(s): - - * From Concelier: `soname!symbol` and/or `func_hash`. - * Map to NJIF `symbols.exported` or `symbols.functions`. - -2. For each binary: - - * Use `cg.roots` as entry set. - * BFS/DFS along edges until: - - * Reaching target node(s), - * Or graph fully explored. - -3. For each successful path: - - * Collect edges’ `confidence` weights, compute path confidence: - - * e.g., product of edge confidences or a log/additive scheme. - -4. Aggregate result: - - * If ≥ 1 path with only `direct/plt` edges: - - * `status = REACHABLE_CONFIRMED`. - * Else if only paths with indirect edges: - - * `status = REACHABLE_POSSIBLE`. - * Else: - - * `status = NOT_REACHABLE_FOUNDATION`. - -5. Emit `reachability` entry back into NJIF (or as separate DB table) and into scan result graph. - -### 5.3 Lattice & VEX - -* Lattice computation is done per `(CVE, component, binary)` triple: - - * Input: reachability status + other signals. -* Resulting state is: - - * Exposed to **Excitior** as a set of **evidence-annotated VEX facts**. -* Excitior translates: - - * `NOT_REACHABLE_FOUNDATION` → likely `not_affected` with justification “code_not_reachable”. - * `REACHABLE_CONFIRMED` → `affected` or “present_and_exploitable” (depending on overall policy). - ---- - -## 6. Patch-Oracle Extension (Advanced, but Architected Now) - -While not strictly required for v1, we should reserve architecture hooks. - -### 6.1 Concept - -* Given: - - * A **vulnerable** library build (or binary), - * A **patched** build. -* Run analyzers on both; produce NJIF for each. -* Compare call graphs & function bodies (e.g., hash of normalized bytes): - - * Identify **changed functions** and potentially changed code regions. -* Concelier links those function IDs to specific CVEs (via vendor patch metadata). -* These become authoritative “patched function sets” (the **patch oracle**). - -### 6.2 Integration Points - -Add a module: - -* `StellaOps.Scanner.Analysis.PatchOracle` - - * Input: pair of artifact hashes (old, new) + NJIF. - * Output: list of `FunctionPatchRecord`: - - * `function_id`, `binary_hash_old`, `binary_hash_new`, `change_kind` (`added`, `modified`, `deleted`). - -Concelier: - -* Ingests `FunctionPatchRecord` via internal API and updates advisory graph: - - * CVE → function set derived from real patch. -* Reachability Engine: - - * Uses patch-derived function sets instead of or in addition to symbol mapping from vendor docs. - ---- - -## 7. Persistence, Determinism, Caching - -### 7.1 Scan Manifest - -For every scan job, create: - -* `scan_manifest`: - - * Input artifact hashes, - * List of binaries, - * Tool container digests (Ghidra, rizin, etc.), - * Ruleset/policy/lattice hashes, - * Time, user, and config flags. - -Authority signs this manifest with DSSE. - -### 7.2 Binary Analysis Cache - -Key: `(binary_hash, arch, toolchain_digest, njif_schema_version)`. - -* If present: - - * Skip re-running Ghidra/rizin; reuse NJIF. -* If absent: - - * Run analysis, then cache NJIF. - -This provides deterministic replay and prevents re-analysis across scans and across customers (if allowed by tenancy model). - ---- - -## 8. APIs & Integration Contracts - -### 8.1 Scanner.WebService External API (REST) - -1. `POST /api/scans/images` - - * Existing; extended to flag: `includeBinaryReachability: true`. -2. `POST /api/scans/binaries` - - * Upload a standalone ELF; returns `scan_id`. -3. `GET /api/scans/{scanId}/reachability` - - * Returns list of `(cve_id, component, binary_path, function_id, status, confidence, path)`. - -No path versioning; idempotent and additive (new fields appear, old ones remain valid). - -### 8.2 Internal APIs - -* **Worker ↔ Object Store**: - - * `PUT /binary-njif/{sha256}/njif-v1.json`. - -* **WebService ↔ Worker (via Scheduler)**: - - * Job payload includes: - - * `scan_manifest_id`, - * `binary_hashes`, - * `analysis_profile` (`default`, `deep`). - -* **WebService ↔ Concelier**: - - * `POST /internal/functions/resolve`: - - * Input: `(cve_id, component_ids[])`, - * Output: `soname!symbol[]`, optional `func_hash[]`. - -* **WebService ↔ Excitior**: - - * Existing VEX ingestion extended with **reachability evidence** fields. - ---- - -## 9. Observability, Security, Resource Model - -### 9.1 Observability - -* **Metrics**: - - * Analysis duration per binary, - * NJIF size, - * Cache hit ratio, - * Reachability evaluation time per CVE. - -* **Logs**: - - * Ghidra/rizin container logs stored alongside NJIF, - * Unknowns logs for unresolved call targets. - -* **Tracing**: - - * Each scan/analysis annotated with `scan_manifest_id` to allow end-to-end trace. - -### 9.2 Security - -* Tools containers: - - * No outbound network. - * Limited to read-only artifact mount + write-only result mount. -* Binary content: - - * Treated as confidential; stored encrypted at rest if your global policy requires it. -* DSSE: - - * Authority signs: - - * Scan Manifest, - * NJIF blob hash, - * Reachability summary. - * Enables “Proof-of-Integrity Graph” linkage later. - -### 9.3 Resource Model - -* ELF analysis can be heavy; design for: - - * Separate **worker queue** and autoscaling group for binary analysis. - * Configurable max concurrency and per-job CPU/memory limits. -* Deep analysis (indirect calls, vtables) can be toggled via `analysis_profile`. - ---- - -## 10. Implementation Roadmap - -A pragmatic, staged plan: - -### Phase 0 – Foundations (1–2 sprints) - -* Create `StellaOps.Scanner.Analyzers.Binary.Elf` project. -* Implement: - - * `ElfDetector`, `ElfNormalizer`. - * DB tables: `binary_artifacts`, `binary_njif`. -* Integrate with Scheduler and Worker pipeline. - -### Phase 1 – Non-stripped ELF + NJIF v1 (2–3 sprints) - -* Implement **DWARF + dynsym symbolization**. -* Implement **GhidraDisassemblyAdapter** for x86_64. -* Build **CallGraphBuilder** (direct + PLT calls). -* Implement NJIF serializer v1; store in object store. -* Basic reachability engine in WebService: - - * Only direct and PLT edges, - * Only for DWARF-named functions. -* Integrate with Concelier function mapping via `soname!symbol`. - -### Phase 2 – Stripped ELF Support (2–3 sprints) - -* Implement `HeuristicFunctionFinder` for function discovery in stripped binaries. -* Extend Ghidra script to mark PLT/GOT, vtables, function pointers. -* Call graph: add: - - * `indirect-funcptr`, `indirect-vtable`, `tailcall` edges. -* Evidence tagging and local confidence scoring. -* Extend reachability engine to: - - * Distinguish `REACHABLE_CONFIRMED` vs `REACHABLE_POSSIBLE`. - -### Phase 3 – Multi-Arch & Performance (2–3 sprints) - -* Add support for **aarch64** (Ghidra language, appropriate calling conventions). -* Optimize: - - * Binary analysis cache, - * Tool container lifecycle, - * Concurrent analysis. -* Add Unknowns reporting and hookup to UnknownsRegistry (if already implemented). - -### Phase 4 – Patch-Oracle Pilot (2–3 sprints) - -* Implement PatchOracle module: - - * Compare old/new NJIFs, - * Detect changed functions. -* Integrate with Concelier’s advisory graph. -* Start validating against curated CVE/patch datasets. - -### Phase 5 – Hardening & Documentation - -* Golden fixtures: - - * Small ELF zoo (stripped/non-stripped, PIE, static, C++, vtables). - * Known vulnerable libs (e.g., OpenSSL, zlib) to confirm correct function mapping. -* Add CLI/demo in `StellaOps.Scanner.Cli`: - - * `stellaops scan-binary --file app --show-reachability`. -* Customer-facing and internal docs: - - * NJIF schema, - * API usage, - * Limitations and interpretation guidelines. - ---- - -If you want, next step I can do is take this plan and: - -* Break it into **epics / tickets** (SCAN-BINARY-xxx) with clear DoD per phase, or -* Draft the **Ghidra headless Java script** and the **.NET NJIF model classes** so your agents can plug them straight into the Scanner repo. + +Here’s a compact blueprint for bringing **stripped ELF binaries** into StellaOps’s **call‑graph + reachability scoring**—from raw bytes → neutral JSON → deterministic scoring. + +--- + +# Why this matters (quick) + +Even when symbols are missing, you can still (1) recover functions, (2) build a call graph, and (3) decide if a vulnerable function is *actually* reachable from the binary’s entrypoints. This feeds StellaOps’s deterministic scoring/lattice engine so VEX decisions are evidence‑backed, not guesswork. + +--- + +# High‑level pipeline + +1. **Ingest** + +* Accept: ELF (static/dynamic), PIE, musl/glibc, multiple arches (x86_64, aarch64, armhf, riscv64). +* Normalize: compute file hash set (SHA‑256, BLAKE3), note `PT_DYNAMIC`, `DT_NEEDED`, interpreter, RPATH/RUNPATH. + +2. **Symbolization (best‑effort)** + +* **If DWARF present**: read `.debug_*` (function names, inlines, CU boundaries, ranges). +* **If stripped**: + + * Use disassembler to **discover functions** (prolog patterns, xref‑to‑targets, thunk detection). + * Derive **synthetic names**: `sub_`, `plt_` (from dynamic symbol table if available), `extern@libc.so.6:memcpy`. + * Lift exported dynsyms and PLT stubs even when local symbols are removed. + * Recover **string‑referenced names** (e.g., Go/Python/C++ RTTI/Itanium mangling where present). + +3. **Disassembly & IR** + +* Disassemble to basic blocks; lift to a neutral IR (SSA‑like) sufficient for: + + * Call edges (direct `call`/`bl`). + * **Indirect calls** via GOT/IAT, vtables, function pointers (approximate with points‑to sets). + * Tailcalls, thunks, PLT interposition. + +4. **Call‑graph build** + +* Start from **entrypoints**: + + * ELF entry (`_start`), constructors (`.init_array`), exported API (public symbols), `main` (if recoverable). + * Optional: **entry‑trace** (cmd‑line + env + loader path) from container image to seed realistic roots. +* Build **CG** with: + + * Direct edges: precise. + * Indirect edges: conservative, with **evidence tags** (GOT target set, vtable class set, signature match). +* Record **inter‑module edges** to shared libs (soname + version) with relocation evidence. + +5. **Reachability scoring (deterministic)** + +* Input: list of vulnerable functions/paths (from CSAF/CVE KB) normalized to **function‑level identifiers** (soname!symbol or hash‑based if unnamed). +* Compute **reachability** from roots → target: + + * `REACHABLE_CONFIRMED` (path with only precise edges), + * `REACHABLE_POSSIBLE` (path contains conservative edges), + * `NOT_REACHABLE_FOUNDATION` (no path in current graph), + * Add **confidence** derived from edge evidence + relocation proof. +* Emit **proof trails** (the exact path: nodes, edges, evidence). + +6. **Neutral JSON intermediate (NJIF)** + +* Stored in cache; signed for deterministic replay. +* Consumed by StellaOps.Policy/Lattice to merge with VEX. + +--- + +# Neutral JSON Intermediate Format (NJIF) + +```json +{ + "artifact": { + "path": "/work/bin/app", + "hashes": {"sha256": "…", "blake3": "…"}, + "arch": "x86_64", + "elf": { + "type": "ET_DYN", + "interpreter": "/lib64/ld-linux-x86-64.so.2", + "needed": ["libc.so.6", "libssl.so.3"], + "rpath": [], + "runpath": [] + } + }, + "symbols": { + "exported": [ + {"id": "libc.so.6!memcpy", "kind": "dynsym", "addr": "0x0", "plt": true} + ], + "functions": [ + {"id": "sub_401000", "addr": "0x401000", "size": 112, "name_hint": null, "from": "disasm"}, + {"id": "main", "addr": "0x4023d0", "size": 348, "from": "dwarf|heuristic"} + ] + }, + "cfg": [ + {"func": "main", "blocks": [ + {"b": "0x4023d0", "succ": ["0x402415"], "calls": [{"type": "direct", "target": "sub_401000"}]}, + {"b": "0x402415", "succ": ["0x402440"], "calls": [{"type": "plt", "target": "libc.so.6!memcpy"}]} + ]} + ], + "cg": { + "nodes": [ + {"id": "main", "evidence": ["dwarf|heuristic"]}, + {"id": "sub_401000"}, + {"id": "libc.so.6!memcpy", "external": true, "lib": "libc.so.6"} + ], + "edges": [ + {"from": "main", "to": "sub_401000", "kind": "direct"}, + {"from": "main", "to": "libc.so.6!memcpy", "kind": "plt", "evidence": ["reloc@GOT"]} + ], + "roots": ["_start", "init_array[]", "main"] + }, + "reachability": [ + { + "target": "libssl.so.3!SSL_free", + "status": "NOT_REACHABLE_FOUNDATION", + "path": [] + }, + { + "target": "libc.so.6!memcpy", + "status": "REACHABLE_CONFIRMED", + "path": ["main", "libc.so.6!memcpy"], + "confidence": 0.98, + "evidence": ["plt", "dynsym", "reloc"] + } + ], + "provenance": { + "toolchain": { + "disasm": "ghidra_headless|radare2|llvm-mca", + "version": "…" + }, + "scan_manifest_hash": "…", + "timestamp_utc": "2025-11-16T00:00:00Z" + } +} +``` + +--- + +# Practical extractors (headless/CLI) + +* **DWARF**: `llvm-dwarfdump`/`eu-readelf` for quick CU/function ranges; fall back to the disassembler. +* **Disassembly/CFG/CG** (choose one or more; wrap with a stable adapter): + + * **Ghidra Headless API**: recover functions, basic blocks, references, PLT/GOT, vtables; export via a custom headless script to NJIF. + * **radare2 / rizin**: `aaa`, `agCd`, `aflj`, `agj` to export functions/graphs as JSON. + * **Binary Ninja headless** (if license permits) for cleaner IL and indirect‑call modeling. + * **angr** for path‑sensitive refinement on tricky indirect calls (optional, gated by budget). + +**Adapter principle:** All tools output a **small, consistent NJIF** so the scoring engine and lattice logic never depend on any single RE tool. + +--- + +# Indirect call modeling (concise rules) + +* **PLT/GOT**: edge from caller → `soname!symbol` with evidence: `plt`, `reloc@GOT`. +* **Function pointers**: if a store to a pointer is found and targets a known function set `{f1…fk}`, add edges with `kind: "indirect"`, `evidence: ["xref-store", "sig-compatible"]`. +* **Virtual calls / vtables**: class‑method set from RTTI/vtable scans; mark edges `evidence: ["vtable-match"]`. +* **Tailcalls**: treat as edges, not fallthrough. + +Each conservative step lowers **confidence**, but keeps determinism: the rules and their hashes are in the scan manifest. + +--- + +# Deterministic scoring (plug into Stella’s lattice) + +* **Inputs**: NJIF, CVE→function mapping (`soname!symbol` or function hash), policy knobs. +* **States**: `{NOT_OBSERVED < POSSIBLE < REACHABLE_CONFIRMED}` with **monotone** merge (never oscillates). +* **Confidence**: product of edge evidences (configurable weights): `direct=1.0, plt=0.98, vtable=0.85, funcptr=0.7`. +* **Output**: OpenVEX/CSAF annotations + human proof path; signed with DSSE to preserve replayability. + +--- + +# Minimal Ghidra headless skeleton (exporter idea) + +```bash +analyzeHeadless /work/gh_proj MyProj -import app -scriptPath scripts \ + -postScript ExportNjif.java /out/app.njif.json +``` + +```java +// ExportNjif.java (outline) +public class ExportNjif extends GhidraScript { + public void run() throws Exception { + var fns = getFunctions(true); + // collect functions, blocks, calls, externs/PLT + // map non‑named functions to sub_ + // detect PLT thunks → dynsym names + // write NJIF JSON deterministically (sorted keys, stable ordering) + } +} +``` + +--- + +# Integration points in StellaOps + +* **Scanner.Analyzers.Binary.Elf** + + * `ElfNormalizer` → hashes, dynamic deps. + * `Symbolizer` → DWARF reader + HeuristicDisasm (via tool adapter). + * `CgBuilder` → NJIF builder/merger (multi‑module). + * `ReachabilityEngine` → path search + confidence math. + * `Emitter` → NJIF cache + VEX/CSAF notes. + +* **Scheduler**: memoize by `(hashes, toolchain_version, ruleset_hash)` to ensure replayable results. + +* **Authority**: sign NJIF + scoring outputs; store manifests (feeds, rule weights, tool versions). + +--- + +# Test fixtures (suggested) + +* Tiny ELF zoo: statically linked, PIE, stripped/non‑stripped, C++ with vtables, musl vs glibc. +* Known CVE libs (e.g., `libssl`, `zlib`) with versioned symbols to validate soname!symbol mapping. +* Synthetic binaries with function‑pointer tables to validate conservative edges. + +--- + +If you want, I can generate: + +* A ready‑to‑run **Ghidra headless exporter** (Java) that writes NJIF exactly like above. +* A small **.NET parser** that ingests NJIF and emits StellaOps reachability + OpenVEX notes. +Below is a full architecture plan for implementing **stripped-ELF binary reachability** (call graph + NJIF + deterministic scoring, with a hook for patch-oracles) inside **StellaOps**. + +I will assume .NET 10, existing microservice split (Scanner.WebService, Scanner.Worker, Concelier, Excitior, Authority, Scheduler, Sbomer, Signals), and your standing rule: **all lattice logic runs in Scanner.WebService**. + +--- + +## 1. Scope, Objectives, Non-Goals + +### 1.1 Objectives + +1. **Recover function-level call graphs from ELF binaries**, including **stripped** ones: + +* Support ET_EXEC / ET_DYN / PIE, static & dynamic linking. +* Support at least **x86_64, aarch64** in v1, later armhf, riscv64. + +2. **Produce a neutral, deterministic JSON representation (NJIF)**: + +* Tool-agnostic: can be generated from Ghidra, radare2/rizin, Binary Ninja, angr, etc. +* Stable identifiers and schema so downstream services don’t depend on a specific RE engine. + +3. **Compute function-level reachability for vulnerabilities**: + +* Given CVE → `soname!symbol` (and later function-hash) mappings from Concelier, +* Decide `REACHABLE_CONFIRMED` / `REACHABLE_POSSIBLE` / `NOT_REACHABLE_FOUNDATION` with evidence and confidence. + +4. **Integrate with StellaOps lattice and VEX outputs**: + +* Lattice logic runs in **Scanner.WebService**. +* Results flow into Excitior (VEX) and Sbomer (SBOM annotations), preserving provenance. + +5. **Enable deterministic replay**: + +* Every analysis run is tied to a **Scan Manifest**: tool versions, ruleset hashes, policy hashes, container image digests. + +### 1.2 Non-Goals (v1) + +* No dynamic runtime probes (EventPipe/JFR) in this phase. +* No full decompilation; we only need enough IR for calls/edges. +* No aggressive path-sensitive analysis (symbolic execution) in v1; that can be a v2 enhancement. + +--- + +## 2. High-Level System Architecture + +### 2.1 Components + +* **Scanner.WebService (existing)** + + * REST/gRPC API for scans. + * Orchestrates analysis jobs via Scheduler. + * Hosts **Lattice & Reachability Engine** for all artifact types. + * Reads NJIF results, merges with Concelier function mappings and policies. + +* **Scanner.Worker (existing, extended)** + + * Executes **Binary Analyzer Pipelines**. + * Invokes RE tools (Ghidra, rizin, etc.) in dedicated containers. + * Produces NJIF and persists it. + +* **Binary Tools Containers (new)** + + * `stellaops-tools-ghidra:` + * `stellaops-tools-rizin:` + * Optionally `stellaops-tools-angr` for advanced passes. + * Pinned versions, no network access (for determinism & air-gap). + +* **Storage & Metadata** + + * **DB (PostgreSQL)**: scan records, NJIF metadata, reachability summaries. + * **Object store** (MinIO/S3/Filesystem): NJIF JSON blobs, tool logs. + * **Authority**: DSSE signatures for Scan Manifest, NJIF, and reachability outputs. + +* **Concelier** + + * Provides **CVE → component → function symbol/hashes** resolution. + * Exposes “Link-Not-Merge” graph of advisory, component, and function nodes. + +* **Excitior (VEX)** + + * Consumes Scanner.WebService reachability states. + * Emits OpenVEX/CSAF with properly justified statuses. + +* **UnknownsRegistry (future)** + + * Receives unresolvable call edges / ambiguous functions from the analyzer, + * Feeds them into “adaptive security” workflows. + +### 2.2 End-to-End Flow (Binary / Image Scan) + +1. Client requests scan (binary or container image) via **Scanner.WebService**. +2. WebService: + + * Extracts binaries from OCI layers (if scanning image), + * Registers **Scan Manifest**, + * Submits a job to Scheduler (queue: `binary-elfflow`). +3. Scanner.Worker dequeues the job: + + * Detects ELF binaries, + * Runs **Binary Analyzer Pipeline** for each unique binary hash. +4. Worker uses tools containers: + + * Ghidra/rizin → CFG, function discovery, call graph, + * Converts to **NJIF**. +5. Worker persists NJIF + metadata; marks analysis complete. +6. Scanner.WebService picks up NJIF: + + * Fetches advisory function mappings from Concelier, + * Runs **Reachability & Lattice scoring**, + * Updates scan results and triggers Excitior / Sbomer. + +All steps are deterministic given: + +* Input artifact, +* Tool container digests, +* Ruleset/policy versions. + +--- + +## 3. Binary Analyzer Subsystem (Scanner.Worker) + +Introduce a dedicated module: + +* `StellaOps.Scanner.Analyzers.Binary.Elf` + +### 3.1 Internal Layers + +1. **ElfDetector** + + * Inspects files in a scan: + + * Magic `0x7f 'E' 'L' 'F'`, + * Confirms architecture via ELF header. + * Produces `BinaryArtifact` records with: + + * `hashes` (SHA-256, BLAKE3), + * `path` in container, + * `arch`, `endianness`. + +2. **ElfNormalizer** + + * Uses a lightweight library (e.g., ElfSharp) to extract: + + * `ElfType` (ET_EXEC, ET_DYN), + * interpreter (`PT_INTERP`), + * `DT_NEEDED` list, + * RPATH/RUNPATH, + * presence/absence of DWARF sections. + * Emits a normalized `ElfMetadata` DTO. + +3. **Symbolization Layer** + + * Sub-components: + + * `DwarfSymbolReader`: if DWARF present, read CU, function ranges, names, inlines. + * `DynsymReader`: parse `.dynsym`, `.plt`, exported symbols. + * `HeuristicFunctionFinder`: + + * For stripped binaries: + + * Use disassembler xrefs, prolog patterns, return instructions, call-targets. + * Recognize PLT thunks → `soname!symbol`. + * Consolidates into `FunctionSymbol` entities: + + * `id` (e.g., `main`, `sub_401000`, `libc.so.6!memcpy`), + * `addr`, `size`, `is_external`, `from` (`dwarf`, `dynsym`, `heuristic`). + +4. **Disassembly & IR Layer** + + * Abstraction: `IDisassemblyAdapter`: + + * `Task AnalyzeAsync(BinaryArtifact, ElfMetadata, ScanManifest)` + * Implementations: + + * `GhidraDisassemblyAdapter`: + + * Invokes headless Ghidra in container, + * Receives machine-readable JSON (script-produced), + * Extracts functions, basic blocks, calls, GOT/PLT info, vtables. + * `RizinDisassemblyAdapter` (backup/fallback). + * Produces: + + * `BasicBlock` objects, + * `Instruction` metadata where needed for calls, + * `CallSite` records (direct, PLT, indirect). + +5. **Call-Graph Builder** + + * Consumes `FunctionSymbol` + `CallSite` sets. + * Identifies **roots**: + + * `_start`, `.init_array` entries, + * `main` (if present), + * Exported API functions for shared libs. + * Creates `CallGraph`: + + * Nodes: functions (`FunctionNode`), + * Edges: `CallEdge` with: + + * `kind`: `direct`, `plt`, `indirect-funcptr`, `indirect-vtable`, `tailcall`, + * `evidence`: tags like `["reloc@GOT", "sig-match", "vtable-class"]`. + +6. **Evidence & Confidence Annotator** + + * For each edge, computes a **local confidence**: + + * `direct`: 1.0 + * `plt`: 0.98 + * `indirect-funcptr`: 0.7 + * `indirect-vtable`: 0.85 + * For each path later, Scanner.WebService composes these. + +7. **NJIF Serializer** + + * Transforms domain objects into **NJIF JSON**: + + * Sorted keys, stable ordering for determinism. + * Writes: + + * `artifact`, `elf`, `symbols`, `cfg`, `cg`, and partial `reachability: []` (filled by WebService). + * Stores in object store, returns location + hash to DB. + +8. **Unknowns Reporting** + + * Any unresolved: + + * Indirect call with empty target set, + * Function region not mapped to symbol, + * Logged as `UnknownEvidence` records and optionally published to **UnknownsRegistry** stream. + +--- + +## 4. NJIF Data Model (Neutral JSON Intermediate Format) + +Define a stable schema with a top-level `njif_schema_version` field. + +### 4.1 Top-Level Shape + +```json +{ + "njif_schema_version": "1.0.0", + "artifact": { ... }, + "symbols": { ... }, + "cfg": [ ... ], + "cg": { ... }, + "reachability": [ ... ], + "provenance": { ... } +} +``` + +### 4.2 Key Sections + +1. `artifact` + + * `path`, `hashes`, `arch`, `elf.type`, `interpreter`, `needed`, `rpath`, `runpath`. + +2. `symbols` + + * `exported`: external/dynamic symbols, especially PLT: + + * `id`, `kind`, `plt`, `lib`. + * `functions`: + + * `id` (synthetic or real name), + * `addr`, `size`, `from` (source of naming info), + * `name_hint` (optional). + +3. `cfg` + + * Per-function basic block CFG plus call sites: + + * Blocks with `succ`, `calls` entries. + * Sufficient for future static checks, not full IR. + +4. `cg` + + * `nodes`: function nodes with evidence tags. + * `edges`: call edges with: + + * `from`, `to`, `kind`, `evidence`. + * `roots`: entrypoints for reachability algorithms. + +5. `reachability` + + * Initially empty from Worker. + * Populated in Scanner.WebService as: + +```json +{ + "target": "libssl.so.3!SSL_free", + "status": "REACHABLE_CONFIRMED", + "path": ["_start", "main", "libssl.so.3!SSL_free"], + "confidence": 0.93, + "evidence": ["plt", "dynsym", "reloc"] +} +``` + +6. `provenance` + + * `toolchain`: + + * `disasm`: `"ghidra_headless:10.4"`, etc. + * `scan_manifest_hash`, + * `timestamp_utc`. + +### 4.3 Persisting NJIF + +* Object store (versioned path): + + * `njif/{sha256}/njif-v1.json` +* DB table `binary_njif`: + + * `binary_hash`, `njif_hash`, `schema_version`, `toolchain_digest`, `scan_manifest_id`. + +--- + +## 5. Reachability & Lattice Integration (Scanner.WebService) + +### 5.1 Inputs + +* **NJIF** for each binary (possibly multiple binaries per container). +* Concelier’s **CVE → (component, function)** resolution: + + * `component_id` → `soname!symbol` sets, and where available, function hashes. +* Scanner’s existing **lattice policies**: + + * States: e.g. `NOT_OBSERVED < POSSIBLE < REACHABLE_CONFIRMED`. + * Merge rules are monotone. + +### 5.2 Reachability Engine + +New service module: + +* `StellaOps.Scanner.Domain.Reachability` + + * `INjifRepository` (reads NJIF JSON), + * `IFunctionMappingResolver` (Concelier adapter), + * `IReachabilityCalculator`. + +Algorithm per target function: + +1. Resolve vulnerable function(s): + + * From Concelier: `soname!symbol` and/or `func_hash`. + * Map to NJIF `symbols.exported` or `symbols.functions`. + +2. For each binary: + + * Use `cg.roots` as entry set. + * BFS/DFS along edges until: + + * Reaching target node(s), + * Or graph fully explored. + +3. For each successful path: + + * Collect edges’ `confidence` weights, compute path confidence: + + * e.g., product of edge confidences or a log/additive scheme. + +4. Aggregate result: + + * If ≥ 1 path with only `direct/plt` edges: + + * `status = REACHABLE_CONFIRMED`. + * Else if only paths with indirect edges: + + * `status = REACHABLE_POSSIBLE`. + * Else: + + * `status = NOT_REACHABLE_FOUNDATION`. + +5. Emit `reachability` entry back into NJIF (or as separate DB table) and into scan result graph. + +### 5.3 Lattice & VEX + +* Lattice computation is done per `(CVE, component, binary)` triple: + + * Input: reachability status + other signals. +* Resulting state is: + + * Exposed to **Excitior** as a set of **evidence-annotated VEX facts**. +* Excitior translates: + + * `NOT_REACHABLE_FOUNDATION` → likely `not_affected` with justification “code_not_reachable”. + * `REACHABLE_CONFIRMED` → `affected` or “present_and_exploitable” (depending on overall policy). + +--- + +## 6. Patch-Oracle Extension (Advanced, but Architected Now) + +While not strictly required for v1, we should reserve architecture hooks. + +### 6.1 Concept + +* Given: + + * A **vulnerable** library build (or binary), + * A **patched** build. +* Run analyzers on both; produce NJIF for each. +* Compare call graphs & function bodies (e.g., hash of normalized bytes): + + * Identify **changed functions** and potentially changed code regions. +* Concelier links those function IDs to specific CVEs (via vendor patch metadata). +* These become authoritative “patched function sets” (the **patch oracle**). + +### 6.2 Integration Points + +Add a module: + +* `StellaOps.Scanner.Analysis.PatchOracle` + + * Input: pair of artifact hashes (old, new) + NJIF. + * Output: list of `FunctionPatchRecord`: + + * `function_id`, `binary_hash_old`, `binary_hash_new`, `change_kind` (`added`, `modified`, `deleted`). + +Concelier: + +* Ingests `FunctionPatchRecord` via internal API and updates advisory graph: + + * CVE → function set derived from real patch. +* Reachability Engine: + + * Uses patch-derived function sets instead of or in addition to symbol mapping from vendor docs. + +--- + +## 7. Persistence, Determinism, Caching + +### 7.1 Scan Manifest + +For every scan job, create: + +* `scan_manifest`: + + * Input artifact hashes, + * List of binaries, + * Tool container digests (Ghidra, rizin, etc.), + * Ruleset/policy/lattice hashes, + * Time, user, and config flags. + +Authority signs this manifest with DSSE. + +### 7.2 Binary Analysis Cache + +Key: `(binary_hash, arch, toolchain_digest, njif_schema_version)`. + +* If present: + + * Skip re-running Ghidra/rizin; reuse NJIF. +* If absent: + + * Run analysis, then cache NJIF. + +This provides deterministic replay and prevents re-analysis across scans and across customers (if allowed by tenancy model). + +--- + +## 8. APIs & Integration Contracts + +### 8.1 Scanner.WebService External API (REST) + +1. `POST /api/scans/images` + + * Existing; extended to flag: `includeBinaryReachability: true`. +2. `POST /api/scans/binaries` + + * Upload a standalone ELF; returns `scan_id`. +3. `GET /api/scans/{scanId}/reachability` + + * Returns list of `(cve_id, component, binary_path, function_id, status, confidence, path)`. + +No path versioning; idempotent and additive (new fields appear, old ones remain valid). + +### 8.2 Internal APIs + +* **Worker ↔ Object Store**: + + * `PUT /binary-njif/{sha256}/njif-v1.json`. + +* **WebService ↔ Worker (via Scheduler)**: + + * Job payload includes: + + * `scan_manifest_id`, + * `binary_hashes`, + * `analysis_profile` (`default`, `deep`). + +* **WebService ↔ Concelier**: + + * `POST /internal/functions/resolve`: + + * Input: `(cve_id, component_ids[])`, + * Output: `soname!symbol[]`, optional `func_hash[]`. + +* **WebService ↔ Excitior**: + + * Existing VEX ingestion extended with **reachability evidence** fields. + +--- + +## 9. Observability, Security, Resource Model + +### 9.1 Observability + +* **Metrics**: + + * Analysis duration per binary, + * NJIF size, + * Cache hit ratio, + * Reachability evaluation time per CVE. + +* **Logs**: + + * Ghidra/rizin container logs stored alongside NJIF, + * Unknowns logs for unresolved call targets. + +* **Tracing**: + + * Each scan/analysis annotated with `scan_manifest_id` to allow end-to-end trace. + +### 9.2 Security + +* Tools containers: + + * No outbound network. + * Limited to read-only artifact mount + write-only result mount. +* Binary content: + + * Treated as confidential; stored encrypted at rest if your global policy requires it. +* DSSE: + + * Authority signs: + + * Scan Manifest, + * NJIF blob hash, + * Reachability summary. + * Enables “Proof-of-Integrity Graph” linkage later. + +### 9.3 Resource Model + +* ELF analysis can be heavy; design for: + + * Separate **worker queue** and autoscaling group for binary analysis. + * Configurable max concurrency and per-job CPU/memory limits. +* Deep analysis (indirect calls, vtables) can be toggled via `analysis_profile`. + +--- + +## 10. Implementation Roadmap + +A pragmatic, staged plan: + +### Phase 0 – Foundations (1–2 sprints) + +* Create `StellaOps.Scanner.Analyzers.Binary.Elf` project. +* Implement: + + * `ElfDetector`, `ElfNormalizer`. + * DB tables: `binary_artifacts`, `binary_njif`. +* Integrate with Scheduler and Worker pipeline. + +### Phase 1 – Non-stripped ELF + NJIF v1 (2–3 sprints) + +* Implement **DWARF + dynsym symbolization**. +* Implement **GhidraDisassemblyAdapter** for x86_64. +* Build **CallGraphBuilder** (direct + PLT calls). +* Implement NJIF serializer v1; store in object store. +* Basic reachability engine in WebService: + + * Only direct and PLT edges, + * Only for DWARF-named functions. +* Integrate with Concelier function mapping via `soname!symbol`. + +### Phase 2 – Stripped ELF Support (2–3 sprints) + +* Implement `HeuristicFunctionFinder` for function discovery in stripped binaries. +* Extend Ghidra script to mark PLT/GOT, vtables, function pointers. +* Call graph: add: + + * `indirect-funcptr`, `indirect-vtable`, `tailcall` edges. +* Evidence tagging and local confidence scoring. +* Extend reachability engine to: + + * Distinguish `REACHABLE_CONFIRMED` vs `REACHABLE_POSSIBLE`. + +### Phase 3 – Multi-Arch & Performance (2–3 sprints) + +* Add support for **aarch64** (Ghidra language, appropriate calling conventions). +* Optimize: + + * Binary analysis cache, + * Tool container lifecycle, + * Concurrent analysis. +* Add Unknowns reporting and hookup to UnknownsRegistry (if already implemented). + +### Phase 4 – Patch-Oracle Pilot (2–3 sprints) + +* Implement PatchOracle module: + + * Compare old/new NJIFs, + * Detect changed functions. +* Integrate with Concelier’s advisory graph. +* Start validating against curated CVE/patch datasets. + +### Phase 5 – Hardening & Documentation + +* Golden fixtures: + + * Small ELF zoo (stripped/non-stripped, PIE, static, C++, vtables). + * Known vulnerable libs (e.g., OpenSSL, zlib) to confirm correct function mapping. +* Add CLI/demo in `StellaOps.Scanner.Cli`: + + * `stellaops scan-binary --file app --show-reachability`. +* Customer-facing and internal docs: + + * NJIF schema, + * API usage, + * Limitations and interpretation guidelines. + +--- + +If you want, next step I can do is take this plan and: + +* Break it into **epics / tickets** (SCAN-BINARY-xxx) with clear DoD per phase, or +* Draft the **Ghidra headless Java script** and the **.NET NJIF model classes** so your agents can plug them straight into the Scanner repo. diff --git a/docs/product-advisories/archived/18-Nov-2026 - Binary-Reachability-Engine.md b/docs/product-advisories/archived/18-Nov-2025 - Binary-Reachability-Engine.md similarity index 96% rename from docs/product-advisories/archived/18-Nov-2026 - Binary-Reachability-Engine.md rename to docs/product-advisories/archived/18-Nov-2025 - Binary-Reachability-Engine.md index 74e1709a7..d99db6660 100644 --- a/docs/product-advisories/archived/18-Nov-2026 - Binary-Reachability-Engine.md +++ b/docs/product-advisories/archived/18-Nov-2025 - Binary-Reachability-Engine.md @@ -1,927 +1,927 @@ - -Here’s a crisp idea that could give Stella Ops a real moat: **binary‑level reachability**—linking CVEs directly to the exact functions and offsets inside compiled artifacts (ELF/PE/Mach‑O), not just to packages. - ---- - -### Why this matters (quick background) - -* **Package‑level flags are noisy.** Most scanners say “vuln in `libX v1.2`,” but that library might be present and never executed. -* **Language‑level call graphs help** (when you have source or rich metadata), but containers often ship only **stripped binaries**. -* **Binary reachability** answers: *Is the vulnerable function actually in this image? Is its code path reachable from the entrypoints we observed or can construct?* - ---- - -### The missing layer: Symbolization - -Build a **symbolization layer** that normalizes debug and symbol info across platforms: - -* **Inputs**: DWARF (ELF/Mach‑O), PDB (PE/Windows), symtabs, exported symbols, `.eh_frame`, and (when stripped) heuristic signatures (e.g., function byte‑hashes, CFG fingerprints). -* **Outputs**: a source‑agnostic map: `{binary → sections → functions → (addresses, ranges, hashes, demangled names, inlined frames)}`. -* **Normalization**: Put everything into a common schema (e.g., `Stella.Symbolix.v1`) so higher layers don’t care if it came from DWARF or PDB. - ---- - -### End‑to‑end reachability (binary‑first, source‑agnostic) - -1. **Acquire & parse** - - * Detect format (ELF/PE/Mach‑O), parse headers, sections, symbol tables. - * If debug info present: parse DWARF/PDB; else fall back to disassembly + function boundary recovery. -2. **Function catalog** - - * Assign stable IDs per function: `(imageHash, textSectionHash, startVA, size, fnHashXX)`. - * Record x‑refs (calls/jumps), imports/exports, PLT/IAT edges. -3. **Entrypoint discovery** - - * Docker entry, process launch args, service scripts; infer likely mains (Go `main.main`, .NET hostfxr path, JVM launcher, etc.). -4. **Call‑graph build (binary CFG)** - - * Build inter/intra‑procedural graph (direct + resolved indirect via IAT/PLT). Keep “unknown‑target” edges for conservative safety. -5. **CVE→function linking** - - * Maintain a **signature bank** per CVE advisory: vulnerable function names, file paths, and—crucially—**byte‑sequence or basic‑block fingerprints** for patched vs vulnerable versions (works even when stripped). -6. **Reachability analysis** - - * Is the vulnerable function present? Is there a path from any entrypoint to it (under conservative assumptions)? Tag as `Present+Reachable`, `Present+Uncertain`, or `Absent`. -7. **Runtime confirmation (optional, when users allow)** - - * Lightweight probes (eBPF on Linux, ETW on Windows, perf/JFR/EventPipe) capture function hits; cross‑check with the static result to upgrade confidence. - ---- - -### Minimal component plan (drop into Stella Ops) - -* **Scanner.Symbolizer** - Parsers: ELF/DWARF (libdw or pure‑managed reader), PE/PDB (Dia/LLVM PDB), Mach‑O/DSYM. - Output: `Symbolix.v1` blobs stored in OCI layer cache. -* **Scanner.CFG** - Lifts functions to a normalized IR (capstone/iced‑x86 for decode) → builds CFG & call graph. -* **Advisory.FingerprintBank** - Ingests CSAF/OpenVEX plus curated fingerprints (fn names, block hashes, patch diff markers). Versioned, signed, air‑gap‑syncable. -* **Reachability.Engine** - Joins (`Symbolix` + `CFG` + `FingerprintBank`) → emits `ReachabilityEvidence` with lattice states for VEX. -* **VEXer.Adapter** - Emits **OpenVEX** statements with `status: affected/not_affected` and `justification: function_not_present | function_not_reachable | mitigated_at_runtime`, attaching Evidence URIs. -* **Console UX** - “Why not affected?” panel showing entrypoint→…→function path (or absence), with byte‑hash proof. - ---- - -### Data model sketch (concise) - -* `ImageFunction { id, name?, startVA, size, fnHash, sectionHash, demangled?, provenance:{DWARF|PDB|Heuristic} }` -* `Edge { srcFnId, dstFnId, kind:{direct|plt|iat|indirect?} }` -* `CveSignature { cveId, fnName?, libHints[], blockFingerprints[], versionRanges }` -* `Evidence { cveId, imageId, functionMatches[], reachable: bool?, confidence:[low|med|high], method:[static|runtime|hybrid] }` - ---- - -### Practical phases (8–10 weeks of focused work) - -1. **P0**: ELF/DWARF symbolizer + basic function catalog; link a handful of CVEs via name‑only; emit OpenVEX `function_not_present`. -2. **P1**: CFG builder (direct calls) + PLT/IAT resolution; simple reachability; first fingerprints for top 50 CVEs in glibc, openssl, curl, zlib. -3. **P2**: Stripped‑binary heuristics (block hashing) + Go/Rust name demangling; Windows PDB ingestion for PE. -4. **P3**: Runtime probes (opt‑in) + confidence upgrade logic; Console path explorer; evidence signing (DSSE). - ---- - -### KPIs to prove the moat - -* **Noise cut**: % reduction in “affected” flags after reachability (target 40–70% on typical containers). -* **Precision**: Ground‑truth validation vs PoC images (TP/FP/FN on presence & reachability). -* **Coverage**: % images where we can make a determination without source (goal: >80%). -* **Latency**: Added scan time per image (<15s typical with caches). - ---- - -### Risks & how to handle them - -* **Stripped binaries** → mitigate with block‑hash fingerprints & library‑version heuristics. -* **Obfuscated/packed code** → mark `Uncertain`; allow user‑supplied hints; prefer runtime confirmation. -* **Advisory inconsistency** → keep our own curated CVE→function fingerprint bank; sign & version it. -* **Platform spread** → start Linux/ELF, then Windows/PDB, then Mach‑O. - ---- - -### Why competitors struggle - -Most tools stop at packages because binary CFG + fingerprint curation is hard and expensive. Shipping a **source‑agnostic reachability engine** tied to signed evidence in VEX would set Stella Ops apart—especially in offline/air‑gapped and sovereign contexts you already target. - -If you want, I can draft: - -* the `Symbolix.v1` protobuf, -* a tiny PoC (ELF→functions→match CVE with a block fingerprint), -* and the OpenVEX emission snippet your VEXer can produce. -Below is a detailed architecture plan for implementing reachability and call-graph analysis in Stella Ops, covering JavaScript, Python, PHP, and binaries, and integrating with your existing Scanner / Concelier / VEXer stack. - -I will assume: - -* .NET 10 for core services. -* Scanner is the place where all “trust algebra / lattice” runs (per your standing rule). -* Concelier and VEXer remain “preserve/prune” layers and do not run lattice logic. -* Output must be JSON-centric with PURLs and OpenVEX. - ---- - -## 1. Scope & Objectives - -### 1.1 Primary goals - -1. From an OCI image, build: - - * A **library-level usage graph** (which libraries are used by which entrypoints). - * A **function-level call graph** for JS / Python / PHP / binaries. -2. Map CVEs (from Concelier) to: - - * Concrete **components** (PURLs) in the SBOM. - * Concrete **functions / entrypoints / code regions** inside those components. -3. Perform **reachability analysis** to classify each vulnerability as: - - * `present + reachable` - * `present + not_reachable` - * `function_not_present` (no vulnerable symbol) - * `uncertain` (dynamic features, unresolved calls) -4. Emit: - - * **Structured JSON** with PURLs and call-graph nodes/edges (“reachability evidence”). - * **OpenVEX** documents with appropriate `status`/`justification`. - -### 1.2 Non-goals (for now) - -* Full dynamic analysis of the running container (eBPF, ptrace, etc.) – leave as Phase 3+ optional add-on. -* Perfect call graph precision for dynamic languages (aim for safe, conservative approximations). -* Automatic “fix recommendations” (handled by other Stella Ops agents later). - ---- - -## 2. High-Level Architecture - -### 2.1 Major components - -Within Stella Ops: - -* **Scanner.WebService** - - * User-facing API. - * Orchestrates full scan (SBOM, CVEs, reachability). - * Hosts the **Lattice/Policy engine** that merges evidence and produces decisions. -* **Scanner.Worker** - - * Runs per-image analysis jobs. - * Invokes analyzers (JS, Python, PHP, Binary) inside its own container context. -* **Scanner.Reachability Core Library** - - * Unified IR for call graphs and reachability evidence. - * Interfaces for language and binary analyzers. - * Graph algorithms (BFS/DFS, lattice evaluation, entrypoint expansion). -* **Language Analyzers** - - * `Scanner.Analyzers.JavaScript` - * `Scanner.Analyzers.Python` - * `Scanner.Analyzers.Php` - * `Scanner.Analyzers.Binary` -* **Symbolization & CFG (for binaries)** - - * `Scanner.Symbolization` (ELF, PE, Mach-O parsers, DWARF/PDB) - * `Scanner.Cfg` (CFG + call graph for binaries) -* **Vulnerability Signature Bank** - - * `Concelier.Signatures` (curated CVE→function/library fingerprints). - * Exposed to Scanner as **offline bundle**. -* **VEXer** - - * `Vexer.Adapter.Reachability` – transforms reachability evidence into OpenVEX. - -### 2.2 Data flow (logical) - -```mermaid -flowchart LR - A[OCI Image / Tar] --> B[Scanner.Worker: Extract FS] - B --> C[SBOM Engine (CycloneDX/SPDX)] - C --> D[Vuln Match (Concelier feeds)] - B --> E1[JS Analyzer] - B --> E2[Python Analyzer] - B --> E3[PHP Analyzer] - B --> E4[Binary Analyzer + Symbolizer/CFG] - - D --> F[Reachability Orchestrator] - E1 --> F - E2 --> F - E3 --> F - E4 --> F - F --> G[Lattice/Policy Engine (Scanner.WebService)] - G --> H[Reachability Evidence JSON] - G --> I[VEXer: OpenVEX] - G --> J[Graph/Cartographer (optional)] -``` - ---- - -## 3. Data Model & JSON Contracts - -### 3.1 Core IR types (Scanner.Reachability) - -Define in a central assembly, e.g. `StellaOps.Scanner.Reachability`: - -```csharp -public record ComponentRef( - string Purl, - string? BomRef, - string? Name, - string? Version); - -public enum SymbolKind { Function, Method, Constructor, Lambda, Import, Export } - -public record SymbolId( - string Language, // "js", "python", "php", "binary" - string ComponentPurl, // SBOM component PURL or "" for app code - string LogicalName, // e.g., "server.js:handleLogin" - string? FilePath, - int? Line); - -public record CallGraphNode( - string Id, // stable id, e.g., hash(SymbolId) - SymbolId Symbol, - SymbolKind Kind, - bool IsEntrypoint); - -public enum CallEdgeKind { Direct, Indirect, Dynamic, External, Ffi } - -public record CallGraphEdge( - string FromNodeId, - string ToNodeId, - CallEdgeKind Kind); - -public record CallGraph( - string GraphId, - IReadOnlyList Nodes, - IReadOnlyList Edges); -``` - -### 3.2 Vulnerability mapping - -```csharp -public record VulnerabilitySignature( - string Source, // "csaf", "nvd", "vendor" - string Id, // "CVE-2023-12345" - IReadOnlyList Purls, - IReadOnlyList TargetSymbolPatterns, // glob-like or regex - IReadOnlyList? FilePathPatterns, - IReadOnlyList? BlockFingerprints // for binaries, optional -); -``` - -### 3.3 Reachability evidence - -```csharp -public enum ReachabilityStatus -{ - PresentReachable, - PresentNotReachable, - FunctionNotPresent, - Unknown -} - -public record ReachabilityEvidence -( - string ImageRef, - string VulnId, // CVE or advisory id - ComponentRef Component, - ReachabilityStatus Status, - double Confidence, // 0..1 - string Method, // "static-callgraph", "binary-fingerprint", etc. - IReadOnlyList EntrypointNodeIds, - IReadOnlyList>? ExamplePaths // optional list of node-paths -); -``` - -### 3.4 JSON structure (external) - -Minimal external JSON (what you store / expose): - -```json -{ - "image": "registry.example.com/app:1.2.3", - "components": [ - { - "purl": "pkg:npm/express@4.18.0", - "bomRef": "component-1" - } - ], - "callGraphs": [ - { - "graphId": "js-main", - "language": "js", - "nodes": [ /* CallGraphNode */ ], - "edges": [ /* CallGraphEdge */ ] - } - ], - "reachability": [ - { - "vulnId": "CVE-2023-12345", - "componentPurl": "pkg:npm/express@4.18.0", - "status": "PresentReachable", - "confidence": 0.92, - "entrypoints": [ "node:..." ], - "paths": [ - ["node:entry", "node:routeHandler", "node:vulnFn"] - ] - } - ] -} -``` - ---- - -## 4. Scanner-Side Architecture - -### 4.1 Project layout (suggested) - -```text -src/ - Scanner/ - StellaOps.Scanner.WebService/ - StellaOps.Scanner.Worker/ - StellaOps.Scanner.Core/ # shared scan domain - StellaOps.Scanner.Reachability/ - StellaOps.Scanner.Symbolization/ - StellaOps.Scanner.Cfg/ - StellaOps.Scanner.Analyzers.JavaScript/ - StellaOps.Scanner.Analyzers.Python/ - StellaOps.Scanner.Analyzers.Php/ - StellaOps.Scanner.Analyzers.Binary/ -``` - -### 4.2 API surface (Scanner.WebService) - -* `POST /api/scan/image` - - * Request: `{ "imageRef": "...", "profile": { "reachability": true, ... } }` - * Returns: scan id. -* `GET /api/scan/{id}/reachability` - - * Returns: `ReachabilityEvidence[]`, plus call graph summary (optional). -* `GET /api/scan/{id}/vex` - - * Returns: OpenVEX with statuses based on reachability lattice. - -### 4.3 Worker orchestration - -`StellaOps.Scanner.Worker`: - -1. Receives scan job with `imageRef`. - -2. Extracts filesystem (layered rootfs) under `/mnt/scans/{scanId}/rootfs`. - -3. Invokes SBOM generator (CycloneDX/SPDX). - -4. Invokes Concelier via offline feeds to get: - - * Component vulnerabilities (CVE list per PURL). - * Vulnerability signatures (fingerprints). - -5. Builds a `ReachabilityPlan`: - - ```csharp - public record ReachabilityPlan( - IReadOnlyList Components, - IReadOnlyList Vulns, - IReadOnlyList AnalyzerTargets // files/dirs grouped by language - ); - ``` - -6. For each language target, dispatch analyzer: - - * JavaScript: `IReachabilityAnalyzer` implementation for JS. - * Python: likewise. - * PHP: likewise. - * Binary: symbolizer + CFG. - -7. Collects call graphs from each analyzer and merges them into a single IR (or separate per-language graphs with shared IDs). - -8. Sends merged graphs + vuln list to **Reachability Engine** (Scanner.Reachability). - ---- - -## 5. Language Analyzers (JS / Python / PHP) - -All analyzers implement a common interface: - -```csharp -public interface IReachabilityAnalyzer -{ - string Language { get; } // "js", "python", "php" - - Task AnalyzeAsync(AnalyzerContext context, CancellationToken ct); -} - -public record AnalyzerContext( - string RootFsPath, - IReadOnlyList Components, - IReadOnlyList Vulnerabilities, - IReadOnlyDictionary Env, // container env, entrypoint, etc. - string? EntrypointCommand // container CMD/ENTRYPOINT -); -``` - -### 5.1 JavaScript (Node.js focus) - -**Inputs:** - -* `/app` tree inside container (or discovered via SBOM). -* `package.json` files. -* Container entrypoint (e.g., `["node", "server.js"]`). - -**Core steps:** - -1. Identify **app root**: - - * Heuristics: directory containing `package.json` that owns the entry script. -2. Parse: - - * All `.js`, `.mjs`, `.cjs` in app and `node_modules` for vulnerable PURLs. - * Use a parsing frontend (e.g., Tree-sitter via .NET binding, or Node+AST-as-JSON). -3. Build module graph: - - * `require`, `import`, `export`. -4. Function-level graph: - - * For each function/method, create `CallGraphNode`. - * For each `callExpression`, create `CallGraphEdge` (try to resolve callee). -5. Entrypoints: - - * Main script in CMD/ENTRYPOINT. - * HTTP route handlers (for express/koa) detected by patterns (e.g., `app.get("/...")`). -6. Map vulnerable symbols: - - * From `VulnerabilitySignature.TargetSymbolPatterns` (e.g., `express/lib/router/layer.js:handle_request`). - * Identify nodes whose `SymbolId` matches patterns. - -**Output:** - -* `CallGraph` for JS with: - - * `IsEntrypoint = true` for main and detected handlers. - * Node attributes include file path, line, component PURL. - -### 5.2 Python - -**Inputs:** - -* Site-packages paths from SBOM. -* Entrypoint script (CMD/ENTRYPOINT). -* Framework heuristics (Django, Flask) from environment variables or common entrypoints. - -**Core steps:** - -1. Discover Python interpreter chain: not needed for pure static, but useful for heuristics. -2. Parse `.py` files of: - - * App code. - * Vulnerable packages (per PURL). -3. Build module import graph (`import`, `from x import y`). -4. Function-level graph: - - * Nodes for functions, methods, class constructors. - * Edges for call expressions; conservative for dynamic calls. -5. Entrypoints: - - * Main script. - * WSGI callable (e.g., `application` in `wsgi.py`). - * Django URLconf -> view functions. -6. Map vulnerable symbols using `TargetSymbolPatterns` like `django.middleware.security.SecurityMiddleware.__call__`. - -### 5.3 PHP - -**Inputs:** - -* Web root (from container image or conventional paths `/var/www/html`, `/app/public`, etc.). -* Composer metadata (`composer.json`, `vendor/`). -* Web server config if present (optional). - -**Core steps:** - -1. Discover front controllers (e.g., `index.php`, `public/index.php`). -2. Parse PHP files (again, via Tree-sitter or any suitable parser). -3. Resolve include/require chains to build file-level inclusion graph. -4. Build function/method graph: - - * Functions, methods, class constructors. - * Calls with best-effort resolution for namespaced functions. -5. Entrypoints: - - * Front controllers and router entrypoints (e.g., Symfony, Laravel detection). -6. Map vulnerable symbols (e.g., functions in certain vendor packages, particular methods). - ---- - -## 6. Binary Analyzer & Symbolizer - -Project: `StellaOps.Scanner.Analyzers.Binary` + `Symbolization` + `Cfg`. - -### 6.1 Inputs - -* All binaries and shared libraries in: - - * `/usr/lib`, `/lib`, `/app/bin`, etc. -* SBOM link: each binary mapped to its component PURL when possible. -* Vulnerability signatures for native libs: function names, symbol names, fingerprints. - -### 6.2 Symbolization - -Module: `StellaOps.Scanner.Symbolization` - -* Detect format: ELF, PE, Mach-O. -* For ELF/Mach-O: - - * Parse symbol tables (`.symtab`, `.dynsym`). - * Parse DWARF (if present) to map functions to source files/lines. -* For PE: - - * Parse PDB (if present) or export table. -* For stripped binaries: - - * Run function boundary recovery (linear sweep + heuristic). - * Compute block/fn-level hashes for fingerprinting. - -Output: - -```csharp -public record ImageFunction( - string ImageId, // e.g., SHA256 of file - ulong StartVa, - uint Size, - string? SymbolName, // demangled if possible - string FnHash, // stable hash of bytes / CFG - string? SourceFile, - int? SourceLine); -``` - -### 6.3 CFG + Call graph - -Module: `StellaOps.Scanner.Cfg` - -* Disassemble `.text` using Capstone/Iced.x86. -* Build basic blocks and CFG. -* Identify: - - * Direct calls (resolved). - * PLT/IAT indirections to shared libraries. -* Build `CallGraph` for binary functions: - - * Entrypoints: `main`, exported functions, Go `main.main`, etc. - * Map application functions to library functions via PLT/IAT edges. - -### 6.4 Linking vulnerabilities - -* For each vulnerability affecting a native library (e.g., OpenSSL): - - * Map to candidate binaries via SBOM + PURL. - * Within library image, find `ImageFunction`s matching: - - * `SymbolName` patterns. - * `FnHash` / `BlockFingerprints` (for precise detection). -* Determine reachability: - - * Starting from application entrypoints, traverse call graph to see if calls to vulnerable library function occur. - ---- - -## 7. Reachability Engine & Lattice (Scanner.WebService) - -Project: `StellaOps.Scanner.Reachability` - -### 7.1 Inputs to engine - -* Combined `CallGraph[]` (per language + binary). -* Vulnerability list (CVE, GHSA, etc.) with affected PURLs. -* Vulnerability signatures. -* Entrypoint hints: - - * Container CMD/ENTRYPOINT. - * Detected HTTP handlers, WSGI/PSGI entrypoints, etc. - -### 7.2 Algorithm steps - -1. **Entrypoint expansion** - - * Identify all `CallGraphNode` with `IsEntrypoint=true`. - * Add language-specific “framework entrypoints” (e.g., Express route dispatch, Django URL dispatch) when detected. - -2. **Graph traversal** - - * For each entrypoint node: - - * BFS/DFS through edges. - * Maintain `reachable` bit on each node. - * For dynamic edges: - - * Conservative: if target cannot be resolved, mark affected path as partially unknown and downgrade confidence. - -3. **Vuln symbol resolution** - - * For each vulnerability: - - * For each vulnerable component PURL found in SBOM: - - * Find candidate nodes whose `SymbolId` matches `TargetSymbolPatterns` / binary fingerprints. - * If none found: - - * `FunctionNotPresent` (if component version range indicates vulnerable but we cannot find symbol – low confidence). - * If found: - - * Check `reachable` bit: - - * If reachable by at least one entrypoint, `PresentReachable`. - * Else, `PresentNotReachable`. - -4. **Confidence computation** - - * Start from: - - * `1.0` for direct match with explicit function name & static call. - * Lower for: - - * Heuristic framework entrypoints. - * Dynamic calls. - * Fingerprint-only matches on stripped binaries. - * Example rule-of-thumb: - - * direct static path only: 0.95–1.0. - * dynamic edges but symbol found: 0.7–0.9. - * symbol not found but version says vulnerable: 0.4–0.6. - -5. **Lattice merge** - - * Represent each CVE+component pair as a lattice element with states: `{affected, not_affected, unknown}`. - * Reachability engine produces a **local state**: - - * `PresentReachable` → candidate `affected`. - * `PresentNotReachable` or `FunctionNotPresent` → candidate `not_affected`. - * `Unknown` → `unknown`. - * Merge with: - - * Upstream vendor VEX (from Concelier). - * Policy overrides (e.g., “treat certain CVEs as affected unless vendor says otherwise”). - * Final state computed here (Scanner.WebService), not in Concelier or VEXer. - -6. **Evidence output** - - * For each vulnerability: - - * Emit `ReachabilityEvidence` with: - - * Status. - * Confidence. - * Method. - * Example entrypoint paths (for UX and audit). - * Persist this evidence alongside regular scan results. - ---- - -## 8. Integration with SBOM & VEX - -### 8.1 SBOM annotation - -* Extend SBOM documents (CycloneDX / SPDX) with extra properties: - - * CycloneDX: - - * `component.properties`: - - * `stellaops:reachability:status` = `present_reachable|present_not_reachable|function_not_present|unknown` - * `stellaops:reachability:confidence` = `0.0-1.0` - * SPDX: - - * `Annotation` or `ExternalRef` with similar metadata. - -### 8.2 OpenVEX generation - -Module: `StellaOps.Vexer.Adapter.Reachability` - -* For each `(vuln, component)` pair: - - * Map to VEX statement: - - * If `PresentReachable`: - - * `status: affected` - * `justification: component_not_fixed` or similar. - * If `PresentNotReachable`: - - * `status: not_affected` - * `justification: function_not_reachable` - * If `FunctionNotPresent`: - - * `status: not_affected` - * `justification: component_not_present` or `function_not_present` - * If `Unknown`: - - * `status: under_investigation` (configurable). - -* Attach evidence via: - - * `analysis` / `details` fields (link to internal evidence JSON or audit link). - -* VEXer does not recalculate reachability; it uses the already computed decision + evidence. - ---- - -## 9. Executable Containers & Offline Operation - -### 9.1 Executable containers - -* Analyzers run inside a dedicated Scanner worker container that has: - - * .NET 10 runtime. - * Language runtimes if needed for parsing (Node, Python, PHP), or Tree-sitter-based parsing. -* Target image filesystem is mounted read-only under `/mnt/rootfs`. -* No network access (offline/air-gap). -* This satisfies “we will use executable containers” while keeping separation between: - - * Target image (mount only). - * Analyzer container (StellaOps code). - -### 9.2 Offline signature bundles - -* Concelier periodically exports: - - * Vulnerability database (CSAF/NVD). - * Vulnerability Signature Bank. -* Bundles are: - - * DSSE-signed. - * Versioned (e.g., `signatures-2025-11-01.tar.zst`). -* Scanner uses: - - * The bundle digest as part of the **Scan Manifest** for deterministic replay. - ---- - -## 10. Determinism & Caching - -### 10.1 Layer-level caching - -* Key: `layerDigest + analyzerVersion + signatureBundleVersion`. -* Cache artifacts: - - * CallGraph(s) per layer (for JS/Python/PHP code present in that layer). - * Symbolization results per binary file hash. -* For images sharing layers: - - * Merge cached graphs instead of re-analyzing. - -### 10.2 Deterministic scan manifest - -For each scan, produce: - -```json -{ - "imageRef": "registry/app:1.2.3", - "imageDigest": "sha256:...", - "scannerVersion": "1.4.0", - "analyzerVersions": { - "js": "1.0.0", - "python": "1.0.0", - "php": "1.0.0", - "binary": "1.0.0" - }, - "signatureBundleDigest": "sha256:...", - "callGraphDigest": "sha256:...", // canonical JSON hash - "reachabilityEvidenceDigest": "sha256:..." -} -``` - -This manifest can be signed (Authority module) and used for audits and replay. - ---- - -## 11. Implementation Roadmap (Phased) - -### Phase 0 – Infrastructure & Binary presence - -**Duration:** 1 sprint - -* Set up `Scanner.Reachability` core types and interfaces. -* Implement: - - * Basic Symbolizer for ELF + DWARF. - * Binary function catalog without CFG. -* Link a small set of CVEs to binary function presence via `SymbolName`. -* Expose minimal evidence: - - * `PresentReachable`/`FunctionNotPresent` based only on presence (no call graph). -* Integrate with VEXer to emit `function_not_present` justifications. - -**Success criteria:** - -* For selected demo images with known vulnerable/ patched OpenSSL, scanner can: - - * Distinguish images where vulnerable function is present vs. absent. - * Emit OpenVEX with correct `not_affected` when patched. - ---- - -### Phase 1 – JS/Python/PHP call graphs & basic reachability - -**Duration:** 1–2 sprints - -* Implement: - - * `Scanner.Analyzers.JavaScript` with module + function call graph. - * `Scanner.Analyzers.Python` and `Scanner.Analyzers.Php` with basic graphs. -* Entrypoint detection: - - * JS: main script from CMD, basic HTTP handlers. - * Python: main script + Django/Flask heuristics. - * PHP: front controllers. -* Implement core reachability algorithm (BFS/DFS). -* Implement simple `VulnerabilitySignature` that uses function names and file paths. -* Hook lattice engine in Scanner.WebService and integrate with: - - * Concelier vulnerability feeds. - * VEXer. - -**Success criteria:** - -* For demo apps (Node, Django, Laravel): - - * Identify vulnerable functions and mark them reachable/unreachable. - * Demonstrate noise reduction (some CVEs flagged as `not_affected`). - ---- - -### Phase 2 – Binary CFG & Fingerprinting, Improved Confidence - -**Duration:** 1–2 sprints - -* Extend Symbolizer & CFG for: - - * Stripped binaries (function hashing). - * Shared libraries (PLT/IAT resolution). -* Implement `VulnerabilitySignature.BlockFingerprints` to distinguish patched vs vulnerable binary functions. -* Refine confidence scoring: - - * Use fingerprint match quality. - * Consider presence/absence of debug info. -* Expand coverage: - - * glibc, curl, zlib, OpenSSL, libxml2, etc. - -**Success criteria:** - -* For curated images: - - * Confirm ability to differentiate patched vs vulnerable versions even when binaries are stripped. - * Reachability reflects true call paths across app→lib boundaries. - ---- - -### Phase 3 – Runtime hooks (optional), UX, and Hardening - -**Duration:** 2+ sprints - -* Add opt-in runtime confirmation: - - * eBPF probes for function hits (Linux). - * Map runtime addresses back to `ImageFunction` via symbolization. -* Enhance console UX: - - * Path explorer UI: show entrypoint → … → vulnerable function path. - * Evidence view with hash-based proofs. -* Hardening: - - * Performance optimization for large images (parallel analysis, caching). - * Conservative fallbacks for dynamic language features. - -**Success criteria:** - -* For selected environments where runtime is allowed: - - * Static reachability is confirmed by runtime traces in majority of cases. - * No significant performance regression on typical images. - ---- - -## 12. How this satisfies your initial bullets - -From your initial requirements: - -1. **JavaScript, Python, PHP, binary** - → Dedicated analyzers per language + binary symbolization/CFG, unified in `Scanner.Reachability`. - -2. **Executable containers** - → Analyzers run inside Scanner’s worker container, mounting the target image rootfs; no network access. - -3. **Libraries usage call graph** - → Call graphs map from entrypoints → app code → library functions; SBOM + PURLs tie functions to libraries. - -4. **Reachability analysis** - → BFS/DFS from entrypoints over per-language and binary graphs, with lattice-based merging in `Scanner.WebService`. - -5. **JSON + PURLs** - → All evidence is JSON with PURL-tagged components; SBOM is annotated, and VEX statements reference those PURLs. - ---- - -If you like, next step can be: I draft concrete C# interface definitions (including some initial Tree-sitter integration stubs for JS/Python/PHP) and a skeleton of the `ReachabilityPlan` and `ReachabilityEngine` classes that you can drop into the monorepo. + +Here’s a crisp idea that could give Stella Ops a real moat: **binary‑level reachability**—linking CVEs directly to the exact functions and offsets inside compiled artifacts (ELF/PE/Mach‑O), not just to packages. + +--- + +### Why this matters (quick background) + +* **Package‑level flags are noisy.** Most scanners say “vuln in `libX v1.2`,” but that library might be present and never executed. +* **Language‑level call graphs help** (when you have source or rich metadata), but containers often ship only **stripped binaries**. +* **Binary reachability** answers: *Is the vulnerable function actually in this image? Is its code path reachable from the entrypoints we observed or can construct?* + +--- + +### The missing layer: Symbolization + +Build a **symbolization layer** that normalizes debug and symbol info across platforms: + +* **Inputs**: DWARF (ELF/Mach‑O), PDB (PE/Windows), symtabs, exported symbols, `.eh_frame`, and (when stripped) heuristic signatures (e.g., function byte‑hashes, CFG fingerprints). +* **Outputs**: a source‑agnostic map: `{binary → sections → functions → (addresses, ranges, hashes, demangled names, inlined frames)}`. +* **Normalization**: Put everything into a common schema (e.g., `Stella.Symbolix.v1`) so higher layers don’t care if it came from DWARF or PDB. + +--- + +### End‑to‑end reachability (binary‑first, source‑agnostic) + +1. **Acquire & parse** + + * Detect format (ELF/PE/Mach‑O), parse headers, sections, symbol tables. + * If debug info present: parse DWARF/PDB; else fall back to disassembly + function boundary recovery. +2. **Function catalog** + + * Assign stable IDs per function: `(imageHash, textSectionHash, startVA, size, fnHashXX)`. + * Record x‑refs (calls/jumps), imports/exports, PLT/IAT edges. +3. **Entrypoint discovery** + + * Docker entry, process launch args, service scripts; infer likely mains (Go `main.main`, .NET hostfxr path, JVM launcher, etc.). +4. **Call‑graph build (binary CFG)** + + * Build inter/intra‑procedural graph (direct + resolved indirect via IAT/PLT). Keep “unknown‑target” edges for conservative safety. +5. **CVE→function linking** + + * Maintain a **signature bank** per CVE advisory: vulnerable function names, file paths, and—crucially—**byte‑sequence or basic‑block fingerprints** for patched vs vulnerable versions (works even when stripped). +6. **Reachability analysis** + + * Is the vulnerable function present? Is there a path from any entrypoint to it (under conservative assumptions)? Tag as `Present+Reachable`, `Present+Uncertain`, or `Absent`. +7. **Runtime confirmation (optional, when users allow)** + + * Lightweight probes (eBPF on Linux, ETW on Windows, perf/JFR/EventPipe) capture function hits; cross‑check with the static result to upgrade confidence. + +--- + +### Minimal component plan (drop into Stella Ops) + +* **Scanner.Symbolizer** + Parsers: ELF/DWARF (libdw or pure‑managed reader), PE/PDB (Dia/LLVM PDB), Mach‑O/DSYM. + Output: `Symbolix.v1` blobs stored in OCI layer cache. +* **Scanner.CFG** + Lifts functions to a normalized IR (capstone/iced‑x86 for decode) → builds CFG & call graph. +* **Advisory.FingerprintBank** + Ingests CSAF/OpenVEX plus curated fingerprints (fn names, block hashes, patch diff markers). Versioned, signed, air‑gap‑syncable. +* **Reachability.Engine** + Joins (`Symbolix` + `CFG` + `FingerprintBank`) → emits `ReachabilityEvidence` with lattice states for VEX. +* **VEXer.Adapter** + Emits **OpenVEX** statements with `status: affected/not_affected` and `justification: function_not_present | function_not_reachable | mitigated_at_runtime`, attaching Evidence URIs. +* **Console UX** + “Why not affected?” panel showing entrypoint→…→function path (or absence), with byte‑hash proof. + +--- + +### Data model sketch (concise) + +* `ImageFunction { id, name?, startVA, size, fnHash, sectionHash, demangled?, provenance:{DWARF|PDB|Heuristic} }` +* `Edge { srcFnId, dstFnId, kind:{direct|plt|iat|indirect?} }` +* `CveSignature { cveId, fnName?, libHints[], blockFingerprints[], versionRanges }` +* `Evidence { cveId, imageId, functionMatches[], reachable: bool?, confidence:[low|med|high], method:[static|runtime|hybrid] }` + +--- + +### Practical phases (8–10 weeks of focused work) + +1. **P0**: ELF/DWARF symbolizer + basic function catalog; link a handful of CVEs via name‑only; emit OpenVEX `function_not_present`. +2. **P1**: CFG builder (direct calls) + PLT/IAT resolution; simple reachability; first fingerprints for top 50 CVEs in glibc, openssl, curl, zlib. +3. **P2**: Stripped‑binary heuristics (block hashing) + Go/Rust name demangling; Windows PDB ingestion for PE. +4. **P3**: Runtime probes (opt‑in) + confidence upgrade logic; Console path explorer; evidence signing (DSSE). + +--- + +### KPIs to prove the moat + +* **Noise cut**: % reduction in “affected” flags after reachability (target 40–70% on typical containers). +* **Precision**: Ground‑truth validation vs PoC images (TP/FP/FN on presence & reachability). +* **Coverage**: % images where we can make a determination without source (goal: >80%). +* **Latency**: Added scan time per image (<15s typical with caches). + +--- + +### Risks & how to handle them + +* **Stripped binaries** → mitigate with block‑hash fingerprints & library‑version heuristics. +* **Obfuscated/packed code** → mark `Uncertain`; allow user‑supplied hints; prefer runtime confirmation. +* **Advisory inconsistency** → keep our own curated CVE→function fingerprint bank; sign & version it. +* **Platform spread** → start Linux/ELF, then Windows/PDB, then Mach‑O. + +--- + +### Why competitors struggle + +Most tools stop at packages because binary CFG + fingerprint curation is hard and expensive. Shipping a **source‑agnostic reachability engine** tied to signed evidence in VEX would set Stella Ops apart—especially in offline/air‑gapped and sovereign contexts you already target. + +If you want, I can draft: + +* the `Symbolix.v1` protobuf, +* a tiny PoC (ELF→functions→match CVE with a block fingerprint), +* and the OpenVEX emission snippet your VEXer can produce. +Below is a detailed architecture plan for implementing reachability and call-graph analysis in Stella Ops, covering JavaScript, Python, PHP, and binaries, and integrating with your existing Scanner / Concelier / VEXer stack. + +I will assume: + +* .NET 10 for core services. +* Scanner is the place where all “trust algebra / lattice” runs (per your standing rule). +* Concelier and VEXer remain “preserve/prune” layers and do not run lattice logic. +* Output must be JSON-centric with PURLs and OpenVEX. + +--- + +## 1. Scope & Objectives + +### 1.1 Primary goals + +1. From an OCI image, build: + + * A **library-level usage graph** (which libraries are used by which entrypoints). + * A **function-level call graph** for JS / Python / PHP / binaries. +2. Map CVEs (from Concelier) to: + + * Concrete **components** (PURLs) in the SBOM. + * Concrete **functions / entrypoints / code regions** inside those components. +3. Perform **reachability analysis** to classify each vulnerability as: + + * `present + reachable` + * `present + not_reachable` + * `function_not_present` (no vulnerable symbol) + * `uncertain` (dynamic features, unresolved calls) +4. Emit: + + * **Structured JSON** with PURLs and call-graph nodes/edges (“reachability evidence”). + * **OpenVEX** documents with appropriate `status`/`justification`. + +### 1.2 Non-goals (for now) + +* Full dynamic analysis of the running container (eBPF, ptrace, etc.) – leave as Phase 3+ optional add-on. +* Perfect call graph precision for dynamic languages (aim for safe, conservative approximations). +* Automatic “fix recommendations” (handled by other Stella Ops agents later). + +--- + +## 2. High-Level Architecture + +### 2.1 Major components + +Within Stella Ops: + +* **Scanner.WebService** + + * User-facing API. + * Orchestrates full scan (SBOM, CVEs, reachability). + * Hosts the **Lattice/Policy engine** that merges evidence and produces decisions. +* **Scanner.Worker** + + * Runs per-image analysis jobs. + * Invokes analyzers (JS, Python, PHP, Binary) inside its own container context. +* **Scanner.Reachability Core Library** + + * Unified IR for call graphs and reachability evidence. + * Interfaces for language and binary analyzers. + * Graph algorithms (BFS/DFS, lattice evaluation, entrypoint expansion). +* **Language Analyzers** + + * `Scanner.Analyzers.JavaScript` + * `Scanner.Analyzers.Python` + * `Scanner.Analyzers.Php` + * `Scanner.Analyzers.Binary` +* **Symbolization & CFG (for binaries)** + + * `Scanner.Symbolization` (ELF, PE, Mach-O parsers, DWARF/PDB) + * `Scanner.Cfg` (CFG + call graph for binaries) +* **Vulnerability Signature Bank** + + * `Concelier.Signatures` (curated CVE→function/library fingerprints). + * Exposed to Scanner as **offline bundle**. +* **VEXer** + + * `Vexer.Adapter.Reachability` – transforms reachability evidence into OpenVEX. + +### 2.2 Data flow (logical) + +```mermaid +flowchart LR + A[OCI Image / Tar] --> B[Scanner.Worker: Extract FS] + B --> C[SBOM Engine (CycloneDX/SPDX)] + C --> D[Vuln Match (Concelier feeds)] + B --> E1[JS Analyzer] + B --> E2[Python Analyzer] + B --> E3[PHP Analyzer] + B --> E4[Binary Analyzer + Symbolizer/CFG] + + D --> F[Reachability Orchestrator] + E1 --> F + E2 --> F + E3 --> F + E4 --> F + F --> G[Lattice/Policy Engine (Scanner.WebService)] + G --> H[Reachability Evidence JSON] + G --> I[VEXer: OpenVEX] + G --> J[Graph/Cartographer (optional)] +``` + +--- + +## 3. Data Model & JSON Contracts + +### 3.1 Core IR types (Scanner.Reachability) + +Define in a central assembly, e.g. `StellaOps.Scanner.Reachability`: + +```csharp +public record ComponentRef( + string Purl, + string? BomRef, + string? Name, + string? Version); + +public enum SymbolKind { Function, Method, Constructor, Lambda, Import, Export } + +public record SymbolId( + string Language, // "js", "python", "php", "binary" + string ComponentPurl, // SBOM component PURL or "" for app code + string LogicalName, // e.g., "server.js:handleLogin" + string? FilePath, + int? Line); + +public record CallGraphNode( + string Id, // stable id, e.g., hash(SymbolId) + SymbolId Symbol, + SymbolKind Kind, + bool IsEntrypoint); + +public enum CallEdgeKind { Direct, Indirect, Dynamic, External, Ffi } + +public record CallGraphEdge( + string FromNodeId, + string ToNodeId, + CallEdgeKind Kind); + +public record CallGraph( + string GraphId, + IReadOnlyList Nodes, + IReadOnlyList Edges); +``` + +### 3.2 Vulnerability mapping + +```csharp +public record VulnerabilitySignature( + string Source, // "csaf", "nvd", "vendor" + string Id, // "CVE-2023-12345" + IReadOnlyList Purls, + IReadOnlyList TargetSymbolPatterns, // glob-like or regex + IReadOnlyList? FilePathPatterns, + IReadOnlyList? BlockFingerprints // for binaries, optional +); +``` + +### 3.3 Reachability evidence + +```csharp +public enum ReachabilityStatus +{ + PresentReachable, + PresentNotReachable, + FunctionNotPresent, + Unknown +} + +public record ReachabilityEvidence +( + string ImageRef, + string VulnId, // CVE or advisory id + ComponentRef Component, + ReachabilityStatus Status, + double Confidence, // 0..1 + string Method, // "static-callgraph", "binary-fingerprint", etc. + IReadOnlyList EntrypointNodeIds, + IReadOnlyList>? ExamplePaths // optional list of node-paths +); +``` + +### 3.4 JSON structure (external) + +Minimal external JSON (what you store / expose): + +```json +{ + "image": "registry.example.com/app:1.2.3", + "components": [ + { + "purl": "pkg:npm/express@4.18.0", + "bomRef": "component-1" + } + ], + "callGraphs": [ + { + "graphId": "js-main", + "language": "js", + "nodes": [ /* CallGraphNode */ ], + "edges": [ /* CallGraphEdge */ ] + } + ], + "reachability": [ + { + "vulnId": "CVE-2023-12345", + "componentPurl": "pkg:npm/express@4.18.0", + "status": "PresentReachable", + "confidence": 0.92, + "entrypoints": [ "node:..." ], + "paths": [ + ["node:entry", "node:routeHandler", "node:vulnFn"] + ] + } + ] +} +``` + +--- + +## 4. Scanner-Side Architecture + +### 4.1 Project layout (suggested) + +```text +src/ + Scanner/ + StellaOps.Scanner.WebService/ + StellaOps.Scanner.Worker/ + StellaOps.Scanner.Core/ # shared scan domain + StellaOps.Scanner.Reachability/ + StellaOps.Scanner.Symbolization/ + StellaOps.Scanner.Cfg/ + StellaOps.Scanner.Analyzers.JavaScript/ + StellaOps.Scanner.Analyzers.Python/ + StellaOps.Scanner.Analyzers.Php/ + StellaOps.Scanner.Analyzers.Binary/ +``` + +### 4.2 API surface (Scanner.WebService) + +* `POST /api/scan/image` + + * Request: `{ "imageRef": "...", "profile": { "reachability": true, ... } }` + * Returns: scan id. +* `GET /api/scan/{id}/reachability` + + * Returns: `ReachabilityEvidence[]`, plus call graph summary (optional). +* `GET /api/scan/{id}/vex` + + * Returns: OpenVEX with statuses based on reachability lattice. + +### 4.3 Worker orchestration + +`StellaOps.Scanner.Worker`: + +1. Receives scan job with `imageRef`. + +2. Extracts filesystem (layered rootfs) under `/mnt/scans/{scanId}/rootfs`. + +3. Invokes SBOM generator (CycloneDX/SPDX). + +4. Invokes Concelier via offline feeds to get: + + * Component vulnerabilities (CVE list per PURL). + * Vulnerability signatures (fingerprints). + +5. Builds a `ReachabilityPlan`: + + ```csharp + public record ReachabilityPlan( + IReadOnlyList Components, + IReadOnlyList Vulns, + IReadOnlyList AnalyzerTargets // files/dirs grouped by language + ); + ``` + +6. For each language target, dispatch analyzer: + + * JavaScript: `IReachabilityAnalyzer` implementation for JS. + * Python: likewise. + * PHP: likewise. + * Binary: symbolizer + CFG. + +7. Collects call graphs from each analyzer and merges them into a single IR (or separate per-language graphs with shared IDs). + +8. Sends merged graphs + vuln list to **Reachability Engine** (Scanner.Reachability). + +--- + +## 5. Language Analyzers (JS / Python / PHP) + +All analyzers implement a common interface: + +```csharp +public interface IReachabilityAnalyzer +{ + string Language { get; } // "js", "python", "php" + + Task AnalyzeAsync(AnalyzerContext context, CancellationToken ct); +} + +public record AnalyzerContext( + string RootFsPath, + IReadOnlyList Components, + IReadOnlyList Vulnerabilities, + IReadOnlyDictionary Env, // container env, entrypoint, etc. + string? EntrypointCommand // container CMD/ENTRYPOINT +); +``` + +### 5.1 JavaScript (Node.js focus) + +**Inputs:** + +* `/app` tree inside container (or discovered via SBOM). +* `package.json` files. +* Container entrypoint (e.g., `["node", "server.js"]`). + +**Core steps:** + +1. Identify **app root**: + + * Heuristics: directory containing `package.json` that owns the entry script. +2. Parse: + + * All `.js`, `.mjs`, `.cjs` in app and `node_modules` for vulnerable PURLs. + * Use a parsing frontend (e.g., Tree-sitter via .NET binding, or Node+AST-as-JSON). +3. Build module graph: + + * `require`, `import`, `export`. +4. Function-level graph: + + * For each function/method, create `CallGraphNode`. + * For each `callExpression`, create `CallGraphEdge` (try to resolve callee). +5. Entrypoints: + + * Main script in CMD/ENTRYPOINT. + * HTTP route handlers (for express/koa) detected by patterns (e.g., `app.get("/...")`). +6. Map vulnerable symbols: + + * From `VulnerabilitySignature.TargetSymbolPatterns` (e.g., `express/lib/router/layer.js:handle_request`). + * Identify nodes whose `SymbolId` matches patterns. + +**Output:** + +* `CallGraph` for JS with: + + * `IsEntrypoint = true` for main and detected handlers. + * Node attributes include file path, line, component PURL. + +### 5.2 Python + +**Inputs:** + +* Site-packages paths from SBOM. +* Entrypoint script (CMD/ENTRYPOINT). +* Framework heuristics (Django, Flask) from environment variables or common entrypoints. + +**Core steps:** + +1. Discover Python interpreter chain: not needed for pure static, but useful for heuristics. +2. Parse `.py` files of: + + * App code. + * Vulnerable packages (per PURL). +3. Build module import graph (`import`, `from x import y`). +4. Function-level graph: + + * Nodes for functions, methods, class constructors. + * Edges for call expressions; conservative for dynamic calls. +5. Entrypoints: + + * Main script. + * WSGI callable (e.g., `application` in `wsgi.py`). + * Django URLconf -> view functions. +6. Map vulnerable symbols using `TargetSymbolPatterns` like `django.middleware.security.SecurityMiddleware.__call__`. + +### 5.3 PHP + +**Inputs:** + +* Web root (from container image or conventional paths `/var/www/html`, `/app/public`, etc.). +* Composer metadata (`composer.json`, `vendor/`). +* Web server config if present (optional). + +**Core steps:** + +1. Discover front controllers (e.g., `index.php`, `public/index.php`). +2. Parse PHP files (again, via Tree-sitter or any suitable parser). +3. Resolve include/require chains to build file-level inclusion graph. +4. Build function/method graph: + + * Functions, methods, class constructors. + * Calls with best-effort resolution for namespaced functions. +5. Entrypoints: + + * Front controllers and router entrypoints (e.g., Symfony, Laravel detection). +6. Map vulnerable symbols (e.g., functions in certain vendor packages, particular methods). + +--- + +## 6. Binary Analyzer & Symbolizer + +Project: `StellaOps.Scanner.Analyzers.Binary` + `Symbolization` + `Cfg`. + +### 6.1 Inputs + +* All binaries and shared libraries in: + + * `/usr/lib`, `/lib`, `/app/bin`, etc. +* SBOM link: each binary mapped to its component PURL when possible. +* Vulnerability signatures for native libs: function names, symbol names, fingerprints. + +### 6.2 Symbolization + +Module: `StellaOps.Scanner.Symbolization` + +* Detect format: ELF, PE, Mach-O. +* For ELF/Mach-O: + + * Parse symbol tables (`.symtab`, `.dynsym`). + * Parse DWARF (if present) to map functions to source files/lines. +* For PE: + + * Parse PDB (if present) or export table. +* For stripped binaries: + + * Run function boundary recovery (linear sweep + heuristic). + * Compute block/fn-level hashes for fingerprinting. + +Output: + +```csharp +public record ImageFunction( + string ImageId, // e.g., SHA256 of file + ulong StartVa, + uint Size, + string? SymbolName, // demangled if possible + string FnHash, // stable hash of bytes / CFG + string? SourceFile, + int? SourceLine); +``` + +### 6.3 CFG + Call graph + +Module: `StellaOps.Scanner.Cfg` + +* Disassemble `.text` using Capstone/Iced.x86. +* Build basic blocks and CFG. +* Identify: + + * Direct calls (resolved). + * PLT/IAT indirections to shared libraries. +* Build `CallGraph` for binary functions: + + * Entrypoints: `main`, exported functions, Go `main.main`, etc. + * Map application functions to library functions via PLT/IAT edges. + +### 6.4 Linking vulnerabilities + +* For each vulnerability affecting a native library (e.g., OpenSSL): + + * Map to candidate binaries via SBOM + PURL. + * Within library image, find `ImageFunction`s matching: + + * `SymbolName` patterns. + * `FnHash` / `BlockFingerprints` (for precise detection). +* Determine reachability: + + * Starting from application entrypoints, traverse call graph to see if calls to vulnerable library function occur. + +--- + +## 7. Reachability Engine & Lattice (Scanner.WebService) + +Project: `StellaOps.Scanner.Reachability` + +### 7.1 Inputs to engine + +* Combined `CallGraph[]` (per language + binary). +* Vulnerability list (CVE, GHSA, etc.) with affected PURLs. +* Vulnerability signatures. +* Entrypoint hints: + + * Container CMD/ENTRYPOINT. + * Detected HTTP handlers, WSGI/PSGI entrypoints, etc. + +### 7.2 Algorithm steps + +1. **Entrypoint expansion** + + * Identify all `CallGraphNode` with `IsEntrypoint=true`. + * Add language-specific “framework entrypoints” (e.g., Express route dispatch, Django URL dispatch) when detected. + +2. **Graph traversal** + + * For each entrypoint node: + + * BFS/DFS through edges. + * Maintain `reachable` bit on each node. + * For dynamic edges: + + * Conservative: if target cannot be resolved, mark affected path as partially unknown and downgrade confidence. + +3. **Vuln symbol resolution** + + * For each vulnerability: + + * For each vulnerable component PURL found in SBOM: + + * Find candidate nodes whose `SymbolId` matches `TargetSymbolPatterns` / binary fingerprints. + * If none found: + + * `FunctionNotPresent` (if component version range indicates vulnerable but we cannot find symbol – low confidence). + * If found: + + * Check `reachable` bit: + + * If reachable by at least one entrypoint, `PresentReachable`. + * Else, `PresentNotReachable`. + +4. **Confidence computation** + + * Start from: + + * `1.0` for direct match with explicit function name & static call. + * Lower for: + + * Heuristic framework entrypoints. + * Dynamic calls. + * Fingerprint-only matches on stripped binaries. + * Example rule-of-thumb: + + * direct static path only: 0.95–1.0. + * dynamic edges but symbol found: 0.7–0.9. + * symbol not found but version says vulnerable: 0.4–0.6. + +5. **Lattice merge** + + * Represent each CVE+component pair as a lattice element with states: `{affected, not_affected, unknown}`. + * Reachability engine produces a **local state**: + + * `PresentReachable` → candidate `affected`. + * `PresentNotReachable` or `FunctionNotPresent` → candidate `not_affected`. + * `Unknown` → `unknown`. + * Merge with: + + * Upstream vendor VEX (from Concelier). + * Policy overrides (e.g., “treat certain CVEs as affected unless vendor says otherwise”). + * Final state computed here (Scanner.WebService), not in Concelier or VEXer. + +6. **Evidence output** + + * For each vulnerability: + + * Emit `ReachabilityEvidence` with: + + * Status. + * Confidence. + * Method. + * Example entrypoint paths (for UX and audit). + * Persist this evidence alongside regular scan results. + +--- + +## 8. Integration with SBOM & VEX + +### 8.1 SBOM annotation + +* Extend SBOM documents (CycloneDX / SPDX) with extra properties: + + * CycloneDX: + + * `component.properties`: + + * `stellaops:reachability:status` = `present_reachable|present_not_reachable|function_not_present|unknown` + * `stellaops:reachability:confidence` = `0.0-1.0` + * SPDX: + + * `Annotation` or `ExternalRef` with similar metadata. + +### 8.2 OpenVEX generation + +Module: `StellaOps.Vexer.Adapter.Reachability` + +* For each `(vuln, component)` pair: + + * Map to VEX statement: + + * If `PresentReachable`: + + * `status: affected` + * `justification: component_not_fixed` or similar. + * If `PresentNotReachable`: + + * `status: not_affected` + * `justification: function_not_reachable` + * If `FunctionNotPresent`: + + * `status: not_affected` + * `justification: component_not_present` or `function_not_present` + * If `Unknown`: + + * `status: under_investigation` (configurable). + +* Attach evidence via: + + * `analysis` / `details` fields (link to internal evidence JSON or audit link). + +* VEXer does not recalculate reachability; it uses the already computed decision + evidence. + +--- + +## 9. Executable Containers & Offline Operation + +### 9.1 Executable containers + +* Analyzers run inside a dedicated Scanner worker container that has: + + * .NET 10 runtime. + * Language runtimes if needed for parsing (Node, Python, PHP), or Tree-sitter-based parsing. +* Target image filesystem is mounted read-only under `/mnt/rootfs`. +* No network access (offline/air-gap). +* This satisfies “we will use executable containers” while keeping separation between: + + * Target image (mount only). + * Analyzer container (StellaOps code). + +### 9.2 Offline signature bundles + +* Concelier periodically exports: + + * Vulnerability database (CSAF/NVD). + * Vulnerability Signature Bank. +* Bundles are: + + * DSSE-signed. + * Versioned (e.g., `signatures-2025-11-01.tar.zst`). +* Scanner uses: + + * The bundle digest as part of the **Scan Manifest** for deterministic replay. + +--- + +## 10. Determinism & Caching + +### 10.1 Layer-level caching + +* Key: `layerDigest + analyzerVersion + signatureBundleVersion`. +* Cache artifacts: + + * CallGraph(s) per layer (for JS/Python/PHP code present in that layer). + * Symbolization results per binary file hash. +* For images sharing layers: + + * Merge cached graphs instead of re-analyzing. + +### 10.2 Deterministic scan manifest + +For each scan, produce: + +```json +{ + "imageRef": "registry/app:1.2.3", + "imageDigest": "sha256:...", + "scannerVersion": "1.4.0", + "analyzerVersions": { + "js": "1.0.0", + "python": "1.0.0", + "php": "1.0.0", + "binary": "1.0.0" + }, + "signatureBundleDigest": "sha256:...", + "callGraphDigest": "sha256:...", // canonical JSON hash + "reachabilityEvidenceDigest": "sha256:..." +} +``` + +This manifest can be signed (Authority module) and used for audits and replay. + +--- + +## 11. Implementation Roadmap (Phased) + +### Phase 0 – Infrastructure & Binary presence + +**Duration:** 1 sprint + +* Set up `Scanner.Reachability` core types and interfaces. +* Implement: + + * Basic Symbolizer for ELF + DWARF. + * Binary function catalog without CFG. +* Link a small set of CVEs to binary function presence via `SymbolName`. +* Expose minimal evidence: + + * `PresentReachable`/`FunctionNotPresent` based only on presence (no call graph). +* Integrate with VEXer to emit `function_not_present` justifications. + +**Success criteria:** + +* For selected demo images with known vulnerable/ patched OpenSSL, scanner can: + + * Distinguish images where vulnerable function is present vs. absent. + * Emit OpenVEX with correct `not_affected` when patched. + +--- + +### Phase 1 – JS/Python/PHP call graphs & basic reachability + +**Duration:** 1–2 sprints + +* Implement: + + * `Scanner.Analyzers.JavaScript` with module + function call graph. + * `Scanner.Analyzers.Python` and `Scanner.Analyzers.Php` with basic graphs. +* Entrypoint detection: + + * JS: main script from CMD, basic HTTP handlers. + * Python: main script + Django/Flask heuristics. + * PHP: front controllers. +* Implement core reachability algorithm (BFS/DFS). +* Implement simple `VulnerabilitySignature` that uses function names and file paths. +* Hook lattice engine in Scanner.WebService and integrate with: + + * Concelier vulnerability feeds. + * VEXer. + +**Success criteria:** + +* For demo apps (Node, Django, Laravel): + + * Identify vulnerable functions and mark them reachable/unreachable. + * Demonstrate noise reduction (some CVEs flagged as `not_affected`). + +--- + +### Phase 2 – Binary CFG & Fingerprinting, Improved Confidence + +**Duration:** 1–2 sprints + +* Extend Symbolizer & CFG for: + + * Stripped binaries (function hashing). + * Shared libraries (PLT/IAT resolution). +* Implement `VulnerabilitySignature.BlockFingerprints` to distinguish patched vs vulnerable binary functions. +* Refine confidence scoring: + + * Use fingerprint match quality. + * Consider presence/absence of debug info. +* Expand coverage: + + * glibc, curl, zlib, OpenSSL, libxml2, etc. + +**Success criteria:** + +* For curated images: + + * Confirm ability to differentiate patched vs vulnerable versions even when binaries are stripped. + * Reachability reflects true call paths across app→lib boundaries. + +--- + +### Phase 3 – Runtime hooks (optional), UX, and Hardening + +**Duration:** 2+ sprints + +* Add opt-in runtime confirmation: + + * eBPF probes for function hits (Linux). + * Map runtime addresses back to `ImageFunction` via symbolization. +* Enhance console UX: + + * Path explorer UI: show entrypoint → … → vulnerable function path. + * Evidence view with hash-based proofs. +* Hardening: + + * Performance optimization for large images (parallel analysis, caching). + * Conservative fallbacks for dynamic language features. + +**Success criteria:** + +* For selected environments where runtime is allowed: + + * Static reachability is confirmed by runtime traces in majority of cases. + * No significant performance regression on typical images. + +--- + +## 12. How this satisfies your initial bullets + +From your initial requirements: + +1. **JavaScript, Python, PHP, binary** + → Dedicated analyzers per language + binary symbolization/CFG, unified in `Scanner.Reachability`. + +2. **Executable containers** + → Analyzers run inside Scanner’s worker container, mounting the target image rootfs; no network access. + +3. **Libraries usage call graph** + → Call graphs map from entrypoints → app code → library functions; SBOM + PURLs tie functions to libraries. + +4. **Reachability analysis** + → BFS/DFS from entrypoints over per-language and binary graphs, with lattice-based merging in `Scanner.WebService`. + +5. **JSON + PURLs** + → All evidence is JSON with PURL-tagged components; SBOM is annotated, and VEX statements reference those PURLs. + +--- + +If you like, next step can be: I draft concrete C# interface definitions (including some initial Tree-sitter integration stubs for JS/Python/PHP) and a skeleton of the `ReachabilityPlan` and `ReachabilityEngine` classes that you can drop into the monorepo. diff --git a/docs/product-advisories/archived/18-Nov-2026 - CSharp-Binary-Analyzer.md b/docs/product-advisories/archived/18-Nov-2025 - CSharp-Binary-Analyzer.md similarity index 97% rename from docs/product-advisories/archived/18-Nov-2026 - CSharp-Binary-Analyzer.md rename to docs/product-advisories/archived/18-Nov-2025 - CSharp-Binary-Analyzer.md index 4b4e92700..fe0ccd594 100644 --- a/docs/product-advisories/archived/18-Nov-2026 - CSharp-Binary-Analyzer.md +++ b/docs/product-advisories/archived/18-Nov-2025 - CSharp-Binary-Analyzer.md @@ -1,989 +1,989 @@ -Vlad, here’s a concrete, **pure‑C#** blueprint to build a multi‑format binary analyzer (Mach‑O, ELF, PE) that produces **call graphs + reachability**, with **no external tools**. Where needed, I point to permissively‑licensed code you can **port** (copy) from other ecosystems. - ---- - -## 0) Targets & non‑negotiables - -* **Formats:** Mach‑O (inc. LC_DYLD_INFO / LC_DYLD_CHAINED_FIXUPS), ELF (SysV gABI), PE/COFF -* **Architectures:** x86‑64 (and x86), AArch64 (ARM64) -* **Outputs:** JSON with **purls** per module + function‑level call graph & reachability -* **No tool reuse:** Only pure C# libraries or code **ported** from permissive sources - ---- - -## 1) Parsing the containers (pure C#) - -**Pick one C# reader per format, keeping licenses permissive:** - -* **ELF & Mach‑O:** `ELFSharp` (pure managed C#; ELF + Mach‑O reading). MIT/X11 license. ([GitHub][1]) -* **ELF & PE (+ DWARF v4):** `LibObjectFile` (C#, BSD‑2). Good ELF relocations (i386, x86_64, ARM, AArch64), PE directories, DWARF sections. Use it as your **common object model** for ELF+PE, then add a Mach‑O adapter. ([GitHub][2]) -* **PE (optional alternative):** `PeNet` (pure C#, broad PE directories, imp/exp, TLS, certs). MIT. Useful if you want a second implementation for cross‑checks. ([GitHub][3]) - -> Why two libs? `LibObjectFile` gives you DWARF and clean models for ELF/PE; `ELFSharp` covers Mach‑O today (and ELF as a fallback). You control the code paths. - -**Spec references you’ll implement against** (for correctness of your readers & link‑time semantics): - -* **ELF (gABI, AMD64 supplement):** dynamic section, PLT/GOT, `R_X86_64_JUMP_SLOT` semantics (eager vs lazy). ([refspecs.linuxbase.org][4]) -* **PE/COFF:** imports/exports/IAT, delay‑load, TLS. ([Microsoft Learn][5]) -* **Mach‑O:** file layout, load commands (`LC_SYMTAB`, `LC_DYSYMTAB`, `LC_FUNCTION_STARTS`, `LC_DYLD_INFO(_ONLY)`), and the modern `LC_DYLD_CHAINED_FIXUPS`. ([leopard-adc.pepas.com][6]) - ---- - -## 2) Mach‑O: what you must **port** (byte‑for‑byte compatible) - -Apple moved from traditional dyld bind opcodes to **chained fixups** on macOS 12/iOS 15+; you need both: - -* **Dyld bind opcodes** (`LC_DYLD_INFO(_ONLY)`): parse the BIND/LAZY_BIND streams (tuples of ``). Port minimal logic from **LLVM** or **LIEF** (both Apache‑2.0‑compatible) into C#. ([LIEF][7]) -* **Chained fixups** (`LC_DYLD_CHAINED_FIXUPS`): port `dyld_chained_fixups_header` structs & chain walking from LLVM’s `MachO.h` or Apple’s dyld headers. This restores imports/rebases without running dyld. ([LLVM][8]) -* **Function discovery hint:** read `LC_FUNCTION_STARTS` (ULEB128 deltas) to seed function boundaries—very helpful on stripped binaries. ([Stack Overflow][9]) -* **Stubs mapping:** resolve `__TEXT,__stubs` ↔ `__DATA,__la_symbol_ptr` via the **indirect symbol table**; conceptually identical to ELF’s PLT/GOT. ([MaskRay][10]) - -> If you prefer an in‑C# base for Mach‑O manipulation, **Melanzana.MachO** exists (MIT) and has been used by .NET folks for Mach‑O/Code Signing/obj writing; you can mine its approach for load‑command modeling. ([GitHub][11]) - ---- - -## 3) Disassembly (pure C#, multi‑arch) - -* **x86/x64:** `iced` (C# decoder/disassembler/encoder; MIT; fast & complete). ([GitHub][12]) -* **AArch64/ARM64:** two options that keep you pure‑C#: - - * **Disarm** (pure C# ARM64 disassembler; MIT). Good starting point to decode & get branch/call kinds. ([GitHub][13]) - * **Port from Ryujinx ARMeilleure** (ARMv8 decoder/JIT in C#, MIT). You can lift only the **decoder** pieces you need. ([Gitee][14]) -* **x86 fallback:** `SharpDisasm` (udis86 port in C#; BSD‑2). Older than iced; keep as a reference. ([GitHub][15]) - ---- - -## 4) Call graph recovery (static) - -**4.1 Function seeds** - -* From symbols (`.dynsym`/`LC_SYMTAB`/PE exports) -* From **LC_FUNCTION_STARTS** (Mach‑O) for stripped code ([Stack Overflow][9]) -* From entrypoints (`_start`/`main` or PE AddressOfEntryPoint) -* From exception/unwind tables & DWARF (when present)—`LibObjectFile` already models DWARF v4. ([GitHub][2]) - -**4.2 CFG & interprocedural calls** - -* **Decode** with iced/Disarm from each seed; form **basic blocks** by following control‑flow until terminators (ret/jmp/call). -* **Direct calls:** immediate targets become edges (PC‑relative fixups where needed). -* **Imported calls:** - - * **ELF:** calls to PLT stubs → resolve via `.rela.plt` & `R_*_JUMP_SLOT` to symbol names (link‑time target). ([cs61.seas.harvard.edu][16]) - * **PE:** calls through the **IAT** → resolve via `IMAGE_IMPORT_DESCRIPTOR` / thunk tables. ([Microsoft Learn][5]) - * **Mach‑O:** calls to `__stubs` use **indirect symbol table** + `__la_symbol_ptr` (or chained fixups) → map to dylib/symbol. ([reinterpretcast.com][17]) -* **Indirect calls within the binary:** heuristics only (function pointer tables, vtables, small constant pools). Keep them labeled **“indirect‑unresolved”** unless a heuristic yields a concrete target. - -**4.3 Cross‑binary graph** - -* Build module‑level edges by simulating the platform’s loader: - - * **ELF:** honor `DT_NEEDED`, `DT_RPATH/RUNPATH`, versioning (`.gnu.version*`) to pick the definer of an imported symbol. gABI rules apply. ([refspecs.linuxbase.org][4]) - * **PE:** pick DLL from the import descriptors. ([Microsoft Learn][5]) - * **Mach‑O:** `LC_LOAD_DYLIB` + dyld binding / chained fixups determine the provider image. ([LIEF][7]) - ---- - -## 5) Reachability analysis - -Represent the **call graph** using a .NET graph lib (or a simple adjacency set). I suggest: - -* **QuikGraph** (successor of QuickGraph; MIT) for algorithms (DFS/BFS, SCCs). Use it to compute reachability from chosen roots (entrypoint(s), exported APIs, or “sinks”). ([GitHub][18]) - -You can visualize with **MSAGL** (MIT) when you need layouts, but your core output is JSON. ([GitHub][19]) - ---- - -## 6) Symbol demangling (nice‑to‑have, pure C#) - -* **Itanium (ELF/Mach‑O):** Either port LLVM’s Itanium demangler or use a C# lib like **CxxDemangler** (a C# rewrite of `cpp_demangle`). ([LLVM][20]) -* **MSVC (PE):** Port LLVM’s `MicrosoftDemangle.cpp` (Apache‑2.0 with LLVM exception) to C#. ([LLVM][21]) - ---- - -## 7) JSON output (with purls) - -Use a stable schema (example) to feed SBOM/vuln matching downstream: - -```json -{ - "modules": [ - { - "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64", - "format": "ELF", - "arch": "x86_64", - "path": "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", - "exports": ["SSL_read", "SSL_write"], - "imports": ["BIO_new", "EVP_CipherInit_ex"], - "functions": [{"name":"SSL_do_handshake","va":"0x401020","size":512,"demangled": "..."}] - } - ], - "graph": { - "nodes": [ - {"id":"bin:main@0x401000","module": "pkg:generic/myapp@1.0.0"}, - {"id":"lib:SSL_read","module":"pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64"} - ], - "edges": [ - {"src":"bin:main@0x401000","dst":"lib:SSL_read","kind":"import_call","evidence":"ELF.R_X86_64_JUMP_SLOT"} - ] - }, - "reachability": { - "roots": ["bin:_start","bin:main@0x401000"], - "reachable": ["lib:SSL_read", "lib:SSL_write"], - "unresolved_indirect_calls": [ - {"site":"0x402ABC","reason":"register-indirect"} - ] - } -} -``` - ---- - -## 8) Minimal C# module layout (sketch) - -``` -Stella.Analysis.Core/ - BinaryModule.cs // common model (sections, symbols, relocs, imports/exports) - Loader/ - PeLoader.cs // wrap LibObjectFile (or PeNet) to BinaryModule - ElfLoader.cs // wrap LibObjectFile to BinaryModule - MachOLoader.cs // wrap ELFSharp + your ported Dyld/ChainedFixups - Disasm/ - X86Disassembler.cs // iced bridge: bytes -> instructions - Arm64Disassembler.cs // Disarm (or ARMeilleure port) bridge - Graph/ - CallGraphBuilder.cs // builds CFG per function + inter-procedural edges - Reachability.cs // BFS/DFS over QuikGraph - Demangle/ - ItaniumDemangler.cs // port or wrap CxxDemangler - MicrosoftDemangler.cs // port from LLVM - Export/ - JsonWriter.cs // writes schema above -``` - ---- - -## 9) Implementation notes (where issues usually bite) - -* **Mach‑O moderns:** Implement both dyld opcode **and** chained fixups; many macOS 12+/iOS15+ binaries only have chained fixups. ([emergetools.com][22]) -* **Stubs vs real targets (Mach‑O):** map `__stubs` → `__la_symbol_ptr` via **indirect symbols** to the true imported symbol (or its post‑fixup target). ([reinterpretcast.com][17]) -* **ELF PLT/GOT:** treat `.plt` entries as **call trampolines**; ultimate edge should point to the symbol (library) that satisfies `DT_NEEDED` + version. ([refspecs.linuxbase.org][4]) -* **PE delay‑load:** don’t forget `IMAGE_DELAYLOAD_DESCRIPTOR` for delayed IATs. ([Microsoft Learn][5]) -* **Function discovery:** use `LC_FUNCTION_STARTS` when symbols are stripped; it’s a cheap way to seed analysis. ([Stack Overflow][9]) -* **Name clarity:** demangle Itanium/MSVC so downstream vuln rules can match consistently. ([LLVM][20]) - ---- - -## 10) What to **copy/port** verbatim (safe licenses) - -* **Dyld bind & exports trie logic:** from **LLVM** or **LIEF** Mach‑O (Apache‑2.0). Great for getting the exact opcode semantics right. ([LIEF][7]) -* **Chained fixups structs/walkers:** from **LLVM MachO.h** or Apple dyld headers (permissive headers). ([LLVM][8]) -* **Itanium/MS demanglers:** LLVM demangler sources are standalone; easy to translate to C#. ([LLVM][23]) -* **ARM64 decoder:** if Disarm gaps hurt, lift just the **decoder** pieces from **Ryujinx ARMeilleure** (MIT). ([Gitee][14]) - -*(Avoid GPL’d parsers like binutils/BFD; they will contaminate your codebase’s licensing.)* - ---- - -## 11) End‑to‑end pipeline (per container image) - -1. **Enumerate binaries** in the container FS. -2. **Parse** each with the appropriate loader → `BinaryModule` (+ imports/exports/symbols/relocs). -3. **Simulate linking** per platform to resolve imported functions to provider libraries. ([refspecs.linuxbase.org][4]) -4. **Disassemble** functions (iced/Disarm) → CFGs → **call edges** (direct, PLT/IAT/stub, indirect). -5. **Assemble call graph** across modules; normalize names via demangling. -6. **Reachability**: given roots (entry or user‑specified) compute reachable set; emit JSON with **purls** (from your SBOM/package resolver). -7. **(Optional)** dump GraphViz / MSAGL views for debugging. ([GitHub][19]) - ---- - -## 12) Quick heuristics for vulnerability triage - -* **Sink maps**: flag edges to high‑risk APIs (`strcpy`, `gets`, legacy SSL ciphers) even without CVE versioning. -* **DWARF line info** (when present): attach file:line to nodes for developer action. `LibObjectFile` gives you DWARF v4 reads. ([GitHub][2]) - ---- - -## 13) Test corpora - -* **ELF:** glibc/openssl/libpng from distro repos; validate `R_*_JUMP_SLOT` handling and PLT edges. ([cs61.seas.harvard.edu][16]) -* **PE:** system DLLs (Kernel32, Advapi32) and a small MSVC console app; validate IAT & delay‑load. ([Microsoft Learn][5]) -* **Mach‑O:** Xcode‑built binaries across macOS 11 & 12+ to cover both dyld opcode and chained fixups paths; verify `LC_FUNCTION_STARTS` improves discovery. ([Stack Overflow][9]) - ---- - -## 14) Deliverables you can start coding now - -* **MachOLoader.cs** - - * Parse headers + load commands (ELFSharp). - * Implement `DyldInfoParser` (port from LLVM/LIEF) and `ChainedFixupsParser` (port structs & walkers). ([LIEF][7]) -* **X86Disassembler.cs / Arm64Disassembler.cs** (iced / Disarm bridges). ([GitHub][12]) -* **CallGraphBuilder.cs** (recursive descent + linear sweep fallback; PLT/IAT/stub resolution). -* **Reachability.cs** (QuikGraph BFS/DFS). ([GitHub][18]) -* **JsonWriter.cs** (schema above with purls). - ---- - -### References (core, load‑bearing) - -* **ELFSharp** (ELF + Mach‑O pure C#). ([GitHub][1]) -* **LibObjectFile** (ELF/PE/DWARF C#, BSD‑2). ([GitHub][2]) -* **iced** (x86/x64 disasm, C#, MIT). ([GitHub][12]) -* **Disarm** (ARM64 disasm, C#, MIT). ([GitHub][13]) -* **Ryujinx (ARMeilleure)** (ARMv8 decode/JIT in C#, MIT). ([Gitee][14]) -* **ELF gABI & AMD64 supplement** (PLT/GOT, relocations). ([refspecs.linuxbase.org][4]) -* **PE/COFF** (imports/exports/IAT). ([Microsoft Learn][5]) -* **Mach‑O docs** (load commands; LC_FUNCTION_STARTS; dyld bindings; chained fixups). ([Apple Developer][24]) - ---- - -If you want, I can draft **`MachOLoader` + `DyldInfoParser`** in C# next, including chained‑fixups structs (ported from LLVM’s headers) and an **iced**‑based call‑edge walker for x86‑64. - -[1]: https://github.com/konrad-kruczynski/elfsharp "GitHub - konrad-kruczynski/elfsharp: Pure managed C# library for reading ELF, UImage, Mach-O binaries." -[2]: https://github.com/xoofx/LibObjectFile "GitHub - xoofx/LibObjectFile: LibObjectFile is a .NET library to read, manipulate and write linker and executable object files (e.g ELF, PE, DWARF, ar...)" -[3]: https://github.com/secana/PeNet?utm_source=chatgpt.com "secana/PeNet: Portable Executable (PE) library written in . ..." -[4]: https://refspecs.linuxbase.org/elf/gabi4%2B/contents.html?utm_source=chatgpt.com "System V Application Binary Interface - DRAFT - 24 April 2001" -[5]: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format?utm_source=chatgpt.com "PE Format - Win32 apps" -[6]: https://leopard-adc.pepas.com/documentation/DeveloperTools/Conceptual/MachOTopics/0-Introduction/introduction.html?utm_source=chatgpt.com "Mach-O Programming Topics: Introduction" -[7]: https://lief.re/doc/stable/doxygen/classLIEF_1_1MachO_1_1DyldInfo.html?utm_source=chatgpt.com "MachO::DyldInfo Class Reference - LIEF" -[8]: https://llvm.org/doxygen/structllvm_1_1MachO_1_1dyld__chained__fixups__header.html?utm_source=chatgpt.com "MachO::dyld_chained_fixups_header Struct Reference" -[9]: https://stackoverflow.com/questions/9602438/mach-o-file-lc-function-starts-load-command?utm_source=chatgpt.com "Mach-O file LC_FUNCTION_STARTS load command" -[10]: https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table?utm_source=chatgpt.com "All about Procedure Linkage Table" -[11]: https://github.com/dotnet/runtime/issues/77178 "Discussion: ObjWriter in C# · Issue #77178 · dotnet/runtime · GitHub" -[12]: https://github.com/icedland/iced?utm_source=chatgpt.com "icedland/iced: Blazing fast and correct x86/x64 ..." -[13]: https://github.com/SamboyCoding/Disarm?utm_source=chatgpt.com "SamboyCoding/Disarm: Fast, pure-C# ARM64 Disassembler" -[14]: https://gitee.com/ryujinx/Ryujinx/blob/master/LICENSE.txt?utm_source=chatgpt.com "Ryujinx/Ryujinx" -[15]: https://github.com/justinstenning/SharpDisasm?utm_source=chatgpt.com "justinstenning/SharpDisasm" -[16]: https://cs61.seas.harvard.edu/site/2022/pdf/x86-64-abi-20210928.pdf?utm_source=chatgpt.com "System V Application Binary Interface" -[17]: https://www.reinterpretcast.com/hello-world-mach-o?utm_source=chatgpt.com "The Nitty Gritty of “Hello World” on macOS | reinterpretcast.com" -[18]: https://github.com/KeRNeLith/QuikGraph?utm_source=chatgpt.com "KeRNeLith/QuikGraph: Generic Graph Data Structures and ..." -[19]: https://github.com/microsoft/automatic-graph-layout?utm_source=chatgpt.com "microsoft/automatic-graph-layout: A set of tools for ..." -[20]: https://llvm.org/doxygen/structllvm_1_1ItaniumPartialDemangler.html?utm_source=chatgpt.com "ItaniumPartialDemangler Struct Reference" -[21]: https://llvm.org/doxygen/MicrosoftDemangle_8cpp_source.html?utm_source=chatgpt.com "lib/Demangle/MicrosoftDemangle.cpp Source File" -[22]: https://www.emergetools.com/blog/posts/iOS15LaunchTime?utm_source=chatgpt.com "How iOS 15 makes your app launch faster" -[23]: https://llvm.org/doxygen/ItaniumDemangle_8cpp.html?utm_source=chatgpt.com "lib/Demangle/ItaniumDemangle.cpp File Reference" -[24]: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/CodeFootprint/Articles/MachOOverview.html?utm_source=chatgpt.com "Overview of the Mach-O Executable Format" -Good, this is exactly the point where you want to nail the internal model, otherwise everything on top (VEX, policy, reporting) will be shaky. - -Below is a detailed, implementation-ready plan for a **reachability graph with purl-aware edges**, covering ELF, PE, and Mach-O, in C#. - -I’ll structure it as: - -1. Overall graph design (3 layers: function, module, purl) -2. Core C# data model -3. Pipeline steps (end-to-end) -4. Format-specific edge construction (ELF / PE / Mach-O) -5. Reachability queries (from entrypoints to vulnerable purls / functions) -6. JSON output layout and integration with SBOM - ---- - -## 1. Overall graph design - -You want three tightly linked graph layers: - -1. **Function-level call graph (FLG)** - - * Nodes: individual **functions** inside binaries - * Edges: calls from function A → function B (intra- or inter-module) - -2. **Module-level graph (MLG)** - - * Nodes: **binaries** (ELF/PE/Mach-O files) - * Edges: “module A calls module B at least once” (aggregated from FLG) - -3. **Purl-level graph (PLG)** - - * Nodes: **purls** (packages or generic artifacts) - * Edges: “purl P1 depends-at-runtime on purl P2” (aggregated from module edges) - -The **reachability algorithm** runs primarily on the **function graph**, but: - -* You can project reachability results to **module** and **purl** nodes. -* You can also run coarse-grained analysis directly on **purl graph** when needed (“Is any code in purl X reachable from the container entrypoint?”). - ---- - -## 2. Core C# data model - -### 2.1 Identifiers and enums - -```csharp -public enum BinaryFormat { Elf, Pe, MachO } - -public readonly record struct ModuleId(string Path, BinaryFormat Format); - -public readonly record struct Purl(string Value); - -public enum EdgeKind -{ - IntraModuleDirect, // call foo -> bar in same module - ImportCall, // call via plt/iat/stub to imported function - SyntheticRoot, // root (entrypoint) edge - IndirectUnresolved // optional: we saw an indirect call we couldn't resolve -} -``` - -### 2.2 Function node - -```csharp -public sealed class FunctionNode -{ - public int Id { get; init; } // internal numeric id - public ModuleId Module { get; init; } - public Purl Purl { get; init; } // resolved from Module -> Purl - public ulong Address { get; init; } // VA or RVA - public string Name { get; init; } // mangled - public string? DemangledName { get; init; } // optional - public bool IsExported { get; init; } - public bool IsImportedStub { get; init; } // e.g. PLT stub, Mach-O stub, PE thunks - public bool IsRoot { get; set; } // _start/main/entrypoint etc. -} -``` - -### 2.3 Edges - -```csharp -public sealed class CallEdge -{ - public int FromId { get; init; } // FunctionNode.Id - public int ToId { get; init; } // FunctionNode.Id - public EdgeKind Kind { get; init; } - public string Evidence { get; init; } // e.g. "ELF.R_X86_64_JUMP_SLOT", "PE.IAT", "MachO.indirectSym" -} -``` - -### 2.4 Graph container - -```csharp -public sealed class CallGraph -{ - public IReadOnlyDictionary Nodes { get; init; } - public IReadOnlyDictionary> OutEdges { get; init; } - public IReadOnlyDictionary> InEdges { get; init; } - - // Convenience: mappings - public IReadOnlyDictionary> FunctionsByModule { get; init; } - public IReadOnlyDictionary> FunctionsByPurl { get; init; } -} -``` - -### 2.5 Purl-level graph view - -You don’t store a separate physical graph; you **derive** it on demand: - -```csharp -public sealed class PurlEdge -{ - public Purl From { get; init; } - public Purl To { get; init; } - public List<(int FromFnId, int ToFnId)> SupportingCalls { get; init; } -} - -public sealed class PurlGraphView -{ - public IReadOnlyDictionary> Adjacent { get; init; } - public IReadOnlyList Edges { get; init; } -} -``` - ---- - -## 3. Pipeline steps (end-to-end) - -### Step 0 – Inputs - -* Set of binaries (files) extracted from container image. -* SBOM or other metadata that can map a file path (or hash) → **purl**. - -### Step 1 – Parse binaries → `BinaryModule` objects - -You define a common in-memory model: - -```csharp -public sealed class BinaryModule -{ - public ModuleId Id { get; init; } - public Purl Purl { get; init; } - public BinaryFormat Format { get; init; } - - // Raw sections / segments - public IReadOnlyList Sections { get; init; } - - // Symbols - public IReadOnlyList Symbols { get; init; } // imports + exports + locals - - // Relocations / fixups - public IReadOnlyList Relocations { get; init; } - - // Import/export tables (PE)/dylib commands (Mach-O)/DT_NEEDED (ELF) - public ImportInfo[] Imports { get; init; } - public ExportInfo[] Exports { get; init; } -} -``` - -Implement format-specific loaders: - -* `ElfLoader : IBinaryLoader` -* `PeLoader : IBinaryLoader` -* `MachOLoader : IBinaryLoader` - -Each loader uses your chosen C# parsers or ported code and fills `BinaryModule`. - -### Step 2 – Disassembly → basic blocks & candidate functions - -For each `BinaryModule`: - -1. Use appropriate decoder (iced for x86/x64; Disarm/ported ARMeilleure for AArch64). -2. Seed function starts: - - * Exported functions - * Entry points (`_start`, `main`, AddressOfEntryPoint) - * Mach-O `LC_FUNCTION_STARTS` if available -3. Walk instructions to build basic blocks: - - * Stop blocks at conditional/unconditional branches, calls, rets. - * Record for each call site: - - * Address of caller function - * Operand type (immediate, memory with import table address, etc.) - -Disassembler outputs a list of `FunctionNode` skeletons (no cross-module link yet) and a list of **raw call sites**: - -```csharp -public sealed class RawCallSite -{ - public int CallerFunctionId { get; init; } - public ulong InstructionAddress { get; init; } - public ulong? DirectTargetAddress { get; init; } // e.g. CALL 0x401000 - public ulong? MemoryTargetAddress { get; init; } // e.g. CALL [0x404000] - public bool IsIndirect { get; init; } // register-based etc. -} -``` - -### Step 3 – Build function nodes - -Using disassembly + symbol tables: - -* For each discovered function: - - * Determine: address, name (if sym available), export/import flags. - * Map `ModuleId` → `Purl` using `IPurlResolver`. -* Populate `FunctionNode` instances and index them by `Id`. - -### Step 4 – Construct intra-module edges - -For each `RawCallSite`: - -* If `DirectTargetAddress` falls inside a known function’s address range in the **same module**, add **IntraModuleDirect** edge. - -This gives you “normal” calls like `foo()` calling `bar()` in the same .so/.dll/. - -### Step 5 – Construct inter-module edges (import calls) - -This is where ELF/PE/Mach-O differ; details in section 4 below. - -But the abstract logic is: - -1. For each call site with `MemoryTargetAddress` (IAT slot / GOT entry / la_symbol_ptr / PLT): -2. From the module’s import, relocation or fixup tables, determine: - - * Which **imported symbol** it corresponds to (name, ordinal, etc.). - * Which **imported module / dylib / DLL** provides that symbol. -3. Find (or create) a `FunctionNode` representing that imported symbol in the **provider module**. -4. Add an **ImportCall** edge from caller function to the provider `FunctionNode`. - -This is the key to turning low-level dynamic linking into **purl-aware cross-module edges**, because each `FunctionNode` is already stamped with a `Purl`. - -### Step 6 – Build adjacency structures - -Once you have all `FunctionNode`s and `CallEdge`s: - -* Build `OutEdges` and `InEdges` dictionaries keyed by `FunctionNode.Id`. -* Build `FunctionsByModule` / `FunctionsByPurl`. - ---- - -## 4. Format-specific edge construction - -This is the “how” for step 5, per binary format. - -### 4.1 ELF - -Goal: map call sites that go via PLT/GOT to an imported function in a `DT_NEEDED` library. - -Algorithm: - -1. Parse: - - * `.dynsym`, `.dynstr` – dynamic symbol table - * `.rela.plt` / `.rel.plt` – relocation entries for PLT - * `.got.plt` / `.got` – PLT’s GOT - * `DT_NEEDED` entries – list of linked shared objects and their sonames - -2. For each relocation of type `R_*_JUMP_SLOT`: - - * It applies to an entry in the PLT GOT; that GOT entry is what CALL instructions read from. - * Relocation gives you: - - * Offset in GOT (`r_offset`) - * Symbol index (`r_info` → symbol) → dynamic symbol (`ElfSymbol`) - * Symbol name, type (FUNC), binding, etc. - -3. Link GOT entries to call sites: - - * For each `RawCallSite` with `MemoryTargetAddress`, check if that address falls inside `.got.plt` (or `.got`). If it does: - - * Find relocation whose `r_offset` equals that GOT entry offset. - * That tells you which **symbol** is being called. - -4. Determine provider module: - - * From the symbol’s `st_name` and `DT_NEEDED` list, decide which shared object is expected to define it (an approximation is: first DT_NEEDED that provides that name). - * Map DT_NEEDED → `ModuleId` (you’ll have loaded these modules separately, or you can create “placeholder modules” if they’re not in the container image). - -5. Create edges: - - * Create/find `FunctionNode` for the **imported symbol** in provider module. - * Add `CallEdge` from caller function to imported function, `EdgeKind = ImportCall`, `Evidence = "ELF.R_X86_64_JUMP_SLOT"` (or arch-specific). - -This yields edges like: - -* `myapp:main` → `libssl.so.1.1:SSL_read` -* `libfoo.so:foo` → `libc.so.6:malloc` - -### 4.2 PE - -Goal: map call sites that go via the Import Address Table (IAT) to imported functions in DLLs. - -Algorithm: - -1. Parse: - - * `IMAGE_IMPORT_DESCRIPTOR[]` – each for a DLL name. - * Original thunk table (INT) – names/ordinals of imported symbols. - * IAT – where the loader writes function addresses at runtime. - -2. For each import entry: - - * Determine: - - * DLL name (`Name`) - * Function name or ordinal (from INT) - * IAT slot address (RVA) - -3. Link IAT slots to call sites: - - * For each `RawCallSite` with `MemoryTargetAddress`: - - * Check if this address equals the VA of an IAT slot. - * If yes, the call site is effectively calling that imported function. - -4. Determine provider module: - - * The DLL name gives you a target module (e.g. `KERNEL32.dll` → `ModuleId`). - * Ensure that DLL is represented as a `BinaryModule` or a “placeholder” if not present in image. - -5. Create edges: - - * Create/find `FunctionNode` for imported function in provider module. - * Add `CallEdge` with `EdgeKind = ImportCall` and `Evidence = "PE.IAT"` (or `"PE.DelayLoad"` if using delay load descriptors). - -Example: - -* `myservice.exe:Start` → `SSPICLI.dll:AcquireCredentialsHandleW` - -### 4.3 Mach-O - -Goal: map stub calls via `__TEXT,__stubs` / `__DATA,__la_symbol_ptr` (and / or chained fixups) to symbols in dependent dylibs. - -Algorithm (for classic dyld opcodes, not chained fixups, then extend): - -1. Parse: - - * Load commands: - - * `LC_SYMTAB`, `LC_DYSYMTAB` - * `LC_LOAD_DYLIB` (to know dependent dylibs) - * `LC_FUNCTION_STARTS` (for seeding functions) - * `LC_DYLD_INFO` (rebase/bind/lazy bind) - * `__TEXT,__stubs` – stub code - * `__DATA,__la_symbol_ptr` (or `__DATA_CONST,__la_symbol_ptr`) – lazy pointer table - * **Indirect symbol table** – maps slot indices to symbol table indices - -2. Stub → la_symbol_ptr mapping: - - * Stubs are small functions (usually a few instructions) that indirect through the corresponding `la_symbol_ptr` entry. - * For each stub function: - - * Determine which la_symbol_ptr entry it uses (based on stub index and linking metadata). - * From the indirect symbol table, find which dynamic symbol that la_symbol_ptr entry corresponds to. - - * This gives you symbol name and the index in `LC_LOAD_DYLIB` (dylib ordinal). - -3. Link stub call sites: - - * In disassembly, treat calls to these stub functions as **import calls**. - * For each call instruction `CALL stub_function`: - - * `RawCallSite.DirectTargetAddress` lies inside `__TEXT,__stubs`. - * Resolve stub → la_symbol_ptr → symbol → dylib. - -4. Determine provider module: - - * From dylib ordinal and load commands, get the path / install name of dylib (`libssl.1.1.dylib`, etc.). - * Map that to a `ModuleId` in your module set. - -5. Create edges: - - * Create/find imported `FunctionNode` in provider module. - * Add `CallEdge` from caller to that function with `EdgeKind = ImportCall`, `Evidence = "MachO.IndirectSymbol"`. - -For **chained fixups** (`LC_DYLD_CHAINED_FIXUPS`), you’ll compute a similar mapping but walking chain entries instead of traditional lazy/weak binds. The key is still: - -* Map a stub or function to a **fixup** entry. -* From fixup, determine the symbol and dylib. -* Then connect call-site → imported function. - ---- - -## 5. Reachability queries - -Once the graph is built, reachability is “just graph algorithms” + mapping back to purls. - -### 5.1 Roots - -Decide what are your **root functions**: - -* Binary entrypoints: - - * ELF: `_start`, `main`, constructors (`.init_array`) - * PE: AddressOfEntryPoint, registered service entrypoints - * Mach-O: `_main`, constructors -* Optionally, any exported API function that a container orchestrator or plugin system will call. - -Mark them as `FunctionNode.IsRoot = true` and create synthetic edges from a special root node if you want: - -```csharp -var syntheticRoot = new FunctionNode -{ - Id = 0, - Name = "", - IsRoot = true, - // Module, Purl can be special markers -}; - -foreach (var fn in allFunctions.Where(f => f.IsRoot)) -{ - edges.Add(new CallEdge - { - FromId = syntheticRoot.Id, - ToId = fn.Id, - Kind = EdgeKind.SyntheticRoot, - Evidence = "Root" - }); -} -``` - -### 5.2 Reachability algorithm (function-level) - -Use BFS/DFS from the root node(s): - -```csharp -public sealed class ReachabilityResult -{ - public HashSet ReachableFunctions { get; } = new(); -} - -public ReachabilityResult ComputeReachableFunctions(CallGraph graph, IEnumerable rootIds) -{ - var visited = new HashSet(); - var stack = new Stack(); - - foreach (var root in rootIds) - { - if (visited.Add(root)) - stack.Push(root); - } - - while (stack.Count > 0) - { - var current = stack.Pop(); - - if (!graph.OutEdges.TryGetValue(current, out var edges)) - continue; - - foreach (var edge in edges) - { - if (visited.Add(edge.ToId)) - stack.Push(edge.ToId); - } - } - - return new ReachabilityResult { ReachableFunctions = visited }; -} -``` - -### 5.3 Project reachability to modules and purls - -Given `ReachableFunctions`: - -```csharp -public sealed class ReachabilityProjection -{ - public HashSet ReachableModules { get; } = new(); - public HashSet ReachablePurls { get; } = new(); -} - -public ReachabilityProjection ProjectToModulesAndPurls(CallGraph graph, ReachabilityResult result) -{ - var projection = new ReachabilityProjection(); - - foreach (var fnId in result.ReachableFunctions) - { - if (!graph.Nodes.TryGetValue(fnId, out var fn)) - continue; - - projection.ReachableModules.Add(fn.Module); - projection.ReachablePurls.Add(fn.Purl); - } - - return projection; -} -``` - -Now you can answer questions like: - -* “Is any code from purl `pkg:deb/openssl@1.1.1w-1` reachable from the container entrypoint?” -* “Which purls are reachable at all?” - -### 5.4 Vulnerability reachability - -Assume you’ve mapped each vulnerability to: - -* `Purl` (where it lives) -* `AffectedFunctionNames` (symbols; optionally demangled) - -You can implement: - -```csharp -public sealed class VulnerabilitySink -{ - public string VulnerabilityId { get; init; } // CVE-... - public Purl Purl { get; init; } - public string FunctionName { get; init; } // symbol name or demangled -} -``` - -Resolution algorithm: - -1. For each `VulnerabilitySink`, find all `FunctionNode` with: - - * `node.Purl == sink.Purl` and - * `node.Name` or `node.DemangledName` matches `sink.FunctionName`. - -2. For each such node, check `ReachableFunctions.Contains(node.Id)`. - -3. Build a `Finding` object: - -```csharp -public sealed class VulnerabilityFinding -{ - public string VulnerabilityId { get; init; } - public Purl Purl { get; init; } - public bool IsReachable { get; init; } - public List SinkFunctionIds { get; init; } = new(); -} -``` - -Plus, if you want **path evidence**, you run a shortest-path search (BFS predecessor map) from root to sink and store the sequence of `FunctionNode.Id`s. - ---- - -## 6. Purl edges (derived graph) - -For reporting and analytics, it’s useful to produce a **purl-level dependency graph**. - -Given `CallGraph`: - -```csharp -public PurlGraphView BuildPurlGraph(CallGraph graph) -{ - var edgesByPair = new Dictionary<(Purl From, Purl To), PurlEdge>(); - - foreach (var kv in graph.OutEdges) - { - var fromFn = graph.Nodes[kv.Key]; - - foreach (var edge in kv.Value) - { - var toFn = graph.Nodes[edge.ToId]; - - if (fromFn.Purl.Equals(toFn.Purl)) - continue; // intra-purl, skip if you only care about inter-purl - - var key = (fromFn.Purl, toFn.Purl); - if (!edgesByPair.TryGetValue(key, out var pe)) - { - pe = new PurlEdge - { - From = fromFn.Purl, - To = toFn.Purl, - SupportingCalls = new List<(int, int)>() - }; - edgesByPair[key] = pe; - } - - pe.SupportingCalls.Add((fromFn.Id, toFn.Id)); - } - } - - var adj = new Dictionary>(); - - foreach (var kv in edgesByPair) - { - var (from, to) = kv.Key; - if (!adj.TryGetValue(from, out var list)) - { - list = new HashSet(); - adj[from] = list; - } - list.Add(to); - } - - return new PurlGraphView - { - Adjacent = adj, - Edges = edgesByPair.Values.ToList() - }; -} -``` - -This gives you: - -* A coarse view of runtime dependencies between purls (“Purl A calls into Purl B”). -* Enough context to emit purl-level VEX or to reason about trust at package granularity. - ---- - -## 7. JSON output and SBOM integration - -### 7.1 JSON shape (high level) - -You can emit a composite document: - -```json -{ - "image": "registry.example.com/app@sha256:...", - "modules": [ - { - "moduleId": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" }, - "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", - "arch": "x86_64" - } - ], - "functions": [ - { - "id": 42, - "name": "SSL_do_handshake", - "demangledName": null, - "module": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" }, - "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", - "address": "0x401020", - "exported": true - } - ], - "edges": [ - { - "from": 10, - "to": 42, - "kind": "ImportCall", - "evidence": "ELF.R_X86_64_JUMP_SLOT" - } - ], - "reachability": { - "roots": [1], - "reachableFunctions": [1,10,42] - }, - "purlGraph": { - "edges": [ - { - "from": "pkg:generic/myapp@1.0.0", - "to": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", - "supportingCalls": [[10,42]] - } - ] - }, - "vulnerabilities": [ - { - "id": "CVE-2024-XXXX", - "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", - "sinkFunctions": [42], - "reachable": true, - "paths": [ - [1, 10, 42] - ] - } - ] -} -``` - -### 7.2 Purl resolution - -Implement an `IPurlResolver` interface: - -```csharp -public interface IPurlResolver -{ - Purl ResolveForModule(string filePath, byte[] contentHash); -} -``` - -Possible implementations: - -* `SbomPurlResolver` – given a CycloneDX/SPDX SBOM for the image, match by path or checksum. -* `LinuxPackagePurlResolver` – read `/var/lib/dpkg/status` / rpm DB in the filesystem. -* `GenericPurlResolver` – fallback: `pkg:generic/`. - -You call the resolver in your loaders so that **every `BinaryModule` has a purl** and thus every `FunctionNode` has a purl. - ---- - -## 8. Concrete implementation tasks for your team - -1. **Data model & interfaces** - - * Implement `ModuleId`, `FunctionNode`, `CallEdge`, `CallGraph`. - * Define `RawCallSite`, `BinaryModule`, and `IPurlResolver`. - -2. **Loaders** - - * `ElfLoader`: fill symbols, dynamic relocations (PLT), DT_NEEDED, etc. - * `PeLoader`: import descriptors, IAT, delay-load. - * `MachOLoader`: load commands, stubs, la_symbol_ptr, indirect symbols / chained fixups. - -3. **Disassembly** - - * `X86Disassembler` (iced) and `Arm64Disassembler` (Disarm or port). - * Function detection and `RawCallSite` extraction. - -4. **CallGraphBuilder** - - * Build intra-module edges from direct calls. - * Build inter-module edges using the format-specific rules above. - * Construct final `CallGraph` with adjacency maps and purl mappings. - -5. **Reachability** - - * Implement BFS/DFS from root functions. - * Projection to modules + purls. - * Vulnerability sink resolution & path reconstruction. - -6. **Export** - - * JSON serializer for the schema above. - * Optional: purl-level summary / VEX generator. - ---- - -If you want, next step I can do a **more concrete design for `CallGraphBuilder`** (including per-format helper classes with method signatures) or a **C# skeleton** for the `ElfImportResolver`, `PeImportResolver`, and `MachOStubResolver` that plug directly into this plan. +Vlad, here’s a concrete, **pure‑C#** blueprint to build a multi‑format binary analyzer (Mach‑O, ELF, PE) that produces **call graphs + reachability**, with **no external tools**. Where needed, I point to permissively‑licensed code you can **port** (copy) from other ecosystems. + +--- + +## 0) Targets & non‑negotiables + +* **Formats:** Mach‑O (inc. LC_DYLD_INFO / LC_DYLD_CHAINED_FIXUPS), ELF (SysV gABI), PE/COFF +* **Architectures:** x86‑64 (and x86), AArch64 (ARM64) +* **Outputs:** JSON with **purls** per module + function‑level call graph & reachability +* **No tool reuse:** Only pure C# libraries or code **ported** from permissive sources + +--- + +## 1) Parsing the containers (pure C#) + +**Pick one C# reader per format, keeping licenses permissive:** + +* **ELF & Mach‑O:** `ELFSharp` (pure managed C#; ELF + Mach‑O reading). MIT/X11 license. ([GitHub][1]) +* **ELF & PE (+ DWARF v4):** `LibObjectFile` (C#, BSD‑2). Good ELF relocations (i386, x86_64, ARM, AArch64), PE directories, DWARF sections. Use it as your **common object model** for ELF+PE, then add a Mach‑O adapter. ([GitHub][2]) +* **PE (optional alternative):** `PeNet` (pure C#, broad PE directories, imp/exp, TLS, certs). MIT. Useful if you want a second implementation for cross‑checks. ([GitHub][3]) + +> Why two libs? `LibObjectFile` gives you DWARF and clean models for ELF/PE; `ELFSharp` covers Mach‑O today (and ELF as a fallback). You control the code paths. + +**Spec references you’ll implement against** (for correctness of your readers & link‑time semantics): + +* **ELF (gABI, AMD64 supplement):** dynamic section, PLT/GOT, `R_X86_64_JUMP_SLOT` semantics (eager vs lazy). ([refspecs.linuxbase.org][4]) +* **PE/COFF:** imports/exports/IAT, delay‑load, TLS. ([Microsoft Learn][5]) +* **Mach‑O:** file layout, load commands (`LC_SYMTAB`, `LC_DYSYMTAB`, `LC_FUNCTION_STARTS`, `LC_DYLD_INFO(_ONLY)`), and the modern `LC_DYLD_CHAINED_FIXUPS`. ([leopard-adc.pepas.com][6]) + +--- + +## 2) Mach‑O: what you must **port** (byte‑for‑byte compatible) + +Apple moved from traditional dyld bind opcodes to **chained fixups** on macOS 12/iOS 15+; you need both: + +* **Dyld bind opcodes** (`LC_DYLD_INFO(_ONLY)`): parse the BIND/LAZY_BIND streams (tuples of ``). Port minimal logic from **LLVM** or **LIEF** (both Apache‑2.0‑compatible) into C#. ([LIEF][7]) +* **Chained fixups** (`LC_DYLD_CHAINED_FIXUPS`): port `dyld_chained_fixups_header` structs & chain walking from LLVM’s `MachO.h` or Apple’s dyld headers. This restores imports/rebases without running dyld. ([LLVM][8]) +* **Function discovery hint:** read `LC_FUNCTION_STARTS` (ULEB128 deltas) to seed function boundaries—very helpful on stripped binaries. ([Stack Overflow][9]) +* **Stubs mapping:** resolve `__TEXT,__stubs` ↔ `__DATA,__la_symbol_ptr` via the **indirect symbol table**; conceptually identical to ELF’s PLT/GOT. ([MaskRay][10]) + +> If you prefer an in‑C# base for Mach‑O manipulation, **Melanzana.MachO** exists (MIT) and has been used by .NET folks for Mach‑O/Code Signing/obj writing; you can mine its approach for load‑command modeling. ([GitHub][11]) + +--- + +## 3) Disassembly (pure C#, multi‑arch) + +* **x86/x64:** `iced` (C# decoder/disassembler/encoder; MIT; fast & complete). ([GitHub][12]) +* **AArch64/ARM64:** two options that keep you pure‑C#: + + * **Disarm** (pure C# ARM64 disassembler; MIT). Good starting point to decode & get branch/call kinds. ([GitHub][13]) + * **Port from Ryujinx ARMeilleure** (ARMv8 decoder/JIT in C#, MIT). You can lift only the **decoder** pieces you need. ([Gitee][14]) +* **x86 fallback:** `SharpDisasm` (udis86 port in C#; BSD‑2). Older than iced; keep as a reference. ([GitHub][15]) + +--- + +## 4) Call graph recovery (static) + +**4.1 Function seeds** + +* From symbols (`.dynsym`/`LC_SYMTAB`/PE exports) +* From **LC_FUNCTION_STARTS** (Mach‑O) for stripped code ([Stack Overflow][9]) +* From entrypoints (`_start`/`main` or PE AddressOfEntryPoint) +* From exception/unwind tables & DWARF (when present)—`LibObjectFile` already models DWARF v4. ([GitHub][2]) + +**4.2 CFG & interprocedural calls** + +* **Decode** with iced/Disarm from each seed; form **basic blocks** by following control‑flow until terminators (ret/jmp/call). +* **Direct calls:** immediate targets become edges (PC‑relative fixups where needed). +* **Imported calls:** + + * **ELF:** calls to PLT stubs → resolve via `.rela.plt` & `R_*_JUMP_SLOT` to symbol names (link‑time target). ([cs61.seas.harvard.edu][16]) + * **PE:** calls through the **IAT** → resolve via `IMAGE_IMPORT_DESCRIPTOR` / thunk tables. ([Microsoft Learn][5]) + * **Mach‑O:** calls to `__stubs` use **indirect symbol table** + `__la_symbol_ptr` (or chained fixups) → map to dylib/symbol. ([reinterpretcast.com][17]) +* **Indirect calls within the binary:** heuristics only (function pointer tables, vtables, small constant pools). Keep them labeled **“indirect‑unresolved”** unless a heuristic yields a concrete target. + +**4.3 Cross‑binary graph** + +* Build module‑level edges by simulating the platform’s loader: + + * **ELF:** honor `DT_NEEDED`, `DT_RPATH/RUNPATH`, versioning (`.gnu.version*`) to pick the definer of an imported symbol. gABI rules apply. ([refspecs.linuxbase.org][4]) + * **PE:** pick DLL from the import descriptors. ([Microsoft Learn][5]) + * **Mach‑O:** `LC_LOAD_DYLIB` + dyld binding / chained fixups determine the provider image. ([LIEF][7]) + +--- + +## 5) Reachability analysis + +Represent the **call graph** using a .NET graph lib (or a simple adjacency set). I suggest: + +* **QuikGraph** (successor of QuickGraph; MIT) for algorithms (DFS/BFS, SCCs). Use it to compute reachability from chosen roots (entrypoint(s), exported APIs, or “sinks”). ([GitHub][18]) + +You can visualize with **MSAGL** (MIT) when you need layouts, but your core output is JSON. ([GitHub][19]) + +--- + +## 6) Symbol demangling (nice‑to‑have, pure C#) + +* **Itanium (ELF/Mach‑O):** Either port LLVM’s Itanium demangler or use a C# lib like **CxxDemangler** (a C# rewrite of `cpp_demangle`). ([LLVM][20]) +* **MSVC (PE):** Port LLVM’s `MicrosoftDemangle.cpp` (Apache‑2.0 with LLVM exception) to C#. ([LLVM][21]) + +--- + +## 7) JSON output (with purls) + +Use a stable schema (example) to feed SBOM/vuln matching downstream: + +```json +{ + "modules": [ + { + "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64", + "format": "ELF", + "arch": "x86_64", + "path": "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", + "exports": ["SSL_read", "SSL_write"], + "imports": ["BIO_new", "EVP_CipherInit_ex"], + "functions": [{"name":"SSL_do_handshake","va":"0x401020","size":512,"demangled": "..."}] + } + ], + "graph": { + "nodes": [ + {"id":"bin:main@0x401000","module": "pkg:generic/myapp@1.0.0"}, + {"id":"lib:SSL_read","module":"pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64"} + ], + "edges": [ + {"src":"bin:main@0x401000","dst":"lib:SSL_read","kind":"import_call","evidence":"ELF.R_X86_64_JUMP_SLOT"} + ] + }, + "reachability": { + "roots": ["bin:_start","bin:main@0x401000"], + "reachable": ["lib:SSL_read", "lib:SSL_write"], + "unresolved_indirect_calls": [ + {"site":"0x402ABC","reason":"register-indirect"} + ] + } +} +``` + +--- + +## 8) Minimal C# module layout (sketch) + +``` +Stella.Analysis.Core/ + BinaryModule.cs // common model (sections, symbols, relocs, imports/exports) + Loader/ + PeLoader.cs // wrap LibObjectFile (or PeNet) to BinaryModule + ElfLoader.cs // wrap LibObjectFile to BinaryModule + MachOLoader.cs // wrap ELFSharp + your ported Dyld/ChainedFixups + Disasm/ + X86Disassembler.cs // iced bridge: bytes -> instructions + Arm64Disassembler.cs // Disarm (or ARMeilleure port) bridge + Graph/ + CallGraphBuilder.cs // builds CFG per function + inter-procedural edges + Reachability.cs // BFS/DFS over QuikGraph + Demangle/ + ItaniumDemangler.cs // port or wrap CxxDemangler + MicrosoftDemangler.cs // port from LLVM + Export/ + JsonWriter.cs // writes schema above +``` + +--- + +## 9) Implementation notes (where issues usually bite) + +* **Mach‑O moderns:** Implement both dyld opcode **and** chained fixups; many macOS 12+/iOS15+ binaries only have chained fixups. ([emergetools.com][22]) +* **Stubs vs real targets (Mach‑O):** map `__stubs` → `__la_symbol_ptr` via **indirect symbols** to the true imported symbol (or its post‑fixup target). ([reinterpretcast.com][17]) +* **ELF PLT/GOT:** treat `.plt` entries as **call trampolines**; ultimate edge should point to the symbol (library) that satisfies `DT_NEEDED` + version. ([refspecs.linuxbase.org][4]) +* **PE delay‑load:** don’t forget `IMAGE_DELAYLOAD_DESCRIPTOR` for delayed IATs. ([Microsoft Learn][5]) +* **Function discovery:** use `LC_FUNCTION_STARTS` when symbols are stripped; it’s a cheap way to seed analysis. ([Stack Overflow][9]) +* **Name clarity:** demangle Itanium/MSVC so downstream vuln rules can match consistently. ([LLVM][20]) + +--- + +## 10) What to **copy/port** verbatim (safe licenses) + +* **Dyld bind & exports trie logic:** from **LLVM** or **LIEF** Mach‑O (Apache‑2.0). Great for getting the exact opcode semantics right. ([LIEF][7]) +* **Chained fixups structs/walkers:** from **LLVM MachO.h** or Apple dyld headers (permissive headers). ([LLVM][8]) +* **Itanium/MS demanglers:** LLVM demangler sources are standalone; easy to translate to C#. ([LLVM][23]) +* **ARM64 decoder:** if Disarm gaps hurt, lift just the **decoder** pieces from **Ryujinx ARMeilleure** (MIT). ([Gitee][14]) + +*(Avoid GPL’d parsers like binutils/BFD; they will contaminate your codebase’s licensing.)* + +--- + +## 11) End‑to‑end pipeline (per container image) + +1. **Enumerate binaries** in the container FS. +2. **Parse** each with the appropriate loader → `BinaryModule` (+ imports/exports/symbols/relocs). +3. **Simulate linking** per platform to resolve imported functions to provider libraries. ([refspecs.linuxbase.org][4]) +4. **Disassemble** functions (iced/Disarm) → CFGs → **call edges** (direct, PLT/IAT/stub, indirect). +5. **Assemble call graph** across modules; normalize names via demangling. +6. **Reachability**: given roots (entry or user‑specified) compute reachable set; emit JSON with **purls** (from your SBOM/package resolver). +7. **(Optional)** dump GraphViz / MSAGL views for debugging. ([GitHub][19]) + +--- + +## 12) Quick heuristics for vulnerability triage + +* **Sink maps**: flag edges to high‑risk APIs (`strcpy`, `gets`, legacy SSL ciphers) even without CVE versioning. +* **DWARF line info** (when present): attach file:line to nodes for developer action. `LibObjectFile` gives you DWARF v4 reads. ([GitHub][2]) + +--- + +## 13) Test corpora + +* **ELF:** glibc/openssl/libpng from distro repos; validate `R_*_JUMP_SLOT` handling and PLT edges. ([cs61.seas.harvard.edu][16]) +* **PE:** system DLLs (Kernel32, Advapi32) and a small MSVC console app; validate IAT & delay‑load. ([Microsoft Learn][5]) +* **Mach‑O:** Xcode‑built binaries across macOS 11 & 12+ to cover both dyld opcode and chained fixups paths; verify `LC_FUNCTION_STARTS` improves discovery. ([Stack Overflow][9]) + +--- + +## 14) Deliverables you can start coding now + +* **MachOLoader.cs** + + * Parse headers + load commands (ELFSharp). + * Implement `DyldInfoParser` (port from LLVM/LIEF) and `ChainedFixupsParser` (port structs & walkers). ([LIEF][7]) +* **X86Disassembler.cs / Arm64Disassembler.cs** (iced / Disarm bridges). ([GitHub][12]) +* **CallGraphBuilder.cs** (recursive descent + linear sweep fallback; PLT/IAT/stub resolution). +* **Reachability.cs** (QuikGraph BFS/DFS). ([GitHub][18]) +* **JsonWriter.cs** (schema above with purls). + +--- + +### References (core, load‑bearing) + +* **ELFSharp** (ELF + Mach‑O pure C#). ([GitHub][1]) +* **LibObjectFile** (ELF/PE/DWARF C#, BSD‑2). ([GitHub][2]) +* **iced** (x86/x64 disasm, C#, MIT). ([GitHub][12]) +* **Disarm** (ARM64 disasm, C#, MIT). ([GitHub][13]) +* **Ryujinx (ARMeilleure)** (ARMv8 decode/JIT in C#, MIT). ([Gitee][14]) +* **ELF gABI & AMD64 supplement** (PLT/GOT, relocations). ([refspecs.linuxbase.org][4]) +* **PE/COFF** (imports/exports/IAT). ([Microsoft Learn][5]) +* **Mach‑O docs** (load commands; LC_FUNCTION_STARTS; dyld bindings; chained fixups). ([Apple Developer][24]) + +--- + +If you want, I can draft **`MachOLoader` + `DyldInfoParser`** in C# next, including chained‑fixups structs (ported from LLVM’s headers) and an **iced**‑based call‑edge walker for x86‑64. + +[1]: https://github.com/konrad-kruczynski/elfsharp "GitHub - konrad-kruczynski/elfsharp: Pure managed C# library for reading ELF, UImage, Mach-O binaries." +[2]: https://github.com/xoofx/LibObjectFile "GitHub - xoofx/LibObjectFile: LibObjectFile is a .NET library to read, manipulate and write linker and executable object files (e.g ELF, PE, DWARF, ar...)" +[3]: https://github.com/secana/PeNet?utm_source=chatgpt.com "secana/PeNet: Portable Executable (PE) library written in . ..." +[4]: https://refspecs.linuxbase.org/elf/gabi4%2B/contents.html?utm_source=chatgpt.com "System V Application Binary Interface - DRAFT - 24 April 2001" +[5]: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format?utm_source=chatgpt.com "PE Format - Win32 apps" +[6]: https://leopard-adc.pepas.com/documentation/DeveloperTools/Conceptual/MachOTopics/0-Introduction/introduction.html?utm_source=chatgpt.com "Mach-O Programming Topics: Introduction" +[7]: https://lief.re/doc/stable/doxygen/classLIEF_1_1MachO_1_1DyldInfo.html?utm_source=chatgpt.com "MachO::DyldInfo Class Reference - LIEF" +[8]: https://llvm.org/doxygen/structllvm_1_1MachO_1_1dyld__chained__fixups__header.html?utm_source=chatgpt.com "MachO::dyld_chained_fixups_header Struct Reference" +[9]: https://stackoverflow.com/questions/9602438/mach-o-file-lc-function-starts-load-command?utm_source=chatgpt.com "Mach-O file LC_FUNCTION_STARTS load command" +[10]: https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table?utm_source=chatgpt.com "All about Procedure Linkage Table" +[11]: https://github.com/dotnet/runtime/issues/77178 "Discussion: ObjWriter in C# · Issue #77178 · dotnet/runtime · GitHub" +[12]: https://github.com/icedland/iced?utm_source=chatgpt.com "icedland/iced: Blazing fast and correct x86/x64 ..." +[13]: https://github.com/SamboyCoding/Disarm?utm_source=chatgpt.com "SamboyCoding/Disarm: Fast, pure-C# ARM64 Disassembler" +[14]: https://gitee.com/ryujinx/Ryujinx/blob/master/LICENSE.txt?utm_source=chatgpt.com "Ryujinx/Ryujinx" +[15]: https://github.com/justinstenning/SharpDisasm?utm_source=chatgpt.com "justinstenning/SharpDisasm" +[16]: https://cs61.seas.harvard.edu/site/2022/pdf/x86-64-abi-20210928.pdf?utm_source=chatgpt.com "System V Application Binary Interface" +[17]: https://www.reinterpretcast.com/hello-world-mach-o?utm_source=chatgpt.com "The Nitty Gritty of “Hello World” on macOS | reinterpretcast.com" +[18]: https://github.com/KeRNeLith/QuikGraph?utm_source=chatgpt.com "KeRNeLith/QuikGraph: Generic Graph Data Structures and ..." +[19]: https://github.com/microsoft/automatic-graph-layout?utm_source=chatgpt.com "microsoft/automatic-graph-layout: A set of tools for ..." +[20]: https://llvm.org/doxygen/structllvm_1_1ItaniumPartialDemangler.html?utm_source=chatgpt.com "ItaniumPartialDemangler Struct Reference" +[21]: https://llvm.org/doxygen/MicrosoftDemangle_8cpp_source.html?utm_source=chatgpt.com "lib/Demangle/MicrosoftDemangle.cpp Source File" +[22]: https://www.emergetools.com/blog/posts/iOS15LaunchTime?utm_source=chatgpt.com "How iOS 15 makes your app launch faster" +[23]: https://llvm.org/doxygen/ItaniumDemangle_8cpp.html?utm_source=chatgpt.com "lib/Demangle/ItaniumDemangle.cpp File Reference" +[24]: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/CodeFootprint/Articles/MachOOverview.html?utm_source=chatgpt.com "Overview of the Mach-O Executable Format" +Good, this is exactly the point where you want to nail the internal model, otherwise everything on top (VEX, policy, reporting) will be shaky. + +Below is a detailed, implementation-ready plan for a **reachability graph with purl-aware edges**, covering ELF, PE, and Mach-O, in C#. + +I’ll structure it as: + +1. Overall graph design (3 layers: function, module, purl) +2. Core C# data model +3. Pipeline steps (end-to-end) +4. Format-specific edge construction (ELF / PE / Mach-O) +5. Reachability queries (from entrypoints to vulnerable purls / functions) +6. JSON output layout and integration with SBOM + +--- + +## 1. Overall graph design + +You want three tightly linked graph layers: + +1. **Function-level call graph (FLG)** + + * Nodes: individual **functions** inside binaries + * Edges: calls from function A → function B (intra- or inter-module) + +2. **Module-level graph (MLG)** + + * Nodes: **binaries** (ELF/PE/Mach-O files) + * Edges: “module A calls module B at least once” (aggregated from FLG) + +3. **Purl-level graph (PLG)** + + * Nodes: **purls** (packages or generic artifacts) + * Edges: “purl P1 depends-at-runtime on purl P2” (aggregated from module edges) + +The **reachability algorithm** runs primarily on the **function graph**, but: + +* You can project reachability results to **module** and **purl** nodes. +* You can also run coarse-grained analysis directly on **purl graph** when needed (“Is any code in purl X reachable from the container entrypoint?”). + +--- + +## 2. Core C# data model + +### 2.1 Identifiers and enums + +```csharp +public enum BinaryFormat { Elf, Pe, MachO } + +public readonly record struct ModuleId(string Path, BinaryFormat Format); + +public readonly record struct Purl(string Value); + +public enum EdgeKind +{ + IntraModuleDirect, // call foo -> bar in same module + ImportCall, // call via plt/iat/stub to imported function + SyntheticRoot, // root (entrypoint) edge + IndirectUnresolved // optional: we saw an indirect call we couldn't resolve +} +``` + +### 2.2 Function node + +```csharp +public sealed class FunctionNode +{ + public int Id { get; init; } // internal numeric id + public ModuleId Module { get; init; } + public Purl Purl { get; init; } // resolved from Module -> Purl + public ulong Address { get; init; } // VA or RVA + public string Name { get; init; } // mangled + public string? DemangledName { get; init; } // optional + public bool IsExported { get; init; } + public bool IsImportedStub { get; init; } // e.g. PLT stub, Mach-O stub, PE thunks + public bool IsRoot { get; set; } // _start/main/entrypoint etc. +} +``` + +### 2.3 Edges + +```csharp +public sealed class CallEdge +{ + public int FromId { get; init; } // FunctionNode.Id + public int ToId { get; init; } // FunctionNode.Id + public EdgeKind Kind { get; init; } + public string Evidence { get; init; } // e.g. "ELF.R_X86_64_JUMP_SLOT", "PE.IAT", "MachO.indirectSym" +} +``` + +### 2.4 Graph container + +```csharp +public sealed class CallGraph +{ + public IReadOnlyDictionary Nodes { get; init; } + public IReadOnlyDictionary> OutEdges { get; init; } + public IReadOnlyDictionary> InEdges { get; init; } + + // Convenience: mappings + public IReadOnlyDictionary> FunctionsByModule { get; init; } + public IReadOnlyDictionary> FunctionsByPurl { get; init; } +} +``` + +### 2.5 Purl-level graph view + +You don’t store a separate physical graph; you **derive** it on demand: + +```csharp +public sealed class PurlEdge +{ + public Purl From { get; init; } + public Purl To { get; init; } + public List<(int FromFnId, int ToFnId)> SupportingCalls { get; init; } +} + +public sealed class PurlGraphView +{ + public IReadOnlyDictionary> Adjacent { get; init; } + public IReadOnlyList Edges { get; init; } +} +``` + +--- + +## 3. Pipeline steps (end-to-end) + +### Step 0 – Inputs + +* Set of binaries (files) extracted from container image. +* SBOM or other metadata that can map a file path (or hash) → **purl**. + +### Step 1 – Parse binaries → `BinaryModule` objects + +You define a common in-memory model: + +```csharp +public sealed class BinaryModule +{ + public ModuleId Id { get; init; } + public Purl Purl { get; init; } + public BinaryFormat Format { get; init; } + + // Raw sections / segments + public IReadOnlyList Sections { get; init; } + + // Symbols + public IReadOnlyList Symbols { get; init; } // imports + exports + locals + + // Relocations / fixups + public IReadOnlyList Relocations { get; init; } + + // Import/export tables (PE)/dylib commands (Mach-O)/DT_NEEDED (ELF) + public ImportInfo[] Imports { get; init; } + public ExportInfo[] Exports { get; init; } +} +``` + +Implement format-specific loaders: + +* `ElfLoader : IBinaryLoader` +* `PeLoader : IBinaryLoader` +* `MachOLoader : IBinaryLoader` + +Each loader uses your chosen C# parsers or ported code and fills `BinaryModule`. + +### Step 2 – Disassembly → basic blocks & candidate functions + +For each `BinaryModule`: + +1. Use appropriate decoder (iced for x86/x64; Disarm/ported ARMeilleure for AArch64). +2. Seed function starts: + + * Exported functions + * Entry points (`_start`, `main`, AddressOfEntryPoint) + * Mach-O `LC_FUNCTION_STARTS` if available +3. Walk instructions to build basic blocks: + + * Stop blocks at conditional/unconditional branches, calls, rets. + * Record for each call site: + + * Address of caller function + * Operand type (immediate, memory with import table address, etc.) + +Disassembler outputs a list of `FunctionNode` skeletons (no cross-module link yet) and a list of **raw call sites**: + +```csharp +public sealed class RawCallSite +{ + public int CallerFunctionId { get; init; } + public ulong InstructionAddress { get; init; } + public ulong? DirectTargetAddress { get; init; } // e.g. CALL 0x401000 + public ulong? MemoryTargetAddress { get; init; } // e.g. CALL [0x404000] + public bool IsIndirect { get; init; } // register-based etc. +} +``` + +### Step 3 – Build function nodes + +Using disassembly + symbol tables: + +* For each discovered function: + + * Determine: address, name (if sym available), export/import flags. + * Map `ModuleId` → `Purl` using `IPurlResolver`. +* Populate `FunctionNode` instances and index them by `Id`. + +### Step 4 – Construct intra-module edges + +For each `RawCallSite`: + +* If `DirectTargetAddress` falls inside a known function’s address range in the **same module**, add **IntraModuleDirect** edge. + +This gives you “normal” calls like `foo()` calling `bar()` in the same .so/.dll/. + +### Step 5 – Construct inter-module edges (import calls) + +This is where ELF/PE/Mach-O differ; details in section 4 below. + +But the abstract logic is: + +1. For each call site with `MemoryTargetAddress` (IAT slot / GOT entry / la_symbol_ptr / PLT): +2. From the module’s import, relocation or fixup tables, determine: + + * Which **imported symbol** it corresponds to (name, ordinal, etc.). + * Which **imported module / dylib / DLL** provides that symbol. +3. Find (or create) a `FunctionNode` representing that imported symbol in the **provider module**. +4. Add an **ImportCall** edge from caller function to the provider `FunctionNode`. + +This is the key to turning low-level dynamic linking into **purl-aware cross-module edges**, because each `FunctionNode` is already stamped with a `Purl`. + +### Step 6 – Build adjacency structures + +Once you have all `FunctionNode`s and `CallEdge`s: + +* Build `OutEdges` and `InEdges` dictionaries keyed by `FunctionNode.Id`. +* Build `FunctionsByModule` / `FunctionsByPurl`. + +--- + +## 4. Format-specific edge construction + +This is the “how” for step 5, per binary format. + +### 4.1 ELF + +Goal: map call sites that go via PLT/GOT to an imported function in a `DT_NEEDED` library. + +Algorithm: + +1. Parse: + + * `.dynsym`, `.dynstr` – dynamic symbol table + * `.rela.plt` / `.rel.plt` – relocation entries for PLT + * `.got.plt` / `.got` – PLT’s GOT + * `DT_NEEDED` entries – list of linked shared objects and their sonames + +2. For each relocation of type `R_*_JUMP_SLOT`: + + * It applies to an entry in the PLT GOT; that GOT entry is what CALL instructions read from. + * Relocation gives you: + + * Offset in GOT (`r_offset`) + * Symbol index (`r_info` → symbol) → dynamic symbol (`ElfSymbol`) + * Symbol name, type (FUNC), binding, etc. + +3. Link GOT entries to call sites: + + * For each `RawCallSite` with `MemoryTargetAddress`, check if that address falls inside `.got.plt` (or `.got`). If it does: + + * Find relocation whose `r_offset` equals that GOT entry offset. + * That tells you which **symbol** is being called. + +4. Determine provider module: + + * From the symbol’s `st_name` and `DT_NEEDED` list, decide which shared object is expected to define it (an approximation is: first DT_NEEDED that provides that name). + * Map DT_NEEDED → `ModuleId` (you’ll have loaded these modules separately, or you can create “placeholder modules” if they’re not in the container image). + +5. Create edges: + + * Create/find `FunctionNode` for the **imported symbol** in provider module. + * Add `CallEdge` from caller function to imported function, `EdgeKind = ImportCall`, `Evidence = "ELF.R_X86_64_JUMP_SLOT"` (or arch-specific). + +This yields edges like: + +* `myapp:main` → `libssl.so.1.1:SSL_read` +* `libfoo.so:foo` → `libc.so.6:malloc` + +### 4.2 PE + +Goal: map call sites that go via the Import Address Table (IAT) to imported functions in DLLs. + +Algorithm: + +1. Parse: + + * `IMAGE_IMPORT_DESCRIPTOR[]` – each for a DLL name. + * Original thunk table (INT) – names/ordinals of imported symbols. + * IAT – where the loader writes function addresses at runtime. + +2. For each import entry: + + * Determine: + + * DLL name (`Name`) + * Function name or ordinal (from INT) + * IAT slot address (RVA) + +3. Link IAT slots to call sites: + + * For each `RawCallSite` with `MemoryTargetAddress`: + + * Check if this address equals the VA of an IAT slot. + * If yes, the call site is effectively calling that imported function. + +4. Determine provider module: + + * The DLL name gives you a target module (e.g. `KERNEL32.dll` → `ModuleId`). + * Ensure that DLL is represented as a `BinaryModule` or a “placeholder” if not present in image. + +5. Create edges: + + * Create/find `FunctionNode` for imported function in provider module. + * Add `CallEdge` with `EdgeKind = ImportCall` and `Evidence = "PE.IAT"` (or `"PE.DelayLoad"` if using delay load descriptors). + +Example: + +* `myservice.exe:Start` → `SSPICLI.dll:AcquireCredentialsHandleW` + +### 4.3 Mach-O + +Goal: map stub calls via `__TEXT,__stubs` / `__DATA,__la_symbol_ptr` (and / or chained fixups) to symbols in dependent dylibs. + +Algorithm (for classic dyld opcodes, not chained fixups, then extend): + +1. Parse: + + * Load commands: + + * `LC_SYMTAB`, `LC_DYSYMTAB` + * `LC_LOAD_DYLIB` (to know dependent dylibs) + * `LC_FUNCTION_STARTS` (for seeding functions) + * `LC_DYLD_INFO` (rebase/bind/lazy bind) + * `__TEXT,__stubs` – stub code + * `__DATA,__la_symbol_ptr` (or `__DATA_CONST,__la_symbol_ptr`) – lazy pointer table + * **Indirect symbol table** – maps slot indices to symbol table indices + +2. Stub → la_symbol_ptr mapping: + + * Stubs are small functions (usually a few instructions) that indirect through the corresponding `la_symbol_ptr` entry. + * For each stub function: + + * Determine which la_symbol_ptr entry it uses (based on stub index and linking metadata). + * From the indirect symbol table, find which dynamic symbol that la_symbol_ptr entry corresponds to. + + * This gives you symbol name and the index in `LC_LOAD_DYLIB` (dylib ordinal). + +3. Link stub call sites: + + * In disassembly, treat calls to these stub functions as **import calls**. + * For each call instruction `CALL stub_function`: + + * `RawCallSite.DirectTargetAddress` lies inside `__TEXT,__stubs`. + * Resolve stub → la_symbol_ptr → symbol → dylib. + +4. Determine provider module: + + * From dylib ordinal and load commands, get the path / install name of dylib (`libssl.1.1.dylib`, etc.). + * Map that to a `ModuleId` in your module set. + +5. Create edges: + + * Create/find imported `FunctionNode` in provider module. + * Add `CallEdge` from caller to that function with `EdgeKind = ImportCall`, `Evidence = "MachO.IndirectSymbol"`. + +For **chained fixups** (`LC_DYLD_CHAINED_FIXUPS`), you’ll compute a similar mapping but walking chain entries instead of traditional lazy/weak binds. The key is still: + +* Map a stub or function to a **fixup** entry. +* From fixup, determine the symbol and dylib. +* Then connect call-site → imported function. + +--- + +## 5. Reachability queries + +Once the graph is built, reachability is “just graph algorithms” + mapping back to purls. + +### 5.1 Roots + +Decide what are your **root functions**: + +* Binary entrypoints: + + * ELF: `_start`, `main`, constructors (`.init_array`) + * PE: AddressOfEntryPoint, registered service entrypoints + * Mach-O: `_main`, constructors +* Optionally, any exported API function that a container orchestrator or plugin system will call. + +Mark them as `FunctionNode.IsRoot = true` and create synthetic edges from a special root node if you want: + +```csharp +var syntheticRoot = new FunctionNode +{ + Id = 0, + Name = "", + IsRoot = true, + // Module, Purl can be special markers +}; + +foreach (var fn in allFunctions.Where(f => f.IsRoot)) +{ + edges.Add(new CallEdge + { + FromId = syntheticRoot.Id, + ToId = fn.Id, + Kind = EdgeKind.SyntheticRoot, + Evidence = "Root" + }); +} +``` + +### 5.2 Reachability algorithm (function-level) + +Use BFS/DFS from the root node(s): + +```csharp +public sealed class ReachabilityResult +{ + public HashSet ReachableFunctions { get; } = new(); +} + +public ReachabilityResult ComputeReachableFunctions(CallGraph graph, IEnumerable rootIds) +{ + var visited = new HashSet(); + var stack = new Stack(); + + foreach (var root in rootIds) + { + if (visited.Add(root)) + stack.Push(root); + } + + while (stack.Count > 0) + { + var current = stack.Pop(); + + if (!graph.OutEdges.TryGetValue(current, out var edges)) + continue; + + foreach (var edge in edges) + { + if (visited.Add(edge.ToId)) + stack.Push(edge.ToId); + } + } + + return new ReachabilityResult { ReachableFunctions = visited }; +} +``` + +### 5.3 Project reachability to modules and purls + +Given `ReachableFunctions`: + +```csharp +public sealed class ReachabilityProjection +{ + public HashSet ReachableModules { get; } = new(); + public HashSet ReachablePurls { get; } = new(); +} + +public ReachabilityProjection ProjectToModulesAndPurls(CallGraph graph, ReachabilityResult result) +{ + var projection = new ReachabilityProjection(); + + foreach (var fnId in result.ReachableFunctions) + { + if (!graph.Nodes.TryGetValue(fnId, out var fn)) + continue; + + projection.ReachableModules.Add(fn.Module); + projection.ReachablePurls.Add(fn.Purl); + } + + return projection; +} +``` + +Now you can answer questions like: + +* “Is any code from purl `pkg:deb/openssl@1.1.1w-1` reachable from the container entrypoint?” +* “Which purls are reachable at all?” + +### 5.4 Vulnerability reachability + +Assume you’ve mapped each vulnerability to: + +* `Purl` (where it lives) +* `AffectedFunctionNames` (symbols; optionally demangled) + +You can implement: + +```csharp +public sealed class VulnerabilitySink +{ + public string VulnerabilityId { get; init; } // CVE-... + public Purl Purl { get; init; } + public string FunctionName { get; init; } // symbol name or demangled +} +``` + +Resolution algorithm: + +1. For each `VulnerabilitySink`, find all `FunctionNode` with: + + * `node.Purl == sink.Purl` and + * `node.Name` or `node.DemangledName` matches `sink.FunctionName`. + +2. For each such node, check `ReachableFunctions.Contains(node.Id)`. + +3. Build a `Finding` object: + +```csharp +public sealed class VulnerabilityFinding +{ + public string VulnerabilityId { get; init; } + public Purl Purl { get; init; } + public bool IsReachable { get; init; } + public List SinkFunctionIds { get; init; } = new(); +} +``` + +Plus, if you want **path evidence**, you run a shortest-path search (BFS predecessor map) from root to sink and store the sequence of `FunctionNode.Id`s. + +--- + +## 6. Purl edges (derived graph) + +For reporting and analytics, it’s useful to produce a **purl-level dependency graph**. + +Given `CallGraph`: + +```csharp +public PurlGraphView BuildPurlGraph(CallGraph graph) +{ + var edgesByPair = new Dictionary<(Purl From, Purl To), PurlEdge>(); + + foreach (var kv in graph.OutEdges) + { + var fromFn = graph.Nodes[kv.Key]; + + foreach (var edge in kv.Value) + { + var toFn = graph.Nodes[edge.ToId]; + + if (fromFn.Purl.Equals(toFn.Purl)) + continue; // intra-purl, skip if you only care about inter-purl + + var key = (fromFn.Purl, toFn.Purl); + if (!edgesByPair.TryGetValue(key, out var pe)) + { + pe = new PurlEdge + { + From = fromFn.Purl, + To = toFn.Purl, + SupportingCalls = new List<(int, int)>() + }; + edgesByPair[key] = pe; + } + + pe.SupportingCalls.Add((fromFn.Id, toFn.Id)); + } + } + + var adj = new Dictionary>(); + + foreach (var kv in edgesByPair) + { + var (from, to) = kv.Key; + if (!adj.TryGetValue(from, out var list)) + { + list = new HashSet(); + adj[from] = list; + } + list.Add(to); + } + + return new PurlGraphView + { + Adjacent = adj, + Edges = edgesByPair.Values.ToList() + }; +} +``` + +This gives you: + +* A coarse view of runtime dependencies between purls (“Purl A calls into Purl B”). +* Enough context to emit purl-level VEX or to reason about trust at package granularity. + +--- + +## 7. JSON output and SBOM integration + +### 7.1 JSON shape (high level) + +You can emit a composite document: + +```json +{ + "image": "registry.example.com/app@sha256:...", + "modules": [ + { + "moduleId": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" }, + "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", + "arch": "x86_64" + } + ], + "functions": [ + { + "id": 42, + "name": "SSL_do_handshake", + "demangledName": null, + "module": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" }, + "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", + "address": "0x401020", + "exported": true + } + ], + "edges": [ + { + "from": 10, + "to": 42, + "kind": "ImportCall", + "evidence": "ELF.R_X86_64_JUMP_SLOT" + } + ], + "reachability": { + "roots": [1], + "reachableFunctions": [1,10,42] + }, + "purlGraph": { + "edges": [ + { + "from": "pkg:generic/myapp@1.0.0", + "to": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", + "supportingCalls": [[10,42]] + } + ] + }, + "vulnerabilities": [ + { + "id": "CVE-2024-XXXX", + "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", + "sinkFunctions": [42], + "reachable": true, + "paths": [ + [1, 10, 42] + ] + } + ] +} +``` + +### 7.2 Purl resolution + +Implement an `IPurlResolver` interface: + +```csharp +public interface IPurlResolver +{ + Purl ResolveForModule(string filePath, byte[] contentHash); +} +``` + +Possible implementations: + +* `SbomPurlResolver` – given a CycloneDX/SPDX SBOM for the image, match by path or checksum. +* `LinuxPackagePurlResolver` – read `/var/lib/dpkg/status` / rpm DB in the filesystem. +* `GenericPurlResolver` – fallback: `pkg:generic/`. + +You call the resolver in your loaders so that **every `BinaryModule` has a purl** and thus every `FunctionNode` has a purl. + +--- + +## 8. Concrete implementation tasks for your team + +1. **Data model & interfaces** + + * Implement `ModuleId`, `FunctionNode`, `CallEdge`, `CallGraph`. + * Define `RawCallSite`, `BinaryModule`, and `IPurlResolver`. + +2. **Loaders** + + * `ElfLoader`: fill symbols, dynamic relocations (PLT), DT_NEEDED, etc. + * `PeLoader`: import descriptors, IAT, delay-load. + * `MachOLoader`: load commands, stubs, la_symbol_ptr, indirect symbols / chained fixups. + +3. **Disassembly** + + * `X86Disassembler` (iced) and `Arm64Disassembler` (Disarm or port). + * Function detection and `RawCallSite` extraction. + +4. **CallGraphBuilder** + + * Build intra-module edges from direct calls. + * Build inter-module edges using the format-specific rules above. + * Construct final `CallGraph` with adjacency maps and purl mappings. + +5. **Reachability** + + * Implement BFS/DFS from root functions. + * Projection to modules + purls. + * Vulnerability sink resolution & path reconstruction. + +6. **Export** + + * JSON serializer for the schema above. + * Optional: purl-level summary / VEX generator. + +--- + +If you want, next step I can do a **more concrete design for `CallGraphBuilder`** (including per-format helper classes with method signatures) or a **C# skeleton** for the `ElfImportResolver`, `PeImportResolver`, and `MachOStubResolver` that plug directly into this plan. diff --git a/docs/product-advisories/archived/18-Nov-2026 - Patch-Oracles.md b/docs/product-advisories/archived/18-Nov-2025 - Patch-Oracles.md similarity index 96% rename from docs/product-advisories/archived/18-Nov-2026 - Patch-Oracles.md rename to docs/product-advisories/archived/18-Nov-2025 - Patch-Oracles.md index 79b6cee81..f353f445f 100644 --- a/docs/product-advisories/archived/18-Nov-2026 - Patch-Oracles.md +++ b/docs/product-advisories/archived/18-Nov-2025 - Patch-Oracles.md @@ -1,635 +1,635 @@ - -Here’s a simple, cheap way to sanity‑check your vuln function recovery without fancy ground truth: **build “patch oracles.”** - ---- - -### What it is (in plain words) - -Take a known CVE and compile two **tiny** binaries from the same source: - -* **Vulnerable** commit/revision -* **Fixed** commit/revision - Then diff the discovered functions + call edges between the two. If your analyzer can’t see the symbol (or guard) the patch adds/removes/tightens, your recall is suspect. - ---- - -### Why it works - -Patches for real CVEs usually: - -* add/remove a **function** (e.g., `validate_len`) -* change a **call site** (new guard before `memcpy`) -* tweak **control flow** (early return on bounds check) - -Those are precisely the things your function recovery / call‑graph pass should surface—even on stripped ELFs. If they don’t move in your graph, you’ve got blind spots. - ---- - -### Minimal workflow (5 steps) - -1. **Pick a CVE** with a clean, public fix (e.g., OpenSSL/zlib/busybox). -2. **Isolate the patch** (git range or cherry‑pick) and craft a *tiny harness* that calls the affected code path. -3. **Build both** with the same toolchain/flags; produce **stripped** ELFs (`-s`) to mimic production. -4. **Run your discovery** on both: - - * function list, demangled where possible - * call edges (A→B), basic blocks (optional) -5. **Diff the graphs**: look for the new guard function, removed unsafe call, or altered edge count. - ---- - -### A tiny “oracle spec” (drop-in YAML for your test runner) - -```yaml -cve: CVE-YYYY-XXXX -target: libfoo 1.2.3 -build: - cc: clang - cflags: [-O2, -fno-omit-frame-pointer] - ldflags: [] - strip: true -evidence: - expect_functions_added: [validate_len] - expect_functions_removed: [unsafe_copy] # optional - expect_call_added: - - caller: foo_parse - callee: validate_len - expect_call_removed: - - caller: foo_parse - callee: memcpy -tolerances: - allow_unresolved_symbols: 0 - allow_extra_funcs: 2 -``` - ---- - -### Quick harness pattern (C) - -```c -// before: foo_parse -> memcpy(buf, src, len); -// after : foo_parse -> validate_len(len) -> memcpy(...) -extern int foo_parse(const char*); - -int main(int argc, char** argv) { - const char* in = argc > 1 ? argv[1] : "AAAA"; - return foo_parse(in); -} -``` - ---- - -### What to flag as a failure - -* Expected **function not discovered** (e.g., `validate_len` missing). -* Expected **edge not present** (`foo_parse → validate_len` absent). -* **No CFG change** where patch clearly adds a guard/early return. - ---- - -### Where this plugs into Stella Ops - -* Put these oracles under `Scanner/tests/patch-oracles/*` per language. -* Run them in CI for **.NET/JVM/C/C++/Go/Rust** analyzers. -* Use them to gate any changes to symbolization, demangling, or call‑graph building. -* Record per‑analyzer **recall deltas** when you tweak heuristics or switch disassemblers. - ---- - -If you want, I can scaffold the first three oracles (e.g., zlib overflow fix, OpenSSL length check, BusyBox `ash` patch) with ready‑to‑run Makefiles and expected graph diffs. -Understood — let us turn the “patch oracle” idea into something you can actually drop into the Stella Ops repo and CI. - -I will walk through: - -1. How to structure this inside the monorepo -2. How to build one oracle end-to-end (C/C++ example) -3. How to do the same for .NET/JVM -4. How to automate running and asserting them -5. Practical rules and pitfalls so these stay stable and useful - ---- - -## 1. Where this lives in Stella Ops - -A simple, language-agnostic layout that will scale: - -```text -src/ - StellaOps.Scanner/ - ... # your scanner code - StellaOps.Scanner.Tests/ # existing tests (if any) - PatchOracles/ - c/ - CVE-YYYY-XXXX-/ - src/ - build.sh - oracle.yml - README.md - cpp/ - ... - dotnet/ - CVE-YYYY-XXXX-/ - src/ - build.ps1 - oracle.yml - README.md - jvm/ - ... - go/ - ... - rust/ - ... - tools/ - scanner-oracle-runner/ # tiny runner (C# console or bash) -``` - -Key principles: - -* Each CVE/test case is **self-contained** (its own folder with sources, build script, oracle.yml). -* Build scripts produce **two binaries/artifacts**: `vuln` and `fixed`. -* `oracle.yml` describes: how to build, what to scan, and what differences to expect in Scanner’s call graph/function list. - ---- - -## 2. How to build a single patch oracle (C/C++) - -Think of a patch oracle as: “Given these two binaries, Scanner must see specific changes in functions and call edges.” - -### 2.1. Step-by-step workflow - -For one C/C++ CVE: - -1. **Pick & freeze the patch** - - * Choose a small, clean CVE in a library with easily buildable code (zlib, OpenSSL, BusyBox, etc.). - * Identify commit `A` (vulnerable) and commit `B` (fixed). - * Extract only the minimal sources needed to build the affected function + a harness into `src/`. - -2. **Create a minimal harness** - -Example: patch adds `validate_len` and guards a `memcpy` in `foo_parse`. - -```c -// src/main.c -#include - -int foo_parse(const char* in); // from the library code under test - -int main(int argc, char** argv) { - const char* in = (argc > 1) ? argv[1] : "AAAA"; - return foo_parse(in); -} -``` - -Under `src/`, you keep two sets of sources: - -```text -src/ - vuln/ - foo.c # vulnerable version - api.h - main.c - fixed/ - foo.c # fixed version (adds validate_len, changes calls) - api.h - main.c -``` - -3. **Provide a deterministic build script** - -Example `build.sh`: - -```bash -#!/usr/bin/env bash -set -euo pipefail - -CC="${CC:-clang}" -CFLAGS="${CFLAGS:- -O2 -fno-omit-frame-pointer -g0}" -LDFLAGS="${LDFLAGS:- }" - -build_one() { - local name="$1" # vuln or fixed - mkdir -p build - ${CC} ${CFLAGS} src/${name}/*.c ${LDFLAGS} -o build/${name} - # Strip symbols to simulate production - strip build/${name} -} - -build_one "vuln" -build_one "fixed" -``` - -Guidelines: - -* Fix the toolchain: either run this inside a Docker image (e.g., `debian:bookworm` with specific `clang` version) or at least document required versions in `README.md`. -* Always build both artifacts with **identical flags**; the only difference should be the code change. -* Use `strip` to ensure Scanner doesn’t accidentally rely on debug symbols. - -4. **Define the oracle (what must change)** - -You define expectations based on the patch: - -* Functions added/removed/renamed. -* New call edges (e.g., `foo_parse -> validate_len`). -* Removed call edges (e.g., `foo_parse -> memcpy`). -* Optionally: new basic blocks, conditional branches, or early returns. - -A practical `oracle.yml` for this case: - -```yaml -cve: CVE-YYYY-XXXX -name: zlib_len_guard_example -language: c -toolchain: - cc: clang - cflags: "-O2 -fno-omit-frame-pointer -g0" - ldflags: "" -build: - script: "./build.sh" - artifacts: - vulnerable: "build/vuln" - fixed: "build/fixed" - -scan: - scanner_cli: "dotnet run --project ../../StellaOps.Scanner.Cli" - # If you have a Dockerized scanner, you could do: - # scanner_cli: "docker run --rm -v $PWD:/work stellaops/scanner:dev" - args: - - "--format=json" - - "--analyzers=native" - timeout_seconds: 120 - -expectations: - functions: - must_exist_in_fixed: - - name: "validate_len" - must_not_exist_in_vuln: - - name: "validate_len" - calls: - must_add: - - caller: "foo_parse" - callee: "validate_len" - must_remove: - - caller: "foo_parse" - callee: "memcpy" - tolerances: - allow_unresolved_symbols: 0 - allow_extra_functions: 5 - allow_missing_calls: 0 -``` - -5. **Connect Scanner output to the oracle** - -Assume your Scanner CLI produces something like: - -```json -{ - "binary": "build/fixed", - "functions": [ - { "name": "foo_parse", "address": "0x401000" }, - { "name": "validate_len", "address": "0x401080" }, - ... - ], - "calls": [ - { "caller": "foo_parse", "callee": "validate_len" }, - { "caller": "validate_len", "callee": "memcpy" } - ] -} -``` - -Your oracle-runner will: - -* Run scanner on `vuln` → `vuln.json` -* Run scanner on `fixed` → `fixed.json` -* Compare each expectation in `oracle.yml` against `vuln.json` and `fixed.json` - -Pseudo-logic for a function expectation: - -```csharp -bool HasFunction(JsonElement doc, string name) => - doc.GetProperty("functions") - .EnumerateArray() - .Any(f => f.GetProperty("name").GetString() == name); - -bool HasCall(JsonElement doc, string caller, string callee) => - doc.GetProperty("calls") - .EnumerateArray() - .Any(c => - c.GetProperty("caller").GetString() == caller && - c.GetProperty("callee").GetString() == callee); -``` - -The runner will produce a small report, per oracle: - -```text -[PASS] CVE-YYYY-XXXX zlib_len_guard_example - + validate_len appears only in fixed → OK - + foo_parse → validate_len call added → OK - + foo_parse → memcpy call removed → OK -``` - -If anything fails, it prints the mismatches and exits with non-zero code so CI fails. - ---- - -## 3. Implementing the oracle runner (practical variant) - -You can implement this either as: - -* A standalone C# console (`StellaOps.Scanner.PatchOracleRunner`), or -* A set of xUnit tests that read `oracle.yml` and run dynamically. - -### 3.1. Console runner skeleton (C#) - -High-level structure: - -```text -src/tools/scanner-oracle-runner/ - Program.cs - Oracles/ - (symlink or reference to src/StellaOps.Scanner.Tests/PatchOracles) -``` - -Core responsibilities: - -1. Discover all `oracle.yml` files under `PatchOracles/`. -2. For each: - - * Run the `build` script. - * Run the scanner on both artifacts. - * Evaluate expectations. -3. Aggregate results and exit with appropriate status. - -Pseudo-code outline: - -```csharp -static int Main(string[] args) -{ - var root = args.Length > 0 ? args[0] : "src/StellaOps.Scanner.Tests/PatchOracles"; - var oracleFiles = Directory.GetFiles(root, "oracle.yml", SearchOption.AllDirectories); - var failures = new List(); - - foreach (var oracleFile in oracleFiles) - { - var result = RunOracle(oracleFile); - if (!result.Success) - { - failures.Add($"{result.Name}: {result.FailureReason}"); - } - } - - if (failures.Any()) - { - Console.Error.WriteLine("Patch oracle failures:"); - foreach (var f in failures) Console.Error.WriteLine(" - " + f); - return 1; - } - - Console.WriteLine("All patch oracles passed."); - return 0; -} -``` - -`RunOracle` does: - -* Deserialize YAML (e.g., via `YamlDotNet`). -* `Process.Start` for `build.script`. -* `Process.Start` for `scanner_cli` twice (vuln/fixed). -* Read/parse JSON outputs. -* Run checks `functions.must_*` and `calls.must_*`. - -This is straightforward plumbing code; once built, adding a new patch oracle is just adding a folder + `oracle.yml`. - ---- - -## 4. Managed (.NET / JVM) patch oracles - -Exact same concept, slightly different mechanics. - -### 4.1. .NET example - -Directory: - -```text -PatchOracles/ - dotnet/ - CVE-2021-XXXXX-systemtextjson/ - src/ - vuln/ - Example.sln - Api/... - fixed/ - Example.sln - Api/... - build.ps1 - oracle.yml -``` - -`build.ps1` (PowerShell, simplified): - -```powershell -param( - [string]$Configuration = "Release" -) - -$ErrorActionPreference = "Stop" - -function Build-One([string]$name) { - Push-Location "src/$name" - dotnet clean - dotnet publish -c $Configuration -p:DebugType=None -p:DebugSymbols=false -o ../../build/$name - Pop-Location -} - -New-Item -ItemType Directory -Force -Path "build" | Out-Null - -Build-One "vuln" -Build-One "fixed" -``` - -`oracle.yml`: - -```yaml -cve: CVE-2021-XXXXX -name: systemtextjson_escape_fix -language: dotnet -build: - script: "pwsh ./build.ps1" - artifacts: - vulnerable: "build/vuln/Api.dll" - fixed: "build/fixed/Api.dll" - -scan: - scanner_cli: "dotnet run --project ../../StellaOps.Scanner.Cli" - args: - - "--format=json" - - "--analyzers=dotnet" - timeout_seconds: 120 - -expectations: - methods: - must_exist_in_fixed: - - "Api.JsonHelper::EscapeString" - must_not_exist_in_vuln: - - "Api.JsonHelper::EscapeString" - calls: - must_add: - - caller: "Api.Controller::Handle" - callee: "Api.JsonHelper::EscapeString" - tolerances: - allow_missing_calls: 0 - allow_extra_methods: 10 -``` - -Scanner’s .NET analyzer should produce method identifiers in a stable format (e.g., `Namespace.Type::Method(Signature)`), which you then use in the oracle. - -### 4.2. JVM example - -Similar structure, but artifacts are JARs: - -```yaml -build: - script: "./gradlew :app:assemble" - artifacts: - vulnerable: "app-vuln.jar" - fixed: "app-fixed.jar" - -scan: - scanner_cli: "dotnet run --project ../../StellaOps.Scanner.Cli" - args: - - "--format=json" - - "--analyzers=jvm" -``` - -Expectations then refer to methods like `com.example.JsonHelper.escapeString:(Ljava/lang/String;)Ljava/lang/String;`. - ---- - -## 5. Wiring into CI - -You can integrate this in your existing pipeline (GitLab Runner / Gitea / etc.) as a separate job. - -Example CI job skeleton (GitLab-like YAML for illustration): - -```yaml -patch-oracle-tests: - stage: test - image: mcr.microsoft.com/dotnet/sdk:10.0 - script: - - dotnet build src/StellaOps.Scanner/StellaOps.Scanner.csproj -c Release - - dotnet build src/tools/scanner-oracle-runner/scanner-oracle-runner.csproj -c Release - - dotnet run --project src/tools/scanner-oracle-runner/scanner-oracle-runner.csproj -- \ - src/StellaOps.Scanner.Tests/PatchOracles - artifacts: - when: on_failure - paths: - - src/StellaOps.Scanner.Tests/PatchOracles/**/build - - oracle-results.log -``` - -You can also: - -* Tag the job (e.g., `oracle` or `reachability`) so you can run it nightly or on changes to Scanner analyzers. -* Pin Docker images with the exact C/C++/Java toolchains used by patch oracles so results are deterministic. - ---- - -## 6. Practical guidelines and pitfalls - -Here are concrete rules of thumb for making this robust: - -### 6.1. Choosing good CVE oracles - -Prefer cases where: - -* The patch clearly adds/removes a **function** or **method**, or introduces a separate helper such as `validate_len`, `check_bounds`, etc. -* The patch adds/removes a **call** that is easy to see even under optimization (e.g., non-inline, non-template). -* The project is easy to build and not heavily reliant on obscure toolchains. - -For each supported language in Scanner, target: - -* 3–5 small C or C++ oracles. -* 3–5 .NET or JVM oracles. -* 1–3 for Go and Rust once those analyzers exist. - -You do not need many; you want **sharp, surgical tests**, not coverage. - -### 6.2. Handle inlining and optimization - -Compilers may inline small functions; this can break naive “must have call edge” expectations. - -Mitigations: - -* Choose functions that are “large enough” or mark them `__attribute__((noinline))` (GCC/Clang) in your test harness code if necessary. -* Alternatively, relax expectations using `should_add` vs `must_add` for some edges: - -```yaml -calls: - must_add: [] - should_add: - - caller: "foo_parse" - callee: "validate_len" -``` - -In the runner, `should_add` failures can mark the oracle as “degraded” but not fatal, while `must_*` failures break the build. - -### 6.3. Keep oracles stable over time - -To avoid flakiness: - -* **Vendor sources** into the repo (or at least snapshot the patch) so upstream changes do not affect builds. -* Pin toolchain versions in Docker images for CI. -* Capture and pin scanner configuration: analyzers enabled, rules, version. If you support “deterministic scan manifests” later, these oracles are perfect consumers of that. - -### 6.4. What to assert beyond functions/calls - -When your Scanner gets more advanced, you can extend `oracle.yml`: - -```yaml -cfg: - must_increase_blocks: - - function: "foo_parse" - must_add_branch_on: - - function: "foo_parse" - operand_pattern: "len <= MAX_LEN" -``` - -Initially, I would keep it to: - -* Function presence/absence -* Call edges presence/absence - -and add CFG assertions only when your analyzers and JSON model for CFG stabilize. - -### 6.5. How to use failures - -When a patch oracle fails, it is a **signal** that either: - -* A change in Scanner or a new optimization pattern created a blind spot, or -* The oracle is too strict (e.g., relying on a call that got inlined). - -You then: - -1. Inspect the disassembly / Scanner JSON for `vuln` and `fixed`. -2. Decide if Scanner is wrong (fix analyzer) or oracle is too rigid (relax to `should_*`). -3. Commit both the code change and updated oracle (if needed) in the same merge request. - ---- - -## 7. Minimal checklist for adding a new patch oracle - -For your future self and your agents, here is a compressed checklist: - -1. Select CVE + patch; copy minimal affected sources into `src/…///src/{vuln,fixed}`. -2. Add a tiny harness that calls the patched code path. -3. Write `build.sh` / `build.ps1` to produce `build/vuln` and `build/fixed` artifacts, stripped/Release. -4. Run manual `scanner` on both artifacts once; inspect JSON to find real symbol names and call edges. -5. Create `oracle.yml` with: - - * `build.script` and `artifacts.*` paths - * `scan.scanner_cli` + args - * `expectations.functions.*` and `expectations.calls.*` -6. Run `scanner-oracle-runner` locally; fix any mismatches or over-strict expectations. -7. Commit and ensure CI job `patch-oracle-tests` runs and must pass on MR. - -If you wish, next step we can design the actual JSON schema that Scanner should emit for function/call graphs and write a first C# implementation of `scanner-oracle-runner` aligned with that schema. + +Here’s a simple, cheap way to sanity‑check your vuln function recovery without fancy ground truth: **build “patch oracles.”** + +--- + +### What it is (in plain words) + +Take a known CVE and compile two **tiny** binaries from the same source: + +* **Vulnerable** commit/revision +* **Fixed** commit/revision + Then diff the discovered functions + call edges between the two. If your analyzer can’t see the symbol (or guard) the patch adds/removes/tightens, your recall is suspect. + +--- + +### Why it works + +Patches for real CVEs usually: + +* add/remove a **function** (e.g., `validate_len`) +* change a **call site** (new guard before `memcpy`) +* tweak **control flow** (early return on bounds check) + +Those are precisely the things your function recovery / call‑graph pass should surface—even on stripped ELFs. If they don’t move in your graph, you’ve got blind spots. + +--- + +### Minimal workflow (5 steps) + +1. **Pick a CVE** with a clean, public fix (e.g., OpenSSL/zlib/busybox). +2. **Isolate the patch** (git range or cherry‑pick) and craft a *tiny harness* that calls the affected code path. +3. **Build both** with the same toolchain/flags; produce **stripped** ELFs (`-s`) to mimic production. +4. **Run your discovery** on both: + + * function list, demangled where possible + * call edges (A→B), basic blocks (optional) +5. **Diff the graphs**: look for the new guard function, removed unsafe call, or altered edge count. + +--- + +### A tiny “oracle spec” (drop-in YAML for your test runner) + +```yaml +cve: CVE-YYYY-XXXX +target: libfoo 1.2.3 +build: + cc: clang + cflags: [-O2, -fno-omit-frame-pointer] + ldflags: [] + strip: true +evidence: + expect_functions_added: [validate_len] + expect_functions_removed: [unsafe_copy] # optional + expect_call_added: + - caller: foo_parse + callee: validate_len + expect_call_removed: + - caller: foo_parse + callee: memcpy +tolerances: + allow_unresolved_symbols: 0 + allow_extra_funcs: 2 +``` + +--- + +### Quick harness pattern (C) + +```c +// before: foo_parse -> memcpy(buf, src, len); +// after : foo_parse -> validate_len(len) -> memcpy(...) +extern int foo_parse(const char*); + +int main(int argc, char** argv) { + const char* in = argc > 1 ? argv[1] : "AAAA"; + return foo_parse(in); +} +``` + +--- + +### What to flag as a failure + +* Expected **function not discovered** (e.g., `validate_len` missing). +* Expected **edge not present** (`foo_parse → validate_len` absent). +* **No CFG change** where patch clearly adds a guard/early return. + +--- + +### Where this plugs into Stella Ops + +* Put these oracles under `Scanner/tests/patch-oracles/*` per language. +* Run them in CI for **.NET/JVM/C/C++/Go/Rust** analyzers. +* Use them to gate any changes to symbolization, demangling, or call‑graph building. +* Record per‑analyzer **recall deltas** when you tweak heuristics or switch disassemblers. + +--- + +If you want, I can scaffold the first three oracles (e.g., zlib overflow fix, OpenSSL length check, BusyBox `ash` patch) with ready‑to‑run Makefiles and expected graph diffs. +Understood — let us turn the “patch oracle” idea into something you can actually drop into the Stella Ops repo and CI. + +I will walk through: + +1. How to structure this inside the monorepo +2. How to build one oracle end-to-end (C/C++ example) +3. How to do the same for .NET/JVM +4. How to automate running and asserting them +5. Practical rules and pitfalls so these stay stable and useful + +--- + +## 1. Where this lives in Stella Ops + +A simple, language-agnostic layout that will scale: + +```text +src/ + StellaOps.Scanner/ + ... # your scanner code + StellaOps.Scanner.Tests/ # existing tests (if any) + PatchOracles/ + c/ + CVE-YYYY-XXXX-/ + src/ + build.sh + oracle.yml + README.md + cpp/ + ... + dotnet/ + CVE-YYYY-XXXX-/ + src/ + build.ps1 + oracle.yml + README.md + jvm/ + ... + go/ + ... + rust/ + ... + tools/ + scanner-oracle-runner/ # tiny runner (C# console or bash) +``` + +Key principles: + +* Each CVE/test case is **self-contained** (its own folder with sources, build script, oracle.yml). +* Build scripts produce **two binaries/artifacts**: `vuln` and `fixed`. +* `oracle.yml` describes: how to build, what to scan, and what differences to expect in Scanner’s call graph/function list. + +--- + +## 2. How to build a single patch oracle (C/C++) + +Think of a patch oracle as: “Given these two binaries, Scanner must see specific changes in functions and call edges.” + +### 2.1. Step-by-step workflow + +For one C/C++ CVE: + +1. **Pick & freeze the patch** + + * Choose a small, clean CVE in a library with easily buildable code (zlib, OpenSSL, BusyBox, etc.). + * Identify commit `A` (vulnerable) and commit `B` (fixed). + * Extract only the minimal sources needed to build the affected function + a harness into `src/`. + +2. **Create a minimal harness** + +Example: patch adds `validate_len` and guards a `memcpy` in `foo_parse`. + +```c +// src/main.c +#include + +int foo_parse(const char* in); // from the library code under test + +int main(int argc, char** argv) { + const char* in = (argc > 1) ? argv[1] : "AAAA"; + return foo_parse(in); +} +``` + +Under `src/`, you keep two sets of sources: + +```text +src/ + vuln/ + foo.c # vulnerable version + api.h + main.c + fixed/ + foo.c # fixed version (adds validate_len, changes calls) + api.h + main.c +``` + +3. **Provide a deterministic build script** + +Example `build.sh`: + +```bash +#!/usr/bin/env bash +set -euo pipefail + +CC="${CC:-clang}" +CFLAGS="${CFLAGS:- -O2 -fno-omit-frame-pointer -g0}" +LDFLAGS="${LDFLAGS:- }" + +build_one() { + local name="$1" # vuln or fixed + mkdir -p build + ${CC} ${CFLAGS} src/${name}/*.c ${LDFLAGS} -o build/${name} + # Strip symbols to simulate production + strip build/${name} +} + +build_one "vuln" +build_one "fixed" +``` + +Guidelines: + +* Fix the toolchain: either run this inside a Docker image (e.g., `debian:bookworm` with specific `clang` version) or at least document required versions in `README.md`. +* Always build both artifacts with **identical flags**; the only difference should be the code change. +* Use `strip` to ensure Scanner doesn’t accidentally rely on debug symbols. + +4. **Define the oracle (what must change)** + +You define expectations based on the patch: + +* Functions added/removed/renamed. +* New call edges (e.g., `foo_parse -> validate_len`). +* Removed call edges (e.g., `foo_parse -> memcpy`). +* Optionally: new basic blocks, conditional branches, or early returns. + +A practical `oracle.yml` for this case: + +```yaml +cve: CVE-YYYY-XXXX +name: zlib_len_guard_example +language: c +toolchain: + cc: clang + cflags: "-O2 -fno-omit-frame-pointer -g0" + ldflags: "" +build: + script: "./build.sh" + artifacts: + vulnerable: "build/vuln" + fixed: "build/fixed" + +scan: + scanner_cli: "dotnet run --project ../../StellaOps.Scanner.Cli" + # If you have a Dockerized scanner, you could do: + # scanner_cli: "docker run --rm -v $PWD:/work stellaops/scanner:dev" + args: + - "--format=json" + - "--analyzers=native" + timeout_seconds: 120 + +expectations: + functions: + must_exist_in_fixed: + - name: "validate_len" + must_not_exist_in_vuln: + - name: "validate_len" + calls: + must_add: + - caller: "foo_parse" + callee: "validate_len" + must_remove: + - caller: "foo_parse" + callee: "memcpy" + tolerances: + allow_unresolved_symbols: 0 + allow_extra_functions: 5 + allow_missing_calls: 0 +``` + +5. **Connect Scanner output to the oracle** + +Assume your Scanner CLI produces something like: + +```json +{ + "binary": "build/fixed", + "functions": [ + { "name": "foo_parse", "address": "0x401000" }, + { "name": "validate_len", "address": "0x401080" }, + ... + ], + "calls": [ + { "caller": "foo_parse", "callee": "validate_len" }, + { "caller": "validate_len", "callee": "memcpy" } + ] +} +``` + +Your oracle-runner will: + +* Run scanner on `vuln` → `vuln.json` +* Run scanner on `fixed` → `fixed.json` +* Compare each expectation in `oracle.yml` against `vuln.json` and `fixed.json` + +Pseudo-logic for a function expectation: + +```csharp +bool HasFunction(JsonElement doc, string name) => + doc.GetProperty("functions") + .EnumerateArray() + .Any(f => f.GetProperty("name").GetString() == name); + +bool HasCall(JsonElement doc, string caller, string callee) => + doc.GetProperty("calls") + .EnumerateArray() + .Any(c => + c.GetProperty("caller").GetString() == caller && + c.GetProperty("callee").GetString() == callee); +``` + +The runner will produce a small report, per oracle: + +```text +[PASS] CVE-YYYY-XXXX zlib_len_guard_example + + validate_len appears only in fixed → OK + + foo_parse → validate_len call added → OK + + foo_parse → memcpy call removed → OK +``` + +If anything fails, it prints the mismatches and exits with non-zero code so CI fails. + +--- + +## 3. Implementing the oracle runner (practical variant) + +You can implement this either as: + +* A standalone C# console (`StellaOps.Scanner.PatchOracleRunner`), or +* A set of xUnit tests that read `oracle.yml` and run dynamically. + +### 3.1. Console runner skeleton (C#) + +High-level structure: + +```text +src/tools/scanner-oracle-runner/ + Program.cs + Oracles/ + (symlink or reference to src/StellaOps.Scanner.Tests/PatchOracles) +``` + +Core responsibilities: + +1. Discover all `oracle.yml` files under `PatchOracles/`. +2. For each: + + * Run the `build` script. + * Run the scanner on both artifacts. + * Evaluate expectations. +3. Aggregate results and exit with appropriate status. + +Pseudo-code outline: + +```csharp +static int Main(string[] args) +{ + var root = args.Length > 0 ? args[0] : "src/StellaOps.Scanner.Tests/PatchOracles"; + var oracleFiles = Directory.GetFiles(root, "oracle.yml", SearchOption.AllDirectories); + var failures = new List(); + + foreach (var oracleFile in oracleFiles) + { + var result = RunOracle(oracleFile); + if (!result.Success) + { + failures.Add($"{result.Name}: {result.FailureReason}"); + } + } + + if (failures.Any()) + { + Console.Error.WriteLine("Patch oracle failures:"); + foreach (var f in failures) Console.Error.WriteLine(" - " + f); + return 1; + } + + Console.WriteLine("All patch oracles passed."); + return 0; +} +``` + +`RunOracle` does: + +* Deserialize YAML (e.g., via `YamlDotNet`). +* `Process.Start` for `build.script`. +* `Process.Start` for `scanner_cli` twice (vuln/fixed). +* Read/parse JSON outputs. +* Run checks `functions.must_*` and `calls.must_*`. + +This is straightforward plumbing code; once built, adding a new patch oracle is just adding a folder + `oracle.yml`. + +--- + +## 4. Managed (.NET / JVM) patch oracles + +Exact same concept, slightly different mechanics. + +### 4.1. .NET example + +Directory: + +```text +PatchOracles/ + dotnet/ + CVE-2021-XXXXX-systemtextjson/ + src/ + vuln/ + Example.sln + Api/... + fixed/ + Example.sln + Api/... + build.ps1 + oracle.yml +``` + +`build.ps1` (PowerShell, simplified): + +```powershell +param( + [string]$Configuration = "Release" +) + +$ErrorActionPreference = "Stop" + +function Build-One([string]$name) { + Push-Location "src/$name" + dotnet clean + dotnet publish -c $Configuration -p:DebugType=None -p:DebugSymbols=false -o ../../build/$name + Pop-Location +} + +New-Item -ItemType Directory -Force -Path "build" | Out-Null + +Build-One "vuln" +Build-One "fixed" +``` + +`oracle.yml`: + +```yaml +cve: CVE-2021-XXXXX +name: systemtextjson_escape_fix +language: dotnet +build: + script: "pwsh ./build.ps1" + artifacts: + vulnerable: "build/vuln/Api.dll" + fixed: "build/fixed/Api.dll" + +scan: + scanner_cli: "dotnet run --project ../../StellaOps.Scanner.Cli" + args: + - "--format=json" + - "--analyzers=dotnet" + timeout_seconds: 120 + +expectations: + methods: + must_exist_in_fixed: + - "Api.JsonHelper::EscapeString" + must_not_exist_in_vuln: + - "Api.JsonHelper::EscapeString" + calls: + must_add: + - caller: "Api.Controller::Handle" + callee: "Api.JsonHelper::EscapeString" + tolerances: + allow_missing_calls: 0 + allow_extra_methods: 10 +``` + +Scanner’s .NET analyzer should produce method identifiers in a stable format (e.g., `Namespace.Type::Method(Signature)`), which you then use in the oracle. + +### 4.2. JVM example + +Similar structure, but artifacts are JARs: + +```yaml +build: + script: "./gradlew :app:assemble" + artifacts: + vulnerable: "app-vuln.jar" + fixed: "app-fixed.jar" + +scan: + scanner_cli: "dotnet run --project ../../StellaOps.Scanner.Cli" + args: + - "--format=json" + - "--analyzers=jvm" +``` + +Expectations then refer to methods like `com.example.JsonHelper.escapeString:(Ljava/lang/String;)Ljava/lang/String;`. + +--- + +## 5. Wiring into CI + +You can integrate this in your existing pipeline (GitLab Runner / Gitea / etc.) as a separate job. + +Example CI job skeleton (GitLab-like YAML for illustration): + +```yaml +patch-oracle-tests: + stage: test + image: mcr.microsoft.com/dotnet/sdk:10.0 + script: + - dotnet build src/StellaOps.Scanner/StellaOps.Scanner.csproj -c Release + - dotnet build src/tools/scanner-oracle-runner/scanner-oracle-runner.csproj -c Release + - dotnet run --project src/tools/scanner-oracle-runner/scanner-oracle-runner.csproj -- \ + src/StellaOps.Scanner.Tests/PatchOracles + artifacts: + when: on_failure + paths: + - src/StellaOps.Scanner.Tests/PatchOracles/**/build + - oracle-results.log +``` + +You can also: + +* Tag the job (e.g., `oracle` or `reachability`) so you can run it nightly or on changes to Scanner analyzers. +* Pin Docker images with the exact C/C++/Java toolchains used by patch oracles so results are deterministic. + +--- + +## 6. Practical guidelines and pitfalls + +Here are concrete rules of thumb for making this robust: + +### 6.1. Choosing good CVE oracles + +Prefer cases where: + +* The patch clearly adds/removes a **function** or **method**, or introduces a separate helper such as `validate_len`, `check_bounds`, etc. +* The patch adds/removes a **call** that is easy to see even under optimization (e.g., non-inline, non-template). +* The project is easy to build and not heavily reliant on obscure toolchains. + +For each supported language in Scanner, target: + +* 3–5 small C or C++ oracles. +* 3–5 .NET or JVM oracles. +* 1–3 for Go and Rust once those analyzers exist. + +You do not need many; you want **sharp, surgical tests**, not coverage. + +### 6.2. Handle inlining and optimization + +Compilers may inline small functions; this can break naive “must have call edge” expectations. + +Mitigations: + +* Choose functions that are “large enough” or mark them `__attribute__((noinline))` (GCC/Clang) in your test harness code if necessary. +* Alternatively, relax expectations using `should_add` vs `must_add` for some edges: + +```yaml +calls: + must_add: [] + should_add: + - caller: "foo_parse" + callee: "validate_len" +``` + +In the runner, `should_add` failures can mark the oracle as “degraded” but not fatal, while `must_*` failures break the build. + +### 6.3. Keep oracles stable over time + +To avoid flakiness: + +* **Vendor sources** into the repo (or at least snapshot the patch) so upstream changes do not affect builds. +* Pin toolchain versions in Docker images for CI. +* Capture and pin scanner configuration: analyzers enabled, rules, version. If you support “deterministic scan manifests” later, these oracles are perfect consumers of that. + +### 6.4. What to assert beyond functions/calls + +When your Scanner gets more advanced, you can extend `oracle.yml`: + +```yaml +cfg: + must_increase_blocks: + - function: "foo_parse" + must_add_branch_on: + - function: "foo_parse" + operand_pattern: "len <= MAX_LEN" +``` + +Initially, I would keep it to: + +* Function presence/absence +* Call edges presence/absence + +and add CFG assertions only when your analyzers and JSON model for CFG stabilize. + +### 6.5. How to use failures + +When a patch oracle fails, it is a **signal** that either: + +* A change in Scanner or a new optimization pattern created a blind spot, or +* The oracle is too strict (e.g., relying on a call that got inlined). + +You then: + +1. Inspect the disassembly / Scanner JSON for `vuln` and `fixed`. +2. Decide if Scanner is wrong (fix analyzer) or oracle is too rigid (relax to `should_*`). +3. Commit both the code change and updated oracle (if needed) in the same merge request. + +--- + +## 7. Minimal checklist for adding a new patch oracle + +For your future self and your agents, here is a compressed checklist: + +1. Select CVE + patch; copy minimal affected sources into `src/…///src/{vuln,fixed}`. +2. Add a tiny harness that calls the patched code path. +3. Write `build.sh` / `build.ps1` to produce `build/vuln` and `build/fixed` artifacts, stripped/Release. +4. Run manual `scanner` on both artifacts once; inspect JSON to find real symbol names and call edges. +5. Create `oracle.yml` with: + + * `build.script` and `artifacts.*` paths + * `scan.scanner_cli` + args + * `expectations.functions.*` and `expectations.calls.*` +6. Run `scanner-oracle-runner` locally; fix any mismatches or over-strict expectations. +7. Commit and ensure CI job `patch-oracle-tests` runs and must pass on MR. + +If you wish, next step we can design the actual JSON schema that Scanner should emit for function/call graphs and write a first C# implementation of `scanner-oracle-runner` aligned with that schema. diff --git a/docs/product-advisories/archived/17-Nov-2026 - SBOM-Provenance-Spine.md b/docs/product-advisories/archived/18-Nov-2025 - SBOM-Provenance-Spine.md similarity index 96% rename from docs/product-advisories/archived/17-Nov-2026 - SBOM-Provenance-Spine.md rename to docs/product-advisories/archived/18-Nov-2025 - SBOM-Provenance-Spine.md index 9cadb0314..f594c37d5 100644 --- a/docs/product-advisories/archived/17-Nov-2026 - SBOM-Provenance-Spine.md +++ b/docs/product-advisories/archived/18-Nov-2025 - SBOM-Provenance-Spine.md @@ -1,785 +1,784 @@ - -Here’s a clean, air‑gap‑ready spine for turning container images into verifiable SBOMs and provenance—built to be idempotent and easy to slot into Stella Ops or any CI/CD. - -```mermaid -flowchart LR - A[OCI Image/Repo]-->B[Layer Extractor] - B-->C[Sbomer: CycloneDX/SPDX] - C-->D[DSSE Sign] - D-->E[in-toto Statement (SLSA Provenance)] - E-->F[Transparency Log Adapter] - C-->G[POST /sbom/ingest] - F-->H[POST /attest/verify] -``` - -### What this does (in plain words) - -* **Pull & crack the image** → extract layers, metadata (labels, env, history). -* **Build an SBOM** → emit **CycloneDX 1.6** and **SPDX 3.0.1** (pick one or both). -* **Sign artifacts** → wrap SBOM/provenance in **DSSE** envelopes. -* **Provenance** → generate **in‑toto Statement** with **SLSA Provenance v1** as the predicate. -* **Auditability** → optionally publish attestations to a transparency log (e.g., Rekor) so they’re tamper‑evident via Merkle proofs. -* **APIs are idempotent** → safe to re‑ingest the same image/SBOM/attestation without version churn. - -### Design notes you can hand to an agent - -* **Idempotency keys** - - * `contentAddress` = SHA256 of OCI manifest (or full image digest) - * `sbomHash` = SHA256 of normalized SBOM JSON - * `attHash` = SHA256 of DSSE payload (base64‑stable) - Store these; reject duplicates with HTTP 200 + `"status":"already_present"`. - -* **Default formats** - - * SBOM export: CycloneDX v1.6 (`application/vnd.cyclonedx+json`), SPDX 3.0.1 (`application/spdx+json`) - * DSSE envelope: `application/dsse+json` - * in‑toto Statement: `application/vnd.in-toto+json` with `predicateType` = SLSA Provenance v1 - -* **Air‑gap mode** - - * No external calls required; Rekor publish is optional. - * Keep a local Merkle log (pluggable) and allow later “sync‑to‑Rekor” when online. - -* **Transparency log adapter** - - * Interface: `Put(entry) -> {logIndex, logID, inclusionProof}` - * Backends: `rekor`, `local-merkle`, `null` (no‑op) - -### Minimal API sketch - -* `POST /sbom/ingest` - - * Body: `{ imageDigest, sbom, format, dsseSignature? }` - * Returns: `{ sbomId, status, sbomHash }` (status: `stored|already_present`) -* `POST /attest/verify` - - * Body: `{ dsseEnvelope, expectedSubjects:[{name, digest}] }` - * Verifies DSSE, checks in‑toto subject ↔ image digest, optionally records/logs. - * Returns: `{ verified:true, predicateType, logIndex?, inclusionProof? }` - -### CLI flow (pseudocode) - -```bash -# 1) Extract -stella-extract --image $IMG --out /work/extract - -# 2) SBOM (Cdx + SPDX) -stella-sbomer cdx --in /work/extract --out /work/sbom.cdx.json -stella-sbomer spdx --in /work/extract --out /work/sbom.spdx.json - -# 3) DSSE sign (offline keyring or HSM) -stella-sign dsse --in /work/sbom.cdx.json --out /work/sbom.cdx.dsse.json --key file:k.pem - -# 4) SLSA provenance (in‑toto Statement) -stella-provenance slsa-v1 --subject $IMG_DIGEST --materials /work/extract/manifest.json \ - --out /work/prov.dsse.json --key file:k.pem - -# 5) (optional) Publish to transparency log -stella-log publish --in /work/prov.dsse.json --backend rekor --rekor-url $REKOR -``` - -### Validation rules (quick) - -* **Subject binding**: in‑toto Statement `subject[].digest.sha256` must equal the OCI image digest you scanned. -* **Key policy**: enforce allowed issuers (Fulcio, internal CA, GOST/SM/EIDAS/FIPS as needed). -* **Normalization**: canonicalize JSON before hashing/signing to keep idempotency stable. - -### Why this matters - -* **Audit‑ready**: You can always prove *what* you scanned, *how* it was built, and *who* signed it. -* **Noise‑gated**: With deterministic SBOMs + provenance, downstream VEX/reachability gets much cleaner. -* **Drop‑in**: Works in harsh environments—offline, mirrors, sovereign crypto stacks—without changing your pipeline. - -If you want, I can generate: - -* a ready‑to‑use OpenAPI stub for `POST /sbom/ingest` and `POST /attest/verify`, -* C# (.NET 10) DSSE + in‑toto helpers (interfaces + test fixtures), -* or a Docker‑compose “air‑gap bundle” showing the full spine end‑to‑end. -Below is a full architecture plan you can hand to an agent as the “master spec” for implementing the SBOM & provenance spine (image → SBOM → DSSE → in-toto/SLSA → transparency log → REST APIs), with idempotent APIs and air-gap readiness. - ---- - -## 1. Scope and Objectives - -**Goal:** Implement a deterministic, air-gap-ready “SBOM spine” that: - -* Converts OCI images into SBOMs (CycloneDX 1.6 and SPDX 3.0.1). -* Generates SLSA v1 provenance wrapped in in-toto Statements. -* Signs all artifacts with DSSE envelopes using pluggable crypto providers. -* Optionally publishes attestations to transparency logs (Rekor/local-Merkle/none). -* Exposes stable, idempotent APIs: - - * `POST /sbom/ingest` - * `POST /attest/verify` -* Avoids versioning by design; APIs are extended, not versioned; all mutations are idempotent keyed by content digests. - -**Out of scope (for this iteration):** - -* Full vulnerability scanning (delegated to Scanner service). -* Policy evaluation / lattice logic (delegated to Scanner/Graph engine). -* Vendor-facing proof-market ledger and trust economics (future module). - ---- - -## 2. High-Level Architecture - -### 2.1 Logical Components - -1. **StellaOps.SupplyChain.Core (Library)** - - * Shared types and utilities: - - * Domain models: SBOM, DSSE, in-toto Statement, SLSA predicates. - * Canonicalization & hashing utilities. - * DSSE sign/verify abstractions. - * Transparency log entry model & Merkle proof verification. - -2. **StellaOps.Sbomer.Engine (Library)** - - * Image → SBOM functionality: - - * Layer & manifest analysis. - * SBOM generation: CycloneDX, SPDX. - * Extraction of metadata (labels, env, history). - * Deterministic ordering & normalization. - -3. **StellaOps.Provenance.Engine (Library)** - - * Build provenance & in-toto: - - * In-toto Statement generator. - * SLSA v1 provenance predicate builder. - * Subject and material resolution from image metadata & SBOM. - -4. **StellaOps.Authority (Service/Library)** - - * Crypto & keys: - - * Key management abstraction (file, HSM, KMS, sovereign crypto). - * DSSE signing & verification with multiple key types. - * Trust roots, certificate chains, key policies. - -5. **StellaOps.LogBridge (Service/Library)** - - * Transparency log adapter: - - * Rekor backend. - * Local Merkle log backend (for air-gap). - * Null backend (no-op). - * Merkle proof validation. - -6. **StellaOps.SupplyChain.Api (Service)** - - * The SBOM spine HTTP API: - - * `POST /sbom/ingest` - * `POST /attest/verify` - * Optionally: `GET /sbom/{id}`, `GET /attest/{id}`, `GET /image/{digest}/summary`. - * Performs orchestrations: - - * SBOM/attestation parsing, canonicalization, hashing. - * Idempotency and persistence. - * Delegation to Authority and LogBridge. - -7. **CLI Tools (optional but recommended)** - - * `stella-extract`, `stella-sbomer`, `stella-sign`, `stella-provenance`, `stella-log`. - * Thin wrappers over the above libraries; usable offline and in CI pipelines. - -8. **Persistence Layer** - - * Primary DB: PostgreSQL (or other RDBMS). - * Optional object storage: S3/MinIO for large SBOM/attestation blobs. - * Tables: `images`, `sboms`, `attestations`, `signatures`, `log_entries`, `keys`. - -### 2.2 Deployment View (Kubernetes / Docker) - -```mermaid -flowchart LR - subgraph Node1[Cluster Node] - A[StellaOps.SupplyChain.Api (ASP.NET Core)] - B[StellaOps.Authority Service] - C[StellaOps.LogBridge Service] - end - - subgraph Node2[Worker Node] - D[Runner / CI / Air-gap host] - E[CLI Tools\nstella-extract/sbomer/sign/provenance/log] - end - - F[(PostgreSQL)] - G[(Object Storage\nS3/MinIO)] - H[(Local Merkle Log\nor Rekor)] - - A --> F - A --> G - A --> C - A --> B - C --> H - E --> A -``` - -* **Air-gap mode:** - - * Rekor backend disabled; LogBridge uses local Merkle log (`H`) or `null`. - * All components run within the offline network. -* **Online mode:** - - * LogBridge talks to external Rekor instance using outbound HTTPS only. - ---- - -## 3. Domain Model and Storage Design - -Use EF Core 9 with PostgreSQL in .NET 10. - -### 3.1 Core Entities - -1. **ImageArtifact** - - * `Id` (GUID/ULID, internal). - * `ImageDigest` (string; OCI digest; UNIQUE). - * `Registry` (string). - * `Repository` (string). - * `Tag` (string, nullable, since digest is canonical). - * `FirstSeenAt` (timestamp). - * `MetadataJson` (JSONB; manifest, labels, env). - -2. **Sbom** - - * `Id` (string, primary key = `SbomHash` or derived ULID). - * `ImageArtifactId` (FK). - * `Format` (enum: `CycloneDX_1_6`, `SPDX_3_0_1`). - * `ContentHash` (string; normalized JSON SHA-256; UNIQUE with `TenantId`). - * `StorageLocation` (inline JSONB or external object storage key). - * `CreatedAt`. - * `Origin` (enum: `Generated`, `Uploaded`, `ExternalVendor`). - * Unique constraint: `(TenantId, ContentHash)`. - -3. **Attestation** - - * `Id` (string, primary key = `AttestationHash` or derived ULID). - * `ImageArtifactId` (FK). - * `Type` (enum: `InTotoStatement_SLSA_v1`, `Other`). - * `PayloadHash` (hash of DSSE payload, before envelope). - * `DsseEnvelopeHash` (hash of full DSSE JSON). - * `StorageLocation` (inline JSONB or object storage). - * `CreatedAt`. - * `Issuer` (string; signer identity / certificate subject). - * Unique constraint: `(TenantId, DsseEnvelopeHash)`. - -4. **SignatureInfo** - - * `Id` (GUID/ULID). - * `AttestationId` (FK). - * `KeyId` (logical key identifier). - * `Algorithm` (enum; includes PQ & sovereign algs). - * `VerifiedAt`. - * `VerificationStatus` (enum: `Valid`, `Invalid`, `Unknown`). - * `DetailsJson` (JSONB; trust-chain, error reasons, etc.). - -5. **TransparencyLogEntry** - - * `Id` (GUID/ULID). - * `AttestationId` (FK). - * `Backend` (enum: `Rekor`, `LocalMerkle`). - * `LogIndex` (string). - * `LogId` (string). - * `InclusionProofJson` (JSONB). - * `RecordedAt`. - * Unique constraint: `(Backend, LogId, LogIndex)`. - -6. **KeyRecord** (optional if not reusing Authority’s DB) - - * `KeyId` (string, PK). - * `KeyType` (enum). - * `Usage` (enum: `Signing`, `Verification`, `Both`). - * `Status` (enum: `Active`, `Retired`, `Revoked`). - * `MetadataJson` (JSONB; KMS ARN, HSM slot, etc.). - -### 3.2 Idempotency Keys - -* SBOM: - - * `sbomHash = SHA256(canonicalJson(sbom))`. - * Uniqueness enforced by `(TenantId, sbomHash)` in DB. -* Attestation: - - * `attHash = SHA256(canonicalJson(dsse.payload))` or full envelope. - * Uniqueness enforced by `(TenantId, attHash)` in DB. -* Image: - - * `imageDigest` is globally unique (per OCI spec). - ---- - -## 4. Service-Level Architecture - -### 4.1 StellaOps.SupplyChain.Api (.NET 10, ASP.NET Core) - -**Responsibilities:** - -* Expose HTTP API for ingest / verify. -* Handle idempotency logic & persistence. -* Delegate cryptographic operations to Authority. -* Delegate transparency logging to LogBridge. -* Perform basic validation against schemas (SBOM, DSSE, in-toto, SLSA). - -**Key Endpoints:** - -1. `POST /sbom/ingest` - - * Request: - - * `imageDigest` (string). - * `sbom` (raw JSON). - * `format` (enum/string). - * Optional: `dsseSignature` or `dsseEnvelope`. - * Behavior: - - * Parse & validate SBOM structure. - * Canonicalize JSON, compute `sbomHash`. - * If `sbomHash` exists for `imageDigest` and tenant: - - * Return `200` with `{ status: "already_present", sbomId, sbomHash }`. - * Else: - - * Persist `Sbom` entity. - * Optionally verify DSSE signature via Authority. - * Return `201` with `{ status: "stored", sbomId, sbomHash }`. - -2. `POST /attest/verify` - - * Request: - - * `dsseEnvelope` (JSON). - * `expectedSubjects` (list of `{ name, digest }`). - * Behavior: - - * Canonicalize payload, compute `attHash`. - * Verify DSSE signature via Authority. - * Parse in-toto Statement; ensure `subject[].digest.sha256` matches `expectedSubjects`. - * Persist `Attestation` & `SignatureInfo`. - * If configured, call LogBridge to publish and store `TransparencyLogEntry`. - * If `attHash` already exists: - - * Return `200` with `status: "already_present"` and existing references. - * Else, return `201` with `verified:true`, plus log info when available. - -3. Optional read APIs: - - * `GET /sbom/by-image/{digest}` - * `GET /attest/by-image/{digest}` - * `GET /image/{digest}/summary` (SBOM + attestations + log status). - -### 4.2 StellaOps.Sbomer.Engine - -**Responsibilities:** - -* Given: - - * OCI image manifest & layers (from local tarball or remote registry). -* Produce: - - * CycloneDX 1.6 JSON. - * SPDX 3.0.1 JSON. - -**Design:** - -* Use layered analyzers: - - * `ILayerAnalyzer` for generic filesystem traversal. - * Language-specific analyzers (optional for SBOM detail): - - * `DotNetAnalyzer`, `NodeJsAnalyzer`, `PythonAnalyzer`, `JavaAnalyzer`, `PhpAnalyzer`, etc. -* Determinism: - - * Sort all lists (components, dependencies) by stable keys. - * Remove unstable fields (timestamps, machine IDs, ephemeral paths). - * Provide `Normalize()` method per format that returns canonical JSON. - -### 4.3 StellaOps.Provenance.Engine - -**Responsibilities:** - -* Build in-toto Statement with SLSA v1 predicate: - - * `subject` derived from image digest(s). - * `materials` from: - - * Git commit, tag, builder image, SBOM components if available. -* Ensure determinism: - - * Sort materials by URI + digest. - * Normalize nested maps. - -**Key APIs (internal library):** - -* `InTotoStatement BuildSlsaProvenance(ImageArtifact image, Sbom sbom, ProvenanceContext ctx)` -* `string ToCanonicalJson(InTotoStatement stmt)` - -### 4.4 StellaOps.Authority - -**Responsibilities:** - -* DSSE signing & verification. -* Key management abstraction. -* Policy enforcement (which keys/trust roots are allowed). - -**Interfaces:** - -* `ISigningProvider` - - * `Task SignAsync(byte[] payload, string payloadType, string keyId)` -* `IVerificationProvider` - - * `Task VerifyAsync(DsseEnvelope envelope, VerificationPolicy policy)` - -**Backends:** - -* File-based keys (PEM). -* HSM/KMS (AWS KMS, Azure Key Vault, on-prem HSM). -* Sovereign crypto providers (GOST, SMx, etc.). -* Optional PQ providers (Dilithium, Falcon). - -### 4.5 StellaOps.LogBridge - -**Responsibilities:** - -* Abstract interaction with transparency logs. - -**Interface:** - -* `ILogBackend` - - * `Task PutAsync(byte[] canonicalPayloadHash, DsseEnvelope env)` - * `Task VerifyInclusionAsync(LogEntryResult entry)` - -**Backends:** - -* `RekorBackend`: - - * Calls Rekor REST API with hashed payload. -* `LocalMerkleBackend`: - - * Maintains Merkle tree in local DB. - * Returns `logIndex`, `logId`, and inclusion proof. -* `NullBackend`: - - * Returns empty/no-op results. - -### 4.6 CLI Tools (Optional) - -Use the same libraries as the services: - -* `stella-extract`: - - * Input: image reference. - * Output: local tarball + manifest JSON. -* `stella-sbomer`: - - * Input: manifest & layers. - * Output: SBOM JSON. -* `stella-sign`: - - * Input: JSON file. - * Output: DSSE envelope. -* `stella-provenance`: - - * Input: image digest, build metadata. - * Output: signed in-toto/SLSA DSSE. -* `stella-log`: - - * Input: DSSE envelope. - * Output: log entry details. - ---- - -## 5. End-to-End Flows - -### 5.1 SBOM Ingest (Upload Path) - -```mermaid -sequenceDiagram - participant Client - participant API as SupplyChain.Api - participant Core as SupplyChain.Core - participant DB as PostgreSQL - - Client->>API: POST /sbom/ingest (imageDigest, sbom, format) - API->>Core: Validate & canonicalize SBOM - Core-->>API: sbomHash - API->>DB: SELECT Sbom WHERE sbomHash & imageDigest - DB-->>API: Not found - API->>DB: INSERT Sbom (sbomHash, imageDigest, content) - DB-->>API: ok - API-->>Client: 201 { status:"stored", sbomId, sbomHash } -``` - -Re-ingest of the same SBOM repeats steps up to SELECT, then returns `status:"already_present"` with `200`. - -### 5.2 Attestation Verify & Record - -```mermaid -sequenceDiagram - participant Client - participant API as SupplyChain.Api - participant Auth as Authority - participant Log as LogBridge - participant DB as PostgreSQL - - Client->>API: POST /attest/verify (dsseEnvelope, expectedSubjects) - API->>Auth: Verify DSSE (keys, policy) - Auth-->>API: VerificationResult(Valid/Invalid) - API->>API: Parse in-toto, check subjects vs expected - API->>DB: SELECT Attestation WHERE attHash - DB-->>API: Not found - API->>DB: INSERT Attestation + SignatureInfo - alt Logging enabled - API->>Log: PutAsync(attHash, envelope) - Log-->>API: LogEntryResult(logIndex, logId, proof) - API->>DB: INSERT TransparencyLogEntry - end - API-->>Client: 201 { verified:true, attestationId, logIndex?, inclusionProof? } -``` - -If attestation already exists, API returns `200` with `status:"already_present"`. - ---- - -## 6. Idempotency and Determinism Strategy - -1. **Canonicalization rules:** - - * Remove insignificant whitespace. - * Sort all object keys lexicographically. - * Sort arrays where order is not semantically meaningful (components, materials). - * Strip non-deterministic fields (timestamps, random IDs) where allowed. - -2. **Hashing:** - - * Always hash canonical JSON as UTF-8. - * Use SHA-256 for core IDs; allow crypto provider to also compute other digests if needed. - -3. **Persistence:** - - * Enforce uniqueness in DB via indices on: - - * `(TenantId, ContentHash)` for SBOMs. - * `(TenantId, AttHash)` for attestations. - * `(Backend, LogId, LogIndex)` for log entries. - * API behavior: - - * Existing row → `200` with `"already_present"`. - * New row → `201` with `"stored"`. - -4. **API design:** - - * No version numbers in path. - * Add fields over time; never break or repurpose existing ones. - * Use explicit capability discovery via `GET /meta/capabilities` if needed. - ---- - -## 7. Air-Gap Mode and Synchronization - -### 7.1 Air-Gap Mode - -* Configuration flag `Mode = Offline` on SupplyChain.Api. -* LogBridge backend: - - * Default to `LocalMerkle` or `Null`. -* Rekor-specific configuration disabled or absent. -* DB & Merkle log stored locally inside the secure network. - -### 7.2 Later Synchronization to Rekor (Optional Future Step) - -Not mandatory for first iteration, but prepare for: - -* Background job (Scheduler module) that: - - * Enumerates local `TransparencyLogEntry` not yet exported. - * Publishes hashed payloads to Rekor when network is available. - * Stores mapping between local log entries and remote Rekor entries. - ---- - -## 8. Security, Access Control, and Observability - -### 8.1 Security - -* mTLS between internal services (SupplyChain.Api, Authority, LogBridge). -* Authentication: - - * API keys/OIDC for clients. - * Per-tenant scoping; `TenantId` must be present in context. -* Authorization: - - * RBAC: which tenants/users can write/verify/only read. - -### 8.2 Crypto Policies - -* Policy object defines: - - * Allowed key types and algorithms. - * Trust roots (Fulcio, internal CA, sovereign PKI). - * Revocation checking strategy (CRL/OCSP, offline lists). -* Authority enforces policies; SupplyChain.Api only consumes `VerificationResult`. - -### 8.3 Observability - -* Logs: - - * Structured logs with correlation IDs; log imageDigest, sbomHash, attHash. -* Metrics: - - * SBOM ingest count, dedup hit rate. - * Attestation verify latency. - * Transparency log publish success/failure counts. -* Traces: - - * OpenTelemetry tracing across API → Authority → LogBridge. - ---- - -## 9. Implementation Plan (Epics & Work Packages) - -You can give this section directly to agents to split. - -### Epic 1: Core Domain & Canonicalization - -1. Define .NET 10 solution structure: - - * Projects: - - * `StellaOps.SupplyChain.Core` - * `StellaOps.Sbomer.Engine` - * `StellaOps.Provenance.Engine` - * `StellaOps.SupplyChain.Api` - * `StellaOps.Authority` (if not already present) - * `StellaOps.LogBridge` -2. Implement core domain models: - - * SBOM, DSSE, in-toto, SLSA v1. -3. Implement canonicalization & hashing utilities. -4. Unit tests: - - * Given semantically equivalent JSON, hashes must match. - * Negative tests where order changes but meaning does not. - -### Epic 2: Persistence Layer - -1. Design EF Core models for: - - * ImageArtifact, Sbom, Attestation, SignatureInfo, TransparencyLogEntry, KeyRecord. -2. Write migrations for PostgreSQL. -3. Implement repository interfaces for read/write. -4. Tests: - - * Unique constraints and idempotency behavior. - * Query performance for common access paths (by imageDigest). - -### Epic 3: SBOM Engine - -1. Implement minimal layer analysis: - - * Accepts local tarball or path (for now). -2. Implement CycloneDX 1.6 generator. -3. Implement SPDX 3.0.1 generator. -4. Deterministic normalization across formats. -5. Tests: - - * Golden files for images → SBOM output. - * Stability under repeated runs. - -### Epic 4: Provenance Engine - -1. Implement in-toto Statement model with SLSA v1 predicate. -2. Implement builder to map: - - * ImageDigest → subject. - * Build metadata → materials. -3. Deterministic canonicalization. -4. Tests: - - * Golden in-toto/SLSA statements for sample inputs. - * Subject matching logic. - -### Epic 5: Authority Integration - -1. Implement `ISigningProvider`, `IVerificationProvider` contracts. -2. Implement file-based key backend as default. -3. Implement DSSE wrapper: - - * `SignAsync(payload, payloadType, keyId)`. - * `VerifyAsync(envelope, policy)`. -4. Tests: - - * DSSE round-trip; invalid signature scenarios. - * Policy enforcement tests. - -### Epic 6: Transparency Log Bridge - -1. Implement `ILogBackend` interface. -2. Implement `LocalMerkleBackend`: - - * Simple Merkle tree with DB storage. -3. Implement `NullBackend`. -4. Define configuration model to select backend. -5. (Optional later) Implement `RekorBackend`. -6. Tests: - - * Stable Merkle root; inclusion proof verification. - -### Epic 7: SupplyChain.Api - -1. Implement `POST /sbom/ingest`: - - * Request/response DTOs. - * Integration with canonicalization, persistence, idempotency logic. -2. Implement `POST /attest/verify`: - - * End-to-end verification and persistence. - * Integration with Authority and LogBridge. -3. Optional read APIs. -4. Add input validation (JSON schema, basic constraints). -5. Integration tests: - - * Full flows for new and duplicate inputs. - * Error cases (invalid DSSE, subject mismatch). - -### Epic 8: CLI Tools - -1. Implement `stella-sbomer` (wraps Sbomer.Engine). -2. Implement `stella-provenance` (wraps Provenance.Engine + Authority). -3. Implement `stella-sign` and `stella-log`. -4. Provide clear help/usage and sample scripts. - -### Epic 9: Hardening, Air-Gap Profile, and Docs - -1. Configuration profiles: - - * `Offline` vs `Online`. - * Log backend selection. -2. Security hardening: - - * mTLS, authentication, authorization. -3. Observability: - - * Metrics, logs, traces wiring. -4. Documentation: - - * API reference. - * Sequence diagrams. - * Deployment recipes for: - - * Single-node air-gap. - * Clustered online deployment. - ---- - -If you want, next step I can: - -* Turn this into an AGENTS/TASKS/PROMPT set for your codex workers, or -* Produce concrete .NET 10 project skeletons (csproj layout, folder structure, and initial interfaces) for the core libraries and API service. +Here’s a clean, air‑gap‑ready spine for turning container images into verifiable SBOMs and provenance—built to be idempotent and easy to slot into Stella Ops or any CI/CD. + +```mermaid +flowchart LR + A[OCI Image/Repo]-->B[Layer Extractor] + B-->C[Sbomer: CycloneDX/SPDX] + C-->D[DSSE Sign] + D-->E[in-toto Statement (SLSA Provenance)] + E-->F[Transparency Log Adapter] + C-->G[POST /sbom/ingest] + F-->H[POST /attest/verify] +``` + +### What this does (in plain words) + +* **Pull & crack the image** → extract layers, metadata (labels, env, history). +* **Build an SBOM** → emit **CycloneDX 1.6** and **SPDX 3.0.1** (pick one or both). +* **Sign artifacts** → wrap SBOM/provenance in **DSSE** envelopes. +* **Provenance** → generate **in‑toto Statement** with **SLSA Provenance v1** as the predicate. +* **Auditability** → optionally publish attestations to a transparency log (e.g., Rekor) so they’re tamper‑evident via Merkle proofs. +* **APIs are idempotent** → safe to re‑ingest the same image/SBOM/attestation without version churn. + +### Design notes you can hand to an agent + +* **Idempotency keys** + + * `contentAddress` = SHA256 of OCI manifest (or full image digest) + * `sbomHash` = SHA256 of normalized SBOM JSON + * `attHash` = SHA256 of DSSE payload (base64‑stable) + Store these; reject duplicates with HTTP 200 + `"status":"already_present"`. + +* **Default formats** + + * SBOM export: CycloneDX v1.6 (`application/vnd.cyclonedx+json`), SPDX 3.0.1 (`application/spdx+json`) + * DSSE envelope: `application/dsse+json` + * in‑toto Statement: `application/vnd.in-toto+json` with `predicateType` = SLSA Provenance v1 + +* **Air‑gap mode** + + * No external calls required; Rekor publish is optional. + * Keep a local Merkle log (pluggable) and allow later “sync‑to‑Rekor” when online. + +* **Transparency log adapter** + + * Interface: `Put(entry) -> {logIndex, logID, inclusionProof}` + * Backends: `rekor`, `local-merkle`, `null` (no‑op) + +### Minimal API sketch + +* `POST /sbom/ingest` + + * Body: `{ imageDigest, sbom, format, dsseSignature? }` + * Returns: `{ sbomId, status, sbomHash }` (status: `stored|already_present`) +* `POST /attest/verify` + + * Body: `{ dsseEnvelope, expectedSubjects:[{name, digest}] }` + * Verifies DSSE, checks in‑toto subject ↔ image digest, optionally records/logs. + * Returns: `{ verified:true, predicateType, logIndex?, inclusionProof? }` + +### CLI flow (pseudocode) + +```bash +# 1) Extract +stella-extract --image $IMG --out /work/extract + +# 2) SBOM (Cdx + SPDX) +stella-sbomer cdx --in /work/extract --out /work/sbom.cdx.json +stella-sbomer spdx --in /work/extract --out /work/sbom.spdx.json + +# 3) DSSE sign (offline keyring or HSM) +stella-sign dsse --in /work/sbom.cdx.json --out /work/sbom.cdx.dsse.json --key file:k.pem + +# 4) SLSA provenance (in‑toto Statement) +stella-provenance slsa-v1 --subject $IMG_DIGEST --materials /work/extract/manifest.json \ + --out /work/prov.dsse.json --key file:k.pem + +# 5) (optional) Publish to transparency log +stella-log publish --in /work/prov.dsse.json --backend rekor --rekor-url $REKOR +``` + +### Validation rules (quick) + +* **Subject binding**: in‑toto Statement `subject[].digest.sha256` must equal the OCI image digest you scanned. +* **Key policy**: enforce allowed issuers (Fulcio, internal CA, GOST/SM/EIDAS/FIPS as needed). +* **Normalization**: canonicalize JSON before hashing/signing to keep idempotency stable. + +### Why this matters + +* **Audit‑ready**: You can always prove *what* you scanned, *how* it was built, and *who* signed it. +* **Noise‑gated**: With deterministic SBOMs + provenance, downstream VEX/reachability gets much cleaner. +* **Drop‑in**: Works in harsh environments—offline, mirrors, sovereign crypto stacks—without changing your pipeline. + +If you want, I can generate: + +* a ready‑to‑use OpenAPI stub for `POST /sbom/ingest` and `POST /attest/verify`, +* C# (.NET 10) DSSE + in‑toto helpers (interfaces + test fixtures), +* or a Docker‑compose “air‑gap bundle” showing the full spine end‑to‑end. +Below is a full architecture plan you can hand to an agent as the “master spec” for implementing the SBOM & provenance spine (image → SBOM → DSSE → in-toto/SLSA → transparency log → REST APIs), with idempotent APIs and air-gap readiness. + +--- + +## 1. Scope and Objectives + +**Goal:** Implement a deterministic, air-gap-ready “SBOM spine” that: + +* Converts OCI images into SBOMs (CycloneDX 1.6 and SPDX 3.0.1). +* Generates SLSA v1 provenance wrapped in in-toto Statements. +* Signs all artifacts with DSSE envelopes using pluggable crypto providers. +* Optionally publishes attestations to transparency logs (Rekor/local-Merkle/none). +* Exposes stable, idempotent APIs: + + * `POST /sbom/ingest` + * `POST /attest/verify` +* Avoids versioning by design; APIs are extended, not versioned; all mutations are idempotent keyed by content digests. + +**Out of scope (for this iteration):** + +* Full vulnerability scanning (delegated to Scanner service). +* Policy evaluation / lattice logic (delegated to Scanner/Graph engine). +* Vendor-facing proof-market ledger and trust economics (future module). + +--- + +## 2. High-Level Architecture + +### 2.1 Logical Components + +1. **StellaOps.SupplyChain.Core (Library)** + + * Shared types and utilities: + + * Domain models: SBOM, DSSE, in-toto Statement, SLSA predicates. + * Canonicalization & hashing utilities. + * DSSE sign/verify abstractions. + * Transparency log entry model & Merkle proof verification. + +2. **StellaOps.Sbomer.Engine (Library)** + + * Image → SBOM functionality: + + * Layer & manifest analysis. + * SBOM generation: CycloneDX, SPDX. + * Extraction of metadata (labels, env, history). + * Deterministic ordering & normalization. + +3. **StellaOps.Provenance.Engine (Library)** + + * Build provenance & in-toto: + + * In-toto Statement generator. + * SLSA v1 provenance predicate builder. + * Subject and material resolution from image metadata & SBOM. + +4. **StellaOps.Authority (Service/Library)** + + * Crypto & keys: + + * Key management abstraction (file, HSM, KMS, sovereign crypto). + * DSSE signing & verification with multiple key types. + * Trust roots, certificate chains, key policies. + +5. **StellaOps.LogBridge (Service/Library)** + + * Transparency log adapter: + + * Rekor backend. + * Local Merkle log backend (for air-gap). + * Null backend (no-op). + * Merkle proof validation. + +6. **StellaOps.SupplyChain.Api (Service)** + + * The SBOM spine HTTP API: + + * `POST /sbom/ingest` + * `POST /attest/verify` + * Optionally: `GET /sbom/{id}`, `GET /attest/{id}`, `GET /image/{digest}/summary`. + * Performs orchestrations: + + * SBOM/attestation parsing, canonicalization, hashing. + * Idempotency and persistence. + * Delegation to Authority and LogBridge. + +7. **CLI Tools (optional but recommended)** + + * `stella-extract`, `stella-sbomer`, `stella-sign`, `stella-provenance`, `stella-log`. + * Thin wrappers over the above libraries; usable offline and in CI pipelines. + +8. **Persistence Layer** + + * Primary DB: PostgreSQL (or other RDBMS). + * Optional object storage: S3/MinIO for large SBOM/attestation blobs. + * Tables: `images`, `sboms`, `attestations`, `signatures`, `log_entries`, `keys`. + +### 2.2 Deployment View (Kubernetes / Docker) + +```mermaid +flowchart LR + subgraph Node1[Cluster Node] + A[StellaOps.SupplyChain.Api (ASP.NET Core)] + B[StellaOps.Authority Service] + C[StellaOps.LogBridge Service] + end + + subgraph Node2[Worker Node] + D[Runner / CI / Air-gap host] + E[CLI Tools\nstella-extract/sbomer/sign/provenance/log] + end + + F[(PostgreSQL)] + G[(Object Storage\nS3/MinIO)] + H[(Local Merkle Log\nor Rekor)] + + A --> F + A --> G + A --> C + A --> B + C --> H + E --> A +``` + +* **Air-gap mode:** + + * Rekor backend disabled; LogBridge uses local Merkle log (`H`) or `null`. + * All components run within the offline network. +* **Online mode:** + + * LogBridge talks to external Rekor instance using outbound HTTPS only. + +--- + +## 3. Domain Model and Storage Design + +Use EF Core 9 with PostgreSQL in .NET 10. + +### 3.1 Core Entities + +1. **ImageArtifact** + + * `Id` (GUID/ULID, internal). + * `ImageDigest` (string; OCI digest; UNIQUE). + * `Registry` (string). + * `Repository` (string). + * `Tag` (string, nullable, since digest is canonical). + * `FirstSeenAt` (timestamp). + * `MetadataJson` (JSONB; manifest, labels, env). + +2. **Sbom** + + * `Id` (string, primary key = `SbomHash` or derived ULID). + * `ImageArtifactId` (FK). + * `Format` (enum: `CycloneDX_1_6`, `SPDX_3_0_1`). + * `ContentHash` (string; normalized JSON SHA-256; UNIQUE with `TenantId`). + * `StorageLocation` (inline JSONB or external object storage key). + * `CreatedAt`. + * `Origin` (enum: `Generated`, `Uploaded`, `ExternalVendor`). + * Unique constraint: `(TenantId, ContentHash)`. + +3. **Attestation** + + * `Id` (string, primary key = `AttestationHash` or derived ULID). + * `ImageArtifactId` (FK). + * `Type` (enum: `InTotoStatement_SLSA_v1`, `Other`). + * `PayloadHash` (hash of DSSE payload, before envelope). + * `DsseEnvelopeHash` (hash of full DSSE JSON). + * `StorageLocation` (inline JSONB or object storage). + * `CreatedAt`. + * `Issuer` (string; signer identity / certificate subject). + * Unique constraint: `(TenantId, DsseEnvelopeHash)`. + +4. **SignatureInfo** + + * `Id` (GUID/ULID). + * `AttestationId` (FK). + * `KeyId` (logical key identifier). + * `Algorithm` (enum; includes PQ & sovereign algs). + * `VerifiedAt`. + * `VerificationStatus` (enum: `Valid`, `Invalid`, `Unknown`). + * `DetailsJson` (JSONB; trust-chain, error reasons, etc.). + +5. **TransparencyLogEntry** + + * `Id` (GUID/ULID). + * `AttestationId` (FK). + * `Backend` (enum: `Rekor`, `LocalMerkle`). + * `LogIndex` (string). + * `LogId` (string). + * `InclusionProofJson` (JSONB). + * `RecordedAt`. + * Unique constraint: `(Backend, LogId, LogIndex)`. + +6. **KeyRecord** (optional if not reusing Authority’s DB) + + * `KeyId` (string, PK). + * `KeyType` (enum). + * `Usage` (enum: `Signing`, `Verification`, `Both`). + * `Status` (enum: `Active`, `Retired`, `Revoked`). + * `MetadataJson` (JSONB; KMS ARN, HSM slot, etc.). + +### 3.2 Idempotency Keys + +* SBOM: + + * `sbomHash = SHA256(canonicalJson(sbom))`. + * Uniqueness enforced by `(TenantId, sbomHash)` in DB. +* Attestation: + + * `attHash = SHA256(canonicalJson(dsse.payload))` or full envelope. + * Uniqueness enforced by `(TenantId, attHash)` in DB. +* Image: + + * `imageDigest` is globally unique (per OCI spec). + +--- + +## 4. Service-Level Architecture + +### 4.1 StellaOps.SupplyChain.Api (.NET 10, ASP.NET Core) + +**Responsibilities:** + +* Expose HTTP API for ingest / verify. +* Handle idempotency logic & persistence. +* Delegate cryptographic operations to Authority. +* Delegate transparency logging to LogBridge. +* Perform basic validation against schemas (SBOM, DSSE, in-toto, SLSA). + +**Key Endpoints:** + +1. `POST /sbom/ingest` + + * Request: + + * `imageDigest` (string). + * `sbom` (raw JSON). + * `format` (enum/string). + * Optional: `dsseSignature` or `dsseEnvelope`. + * Behavior: + + * Parse & validate SBOM structure. + * Canonicalize JSON, compute `sbomHash`. + * If `sbomHash` exists for `imageDigest` and tenant: + + * Return `200` with `{ status: "already_present", sbomId, sbomHash }`. + * Else: + + * Persist `Sbom` entity. + * Optionally verify DSSE signature via Authority. + * Return `201` with `{ status: "stored", sbomId, sbomHash }`. + +2. `POST /attest/verify` + + * Request: + + * `dsseEnvelope` (JSON). + * `expectedSubjects` (list of `{ name, digest }`). + * Behavior: + + * Canonicalize payload, compute `attHash`. + * Verify DSSE signature via Authority. + * Parse in-toto Statement; ensure `subject[].digest.sha256` matches `expectedSubjects`. + * Persist `Attestation` & `SignatureInfo`. + * If configured, call LogBridge to publish and store `TransparencyLogEntry`. + * If `attHash` already exists: + + * Return `200` with `status: "already_present"` and existing references. + * Else, return `201` with `verified:true`, plus log info when available. + +3. Optional read APIs: + + * `GET /sbom/by-image/{digest}` + * `GET /attest/by-image/{digest}` + * `GET /image/{digest}/summary` (SBOM + attestations + log status). + +### 4.2 StellaOps.Sbomer.Engine + +**Responsibilities:** + +* Given: + + * OCI image manifest & layers (from local tarball or remote registry). +* Produce: + + * CycloneDX 1.6 JSON. + * SPDX 3.0.1 JSON. + +**Design:** + +* Use layered analyzers: + + * `ILayerAnalyzer` for generic filesystem traversal. + * Language-specific analyzers (optional for SBOM detail): + + * `DotNetAnalyzer`, `NodeJsAnalyzer`, `PythonAnalyzer`, `JavaAnalyzer`, `PhpAnalyzer`, etc. +* Determinism: + + * Sort all lists (components, dependencies) by stable keys. + * Remove unstable fields (timestamps, machine IDs, ephemeral paths). + * Provide `Normalize()` method per format that returns canonical JSON. + +### 4.3 StellaOps.Provenance.Engine + +**Responsibilities:** + +* Build in-toto Statement with SLSA v1 predicate: + + * `subject` derived from image digest(s). + * `materials` from: + + * Git commit, tag, builder image, SBOM components if available. +* Ensure determinism: + + * Sort materials by URI + digest. + * Normalize nested maps. + +**Key APIs (internal library):** + +* `InTotoStatement BuildSlsaProvenance(ImageArtifact image, Sbom sbom, ProvenanceContext ctx)` +* `string ToCanonicalJson(InTotoStatement stmt)` + +### 4.4 StellaOps.Authority + +**Responsibilities:** + +* DSSE signing & verification. +* Key management abstraction. +* Policy enforcement (which keys/trust roots are allowed). + +**Interfaces:** + +* `ISigningProvider` + + * `Task SignAsync(byte[] payload, string payloadType, string keyId)` +* `IVerificationProvider` + + * `Task VerifyAsync(DsseEnvelope envelope, VerificationPolicy policy)` + +**Backends:** + +* File-based keys (PEM). +* HSM/KMS (AWS KMS, Azure Key Vault, on-prem HSM). +* Sovereign crypto providers (GOST, SMx, etc.). +* Optional PQ providers (Dilithium, Falcon). + +### 4.5 StellaOps.LogBridge + +**Responsibilities:** + +* Abstract interaction with transparency logs. + +**Interface:** + +* `ILogBackend` + + * `Task PutAsync(byte[] canonicalPayloadHash, DsseEnvelope env)` + * `Task VerifyInclusionAsync(LogEntryResult entry)` + +**Backends:** + +* `RekorBackend`: + + * Calls Rekor REST API with hashed payload. +* `LocalMerkleBackend`: + + * Maintains Merkle tree in local DB. + * Returns `logIndex`, `logId`, and inclusion proof. +* `NullBackend`: + + * Returns empty/no-op results. + +### 4.6 CLI Tools (Optional) + +Use the same libraries as the services: + +* `stella-extract`: + + * Input: image reference. + * Output: local tarball + manifest JSON. +* `stella-sbomer`: + + * Input: manifest & layers. + * Output: SBOM JSON. +* `stella-sign`: + + * Input: JSON file. + * Output: DSSE envelope. +* `stella-provenance`: + + * Input: image digest, build metadata. + * Output: signed in-toto/SLSA DSSE. +* `stella-log`: + + * Input: DSSE envelope. + * Output: log entry details. + +--- + +## 5. End-to-End Flows + +### 5.1 SBOM Ingest (Upload Path) + +```mermaid +sequenceDiagram + participant Client + participant API as SupplyChain.Api + participant Core as SupplyChain.Core + participant DB as PostgreSQL + + Client->>API: POST /sbom/ingest (imageDigest, sbom, format) + API->>Core: Validate & canonicalize SBOM + Core-->>API: sbomHash + API->>DB: SELECT Sbom WHERE sbomHash & imageDigest + DB-->>API: Not found + API->>DB: INSERT Sbom (sbomHash, imageDigest, content) + DB-->>API: ok + API-->>Client: 201 { status:"stored", sbomId, sbomHash } +``` + +Re-ingest of the same SBOM repeats steps up to SELECT, then returns `status:"already_present"` with `200`. + +### 5.2 Attestation Verify & Record + +```mermaid +sequenceDiagram + participant Client + participant API as SupplyChain.Api + participant Auth as Authority + participant Log as LogBridge + participant DB as PostgreSQL + + Client->>API: POST /attest/verify (dsseEnvelope, expectedSubjects) + API->>Auth: Verify DSSE (keys, policy) + Auth-->>API: VerificationResult(Valid/Invalid) + API->>API: Parse in-toto, check subjects vs expected + API->>DB: SELECT Attestation WHERE attHash + DB-->>API: Not found + API->>DB: INSERT Attestation + SignatureInfo + alt Logging enabled + API->>Log: PutAsync(attHash, envelope) + Log-->>API: LogEntryResult(logIndex, logId, proof) + API->>DB: INSERT TransparencyLogEntry + end + API-->>Client: 201 { verified:true, attestationId, logIndex?, inclusionProof? } +``` + +If attestation already exists, API returns `200` with `status:"already_present"`. + +--- + +## 6. Idempotency and Determinism Strategy + +1. **Canonicalization rules:** + + * Remove insignificant whitespace. + * Sort all object keys lexicographically. + * Sort arrays where order is not semantically meaningful (components, materials). + * Strip non-deterministic fields (timestamps, random IDs) where allowed. + +2. **Hashing:** + + * Always hash canonical JSON as UTF-8. + * Use SHA-256 for core IDs; allow crypto provider to also compute other digests if needed. + +3. **Persistence:** + + * Enforce uniqueness in DB via indices on: + + * `(TenantId, ContentHash)` for SBOMs. + * `(TenantId, AttHash)` for attestations. + * `(Backend, LogId, LogIndex)` for log entries. + * API behavior: + + * Existing row → `200` with `"already_present"`. + * New row → `201` with `"stored"`. + +4. **API design:** + + * No version numbers in path. + * Add fields over time; never break or repurpose existing ones. + * Use explicit capability discovery via `GET /meta/capabilities` if needed. + +--- + +## 7. Air-Gap Mode and Synchronization + +### 7.1 Air-Gap Mode + +* Configuration flag `Mode = Offline` on SupplyChain.Api. +* LogBridge backend: + + * Default to `LocalMerkle` or `Null`. +* Rekor-specific configuration disabled or absent. +* DB & Merkle log stored locally inside the secure network. + +### 7.2 Later Synchronization to Rekor (Optional Future Step) + +Not mandatory for first iteration, but prepare for: + +* Background job (Scheduler module) that: + + * Enumerates local `TransparencyLogEntry` not yet exported. + * Publishes hashed payloads to Rekor when network is available. + * Stores mapping between local log entries and remote Rekor entries. + +--- + +## 8. Security, Access Control, and Observability + +### 8.1 Security + +* mTLS between internal services (SupplyChain.Api, Authority, LogBridge). +* Authentication: + + * API keys/OIDC for clients. + * Per-tenant scoping; `TenantId` must be present in context. +* Authorization: + + * RBAC: which tenants/users can write/verify/only read. + +### 8.2 Crypto Policies + +* Policy object defines: + + * Allowed key types and algorithms. + * Trust roots (Fulcio, internal CA, sovereign PKI). + * Revocation checking strategy (CRL/OCSP, offline lists). +* Authority enforces policies; SupplyChain.Api only consumes `VerificationResult`. + +### 8.3 Observability + +* Logs: + + * Structured logs with correlation IDs; log imageDigest, sbomHash, attHash. +* Metrics: + + * SBOM ingest count, dedup hit rate. + * Attestation verify latency. + * Transparency log publish success/failure counts. +* Traces: + + * OpenTelemetry tracing across API → Authority → LogBridge. + +--- + +## 9. Implementation Plan (Epics & Work Packages) + +You can give this section directly to agents to split. + +### Epic 1: Core Domain & Canonicalization + +1. Define .NET 10 solution structure: + + * Projects: + + * `StellaOps.SupplyChain.Core` + * `StellaOps.Sbomer.Engine` + * `StellaOps.Provenance.Engine` + * `StellaOps.SupplyChain.Api` + * `StellaOps.Authority` (if not already present) + * `StellaOps.LogBridge` +2. Implement core domain models: + + * SBOM, DSSE, in-toto, SLSA v1. +3. Implement canonicalization & hashing utilities. +4. Unit tests: + + * Given semantically equivalent JSON, hashes must match. + * Negative tests where order changes but meaning does not. + +### Epic 2: Persistence Layer + +1. Design EF Core models for: + + * ImageArtifact, Sbom, Attestation, SignatureInfo, TransparencyLogEntry, KeyRecord. +2. Write migrations for PostgreSQL. +3. Implement repository interfaces for read/write. +4. Tests: + + * Unique constraints and idempotency behavior. + * Query performance for common access paths (by imageDigest). + +### Epic 3: SBOM Engine + +1. Implement minimal layer analysis: + + * Accepts local tarball or path (for now). +2. Implement CycloneDX 1.6 generator. +3. Implement SPDX 3.0.1 generator. +4. Deterministic normalization across formats. +5. Tests: + + * Golden files for images → SBOM output. + * Stability under repeated runs. + +### Epic 4: Provenance Engine + +1. Implement in-toto Statement model with SLSA v1 predicate. +2. Implement builder to map: + + * ImageDigest → subject. + * Build metadata → materials. +3. Deterministic canonicalization. +4. Tests: + + * Golden in-toto/SLSA statements for sample inputs. + * Subject matching logic. + +### Epic 5: Authority Integration + +1. Implement `ISigningProvider`, `IVerificationProvider` contracts. +2. Implement file-based key backend as default. +3. Implement DSSE wrapper: + + * `SignAsync(payload, payloadType, keyId)`. + * `VerifyAsync(envelope, policy)`. +4. Tests: + + * DSSE round-trip; invalid signature scenarios. + * Policy enforcement tests. + +### Epic 6: Transparency Log Bridge + +1. Implement `ILogBackend` interface. +2. Implement `LocalMerkleBackend`: + + * Simple Merkle tree with DB storage. +3. Implement `NullBackend`. +4. Define configuration model to select backend. +5. (Optional later) Implement `RekorBackend`. +6. Tests: + + * Stable Merkle root; inclusion proof verification. + +### Epic 7: SupplyChain.Api + +1. Implement `POST /sbom/ingest`: + + * Request/response DTOs. + * Integration with canonicalization, persistence, idempotency logic. +2. Implement `POST /attest/verify`: + + * End-to-end verification and persistence. + * Integration with Authority and LogBridge. +3. Optional read APIs. +4. Add input validation (JSON schema, basic constraints). +5. Integration tests: + + * Full flows for new and duplicate inputs. + * Error cases (invalid DSSE, subject mismatch). + +### Epic 8: CLI Tools + +1. Implement `stella-sbomer` (wraps Sbomer.Engine). +2. Implement `stella-provenance` (wraps Provenance.Engine + Authority). +3. Implement `stella-sign` and `stella-log`. +4. Provide clear help/usage and sample scripts. + +### Epic 9: Hardening, Air-Gap Profile, and Docs + +1. Configuration profiles: + + * `Offline` vs `Online`. + * Log backend selection. +2. Security hardening: + + * mTLS, authentication, authorization. +3. Observability: + + * Metrics, logs, traces wiring. +4. Documentation: + + * API reference. + * Sequence diagrams. + * Deployment recipes for: + + * Single-node air-gap. + * Clustered online deployment. + +--- + +If you want, next step I can: + +* Turn this into an AGENTS/TASKS/PROMPT set for your codex workers, or +* Produce concrete .NET 10 project skeletons (csproj layout, folder structure, and initial interfaces) for the core libraries and API service. diff --git a/docs/product-advisories/archived/18-Nov-2026 - Unknowns-Registry.md b/docs/product-advisories/archived/18-Nov-2025 - Unknowns-Registry.md similarity index 96% rename from docs/product-advisories/archived/18-Nov-2026 - Unknowns-Registry.md rename to docs/product-advisories/archived/18-Nov-2025 - Unknowns-Registry.md index a52cc069c..124ac5727 100644 --- a/docs/product-advisories/archived/18-Nov-2026 - Unknowns-Registry.md +++ b/docs/product-advisories/archived/18-Nov-2025 - Unknowns-Registry.md @@ -1,719 +1,719 @@ - -Here’s a crisp idea you can drop straight into Stella Ops: treat “unknowns” as first‑class data, not noise. - ---- - -# Unknowns Registry — turning uncertainty into signals - -**Why:** Scanners and VEX feeds miss things (ambiguous package IDs, unverifiable hashes, orphaned layers, missing SBOM edges, runtime-only artifacts). Today these get logged and forgotten. If we **structure** them, downstream agents can reason about risk and shrink blast radius proactively. - -**What it is:** A small service + schema that records every uncertainty with enough context for later inference. - -## Core model (v0) - -```json -{ - "unknown_id": "unk:sha256:…", - "observed_at": "2025-11-18T12:00:00Z", - "provenance": { - "source": "Scanner.Analyzer.DotNet|Sbomer|Signals|Vexer", - "host": "runner-42", - "scan_id": "scan:…" - }, - "scope": { - "artifact": { "type": "oci.image", "ref": "registry/app@sha256:…" }, - "subpath": "/app/bin/Contoso.dll", - "phase": "build|scan|runtime" - }, - "unknown_type": "identity_gap|version_conflict|hash_mismatch|missing_edge|runtime_shadow|policy_undecidable", - "evidence": { - "raw": "nuget id 'Serilog' but assembly name 'Serilog.Core'", - "signals": ["sym:Serilog.Core.Logger", "procopen:/app/agent"] - }, - "transitive": { - "depth": 2, - "parents": ["pkg:nuget/Serilog@?"], - "children": [] - }, - "confidence": { "p": 0.42, "method": "bayes-merge|rule" }, - "exposure_hints": { - "surface": ["logging pipeline", "startup path"], - "runtime_hits": 3 - }, - "status": "open|triaged|suppressed|resolved", - "labels": ["reachability:possible", "sbom:incomplete"] -} -``` - -## Categorize by three axes - -* **Provenance** (where it came from): Scanner vs Sbomer vs Vexer vs Signals. -* **Scope** (what it touches): image/layer/file/symbol/runtime‑proc/policy. -* **Transitive depth** (how far from an entry point): 0 = direct, 1..N via deps. - -## How agents use it - -* **Cartographer**: includes unknown edges in the graph with special weight; lets Policy/Lattice down‑rank vulnerable nodes near high‑impact unknowns. -* **Remedy Assistant (Zastava)**: proposes micro‑probes (“add EventPipe/JFR tap for X symbol”) or build‑time assertions (“pin Serilog>=3.1, regenerate SBOM”). -* **Scheduler**: prioritizes scans where unknown density × asset criticality is highest. - -## Minimal API (idempotent, additive) - -* `POST /unknowns/ingest` — upsert by `unknown_id` (hash of type+scope+evidence). -* `GET /unknowns?artifact=…&status=open` — list for a target. -* `POST /unknowns/:id/triage` — set status/labels, attach rationale. -* `GET /metrics` — density by artifact/namespace/unknown_type. - -*All additive; no versioning required. Repeat calls with the same payload are no‑ops.* - -## Scoring hook (into your lattice) - -* Add a **“Unknowns Pressure”** term: - `risk = base ⊕ (α * density_depth≤1) ⊕ (β * runtime_shadow) ⊕ (γ * policy_undecidable)` -* Gate “green” only if `density_depth≤1 == 0` **or** compensating controls active. - -## Storage & plumbing - -* **Store:** append‑only KV (Badger/Rocks) + Graph overlay (SQLite/Neo4j—your call). -* **Emit:** DSSE‑signed “Unknowns Attestation” per scan for replayable audits. -* **UI:** heatmap per artifact (unknowns by type × depth), drill‑down to evidence. - -## First 2‑day slice - -1. Define `unknown_type` enum + hashable `unknown_id`. -2. Wire Scanner/Sbomer/Vexer to emit unknowns (start with: identity_gap, missing_edge). -3. Persist + expose `/metrics` (density, by depth and type). -4. In Policy Studio, add the Unknowns Pressure term with default α/β/γ. - -If you want, I’ll draft the exact protobuf/JSON schema and drop a .NET 10 record types + EF model, plus a tiny CLI to query and a Grafana panel JSON. -I will treat “it” as the whole vision behind **Pushing Binary Reachability Toward True Determinism** inside Stella Ops: function-/symbol-level reachability for binaries and higher-level languages, wired into Scanner, Cartographer, Signals, and VEX. - -Below is an implementation-oriented architecture plan you can hand directly to agents. - ---- - -## 1. Scope, goals, and non-negotiable invariants - -### 1.1. Scope - -Deliver a deterministic reachability pipeline for containers that: - -1. Builds **call graphs** and **symbol usage maps** for: - - * Native binaries (ELF, PE, Mach-O) — primary for this branch. - * Scripted/VM languages later: JS, Python, PHP (as part of the same architecture). -2. Maps symbols and functions to: - - * Packages (purls). - * Vulnerabilities (CVE → symbol/function list via Concelier/VEX data). -3. Computes **deterministic reachability states** for each `(vulnerability, artifact)` pair. -4. Emits: - - * Machine-readable JSON (with `purl`s). - * Graph overlays for Cartographer. - * Inputs for the lattice/trust engine and VEXer/Excitor. - -### 1.2. Invariants - -* **Deterministic replay**: Given the same: - - * Image digest(s), - * Analyzer versions, - * Config + policy, - * Runtime trace inputs (if any), - the same reachability outputs must be produced, bit-for-bit. -* **Idempotent, additive APIs**: - - * No versioning of endpoints, only additive/optional fields. - * Same request = same response, no side effects besides storing/caching. -* **Lattice logic runs in `Scanner.WebService`**: - - * All “reachable/unreachable/unknown” and confidence merging lives in Scanner, not Concelier/Excitors. -* **Preserve prune source**: - - * Concelier and Excitors preserve provenance and do not “massage” reachability; they only consume it. -* **Offline, air-gap friendly**: - - * No mandatory external calls; dependency on local analyzers and local advisory/VEX cache. - ---- - -## 2. High-level pipeline - -From container image to reachability output: - -1. **Image enumeration** - `Scanner.WebService` receives an image ref or tarball and spawns an analysis run. -2. **Binary discovery & classification** - Binary analyzers detect ELF/PE/Mach-O + main interpreters (python, node, php) and scripts. -3. **Symbolization & call graph building** - - * For each binary/module, we produce: - - * Symbol table (exported + imported). - * Call graph edges (function-level where possible). - * For dynamic languages, we later plug in appropriate analyzers. -4. **Symbol→package mapping** - - * Match symbols to packages and `purl`s using: - - * Known vendor symbol maps (from Concelier / Feedser). - * Heuristics, path patterns, build IDs. -5. **Vulnerability→symbol mapping** - - * From Concelier/VEX/CSAF: map each CVE to the set of symbols/functions it affects. -6. **Reachability solving** - - * For each `(CVE, artifact)`: - - * Determine presence and reachability of affected symbols from known entrypoints. - * Merge static call graph and runtime signals (if available) via deterministic lattice. -7. **Output & storage** - - * Reachability JSON with purls and confidence. - * Graph overlay into Cartographer. - * Signals/events for downstream scoring. - * DSSE-signed reachability attestation for replay/audit. - ---- - -## 3. Component architecture - -### 3.1. New and extended services - -1. **`StellaOps.Scanner.WebService` (extended)** - - * Orchestration of reachability analyses. - * Lattice/merging engine. - * Idempotent reachability APIs. - -2. **`StellaOps.Scanner.Analyzers.Binary.*` (new)** - - * `…Binary.Discovery`: file type detection, ELF/PE/Mach-O parsing. - * `…Binary.Symbolizer`: resolves symbols, imports/exports, relocations. - * `…Binary.CallGraph.Native`: builds call graphs where possible (via disassembly/CFG). - * `…Binary.CallGraph.DynamicStubs`: heuristics for indirect calls, PLT/GOT, vtables. - -3. **`StellaOps.Scanner.Analyzers.Script.*` (future extension)** - - * `…Lang.JavaScript.CallGraph` - * `…Lang.Python.CallGraph` - * `…Lang.Php.CallGraph` - * These emit the same generic call-graph IR. - -4. **`StellaOps.Reachability.Engine` (within Scanner.WebService)** - - * Normalizes all call graphs into a common IR. - * Merges static and dynamic evidence. - * Computes reachability states and scores. - -5. **`StellaOps.Cartographer.ReachabilityOverlay` (new overlay module)** - - * Stores per-artifact call graphs and reachability tags. - * Provides graph queries for UI and policy tools. - -6. **`StellaOps.Signals` (extended)** - - * Ingests runtime call traces (e.g., from EventPipe/JFR/ebpf in other branches). - * Feeds function-hit events into the Reachability Engine. - -7. **Unknowns Registry integration (optional but recommended)** - - * Stores unresolved symbol/package mappings and incomplete edges as `unknowns`. - * Used to adjust risk scores (“Unknowns Pressure”) when binary analysis is incomplete. - ---- - -## 4. Detailed design by layer - -### 4.1. Static analysis layer (binaries) - -#### 4.1.1. Binary discovery - -Module: `StellaOps.Scanner.Analyzers.Binary.Discovery` - -* Inputs: - - * Per-image file list (from existing Scanner). - * Byte slices of candidate binaries. -* Logic: - - * Detect ELF/PE/Mach-O via magic bytes, not extensions. - * Classify as: - - * Main executable - * Shared library - * Plugin/module -* Output: - - * `binary_manifest.json` per image: - - ```json - { - "image_ref": "registry/app@sha256:…", - "binaries": [ - { - "id": "bin:elf:/usr/local/bin/app", - "path": "/usr/local/bin/app", - "format": "elf", - "arch": "x86_64", - "role": "executable" - } - ] - } - ``` - -#### 4.1.2. Symbolization - -Module: `StellaOps.Scanner.Analyzers.Binary.Symbolizer` - -* Uses: - - * ELF/PE/Mach-O parsers (internal or third-party), no external calls. -* Output per binary: - - ```json - { - "binary_id": "bin:elf:/usr/local/bin/app", - "build_id": "buildid:abcd…", - "exports": ["pkg1::ClassA::method1", "..."], - "imports": ["openssl::EVP_EncryptInit_ex", "..."], - "sections": { "text": { "va": "0x...", "size": 12345 } } - } - ``` -* Writes unresolved symbol sets to Unknowns Registry when: - - * Imports cannot be tied to known packages or symbols. - -#### 4.1.3. Call graph construction - -Module: `StellaOps.Scanner.Analyzers.Binary.CallGraph.Native` - -* Core tasks: - - * Build control-flow graphs (CFG) for each function via: - - * Disassembly. - * Basic block detection. - * Identify direct calls (`call func`) and indirect calls (function pointers, vtables). -* IR model: - - ```json - { - "binary_id": "bin:elf:/usr/local/bin/app", - "functions": [ - { "fid": "func:app::main", "va": "0x401000", "size": 128 }, - { "fid": "func:libssl::EVP_EncryptInit_ex", "external": true } - ], - "edges": [ - { "caller": "func:app::main", "callee": "func:app::init_config", "type": "direct" }, - { "caller": "func:app::main", "callee": "func:libssl::EVP_EncryptInit_ex", "type": "import" } - ] - } - ``` -* Edge confidence: - - * `type: direct|import|indirect|heuristic` - * Used later by the lattice. - -#### 4.1.4. Entry point inference - -* Sources: - - * ELF `PT_INTERP`, PE `AddressOfEntryPoint`. - * Application-level hints (known frameworks, service main methods). - * Container metadata (CMD, ENTRYPOINT). -* Output: - - ```json - { - "binary_id": "bin:elf:/usr/local/bin/app", - "entrypoints": ["func:app::main"] - } - ``` - -> Note: For JS/Python/PHP, equivalent analyzers will later define module entrypoints (`index.js`, `wsgi_app`, `public/index.php`). - ---- - -### 4.2. Symbol-to-package and CVE-to-symbol mapping - -#### 4.2.1. Symbol→package mapping - -Module: `StellaOps.Reachability.Mapping.SymbolToPurl` - -* Inputs: - - * Binary symbolization outputs. - * Local mapping DB in Concelier (vendor symbol maps, debug info, name patterns). - * File path + container context (`/usr/lib/...`, `/site-packages/...`). -* Output: - - ```json - { - "symbol": "libssl::EVP_EncryptInit_ex", - "purl": "pkg:apk/alpine/openssl@3.1.5-r2", - "confidence": 0.93, - "method": "vendor_map+path_heuristic" - } - ``` -* Unresolved / ambiguous symbols: - - * Stored as `unknowns` of type `identity_gap`. - -#### 4.2.2. CVE→symbol mapping - -Responsibility: Concelier + its advisory ingestion. - -* For each vulnerability: - - ```json - { - "cve_id": "CVE-2025-12345", - "purl": "pkg:apk/alpine/openssl@3.1.5-r2", - "affected_symbols": [ - "libssl::EVP_EncryptInit_ex", - "libssl::EVP_EncryptUpdate" - ], - "source": "vendor_vex", - "confidence": 1.0 - } - ``` -* Reachability Engine consumes this mapping read-only. - ---- - -### 4.3. Reachability Engine - -Module: `StellaOps.Reachability.Engine` (in Scanner.WebService) - -#### 4.3.1. Core data model - -Per `(artifact, cve, purl)`: - -```json -{ - "artifact": { "type": "oci.image", "ref": "registry/app@sha256:…" }, - "cve_id": "CVE-2025-12345", - "purl": "pkg:apk/alpine/openssl@3.1.5-r2", - "symbols": [ - { - "symbol": "libssl::EVP_EncryptInit_ex", - "static_presence": "present|absent|unknown", - "static_reachability": "reachable|unreachable|unknown", - "runtime_hits": 3, - "runtime_reachability": "observed|not_observed|unknown" - } - ], - "reachability_state": "confirmed_reachable|statically_reachable|present_not_reachable|not_present|unknown", - "confidence": { - "p": 0.87, - "evidence": ["static_callgraph", "runtime_trace", "symbol_map"], - "unknowns_pressure": 0.12 - } -} -``` - -#### 4.3.2. Lattice / state machine - -Define a deterministic lattice over states: - -* `NOT_PRESENT` -* `PRESENT_NOT_REACHABLE` -* `STATICALLY_REACHABLE` -* `RUNTIME_OBSERVED` - -And “unknown” flags overlayed when evidence is missing. - -Merging rules (simplified): - -* If `NOT_PRESENT` and no conflicting evidence → `NOT_PRESENT`. -* If at least one affected symbol is on a static path from any entrypoint → `STATICALLY_REACHABLE`. -* If symbol observed at runtime → `RUNTIME_OBSERVED` (top state). -* If symbol present in binary but not on any static path → `PRESENT_NOT_REACHABLE`, unless unknown edges exist near it (then downgrade with lower confidence). -* Unknowns Registry entries near affected symbols increase `unknowns_pressure` and may push from `NOT_PRESENT` to `UNKNOWN`. - -Implementation: pure functional merge functions inside Scanner.WebService: - -```csharp -ReachabilityState Merge(ReachabilityState a, ReachabilityState b); -ReachabilityState FromEvidence(StaticEvidence s, RuntimeEvidence r, UnknownsPressure u); -``` - -#### 4.3.3. Deterministic inputs - -To guarantee replay: - -* Build **Reachability Plan Manifest** per run: - - ```json - { - "plan_id": "reach:sha256:…", - "scanner_version": "1.4.0", - "analyzers": { - "binary_discovery": "1.0.0", - "binary_symbolizer": "1.1.0", - "binary_callgraph": "1.2.0" - }, - "inputs": { - "image_digest": "sha256:…", - "runtime_trace_files": ["signals:run:2025-11-18T12:00:00Z"], - "config": { - "assume_indirect_calls": "conservative", - "max_call_depth": 10 - } - } - } - ``` -* DSSE-sign the plan + result. - ---- - -### 4.4. Storage and graph overlay - -#### 4.4.1. Reachability store - -Backend: re-use existing Scanner/Cartographer storage stack (e.g., Postgres or SQLite + blob store). - -Tables/collections: - -* `reachability_runs` - - * `plan_id`, `image_ref`, `created_at`, `scanner_version`. - -* `reachability_results` - - * `plan_id`, `cve_id`, `purl`, `state`, `confidence_p`, `unknowns_pressure`, `payload_json`. - -* Indexes on `(image_ref, cve_id)`, `(image_ref, purl)`. - -#### 4.4.2. Cartographer overlay - -Edges: - -* `IMAGE` → `BINARY` → `FUNCTION` → `PACKAGE` → `CVE` -* Extra property on `IMAGE -[AFFECTED_BY]-> CVE`: - - * `reachability_state` - * `reachability_plan_id` - -Enables queries: - -* “Show me all CVEs with `STATICALLY_REACHABLE` in this namespace.” -* “Show me binaries with high density of reachable crypto CVEs.” - ---- - -### 4.5. APIs (idempotent, additive) - -#### 4.5.1. Trigger reachability - -`POST /reachability/runs` - -Request: - -```json -{ - "artifact": { "type": "oci.image", "ref": "registry/app@sha256:…" }, - "config": { - "include_languages": ["binary"], - "max_call_depth": 10, - "assume_indirect_calls": "conservative" - } -} -``` - -Response: - -```json -{ "plan_id": "reach:sha256:…" } -``` - -* Idempotent key: `(image_ref, config_hash)`. Subsequent calls return same `plan_id`. - -#### 4.5.2. Fetch results - -`GET /reachability/runs/:plan_id` - -```json -{ - "plan": { /* reachability plan manifest */ }, - "results": [ - { - "cve_id": "CVE-2025-12345", - "purl": "pkg:apk/alpine/openssl@3.1.5-r2", - "reachability_state": "static_reachable", - "confidence": { "p": 0.84, "unknowns_pressure": 0.1 } - } - ] -} -``` - -#### 4.5.3. Per-CVE view for VEXer/Excitor - -`GET /reachability/by-cve?artifact=…&cve_id=…` - -* Returns filtered result for downstream VEX creation. - -All APIs are **read-only** except for the side effect of storing/caching runs. - ---- - -## 5. Interaction with other Stella Ops modules - -### 5.1. Concelier - -* Provides: - - * CVE→purl→symbol mapping. - * Vendor VEX statements indicating affected functions. -* Consumes: - - * Nothing from reachability directly; Scanner/WebService passes reachability summary to VEXer/Excitor which merges with vendor statements. - -### 5.2. VEXer / Excitor - -* Input: - - * For each `(artifact, cve)`: - - * Reachability state. - * Confidence. -* Logic: - - * Translate states to VEX statements: - - * `NOT_PRESENT` → `not_affected` - * `PRESENT_NOT_REACHABLE` → `not_affected` (with justification “code not reachable according to analysis”) - * `STATICALLY_REACHABLE` → `affected` - * `RUNTIME_OBSERVED` → `affected` (higher severity) - * Attach determinism proof: - - * Plan ID + DSSE of reachability run. - -### 5.3. Signals - -* Provides: - - * Function hit events: `(binary_id, function_id, timestamp)` aggregated per image. -* Reachability Engine: - - * Marks `runtime_hits` and state `RUNTIME_OBSERVED` for symbols with hits. -* Unknowns: - - * If runtime sees hits in functions with no static edges to entrypoints (or unmapped symbols), these produce Unknowns and increase `unknowns_pressure`. - -### 5.4. Unknowns Registry - -* From reachability pipeline, create Unknowns when: - - * Symbol→package mapping is ambiguous. - * CVE→symbol mapping exists, but symbol cannot be found in binaries. - * Call graph has indirect calls that cannot be resolved. -* The “Unknowns Pressure” term is fed into: - - * Reachability confidence. - * Global risk scoring (Trust Algebra Studio). - ---- - -## 6. Implementation phases and engineering plan - -### Phase 0 – Scaffolding & manifests (1 sprint) - -* Create: - - * `StellaOps.Reachability.Engine` skeleton. - * Reachability Plan Manifest schema. - * Reachability Run + Result persistence. -* Add `/reachability/runs` and `/reachability/runs/:plan_id` endpoints, returning mock data. -* Wire DSSE attestation generation for reachability results (even if payload is empty). - -### Phase 1 – Binary discovery + symbolization (1–2 sprints) - -* Implement `Binary.Discovery` and `Binary.Symbolizer`. -* Feed symbol tables into Reachability Engine as “presence-only evidence”: - - * States: `NOT_PRESENT` vs `PRESENT_NOT_REACHABLE` vs `UNKNOWN`. -* Integrate with Concelier’s CVE→purl mapping (no symbol-level yet): - - * For CVEs affecting a package present in the image, mark as `PRESENT_NOT_REACHABLE`. -* Emit Unknowns for unresolved binary roles and ambiguous package mapping. - -Deliverable: package-level reachability with deterministic manifests. - -### Phase 2 – Binary call graphs & entrypoints (2–3 sprints) - -* Implement `Binary.CallGraph.Native`: - - * CFG + direct call edges. -* Implement entrypoint inference from binary + container ENTRYPOINT/CMD. -* Add static reachability algorithm: - - * DFS/BFS from entrypoints through call graph. - * Mark affected symbols as reachable if found on paths. -* Extend Concelier to ingest symbol-aware vulnerability metadata (for pilots; can be partial). - -Deliverable: function-level static reachability for native binaries where symbol maps exist. - -### Phase 3 – Runtime integration (2 sprints, may be in parallel workstream) - -* Integrate Signals runtime evidence: - - * Define schema for function hit events. - * Add ingestion path into Reachability Engine. -* Update lattice: - - * Promote symbols to `RUNTIME_OBSERVED` when hits exist. -* Extend DSSE attestation to reference runtime evidence URIs (hashes of trace inputs). - -Deliverable: static + runtime-confirmed reachability. - -### Phase 4 – Unknowns & pressure (1 sprint) - -* Wire Unknowns Registry: - - * Emit unknowns from Symbolizer and CallGraph (identity gaps, missing edges). - * Compute `unknowns_pressure` per `(artifact, cve)` as density of unknowns near affected symbols. -* Adjust confidence calculation in Reachability Engine. -* Expose unknowns metrics in API and Cartographer. - -Deliverable: explicit modelling of uncertainty, feeding into trust/lattice. - -### Phase 5 – Language extensions (JS/Python/PHP) (ongoing) - -* Implement per-language call-graph analyzers creating the same IR as binary. -* Extend symbol→purl mapping for these ecosystems (npm, PyPI, Packagist). -* Update reachability solver to include multi-language edges (e.g., Python calling into native modules). - ---- - -## 7. Minimal contracts for agents - -To hand off to agents, you can codify: - -1. **IR schemas** - - * Call graph IR. - * Reachability Result JSON. - * Reachability Plan Manifest. -2. **API contracts** - - * `POST /reachability/runs` - * `GET /reachability/runs/:plan_id` - * `GET /reachability/by-cve` -3. **Module boundaries** - - * `Scanner.Analyzers.Binary.*` produce IR only; NO network calls. - * `Reachability.Engine` is the only place where lattice logic lives. - * `Concelier` is read-only for reachability; no custom logic there. -4. **Determinism practices** - - * All algorithmic randomness is banned; where unavoidable, seed with values derived from plan_id. - * All external inputs must be listed in the Plan Manifest. - -If you like, next step I can draft: - -* Concrete C# record types for the IRs. -* A small pseudo-code implementation of the lattice functions and static reachability DFS. -* A proposed directory layout under `src/StellaOps.Scanner` and `src/StellaOps.Cartographer`. + +Here’s a crisp idea you can drop straight into Stella Ops: treat “unknowns” as first‑class data, not noise. + +--- + +# Unknowns Registry — turning uncertainty into signals + +**Why:** Scanners and VEX feeds miss things (ambiguous package IDs, unverifiable hashes, orphaned layers, missing SBOM edges, runtime-only artifacts). Today these get logged and forgotten. If we **structure** them, downstream agents can reason about risk and shrink blast radius proactively. + +**What it is:** A small service + schema that records every uncertainty with enough context for later inference. + +## Core model (v0) + +```json +{ + "unknown_id": "unk:sha256:…", + "observed_at": "2025-11-18T12:00:00Z", + "provenance": { + "source": "Scanner.Analyzer.DotNet|Sbomer|Signals|Vexer", + "host": "runner-42", + "scan_id": "scan:…" + }, + "scope": { + "artifact": { "type": "oci.image", "ref": "registry/app@sha256:…" }, + "subpath": "/app/bin/Contoso.dll", + "phase": "build|scan|runtime" + }, + "unknown_type": "identity_gap|version_conflict|hash_mismatch|missing_edge|runtime_shadow|policy_undecidable", + "evidence": { + "raw": "nuget id 'Serilog' but assembly name 'Serilog.Core'", + "signals": ["sym:Serilog.Core.Logger", "procopen:/app/agent"] + }, + "transitive": { + "depth": 2, + "parents": ["pkg:nuget/Serilog@?"], + "children": [] + }, + "confidence": { "p": 0.42, "method": "bayes-merge|rule" }, + "exposure_hints": { + "surface": ["logging pipeline", "startup path"], + "runtime_hits": 3 + }, + "status": "open|triaged|suppressed|resolved", + "labels": ["reachability:possible", "sbom:incomplete"] +} +``` + +## Categorize by three axes + +* **Provenance** (where it came from): Scanner vs Sbomer vs Vexer vs Signals. +* **Scope** (what it touches): image/layer/file/symbol/runtime‑proc/policy. +* **Transitive depth** (how far from an entry point): 0 = direct, 1..N via deps. + +## How agents use it + +* **Cartographer**: includes unknown edges in the graph with special weight; lets Policy/Lattice down‑rank vulnerable nodes near high‑impact unknowns. +* **Remedy Assistant (Zastava)**: proposes micro‑probes (“add EventPipe/JFR tap for X symbol”) or build‑time assertions (“pin Serilog>=3.1, regenerate SBOM”). +* **Scheduler**: prioritizes scans where unknown density × asset criticality is highest. + +## Minimal API (idempotent, additive) + +* `POST /unknowns/ingest` — upsert by `unknown_id` (hash of type+scope+evidence). +* `GET /unknowns?artifact=…&status=open` — list for a target. +* `POST /unknowns/:id/triage` — set status/labels, attach rationale. +* `GET /metrics` — density by artifact/namespace/unknown_type. + +*All additive; no versioning required. Repeat calls with the same payload are no‑ops.* + +## Scoring hook (into your lattice) + +* Add a **“Unknowns Pressure”** term: + `risk = base ⊕ (α * density_depth≤1) ⊕ (β * runtime_shadow) ⊕ (γ * policy_undecidable)` +* Gate “green” only if `density_depth≤1 == 0` **or** compensating controls active. + +## Storage & plumbing + +* **Store:** append‑only KV (Badger/Rocks) + Graph overlay (SQLite/Neo4j—your call). +* **Emit:** DSSE‑signed “Unknowns Attestation” per scan for replayable audits. +* **UI:** heatmap per artifact (unknowns by type × depth), drill‑down to evidence. + +## First 2‑day slice + +1. Define `unknown_type` enum + hashable `unknown_id`. +2. Wire Scanner/Sbomer/Vexer to emit unknowns (start with: identity_gap, missing_edge). +3. Persist + expose `/metrics` (density, by depth and type). +4. In Policy Studio, add the Unknowns Pressure term with default α/β/γ. + +If you want, I’ll draft the exact protobuf/JSON schema and drop a .NET 10 record types + EF model, plus a tiny CLI to query and a Grafana panel JSON. +I will treat “it” as the whole vision behind **Pushing Binary Reachability Toward True Determinism** inside Stella Ops: function-/symbol-level reachability for binaries and higher-level languages, wired into Scanner, Cartographer, Signals, and VEX. + +Below is an implementation-oriented architecture plan you can hand directly to agents. + +--- + +## 1. Scope, goals, and non-negotiable invariants + +### 1.1. Scope + +Deliver a deterministic reachability pipeline for containers that: + +1. Builds **call graphs** and **symbol usage maps** for: + + * Native binaries (ELF, PE, Mach-O) — primary for this branch. + * Scripted/VM languages later: JS, Python, PHP (as part of the same architecture). +2. Maps symbols and functions to: + + * Packages (purls). + * Vulnerabilities (CVE → symbol/function list via Concelier/VEX data). +3. Computes **deterministic reachability states** for each `(vulnerability, artifact)` pair. +4. Emits: + + * Machine-readable JSON (with `purl`s). + * Graph overlays for Cartographer. + * Inputs for the lattice/trust engine and VEXer/Excitor. + +### 1.2. Invariants + +* **Deterministic replay**: Given the same: + + * Image digest(s), + * Analyzer versions, + * Config + policy, + * Runtime trace inputs (if any), + the same reachability outputs must be produced, bit-for-bit. +* **Idempotent, additive APIs**: + + * No versioning of endpoints, only additive/optional fields. + * Same request = same response, no side effects besides storing/caching. +* **Lattice logic runs in `Scanner.WebService`**: + + * All “reachable/unreachable/unknown” and confidence merging lives in Scanner, not Concelier/Excitors. +* **Preserve prune source**: + + * Concelier and Excitors preserve provenance and do not “massage” reachability; they only consume it. +* **Offline, air-gap friendly**: + + * No mandatory external calls; dependency on local analyzers and local advisory/VEX cache. + +--- + +## 2. High-level pipeline + +From container image to reachability output: + +1. **Image enumeration** + `Scanner.WebService` receives an image ref or tarball and spawns an analysis run. +2. **Binary discovery & classification** + Binary analyzers detect ELF/PE/Mach-O + main interpreters (python, node, php) and scripts. +3. **Symbolization & call graph building** + + * For each binary/module, we produce: + + * Symbol table (exported + imported). + * Call graph edges (function-level where possible). + * For dynamic languages, we later plug in appropriate analyzers. +4. **Symbol→package mapping** + + * Match symbols to packages and `purl`s using: + + * Known vendor symbol maps (from Concelier / Feedser). + * Heuristics, path patterns, build IDs. +5. **Vulnerability→symbol mapping** + + * From Concelier/VEX/CSAF: map each CVE to the set of symbols/functions it affects. +6. **Reachability solving** + + * For each `(CVE, artifact)`: + + * Determine presence and reachability of affected symbols from known entrypoints. + * Merge static call graph and runtime signals (if available) via deterministic lattice. +7. **Output & storage** + + * Reachability JSON with purls and confidence. + * Graph overlay into Cartographer. + * Signals/events for downstream scoring. + * DSSE-signed reachability attestation for replay/audit. + +--- + +## 3. Component architecture + +### 3.1. New and extended services + +1. **`StellaOps.Scanner.WebService` (extended)** + + * Orchestration of reachability analyses. + * Lattice/merging engine. + * Idempotent reachability APIs. + +2. **`StellaOps.Scanner.Analyzers.Binary.*` (new)** + + * `…Binary.Discovery`: file type detection, ELF/PE/Mach-O parsing. + * `…Binary.Symbolizer`: resolves symbols, imports/exports, relocations. + * `…Binary.CallGraph.Native`: builds call graphs where possible (via disassembly/CFG). + * `…Binary.CallGraph.DynamicStubs`: heuristics for indirect calls, PLT/GOT, vtables. + +3. **`StellaOps.Scanner.Analyzers.Script.*` (future extension)** + + * `…Lang.JavaScript.CallGraph` + * `…Lang.Python.CallGraph` + * `…Lang.Php.CallGraph` + * These emit the same generic call-graph IR. + +4. **`StellaOps.Reachability.Engine` (within Scanner.WebService)** + + * Normalizes all call graphs into a common IR. + * Merges static and dynamic evidence. + * Computes reachability states and scores. + +5. **`StellaOps.Cartographer.ReachabilityOverlay` (new overlay module)** + + * Stores per-artifact call graphs and reachability tags. + * Provides graph queries for UI and policy tools. + +6. **`StellaOps.Signals` (extended)** + + * Ingests runtime call traces (e.g., from EventPipe/JFR/ebpf in other branches). + * Feeds function-hit events into the Reachability Engine. + +7. **Unknowns Registry integration (optional but recommended)** + + * Stores unresolved symbol/package mappings and incomplete edges as `unknowns`. + * Used to adjust risk scores (“Unknowns Pressure”) when binary analysis is incomplete. + +--- + +## 4. Detailed design by layer + +### 4.1. Static analysis layer (binaries) + +#### 4.1.1. Binary discovery + +Module: `StellaOps.Scanner.Analyzers.Binary.Discovery` + +* Inputs: + + * Per-image file list (from existing Scanner). + * Byte slices of candidate binaries. +* Logic: + + * Detect ELF/PE/Mach-O via magic bytes, not extensions. + * Classify as: + + * Main executable + * Shared library + * Plugin/module +* Output: + + * `binary_manifest.json` per image: + + ```json + { + "image_ref": "registry/app@sha256:…", + "binaries": [ + { + "id": "bin:elf:/usr/local/bin/app", + "path": "/usr/local/bin/app", + "format": "elf", + "arch": "x86_64", + "role": "executable" + } + ] + } + ``` + +#### 4.1.2. Symbolization + +Module: `StellaOps.Scanner.Analyzers.Binary.Symbolizer` + +* Uses: + + * ELF/PE/Mach-O parsers (internal or third-party), no external calls. +* Output per binary: + + ```json + { + "binary_id": "bin:elf:/usr/local/bin/app", + "build_id": "buildid:abcd…", + "exports": ["pkg1::ClassA::method1", "..."], + "imports": ["openssl::EVP_EncryptInit_ex", "..."], + "sections": { "text": { "va": "0x...", "size": 12345 } } + } + ``` +* Writes unresolved symbol sets to Unknowns Registry when: + + * Imports cannot be tied to known packages or symbols. + +#### 4.1.3. Call graph construction + +Module: `StellaOps.Scanner.Analyzers.Binary.CallGraph.Native` + +* Core tasks: + + * Build control-flow graphs (CFG) for each function via: + + * Disassembly. + * Basic block detection. + * Identify direct calls (`call func`) and indirect calls (function pointers, vtables). +* IR model: + + ```json + { + "binary_id": "bin:elf:/usr/local/bin/app", + "functions": [ + { "fid": "func:app::main", "va": "0x401000", "size": 128 }, + { "fid": "func:libssl::EVP_EncryptInit_ex", "external": true } + ], + "edges": [ + { "caller": "func:app::main", "callee": "func:app::init_config", "type": "direct" }, + { "caller": "func:app::main", "callee": "func:libssl::EVP_EncryptInit_ex", "type": "import" } + ] + } + ``` +* Edge confidence: + + * `type: direct|import|indirect|heuristic` + * Used later by the lattice. + +#### 4.1.4. Entry point inference + +* Sources: + + * ELF `PT_INTERP`, PE `AddressOfEntryPoint`. + * Application-level hints (known frameworks, service main methods). + * Container metadata (CMD, ENTRYPOINT). +* Output: + + ```json + { + "binary_id": "bin:elf:/usr/local/bin/app", + "entrypoints": ["func:app::main"] + } + ``` + +> Note: For JS/Python/PHP, equivalent analyzers will later define module entrypoints (`index.js`, `wsgi_app`, `public/index.php`). + +--- + +### 4.2. Symbol-to-package and CVE-to-symbol mapping + +#### 4.2.1. Symbol→package mapping + +Module: `StellaOps.Reachability.Mapping.SymbolToPurl` + +* Inputs: + + * Binary symbolization outputs. + * Local mapping DB in Concelier (vendor symbol maps, debug info, name patterns). + * File path + container context (`/usr/lib/...`, `/site-packages/...`). +* Output: + + ```json + { + "symbol": "libssl::EVP_EncryptInit_ex", + "purl": "pkg:apk/alpine/openssl@3.1.5-r2", + "confidence": 0.93, + "method": "vendor_map+path_heuristic" + } + ``` +* Unresolved / ambiguous symbols: + + * Stored as `unknowns` of type `identity_gap`. + +#### 4.2.2. CVE→symbol mapping + +Responsibility: Concelier + its advisory ingestion. + +* For each vulnerability: + + ```json + { + "cve_id": "CVE-2025-12345", + "purl": "pkg:apk/alpine/openssl@3.1.5-r2", + "affected_symbols": [ + "libssl::EVP_EncryptInit_ex", + "libssl::EVP_EncryptUpdate" + ], + "source": "vendor_vex", + "confidence": 1.0 + } + ``` +* Reachability Engine consumes this mapping read-only. + +--- + +### 4.3. Reachability Engine + +Module: `StellaOps.Reachability.Engine` (in Scanner.WebService) + +#### 4.3.1. Core data model + +Per `(artifact, cve, purl)`: + +```json +{ + "artifact": { "type": "oci.image", "ref": "registry/app@sha256:…" }, + "cve_id": "CVE-2025-12345", + "purl": "pkg:apk/alpine/openssl@3.1.5-r2", + "symbols": [ + { + "symbol": "libssl::EVP_EncryptInit_ex", + "static_presence": "present|absent|unknown", + "static_reachability": "reachable|unreachable|unknown", + "runtime_hits": 3, + "runtime_reachability": "observed|not_observed|unknown" + } + ], + "reachability_state": "confirmed_reachable|statically_reachable|present_not_reachable|not_present|unknown", + "confidence": { + "p": 0.87, + "evidence": ["static_callgraph", "runtime_trace", "symbol_map"], + "unknowns_pressure": 0.12 + } +} +``` + +#### 4.3.2. Lattice / state machine + +Define a deterministic lattice over states: + +* `NOT_PRESENT` +* `PRESENT_NOT_REACHABLE` +* `STATICALLY_REACHABLE` +* `RUNTIME_OBSERVED` + +And “unknown” flags overlayed when evidence is missing. + +Merging rules (simplified): + +* If `NOT_PRESENT` and no conflicting evidence → `NOT_PRESENT`. +* If at least one affected symbol is on a static path from any entrypoint → `STATICALLY_REACHABLE`. +* If symbol observed at runtime → `RUNTIME_OBSERVED` (top state). +* If symbol present in binary but not on any static path → `PRESENT_NOT_REACHABLE`, unless unknown edges exist near it (then downgrade with lower confidence). +* Unknowns Registry entries near affected symbols increase `unknowns_pressure` and may push from `NOT_PRESENT` to `UNKNOWN`. + +Implementation: pure functional merge functions inside Scanner.WebService: + +```csharp +ReachabilityState Merge(ReachabilityState a, ReachabilityState b); +ReachabilityState FromEvidence(StaticEvidence s, RuntimeEvidence r, UnknownsPressure u); +``` + +#### 4.3.3. Deterministic inputs + +To guarantee replay: + +* Build **Reachability Plan Manifest** per run: + + ```json + { + "plan_id": "reach:sha256:…", + "scanner_version": "1.4.0", + "analyzers": { + "binary_discovery": "1.0.0", + "binary_symbolizer": "1.1.0", + "binary_callgraph": "1.2.0" + }, + "inputs": { + "image_digest": "sha256:…", + "runtime_trace_files": ["signals:run:2025-11-18T12:00:00Z"], + "config": { + "assume_indirect_calls": "conservative", + "max_call_depth": 10 + } + } + } + ``` +* DSSE-sign the plan + result. + +--- + +### 4.4. Storage and graph overlay + +#### 4.4.1. Reachability store + +Backend: re-use existing Scanner/Cartographer storage stack (e.g., Postgres or SQLite + blob store). + +Tables/collections: + +* `reachability_runs` + + * `plan_id`, `image_ref`, `created_at`, `scanner_version`. + +* `reachability_results` + + * `plan_id`, `cve_id`, `purl`, `state`, `confidence_p`, `unknowns_pressure`, `payload_json`. + +* Indexes on `(image_ref, cve_id)`, `(image_ref, purl)`. + +#### 4.4.2. Cartographer overlay + +Edges: + +* `IMAGE` → `BINARY` → `FUNCTION` → `PACKAGE` → `CVE` +* Extra property on `IMAGE -[AFFECTED_BY]-> CVE`: + + * `reachability_state` + * `reachability_plan_id` + +Enables queries: + +* “Show me all CVEs with `STATICALLY_REACHABLE` in this namespace.” +* “Show me binaries with high density of reachable crypto CVEs.” + +--- + +### 4.5. APIs (idempotent, additive) + +#### 4.5.1. Trigger reachability + +`POST /reachability/runs` + +Request: + +```json +{ + "artifact": { "type": "oci.image", "ref": "registry/app@sha256:…" }, + "config": { + "include_languages": ["binary"], + "max_call_depth": 10, + "assume_indirect_calls": "conservative" + } +} +``` + +Response: + +```json +{ "plan_id": "reach:sha256:…" } +``` + +* Idempotent key: `(image_ref, config_hash)`. Subsequent calls return same `plan_id`. + +#### 4.5.2. Fetch results + +`GET /reachability/runs/:plan_id` + +```json +{ + "plan": { /* reachability plan manifest */ }, + "results": [ + { + "cve_id": "CVE-2025-12345", + "purl": "pkg:apk/alpine/openssl@3.1.5-r2", + "reachability_state": "static_reachable", + "confidence": { "p": 0.84, "unknowns_pressure": 0.1 } + } + ] +} +``` + +#### 4.5.3. Per-CVE view for VEXer/Excitor + +`GET /reachability/by-cve?artifact=…&cve_id=…` + +* Returns filtered result for downstream VEX creation. + +All APIs are **read-only** except for the side effect of storing/caching runs. + +--- + +## 5. Interaction with other Stella Ops modules + +### 5.1. Concelier + +* Provides: + + * CVE→purl→symbol mapping. + * Vendor VEX statements indicating affected functions. +* Consumes: + + * Nothing from reachability directly; Scanner/WebService passes reachability summary to VEXer/Excitor which merges with vendor statements. + +### 5.2. VEXer / Excitor + +* Input: + + * For each `(artifact, cve)`: + + * Reachability state. + * Confidence. +* Logic: + + * Translate states to VEX statements: + + * `NOT_PRESENT` → `not_affected` + * `PRESENT_NOT_REACHABLE` → `not_affected` (with justification “code not reachable according to analysis”) + * `STATICALLY_REACHABLE` → `affected` + * `RUNTIME_OBSERVED` → `affected` (higher severity) + * Attach determinism proof: + + * Plan ID + DSSE of reachability run. + +### 5.3. Signals + +* Provides: + + * Function hit events: `(binary_id, function_id, timestamp)` aggregated per image. +* Reachability Engine: + + * Marks `runtime_hits` and state `RUNTIME_OBSERVED` for symbols with hits. +* Unknowns: + + * If runtime sees hits in functions with no static edges to entrypoints (or unmapped symbols), these produce Unknowns and increase `unknowns_pressure`. + +### 5.4. Unknowns Registry + +* From reachability pipeline, create Unknowns when: + + * Symbol→package mapping is ambiguous. + * CVE→symbol mapping exists, but symbol cannot be found in binaries. + * Call graph has indirect calls that cannot be resolved. +* The “Unknowns Pressure” term is fed into: + + * Reachability confidence. + * Global risk scoring (Trust Algebra Studio). + +--- + +## 6. Implementation phases and engineering plan + +### Phase 0 – Scaffolding & manifests (1 sprint) + +* Create: + + * `StellaOps.Reachability.Engine` skeleton. + * Reachability Plan Manifest schema. + * Reachability Run + Result persistence. +* Add `/reachability/runs` and `/reachability/runs/:plan_id` endpoints, returning mock data. +* Wire DSSE attestation generation for reachability results (even if payload is empty). + +### Phase 1 – Binary discovery + symbolization (1–2 sprints) + +* Implement `Binary.Discovery` and `Binary.Symbolizer`. +* Feed symbol tables into Reachability Engine as “presence-only evidence”: + + * States: `NOT_PRESENT` vs `PRESENT_NOT_REACHABLE` vs `UNKNOWN`. +* Integrate with Concelier’s CVE→purl mapping (no symbol-level yet): + + * For CVEs affecting a package present in the image, mark as `PRESENT_NOT_REACHABLE`. +* Emit Unknowns for unresolved binary roles and ambiguous package mapping. + +Deliverable: package-level reachability with deterministic manifests. + +### Phase 2 – Binary call graphs & entrypoints (2–3 sprints) + +* Implement `Binary.CallGraph.Native`: + + * CFG + direct call edges. +* Implement entrypoint inference from binary + container ENTRYPOINT/CMD. +* Add static reachability algorithm: + + * DFS/BFS from entrypoints through call graph. + * Mark affected symbols as reachable if found on paths. +* Extend Concelier to ingest symbol-aware vulnerability metadata (for pilots; can be partial). + +Deliverable: function-level static reachability for native binaries where symbol maps exist. + +### Phase 3 – Runtime integration (2 sprints, may be in parallel workstream) + +* Integrate Signals runtime evidence: + + * Define schema for function hit events. + * Add ingestion path into Reachability Engine. +* Update lattice: + + * Promote symbols to `RUNTIME_OBSERVED` when hits exist. +* Extend DSSE attestation to reference runtime evidence URIs (hashes of trace inputs). + +Deliverable: static + runtime-confirmed reachability. + +### Phase 4 – Unknowns & pressure (1 sprint) + +* Wire Unknowns Registry: + + * Emit unknowns from Symbolizer and CallGraph (identity gaps, missing edges). + * Compute `unknowns_pressure` per `(artifact, cve)` as density of unknowns near affected symbols. +* Adjust confidence calculation in Reachability Engine. +* Expose unknowns metrics in API and Cartographer. + +Deliverable: explicit modelling of uncertainty, feeding into trust/lattice. + +### Phase 5 – Language extensions (JS/Python/PHP) (ongoing) + +* Implement per-language call-graph analyzers creating the same IR as binary. +* Extend symbol→purl mapping for these ecosystems (npm, PyPI, Packagist). +* Update reachability solver to include multi-language edges (e.g., Python calling into native modules). + +--- + +## 7. Minimal contracts for agents + +To hand off to agents, you can codify: + +1. **IR schemas** + + * Call graph IR. + * Reachability Result JSON. + * Reachability Plan Manifest. +2. **API contracts** + + * `POST /reachability/runs` + * `GET /reachability/runs/:plan_id` + * `GET /reachability/by-cve` +3. **Module boundaries** + + * `Scanner.Analyzers.Binary.*` produce IR only; NO network calls. + * `Reachability.Engine` is the only place where lattice logic lives. + * `Concelier` is read-only for reachability; no custom logic there. +4. **Determinism practices** + + * All algorithmic randomness is banned; where unavoidable, seed with values derived from plan_id. + * All external inputs must be listed in the Plan Manifest. + +If you like, next step I can draft: + +* Concrete C# record types for the IRs. +* A small pseudo-code implementation of the lattice functions and static reachability DFS. +* A proposed directory layout under `src/StellaOps.Scanner` and `src/StellaOps.Cartographer`. diff --git a/docs/product-advisories/archived/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md b/docs/product-advisories/archived/20-Nov-2025 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md similarity index 96% rename from docs/product-advisories/archived/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md rename to docs/product-advisories/archived/20-Nov-2025 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md index 8bd9bdf87..34c59abdf 100644 --- a/docs/product-advisories/archived/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md +++ b/docs/product-advisories/archived/20-Nov-2025 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md @@ -1,1246 +1,1246 @@ -Here’s a quick, practical win for your SBOM/runtime join story: **record the ELF build‑id alongside soname and path when mapping modules to purls.** - -Why it matters: - -* **build‑id** (from `.note.gnu.build-id`) is a **content hash** that uniquely identifies an ELF image—even if filenames/paths change. -* Distros and **debuginfod** index debug symbols **by build‑id**, so you can reliably join runtime traces → binaries → SBOM entries → debug artifacts. -* It hardens reachability and VEX joins (no “same soname, different bits” ambiguity). - -### What to capture per ELF - -* `soname` (if shared object) -* `full path` at runtime -* `purl` (package URL from your resolver) -* **`build_id`** (hex, no colons) -* `arch`, `file type` (ET_DYN/ET_EXEC), and `build-id source` (NT_GNU_BUILD_ID) - -### How to read it (portable snippets) - -**CLI** - -```bash -# show build-id quickly -readelf -n /path/to/bin | awk '/Build ID:/ {print $3}' -# or: -objdump -s --section .note.gnu.build-id /path/to/bin -``` - -**C (runtime collector)** - -```c -#include -#include -static int note_cb(struct dl_phdr_info *info, size_t size, void *data) { - for (int i=0; iphnum; i++) { - const ElfW(Phdr) *ph = &info->phdr[i]; - if (ph->p_type == PT_NOTE) { - // scan notes for NT_GNU_BUILD_ID (type=3, name="GNU") - // extract desc bytes → hex string build_id - } - } - return 0; -} -// call dl_iterate_phdr(note_cb, NULL); -``` - -**Go (scanner)** - -```go -f, _ := elf.Open(path) -for _, n := range f.Notes { - if n.Type == elf.NT_GNU_BUILD_ID && n.Name == "GNU" { - buildID := fmt.Sprintf("%x", n.Desc) - // record buildID - } -} -``` - -### Suggested Stella Ops schema (add field, no versioning break) - -```json -{ - "module": { - "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", - "soname": "libssl.so.3", - "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", - "elf": { - "build_id": "a1b2c3d4e5f6...", - "type": "ET_DYN", - "arch": "x86_64", - "notes": { "source": "NT_GNU_BUILD_ID" } - } - } -} -``` - -### Join strategy - -1. **Runtime → build‑id:** collect from process maps (or dl_iterate_phdr) and file scan fallback. -2. **SBOM → candidate binaries:** map by purl/filename, then **confirm by build‑id** where available. -3. **Debug/Source:** query debuginfod or distro debug repos by build‑id to fetch symbols for precise call‑graph and reachability. -4. **VEX/Policies:** treat build‑id as the primary key for binary‑level assertions; purl stays as the package‑level key. - -### Edge handling - -* **Stripped binaries:** build‑id still present in the note; if missing, fall back to **full‑file hash** and flag `build_id_absent=true`. -* **Containers:** compute build‑id inside image layers and cache in your “Proof‑of‑Integrity Graph.” -* **Kernel/Modules:** same idea—`/sys/module/*/notes/.note.gnu.build-id`. - -### Quick acceptance tests - -* Scan a container image (Debian/Ubuntu/RHEL) and verify >90% of ELF objects yield a build‑id. -* Cross‑check one binary: path changes across containers, **build‑id stays identical**. -* Fetch symbols via debuginfod using that build‑id and run a tiny call‑graph demo to prove determinism. - -If you want, I can draft the exact .NET 10 collector for Linux (P/Invoke `dl_iterate_phdr`) and a CycloneDX extension block to store `build_id`. -Here’s a concrete “implementation spec” for a C# dev to build an **ELF metadata / build-id collector** (“elf builder”). I’ll treat this as a small reusable .NET library plus some process-level helpers. - ---- - -## 1. Goal & Scope - -**Goal:** From C# on Linux, be able to: - -1. Given an ELF file path, extract: - - * `build-id` (from `.note.gnu.build-id`, i.e. NT_GNU_BUILD_ID) - * `soname` (for shared objects) - * ELF type (ET_EXEC / ET_DYN / etc.) - * machine architecture - * file path - * optional fallback: full-file hash if build-id is missing - -2. Given a running process (usually self), enumerate loaded ELF modules and attach the above metadata per module. - -The output will power your SBOM/runtime join (path + soname + build-id → purl). - ---- - -## 2. Public API Spec - -### 2.1 Core model - -```csharp -public enum ElfFileType -{ - Unknown = 0, - Relocatable = 1, // ET_REL - Executable = 2, // ET_EXEC - SharedObject = 3, // ET_DYN - Core = 4 // ET_CORE -} - -public sealed class ElfMetadata -{ - public required string Path { get; init; } - public string? Soname { get; init; } - public string? BuildId { get; init; } // Hex, lowercase, no colons - public string BuildIdSource { get; init; } = ""; // "NT_GNU_BUILD_ID" | "FileHash" | "" - public ElfFileType FileType { get; init; } - - public string Machine { get; init; } = ""; // e.g. "x86_64", "aarch64" - public bool Is64Bit { get; init; } - public bool IsLittleEndian { get; init; } - - public string? FileHashSha256 { get; init; } // only if BuildId == null -} -``` - -### 2.2 File-level API - -```csharp -public static class ElfReader -{ - /// - /// Parse the ELF file at the given path and extract metadata. - /// Throws if file is not ELF or cannot be read. - /// - public static ElfMetadata ReadMetadata(string path); -} -``` - -**Behavior:** - -* Validates ELF magic. -* Supports both 32-bit and 64-bit ELF. -* Supports little and big endian (but you can initially only test little-endian). -* Uses program headers (PT_NOTE) and note parsing to extract build-id. -* Uses section headers + .dynamic to extract `DT_SONAME`. -* Sets `BuildIdSource = "NT_GNU_BUILD_ID"` if build-id present. -* If no build-id, computes `FileHashSha256` and sets `BuildIdSource = "FileHash"`. - -### 2.3 Process-level API (Linux) - -```csharp -public static class ElfProcessScanner -{ - /// - /// Enumerate ELF modules for the current process (default) or a given pid. - /// Only returns unique paths that are actual ELF files. - /// - public static IReadOnlyList GetProcessModules(int? pid = null); -} -``` - -**Default implementation:** - -* Only supports Linux. -* Reads `/proc//maps`. -* Filters entries that map regular files (path not `[vdso]`, `[heap]`, etc.). -* De-duplicates by canonical path (e.g. `realpath` behavior). -* For each unique path: - - * Check first 4 bytes for ELF magic. - * Call `ElfReader.ReadMetadata(path)`. - ---- - -## 3. ELF Parsing: Binary Layout & Rules - -You do **not** need unsafe code; a `BinaryReader` is enough. - -### 3.1 ELF header - -First 16 bytes: `e_ident[]`. - -Key fields: - -* `e_ident[0..3]` = `0x7F, 'E', 'L', 'F'` (magic) -* `e_ident[4]` = `EI_CLASS`: - - * 1 = 32-bit (`ELFCLASS32`) - * 2 = 64-bit (`ELFCLASS64`) -* `e_ident[5]` = `EI_DATA`: - - * 1 = little-endian (`ELFDATA2LSB`) - * 2 = big-endian (`ELFDATA2MSB`) - -Then the “native” header fields, which differ slightly between 32 & 64 bit. - -Define two internal structs (don’t use `[StructLayout]`; just read fields manually): - -```csharp -internal sealed class ElfHeaderCommon -{ - public byte[] Ident = new byte[16]; - public ushort Type; // e_type - public ushort Machine; // e_machine - public uint Version; // e_version - public ulong Entry; // e_entry (32/64 sized) - public ulong Phoff; // e_phoff - public ulong Shoff; // e_shoff - public uint Flags; // e_flags - public ushort Ehsize; // e_ehsize - public ushort Phentsize; // e_phentsize - public ushort Phnum; // e_phnum - public ushort Shentsize; // e_shentsize - public ushort Shnum; // e_shnum - public ushort Shstrndx; // e_shstrndx -} -``` - -**Algorithm to read header:** - -1. `ReadBytes(16)` → `Ident`. Validate magic & EI_CLASS/EI_DATA. - -2. Decide `is64` (from EI_CLASS) and `littleEndian` (from EI_DATA). - -3. Use helper methods: - - ```csharp - static ushort ReadUInt16(BinaryReader br, bool little) { ... } - static uint ReadUInt32(BinaryReader br, bool little) { ... } - static ulong ReadUInt64(BinaryReader br, bool little) { ... } - ``` - - Where these helpers swap bytes if file is big-endian and host is little-endian. - -4. For 32-bit ELF: fields `Entry`, `Phoff`, `Shoff` are 4-byte values that you zero-extend to 64-bit. - -5. For 64-bit ELF: fields are 8-byte values. - -### 3.2 Program headers (for build-id) - -Each program header: - -* 32-bit: - - ```text - uint32 p_type; - uint32 p_offset; - uint32 p_vaddr; - uint32 p_paddr; - uint32 p_filesz; - uint32 p_memsz; - uint32 p_flags; - uint32 p_align; - ``` - -* 64-bit: - - ```text - uint32 p_type; - uint32 p_flags; - uint64 p_offset; - uint64 p_vaddr; - uint64 p_paddr; - uint64 p_filesz; - uint64 p_memsz; - uint64 p_align; - ``` - -You only really need: - -* `p_type` (look for `PT_NOTE` = 4) -* `p_offset` -* `p_filesz` - -**Reading algorithm:** - -```csharp -internal sealed class ProgramHeader -{ - public uint Type; - public ulong Offset; - public ulong FileSize; -} -``` - -* Seek to `header.Phoff`. -* For `i = 0..Phnum-1`: - - * For 32-bit: - - * `Type = ReadUInt32()` - * Skip `p_offset` into `Offset = ReadUInt32()` - * Skip the rest. - * For 64-bit: - - * `Type = ReadUInt32()` - * `flags = ReadUInt32()` (ignored) - * `Offset = ReadUInt64()` - * `FileSize = ReadUInt64()` - * Skip rest. -* Store those with `Type == 4` (PT_NOTE). - -### 3.3 Note segments & NT_GNU_BUILD_ID - -Each **note** has: - -```text -uint32 namesz; -uint32 descsz; -uint32 type; -char name[namesz]; // padded to 4-byte boundary -byte desc[descsz]; // padded to 4-byte boundary -``` - -We care about: - -* `type == 3` (NT_GNU_BUILD_ID) -* `name == "GNU"` (null-terminated; usually `"GNU\0"`) - -**Algorithm:** - -For each `PT_NOTE` program header: - -1. Seek to `ph.Offset`, set `remaining = ph.FileSize`. -2. While `remaining >= 12`: - - * `namesz = ReadUInt32()` - * `descsz = ReadUInt32()` - * `type = ReadUInt32()` - * `remaining -= 12`. - * Read `nameBytes = ReadBytes(namesz)`; `remaining -= namesz`. - - * Skip padding: `pad = (4 - (namesz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. - * Read `desc = ReadBytes(descsz)`; `remaining -= descsz`. - - * Skip padding: `pad = (4 - (descsz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. - * If `type == 3` and `Encoding.ASCII.GetString(nameBytes).TrimEnd('\0') == "GNU"`: - - * Convert `desc` to hex: - - ```csharp - string buildId = BitConverter.ToString(desc).Replace("-", "").ToLowerInvariant(); - ``` - - * Return immediately. - -If no note matches, return null, and you can later fall back to `FileHashSha256`. - -### 3.4 Section headers & SONAME - -You need `DT_SONAME` from the dynamic section. Steps: - -1. Read **section headers** from `Shoff` (ELF header). - - Minimal section header model: - - ```csharp - internal sealed class SectionHeader - { - public uint Name; // index into shstrtab - public uint Type; // SHT_* - public ulong Offset; - public ulong Size; - public uint Link; // for some types - } - ``` - - For each section: - - * Read `Name`, `Type`, `Flags` (ignored), `Addr` (ignored), `Offset`, `Size`, `Link`, etc. - * Keep these in an array. - -2. Find the **section header string table** (`shstrtab`): - - * Use `header.Shstrndx` to locate its section header. - * Read that section’s bytes into `shStrTab`. - * Define helper to get section name: - - ```csharp - static string ReadNullTerminatedString(byte[] table, uint offset) - { - int i = (int)offset; - int start = i; - while (i < table.Length && table[i] != 0) i++; - return Encoding.ASCII.GetString(table, start, i - start); - } - ``` - -3. Use `shStrTab` to find: - - * `.dynamic` section (`Type == 6` i.e. `SHT_DYNAMIC`). - * The string table it references (`SectionHeader.Link` → index of the dynamic string table, often `.dynstr`). - -4. Parse the **dynamic section**: - - * `Elf64_Dyn` is array of entries: - - ```text - int64 d_tag; - uint64 d_val; - ``` - - (For 32-bit, both are 4 bytes; you can cast to 64-bit.) - - * For each entry: - - * Read `d_tag` (signed, but you can treat as 64-bit). - * Read `d_val`. - * If `d_tag == 14` (`DT_SONAME`), then `d_val` is an offset into the dynstr string table. - -5. Read `SONAME`: - - * Use dynstr bytes + `d_val` as index, decode null-terminated ASCII → `Soname`. - -If there is no `.dynamic` section or no `DT_SONAME`, set `Soname = null`. - -### 3.5 Mapping `e_machine` to architecture string - -`e_machine` is a numeric code. Map the most common ones: - -```csharp -static string MapMachine(ushort eMachine) => eMachine switch -{ - 3 => "x86", // EM_386 - 62 => "x86_64", // EM_X86_64 - 40 => "arm", // EM_ARM - 183 => "aarch64", // EM_AARCH64 - 8 => "mips", // EM_MIPS - _ => $"unknown({eMachine})" -}; -``` - -### 3.6 Mapping `e_type` to `ElfFileType` - -```csharp -static ElfFileType MapFileType(ushort eType) => eType switch -{ - 1 => ElfFileType.Relocatable, // ET_REL - 2 => ElfFileType.Executable, // ET_EXEC - 3 => ElfFileType.SharedObject,// ET_DYN - 4 => ElfFileType.Core, // ET_CORE - _ => ElfFileType.Unknown -}; -``` - -### 3.7 Fallback: SHA-256 hash - -If build-id is missing: - -```csharp -static string ComputeFileSha256(string path) -{ - using var sha = System.Security.Cryptography.SHA256.Create(); - using var fs = File.OpenRead(path); - var hash = sha.ComputeHash(fs); - return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant(); -} -``` - -Set: - -* `BuildId = null` -* `BuildIdSource = "FileHash"` -* `FileHashSha256 = computedHash` - ---- - -## 4. Implementation Skeleton (ElfReader) - -Here’s a compact skeleton tying it together: - -```csharp -public static class ElfReader -{ - public static ElfMetadata ReadMetadata(string path) - { - using var fs = File.OpenRead(path); - using var br = new BinaryReader(fs); - - // 1. Read e_ident - byte[] ident = br.ReadBytes(16); - if (ident.Length < 16 || - ident[0] != 0x7F || ident[1] != (byte)'E' || - ident[2] != (byte)'L' || ident[3] != (byte)'F') - { - throw new InvalidDataException("Not an ELF file."); - } - - bool is64 = ident[4] == 2; // EI_CLASS - bool little = ident[5] == 1; // EI_DATA - - // 2. Read header - var header = ReadElfHeader(br, ident, is64, little); - - // 3. Read program headers - var phdrs = ReadProgramHeaders(br, header, is64, little); - - // 4. Extract build-id from PT_NOTE - string? buildId = TryReadBuildIdFromNotes(br, phdrs, little, is64); - - // 5. Read SONAME from .dynamic - string? soname = TryReadSoname(br, header, is64, little); - - // 6. Map machine & type - string machine = MapMachine(header.Machine); - ElfFileType fileType = MapFileType(header.Type); - - // 7. Hash fallback - string? fileHash = null; - string source; - if (buildId is null) - { - fileHash = ComputeFileSha256(path); - source = "FileHash"; - } - else - { - source = "NT_GNU_BUILD_ID"; - } - - return new ElfMetadata - { - Path = path, - Soname = soname, - BuildId = buildId, - BuildIdSource = source, - FileType = fileType, - Machine = machine, - Is64Bit = is64, - IsLittleEndian = little, - FileHashSha256 = fileHash - }; - } - - // ... implement ReadElfHeader, ReadProgramHeaders, - // TryReadBuildIdFromNotes, TryReadSoname, MapMachine, - // MapFileType, ComputeFileSha256, + endian helpers ... -} -``` - -I didn’t expand *every* helper to keep this readable, but all helpers follow exactly the rules in section 3. - ---- - -## 5. Process Scanner Spec (Linux) - -### 5.1 Reading `/proc//maps` - -Each line looks roughly like: - -```text -7f2d9c214000-7f2d9c234000 r--p 00000000 08:01 1234567 /usr/lib/x86_64-linux-gnu/libssl.so.3 -``` - -Last field is the file path, if any. - -**Algorithm:** - -```csharp -public static class ElfProcessScanner -{ - public static IReadOnlyList GetProcessModules(int? pid = null) - { - int actualPid = pid ?? Environment.ProcessId; - string mapsPath = $"/proc/{actualPid}/maps"; - - if (!File.Exists(mapsPath)) - throw new PlatformNotSupportedException("Only supported on Linux with /proc."); - - var paths = new HashSet(StringComparer.Ordinal); - foreach (var line in File.ReadLines(mapsPath)) - { - int idx = line.IndexOf('/'); - if (idx < 0) - continue; - - string p = line.Substring(idx).Trim(); - if (p.StartsWith("[")) - continue; // skip [heap], [vdso], etc. - - if (!File.Exists(p)) - continue; - - // De-duplicate - if (!paths.Add(p)) - continue; - } - - var result = new List(); - foreach (var p in paths) - { - if (!IsElfFile(p)) - continue; - - try - { - var meta = ElfReader.ReadMetadata(p); - result.Add(meta); - } - catch - { - // swallow or log; not all mapped files are valid ELF - } - } - - return result; - } - - private static bool IsElfFile(string path) - { - try - { - using var fs = File.OpenRead(path); - Span magic = stackalloc byte[4]; - if (fs.Read(magic) != 4) return false; - return magic[0] == 0x7F && magic[1] == (byte)'E' && - magic[2] == (byte)'L' && magic[3] == (byte)'F'; - } - catch { return false; } - } -} -``` - -This is simple and robust. If you later want **even more accurate** results (e.g., also non-file-backed shared objects), you can add a P/Invoke path that uses `dl_iterate_phdr`, but `/proc//maps` gets you the SBOM-relevant modules. - ---- - -## 6. JSON / SBOM Integration (Optional but Recommended) - -When you serialize `ElfMetadata` into your runtime evidence / graph, I’d recommend a nested ELF block, e.g.: - -```json -{ - "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", - "soname": "libssl.so.3", - "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", - "elf": { - "build_id": "a1b2c3d4e5f6...", - "build_id_source": "NT_GNU_BUILD_ID", - "file_type": "SharedObject", - "machine": "x86_64", - "is_64bit": true, - "is_little_endian": true, - "file_hash_sha256": null - } -} -``` - -You can keep `purl` on the higher-level module object; `build_id` becomes the primary key for binary-accurate joins. - ---- - -## 7. Testing Checklist - -For a C# dev implementing this, I’d suggest these tests: - -1. **Basic ELF detection** - - * Provide a non-ELF file → `ReadMetadata` throws. - * Provide `/bin/ls` (or similar) → parses successfully. - -2. **32-bit vs 64-bit** - - * If you have a 32-bit ELF on the system, verify `Is64Bit` and basic fields. - -3. **Build-id extraction** - - * Compare `ReadMetadata(path).BuildId` against `readelf -n path | grep 'Build ID:'`. - * Ensure they match. - -4. **SONAME extraction** - - * For a shared library, ensure `Soname` matches `readelf -d path | grep SONAME`. - -5. **Hash fallback** - - * On an ELF without build-id (you can strip or use a test file), verify `BuildId == null` and `FileHashSha256 != null`. - -6. **Process scanner** - - * Call `GetProcessModules()` in your test app; ensure it returns at least: - - * your own executable - * core system libs (`libc.so.6`, `ld-linux*`, etc.) - ---- - -If you’d like, next step I can do is expand any of the helper methods (e.g. `ReadElfHeader` or `TryReadSoname`) into complete C# code, but this should be enough as a “full spec & directions” for a C# dev to implement the ELF build-id collector. -You’re already in “pretty good” territory with the current spec. To get to “best in class,” I’d tighten it in a few dimensions: **API design, correctness, performance, observability, testing, and extensibility**. - -I’ll walk through concrete ways to upgrade what we already have, without rewriting everything from scratch. - ---- - -## 1. Sharpen the public API - -### 1.1 Split responsibilities into clear layers - -Right now we have: - -* `ElfReader.ReadMetadata(string path)` -* `ElfProcessScanner.GetProcessModules(int? pid = null)` - -For a best‑in‑class library, I’d explicitly layer things: - -```csharp -public interface IElfParser -{ - ElfMetadata Parse(Stream stream, string? pathHint = null); -} - -public interface IElfFileInspector -{ - ElfMetadata InspectFile(string path); -} - -public interface IElfProcessInspector -{ - IReadOnlyList GetProcessModules(ElfProcessScanOptions? options = null); -} -``` - -With default implementations: - -* `ElfParser` – pure, stateless binary parser (no file I/O). -* `ElfFileInspector` – wraps `ElfParser` + file system. -* `ElfProcessInspector` – wraps `/proc//maps` (and optionally `dl_iterate_phdr`). - -This makes testing simpler (you can feed a `MemoryStream`) and keeps “how we read” decoupled from “how we parse.” - -### 1.2 Options objects & async variants - -Give users knobs and modern .NET ergonomics: - -```csharp -public sealed class ElfProcessScanOptions -{ - public int? Pid { get; init; } - public bool IncludeNonElfFiles { get; init; } = false; - public bool ParallelFileParsing { get; init; } = true; - public bool ComputeHashWhenBuildIdMissing { get; init; } = true; - public int? MaxFiles { get; init; } // safety valve on huge systems -} - -public static class ElfProcessScanner -{ - public static IReadOnlyList GetProcessModules( - ElfProcessScanOptions? options = null); - - public static IAsyncEnumerable GetProcessModulesAsync( - ElfProcessScanOptions? options = null, - CancellationToken cancellationToken = default); -} -``` - -Same for file scans: - -```csharp -public sealed class ElfFileScanOptions -{ - public bool ComputeFileHashWhenBuildIdPresent { get; init; } = false; - public bool ThrowOnNonElf { get; init; } = true; -} - -public static ElfMetadata ReadMetadata( - string path, - ElfFileScanOptions? options = null); -``` - -### 1.3 Strong types for identity - -Instead of `string BuildId`, add a value type: - -```csharp -public readonly struct ElfBuildId : IEquatable -{ - public string HexString { get; } // "a1b2c3..." - public string DebugPathComponent => $"{HexString[..2]}/{HexString[2..]}"; - - // Parse, TryParse, equality, GetHashCode, etc. -} -``` - -Then in `ElfMetadata`: - -```csharp -public ElfBuildId? BuildId { get; init; } // nullable -public string BuildIdSource { get; init; } // "NT_GNU_BUILD_ID" | "FileHash" | "None" -``` - -This prevents subtle bugs from string normalization and gives you the debuginfod‑style path precomputed. - ---- - -## 2. Make parsing spec‑accurate & robust - -### 2.1 Handle both PT_NOTE and SHT_NOTE `.note.gnu.build-id` - -Many binaries place build‑id in: - -* `PT_NOTE` segments **and/or** -* a section named `.note.gnu.build-id` (`SHT_NOTE`) - -Your spec only mentions `PT_NOTE`. For best coverage: - -1. Search all `PT_NOTE` segments for `NT_GNU_BUILD_ID`. -2. If none found, search `SHT_NOTE` sections with name `.note.gnu.build-id`. -3. If both exist and disagree (extremely rare), decide a precedence and log a diagnostic. - -### 2.2 Correct note alignment rules - -Spec nuance: - -* Note *fields* (`namesz`, `descsz`, `type`) are always 4‑byte aligned. -* On 64‑bit, the **overall note segment** may be aligned to 8 bytes, but the internal padding rules still use 4‑byte boundaries. - -Your spec uses `pad = (4 - (size % 4)) & 3`, which is correct, but I’d codify it clearly: - -```csharp -static int NotePadding(int size) => (4 - (size & 3)) & 3; -``` - -And call that everywhere you advance across notes so future maintainers don’t “optimize” it incorrectly. - -### 2.3 Be strict on bounds & corruption - -Add explicit, defensive checks: - -* Do not trust `p_offset` + `p_filesz` blindly. -* Before any read, verify `offset + length <= streamLength`. -* If the file lies about sizes, **fail gracefully** with a structured error. - -E.g.: - -```csharp -public sealed class ElfParseException : Exception -{ - public ElfParseErrorKind Kind { get; } - public string? Detail { get; } - - // ... -} - -public enum ElfParseErrorKind -{ - NotElf, - TruncatedHeader, - TruncatedProgramHeader, - TruncatedSectionHeader, - TruncatedNote, - UnsupportedClass, - UnsupportedEndianess, - IoError, - Unknown -} -``` - -And then: - -```csharp -if (header.Phoff + (ulong)header.Phnum * header.Phentsize > (ulong)fs.Length) - throw new ElfParseException(ElfParseErrorKind.TruncatedProgramHeader, "..."); -``` - -Best‑in‑class means you *never* trust the file, and your errors are debuggable. - -### 2.4 Big‑endian and 32‑bit are first‑class citizens - -Even if your primary target is x86_64 Linux, a robust spec: - -* Fully supports EI_CLASS = 1 and 2 (32/64). -* Fully supports EI_DATA = 1 and 2 (LSB/MSB). -* Has tests for at least one big‑endian ELF (e.g., sample artifacts in your test assets). - -Your current spec *mentions* big-endian, but I’d explicitly require: - -* A generic `EndianBinaryReader` abstraction that: - - * Wraps a `Stream` - * Exposes `ReadUInt16/32/64`, `ReadInt64`, `ReadBytes` with endianness. - ---- - -## 3. Performance & scale improvements - -### 3.1 Avoid full-file reads by design - -Your current design lets devs accidentally hash everything or read all sections even when not needed. - -Refine the spec so that **default path** is minimal I/O: - -* Read ELF header. -* Read program headers. -* Read only: - - * PT_NOTE ranges - * Section headers (once) - * `.shstrtab`, `.dynamic`, and its dynstr. - -Only compute SHA‑256 when expressly configured (via `ElfFileScanOptions.ComputeFileHashWhenBuildIdPresent` or `ComputeFileHashWhenBuildIdMissing`). - -### 3.2 Optional memory‑mapped mode - -For very large scans (filesystem crawls, containers), allow a mode that uses `MemoryMappedFile`: - -```csharp -public sealed class ElfReaderOptions -{ - public bool UseMemoryMappedFile { get; init; } = false; -} -``` - -Internally, you can spec that the implementation: - -* Uses `MemoryMappedFile.CreateFromFile` -* Creates views over relevant ranges (header, program headers, etc.) -* Avoids multiple OS reads for repeated random access. - -### 3.3 Parallel directory / image scanning - -If you foresee scanning whole images or file trees, define a helper: - -```csharp -public static class ElfDirectoryScanner -{ - public static IReadOnlyList Scan( - string rootDirectory, - ElfDirectoryScanOptions? options = null); - - public static IAsyncEnumerable ScanAsync( - string rootDirectory, - ElfDirectoryScanOptions? options = null, - CancellationToken cancellationToken = default); -} - -public sealed class ElfDirectoryScanOptions -{ - public SearchOption SearchOption { get; init; } = SearchOption.AllDirectories; - public int MaxDegreeOfParallelism { get; init; } = Environment.ProcessorCount; - public Func? PathFilter { get; init; } // e.g., skip /proc, /sys -} -``` - -And explicitly say that the implementation: - -* Uses `Parallel.ForEach` (or `Parallel.ForEachAsync` in .NET 8/9+) with bounded parallelism. -* Shares a single `ElfParser` across threads (it’s stateless). -* De‑dups by `(device, inode)` when possible (see below). - ---- - -## 4. Process scanner: correctness & completeness - -### 4.1 De‑duplication by inode, not just path - -The current spec de‑dups only by path. On Linux: - -* Same inode may have multiple paths (hard links, bind mounts, chroot/container overlays). - -For best‑in‑class accuracy of “unique binaries,” spec: - -* De‑duplicate entries by `(st_dev, st_ino)` from `stat(2)`, not just string path. -* Provide both views: unique by file identity and by path. - -API example: - -```csharp -public sealed class ElfProcessModules -{ - public IReadOnlyList UniqueFiles { get; init; } // dedup by inode - public IReadOnlyList Instances { get; init; } // per mapping -} - -public sealed class ElfModuleInstance -{ - public ElfMetadata Metadata { get; init; } - public string Path { get; init; } - public string? MappingRange { get; init; } // "7f2d9c214000-7f2d9c234000" -} -``` - -And `ElfProcessScanner.GetProcessModules` returns an `ElfProcessModules`, not just a flat list. - -### 4.2 Optional `dl_iterate_phdr` P/Invoke path - -For a “maximum correctness” mode, you can specify: - -* A secondary implementation that uses `dl_iterate_phdr` via P/Invoke. -* This gives you module base addresses and sometimes more consistent views across distros. -* You can hybridize: use `/proc//maps` for path enumeration and `dl_iterate_phdr` to confirm loaded segments (future feature). - -You don’t **have** to implement it day one, but the spec can carve out an extension point: - -```csharp -public enum ElfProcessModuleSource -{ - ProcMaps, - DlIteratePhdr -} - -public sealed class ElfProcessScanOptions -{ - public ElfProcessModuleSource Source { get; init; } = ElfProcessModuleSource.ProcMaps; -} -``` - -And define behavior if the requested source isn’t available. - ---- - -## 5. Observability & diagnostics - -Best‑in‑class libraries are easy to debug. - -### 5.1 Structured diagnostics on parse failures - -Instead of “swallow or log” in the scanner, define: - -```csharp -public sealed class ElfScanResult -{ - public IReadOnlyList Successes { get; init; } - public IReadOnlyList Errors { get; init; } -} - -public sealed class ElfScanError -{ - public string Path { get; init; } - public ElfParseErrorKind Kind { get; init; } - public string Message { get; init; } -} -``` - -And make `ElfProcessScanner.GetProcessModules` optionally return `ElfScanResult` (or have an overload). - -This way you can: - -* Report how many files failed. -* See common misconfigurations (e.g., insufficient permissions, truncated files). - -### 5.2 Logging hooks instead of hard-coded logging - -Don’t bake in a logging framework, but add a hook: - -```csharp -public interface IElfLogger -{ - void Debug(string message); - void Info(string message); - void Warn(string message); - void Error(string message, Exception? ex = null); -} - -public sealed class ElfReaderOptions -{ - public IElfLogger? Logger { get; init; } -} -``` - -Then use it for “soft failures” (skipping non‑ELF files, ignoring suspect sections, etc.). - ---- - -## 6. Security & safety considerations - -### 6.1 Treat inputs as untrusted - -Spec explicitly that: - -* No ELF is ever loaded or executed. -* No ld.so / dynamic loading is used: all reading is via `FileStream` / `MemoryMappedFile`. -* No writes occur to inspected paths. - -### 6.2 Control resource usage - -For environments scanning untrusted file trees (e.g., user uploads): - -* Have configurable caps on: - - * `MaxFileSizeBytes` to parse. - * `MaxNotesPerSegment` / `MaxSections` to avoid pathological “zip bomb” style ELFs. -* Fail with `ElfParseErrorKind.TruncatedHeader` or `Unsupported` rather than exhausting RAM. - ---- - -## 7. Testing & validation: make it part of the spec - -Instead of just “add tests,” bake them in as requirements. - -### 7.1 Golden tests vs `readelf` or `llvm-readobj` - -Define that CI must include: - -* For a set of ELFs (32‑bit, 64‑bit, big‑endian, stripped, PIE, static): - - * Compare `ElfMetadata.BuildId` with `readelf -n` output. - * Compare `ElfMetadata.Soname` with `readelf -d` / `objdump -p`. - -You don’t need to name the exact tools in the API, but the spec can say: - -> The library’s test suite **must** cross‑validate build‑id and SONAME values against a trusted system tool (such as `readelf` or `llvm-readobj`) for a curated set of binaries. - -### 7.2 Fuzzing & corruption tests - -Add: - -* A small fuzz harness that: - - * Mutates bytes in real ELF samples. - * Feeds them to `ElfParser`. - * Asserts: no crashes, only `ElfParseException`s. - -This directly supports the “never trust input” goal. - -### 7.3 Regression fixtures - -Check in a `testdata/` folder with: - -* Minimal 32‑bit/64‑bit ELF with build‑id. -* Minimal ELF without build‑id. -* Shared library with SONAME. -* Big‑endian sample. - ---- - -## 8. Extensibility hooks (future-friendly) - -Even if you only care about Linux/ELF today, you can design with “other formats later” in mind. - -### 8.1 Generalized module metadata interface - -```csharp -public interface IModuleMetadata -{ - string Path { get; } - string? Soname { get; } - string? BuildId { get; } - string Format { get; } // "ELF", "PE", "MachO" -} -``` - -`ElfMetadata` implements `IModuleMetadata`. That way, a future `PeMetadata` or `MachOMetadata` can slot into the same pipelines. - -### 8.2 Integration with SBOM & VEX - -Add a tiny, optional interface that lines up with your SBOM graph: - -```csharp -public interface IHasPackageCoordinates -{ - string? Purl { get; } -} - -public sealed partial class ElfMetadata : IHasPackageCoordinates -{ - public string? Purl { get; init; } // populated by your higher-layer resolver -} -``` - -The ELF layer doesn’t know how to compute `Purl`, but it gives a spot for higher layers to attach it without wrapping everything in another type. - ---- - -## 9. Documentation & usage examples - -Finally, “best in class” is as much about *developer experience* as code. - -Your spec should require: - -* XML docs on all public types/members (shown in IntelliSense). -* Samples: - - * “Read build‑id from a single file” - * “Enumerate current process modules and print build‑ids” - * “Scan a container filesystem for unique ELFs and dump JSON” - -For example: - -```csharp -// Example: dump all modules for the current process -var modules = ElfProcessScanner.GetProcessModules(); -foreach (var m in modules) -{ - Console.WriteLine($"{m.Path} | SONAME={m.Soname} | BUILD-ID={m.BuildId?.HexString ?? ""}"); -} -``` - ---- - -## TL;DR: What to actually change in your current spec - -If you just want a concrete checklist: - -1. **Refine API** - - * Introduce `ElfBuildId` struct, options objects, async variants. - * Split parser vs file/process scanners. - -2. **Parsing correctness** - - * Support build‑id in both PT_NOTE and `.note.gnu.build-id`. - * Add strict bounds checks and `ElfParseException` with `ElfParseErrorKind`. - * Treat big‑endian & 32‑bit as first‑class. - -3. **Performance** - - * Make full file hashing opt‑in. - * Avoid unnecessary section reads. - * Add optional memory‑mapped mode. - -4. **Process scanner** - - * De‑dup by inode, not just path. - * Return both unique files and per‑mapping instances. - * Add structured error reporting (successes + failures). - -5. **Testing & security** - - * Mandate cross‑validation vs `readelf`. - * Add fuzz/corruption tests. - * Add resource caps (max file size, max sections/notes). - -If you’d like, next step I can do is **rewrite the public C# surface** (interfaces, classes, XML docs) in one place with all of these improvements baked in, so your team can just drop it into a project and fill in the internals. +Here’s a quick, practical win for your SBOM/runtime join story: **record the ELF build‑id alongside soname and path when mapping modules to purls.** + +Why it matters: + +* **build‑id** (from `.note.gnu.build-id`) is a **content hash** that uniquely identifies an ELF image—even if filenames/paths change. +* Distros and **debuginfod** index debug symbols **by build‑id**, so you can reliably join runtime traces → binaries → SBOM entries → debug artifacts. +* It hardens reachability and VEX joins (no “same soname, different bits” ambiguity). + +### What to capture per ELF + +* `soname` (if shared object) +* `full path` at runtime +* `purl` (package URL from your resolver) +* **`build_id`** (hex, no colons) +* `arch`, `file type` (ET_DYN/ET_EXEC), and `build-id source` (NT_GNU_BUILD_ID) + +### How to read it (portable snippets) + +**CLI** + +```bash +# show build-id quickly +readelf -n /path/to/bin | awk '/Build ID:/ {print $3}' +# or: +objdump -s --section .note.gnu.build-id /path/to/bin +``` + +**C (runtime collector)** + +```c +#include +#include +static int note_cb(struct dl_phdr_info *info, size_t size, void *data) { + for (int i=0; iphnum; i++) { + const ElfW(Phdr) *ph = &info->phdr[i]; + if (ph->p_type == PT_NOTE) { + // scan notes for NT_GNU_BUILD_ID (type=3, name="GNU") + // extract desc bytes → hex string build_id + } + } + return 0; +} +// call dl_iterate_phdr(note_cb, NULL); +``` + +**Go (scanner)** + +```go +f, _ := elf.Open(path) +for _, n := range f.Notes { + if n.Type == elf.NT_GNU_BUILD_ID && n.Name == "GNU" { + buildID := fmt.Sprintf("%x", n.Desc) + // record buildID + } +} +``` + +### Suggested Stella Ops schema (add field, no versioning break) + +```json +{ + "module": { + "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", + "soname": "libssl.so.3", + "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", + "elf": { + "build_id": "a1b2c3d4e5f6...", + "type": "ET_DYN", + "arch": "x86_64", + "notes": { "source": "NT_GNU_BUILD_ID" } + } + } +} +``` + +### Join strategy + +1. **Runtime → build‑id:** collect from process maps (or dl_iterate_phdr) and file scan fallback. +2. **SBOM → candidate binaries:** map by purl/filename, then **confirm by build‑id** where available. +3. **Debug/Source:** query debuginfod or distro debug repos by build‑id to fetch symbols for precise call‑graph and reachability. +4. **VEX/Policies:** treat build‑id as the primary key for binary‑level assertions; purl stays as the package‑level key. + +### Edge handling + +* **Stripped binaries:** build‑id still present in the note; if missing, fall back to **full‑file hash** and flag `build_id_absent=true`. +* **Containers:** compute build‑id inside image layers and cache in your “Proof‑of‑Integrity Graph.” +* **Kernel/Modules:** same idea—`/sys/module/*/notes/.note.gnu.build-id`. + +### Quick acceptance tests + +* Scan a container image (Debian/Ubuntu/RHEL) and verify >90% of ELF objects yield a build‑id. +* Cross‑check one binary: path changes across containers, **build‑id stays identical**. +* Fetch symbols via debuginfod using that build‑id and run a tiny call‑graph demo to prove determinism. + +If you want, I can draft the exact .NET 10 collector for Linux (P/Invoke `dl_iterate_phdr`) and a CycloneDX extension block to store `build_id`. +Here’s a concrete “implementation spec” for a C# dev to build an **ELF metadata / build-id collector** (“elf builder”). I’ll treat this as a small reusable .NET library plus some process-level helpers. + +--- + +## 1. Goal & Scope + +**Goal:** From C# on Linux, be able to: + +1. Given an ELF file path, extract: + + * `build-id` (from `.note.gnu.build-id`, i.e. NT_GNU_BUILD_ID) + * `soname` (for shared objects) + * ELF type (ET_EXEC / ET_DYN / etc.) + * machine architecture + * file path + * optional fallback: full-file hash if build-id is missing + +2. Given a running process (usually self), enumerate loaded ELF modules and attach the above metadata per module. + +The output will power your SBOM/runtime join (path + soname + build-id → purl). + +--- + +## 2. Public API Spec + +### 2.1 Core model + +```csharp +public enum ElfFileType +{ + Unknown = 0, + Relocatable = 1, // ET_REL + Executable = 2, // ET_EXEC + SharedObject = 3, // ET_DYN + Core = 4 // ET_CORE +} + +public sealed class ElfMetadata +{ + public required string Path { get; init; } + public string? Soname { get; init; } + public string? BuildId { get; init; } // Hex, lowercase, no colons + public string BuildIdSource { get; init; } = ""; // "NT_GNU_BUILD_ID" | "FileHash" | "" + public ElfFileType FileType { get; init; } + + public string Machine { get; init; } = ""; // e.g. "x86_64", "aarch64" + public bool Is64Bit { get; init; } + public bool IsLittleEndian { get; init; } + + public string? FileHashSha256 { get; init; } // only if BuildId == null +} +``` + +### 2.2 File-level API + +```csharp +public static class ElfReader +{ + /// + /// Parse the ELF file at the given path and extract metadata. + /// Throws if file is not ELF or cannot be read. + /// + public static ElfMetadata ReadMetadata(string path); +} +``` + +**Behavior:** + +* Validates ELF magic. +* Supports both 32-bit and 64-bit ELF. +* Supports little and big endian (but you can initially only test little-endian). +* Uses program headers (PT_NOTE) and note parsing to extract build-id. +* Uses section headers + .dynamic to extract `DT_SONAME`. +* Sets `BuildIdSource = "NT_GNU_BUILD_ID"` if build-id present. +* If no build-id, computes `FileHashSha256` and sets `BuildIdSource = "FileHash"`. + +### 2.3 Process-level API (Linux) + +```csharp +public static class ElfProcessScanner +{ + /// + /// Enumerate ELF modules for the current process (default) or a given pid. + /// Only returns unique paths that are actual ELF files. + /// + public static IReadOnlyList GetProcessModules(int? pid = null); +} +``` + +**Default implementation:** + +* Only supports Linux. +* Reads `/proc//maps`. +* Filters entries that map regular files (path not `[vdso]`, `[heap]`, etc.). +* De-duplicates by canonical path (e.g. `realpath` behavior). +* For each unique path: + + * Check first 4 bytes for ELF magic. + * Call `ElfReader.ReadMetadata(path)`. + +--- + +## 3. ELF Parsing: Binary Layout & Rules + +You do **not** need unsafe code; a `BinaryReader` is enough. + +### 3.1 ELF header + +First 16 bytes: `e_ident[]`. + +Key fields: + +* `e_ident[0..3]` = `0x7F, 'E', 'L', 'F'` (magic) +* `e_ident[4]` = `EI_CLASS`: + + * 1 = 32-bit (`ELFCLASS32`) + * 2 = 64-bit (`ELFCLASS64`) +* `e_ident[5]` = `EI_DATA`: + + * 1 = little-endian (`ELFDATA2LSB`) + * 2 = big-endian (`ELFDATA2MSB`) + +Then the “native” header fields, which differ slightly between 32 & 64 bit. + +Define two internal structs (don’t use `[StructLayout]`; just read fields manually): + +```csharp +internal sealed class ElfHeaderCommon +{ + public byte[] Ident = new byte[16]; + public ushort Type; // e_type + public ushort Machine; // e_machine + public uint Version; // e_version + public ulong Entry; // e_entry (32/64 sized) + public ulong Phoff; // e_phoff + public ulong Shoff; // e_shoff + public uint Flags; // e_flags + public ushort Ehsize; // e_ehsize + public ushort Phentsize; // e_phentsize + public ushort Phnum; // e_phnum + public ushort Shentsize; // e_shentsize + public ushort Shnum; // e_shnum + public ushort Shstrndx; // e_shstrndx +} +``` + +**Algorithm to read header:** + +1. `ReadBytes(16)` → `Ident`. Validate magic & EI_CLASS/EI_DATA. + +2. Decide `is64` (from EI_CLASS) and `littleEndian` (from EI_DATA). + +3. Use helper methods: + + ```csharp + static ushort ReadUInt16(BinaryReader br, bool little) { ... } + static uint ReadUInt32(BinaryReader br, bool little) { ... } + static ulong ReadUInt64(BinaryReader br, bool little) { ... } + ``` + + Where these helpers swap bytes if file is big-endian and host is little-endian. + +4. For 32-bit ELF: fields `Entry`, `Phoff`, `Shoff` are 4-byte values that you zero-extend to 64-bit. + +5. For 64-bit ELF: fields are 8-byte values. + +### 3.2 Program headers (for build-id) + +Each program header: + +* 32-bit: + + ```text + uint32 p_type; + uint32 p_offset; + uint32 p_vaddr; + uint32 p_paddr; + uint32 p_filesz; + uint32 p_memsz; + uint32 p_flags; + uint32 p_align; + ``` + +* 64-bit: + + ```text + uint32 p_type; + uint32 p_flags; + uint64 p_offset; + uint64 p_vaddr; + uint64 p_paddr; + uint64 p_filesz; + uint64 p_memsz; + uint64 p_align; + ``` + +You only really need: + +* `p_type` (look for `PT_NOTE` = 4) +* `p_offset` +* `p_filesz` + +**Reading algorithm:** + +```csharp +internal sealed class ProgramHeader +{ + public uint Type; + public ulong Offset; + public ulong FileSize; +} +``` + +* Seek to `header.Phoff`. +* For `i = 0..Phnum-1`: + + * For 32-bit: + + * `Type = ReadUInt32()` + * Skip `p_offset` into `Offset = ReadUInt32()` + * Skip the rest. + * For 64-bit: + + * `Type = ReadUInt32()` + * `flags = ReadUInt32()` (ignored) + * `Offset = ReadUInt64()` + * `FileSize = ReadUInt64()` + * Skip rest. +* Store those with `Type == 4` (PT_NOTE). + +### 3.3 Note segments & NT_GNU_BUILD_ID + +Each **note** has: + +```text +uint32 namesz; +uint32 descsz; +uint32 type; +char name[namesz]; // padded to 4-byte boundary +byte desc[descsz]; // padded to 4-byte boundary +``` + +We care about: + +* `type == 3` (NT_GNU_BUILD_ID) +* `name == "GNU"` (null-terminated; usually `"GNU\0"`) + +**Algorithm:** + +For each `PT_NOTE` program header: + +1. Seek to `ph.Offset`, set `remaining = ph.FileSize`. +2. While `remaining >= 12`: + + * `namesz = ReadUInt32()` + * `descsz = ReadUInt32()` + * `type = ReadUInt32()` + * `remaining -= 12`. + * Read `nameBytes = ReadBytes(namesz)`; `remaining -= namesz`. + + * Skip padding: `pad = (4 - (namesz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. + * Read `desc = ReadBytes(descsz)`; `remaining -= descsz`. + + * Skip padding: `pad = (4 - (descsz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. + * If `type == 3` and `Encoding.ASCII.GetString(nameBytes).TrimEnd('\0') == "GNU"`: + + * Convert `desc` to hex: + + ```csharp + string buildId = BitConverter.ToString(desc).Replace("-", "").ToLowerInvariant(); + ``` + + * Return immediately. + +If no note matches, return null, and you can later fall back to `FileHashSha256`. + +### 3.4 Section headers & SONAME + +You need `DT_SONAME` from the dynamic section. Steps: + +1. Read **section headers** from `Shoff` (ELF header). + + Minimal section header model: + + ```csharp + internal sealed class SectionHeader + { + public uint Name; // index into shstrtab + public uint Type; // SHT_* + public ulong Offset; + public ulong Size; + public uint Link; // for some types + } + ``` + + For each section: + + * Read `Name`, `Type`, `Flags` (ignored), `Addr` (ignored), `Offset`, `Size`, `Link`, etc. + * Keep these in an array. + +2. Find the **section header string table** (`shstrtab`): + + * Use `header.Shstrndx` to locate its section header. + * Read that section’s bytes into `shStrTab`. + * Define helper to get section name: + + ```csharp + static string ReadNullTerminatedString(byte[] table, uint offset) + { + int i = (int)offset; + int start = i; + while (i < table.Length && table[i] != 0) i++; + return Encoding.ASCII.GetString(table, start, i - start); + } + ``` + +3. Use `shStrTab` to find: + + * `.dynamic` section (`Type == 6` i.e. `SHT_DYNAMIC`). + * The string table it references (`SectionHeader.Link` → index of the dynamic string table, often `.dynstr`). + +4. Parse the **dynamic section**: + + * `Elf64_Dyn` is array of entries: + + ```text + int64 d_tag; + uint64 d_val; + ``` + + (For 32-bit, both are 4 bytes; you can cast to 64-bit.) + + * For each entry: + + * Read `d_tag` (signed, but you can treat as 64-bit). + * Read `d_val`. + * If `d_tag == 14` (`DT_SONAME`), then `d_val` is an offset into the dynstr string table. + +5. Read `SONAME`: + + * Use dynstr bytes + `d_val` as index, decode null-terminated ASCII → `Soname`. + +If there is no `.dynamic` section or no `DT_SONAME`, set `Soname = null`. + +### 3.5 Mapping `e_machine` to architecture string + +`e_machine` is a numeric code. Map the most common ones: + +```csharp +static string MapMachine(ushort eMachine) => eMachine switch +{ + 3 => "x86", // EM_386 + 62 => "x86_64", // EM_X86_64 + 40 => "arm", // EM_ARM + 183 => "aarch64", // EM_AARCH64 + 8 => "mips", // EM_MIPS + _ => $"unknown({eMachine})" +}; +``` + +### 3.6 Mapping `e_type` to `ElfFileType` + +```csharp +static ElfFileType MapFileType(ushort eType) => eType switch +{ + 1 => ElfFileType.Relocatable, // ET_REL + 2 => ElfFileType.Executable, // ET_EXEC + 3 => ElfFileType.SharedObject,// ET_DYN + 4 => ElfFileType.Core, // ET_CORE + _ => ElfFileType.Unknown +}; +``` + +### 3.7 Fallback: SHA-256 hash + +If build-id is missing: + +```csharp +static string ComputeFileSha256(string path) +{ + using var sha = System.Security.Cryptography.SHA256.Create(); + using var fs = File.OpenRead(path); + var hash = sha.ComputeHash(fs); + return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant(); +} +``` + +Set: + +* `BuildId = null` +* `BuildIdSource = "FileHash"` +* `FileHashSha256 = computedHash` + +--- + +## 4. Implementation Skeleton (ElfReader) + +Here’s a compact skeleton tying it together: + +```csharp +public static class ElfReader +{ + public static ElfMetadata ReadMetadata(string path) + { + using var fs = File.OpenRead(path); + using var br = new BinaryReader(fs); + + // 1. Read e_ident + byte[] ident = br.ReadBytes(16); + if (ident.Length < 16 || + ident[0] != 0x7F || ident[1] != (byte)'E' || + ident[2] != (byte)'L' || ident[3] != (byte)'F') + { + throw new InvalidDataException("Not an ELF file."); + } + + bool is64 = ident[4] == 2; // EI_CLASS + bool little = ident[5] == 1; // EI_DATA + + // 2. Read header + var header = ReadElfHeader(br, ident, is64, little); + + // 3. Read program headers + var phdrs = ReadProgramHeaders(br, header, is64, little); + + // 4. Extract build-id from PT_NOTE + string? buildId = TryReadBuildIdFromNotes(br, phdrs, little, is64); + + // 5. Read SONAME from .dynamic + string? soname = TryReadSoname(br, header, is64, little); + + // 6. Map machine & type + string machine = MapMachine(header.Machine); + ElfFileType fileType = MapFileType(header.Type); + + // 7. Hash fallback + string? fileHash = null; + string source; + if (buildId is null) + { + fileHash = ComputeFileSha256(path); + source = "FileHash"; + } + else + { + source = "NT_GNU_BUILD_ID"; + } + + return new ElfMetadata + { + Path = path, + Soname = soname, + BuildId = buildId, + BuildIdSource = source, + FileType = fileType, + Machine = machine, + Is64Bit = is64, + IsLittleEndian = little, + FileHashSha256 = fileHash + }; + } + + // ... implement ReadElfHeader, ReadProgramHeaders, + // TryReadBuildIdFromNotes, TryReadSoname, MapMachine, + // MapFileType, ComputeFileSha256, + endian helpers ... +} +``` + +I didn’t expand *every* helper to keep this readable, but all helpers follow exactly the rules in section 3. + +--- + +## 5. Process Scanner Spec (Linux) + +### 5.1 Reading `/proc//maps` + +Each line looks roughly like: + +```text +7f2d9c214000-7f2d9c234000 r--p 00000000 08:01 1234567 /usr/lib/x86_64-linux-gnu/libssl.so.3 +``` + +Last field is the file path, if any. + +**Algorithm:** + +```csharp +public static class ElfProcessScanner +{ + public static IReadOnlyList GetProcessModules(int? pid = null) + { + int actualPid = pid ?? Environment.ProcessId; + string mapsPath = $"/proc/{actualPid}/maps"; + + if (!File.Exists(mapsPath)) + throw new PlatformNotSupportedException("Only supported on Linux with /proc."); + + var paths = new HashSet(StringComparer.Ordinal); + foreach (var line in File.ReadLines(mapsPath)) + { + int idx = line.IndexOf('/'); + if (idx < 0) + continue; + + string p = line.Substring(idx).Trim(); + if (p.StartsWith("[")) + continue; // skip [heap], [vdso], etc. + + if (!File.Exists(p)) + continue; + + // De-duplicate + if (!paths.Add(p)) + continue; + } + + var result = new List(); + foreach (var p in paths) + { + if (!IsElfFile(p)) + continue; + + try + { + var meta = ElfReader.ReadMetadata(p); + result.Add(meta); + } + catch + { + // swallow or log; not all mapped files are valid ELF + } + } + + return result; + } + + private static bool IsElfFile(string path) + { + try + { + using var fs = File.OpenRead(path); + Span magic = stackalloc byte[4]; + if (fs.Read(magic) != 4) return false; + return magic[0] == 0x7F && magic[1] == (byte)'E' && + magic[2] == (byte)'L' && magic[3] == (byte)'F'; + } + catch { return false; } + } +} +``` + +This is simple and robust. If you later want **even more accurate** results (e.g., also non-file-backed shared objects), you can add a P/Invoke path that uses `dl_iterate_phdr`, but `/proc//maps` gets you the SBOM-relevant modules. + +--- + +## 6. JSON / SBOM Integration (Optional but Recommended) + +When you serialize `ElfMetadata` into your runtime evidence / graph, I’d recommend a nested ELF block, e.g.: + +```json +{ + "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", + "soname": "libssl.so.3", + "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", + "elf": { + "build_id": "a1b2c3d4e5f6...", + "build_id_source": "NT_GNU_BUILD_ID", + "file_type": "SharedObject", + "machine": "x86_64", + "is_64bit": true, + "is_little_endian": true, + "file_hash_sha256": null + } +} +``` + +You can keep `purl` on the higher-level module object; `build_id` becomes the primary key for binary-accurate joins. + +--- + +## 7. Testing Checklist + +For a C# dev implementing this, I’d suggest these tests: + +1. **Basic ELF detection** + + * Provide a non-ELF file → `ReadMetadata` throws. + * Provide `/bin/ls` (or similar) → parses successfully. + +2. **32-bit vs 64-bit** + + * If you have a 32-bit ELF on the system, verify `Is64Bit` and basic fields. + +3. **Build-id extraction** + + * Compare `ReadMetadata(path).BuildId` against `readelf -n path | grep 'Build ID:'`. + * Ensure they match. + +4. **SONAME extraction** + + * For a shared library, ensure `Soname` matches `readelf -d path | grep SONAME`. + +5. **Hash fallback** + + * On an ELF without build-id (you can strip or use a test file), verify `BuildId == null` and `FileHashSha256 != null`. + +6. **Process scanner** + + * Call `GetProcessModules()` in your test app; ensure it returns at least: + + * your own executable + * core system libs (`libc.so.6`, `ld-linux*`, etc.) + +--- + +If you’d like, next step I can do is expand any of the helper methods (e.g. `ReadElfHeader` or `TryReadSoname`) into complete C# code, but this should be enough as a “full spec & directions” for a C# dev to implement the ELF build-id collector. +You’re already in “pretty good” territory with the current spec. To get to “best in class,” I’d tighten it in a few dimensions: **API design, correctness, performance, observability, testing, and extensibility**. + +I’ll walk through concrete ways to upgrade what we already have, without rewriting everything from scratch. + +--- + +## 1. Sharpen the public API + +### 1.1 Split responsibilities into clear layers + +Right now we have: + +* `ElfReader.ReadMetadata(string path)` +* `ElfProcessScanner.GetProcessModules(int? pid = null)` + +For a best‑in‑class library, I’d explicitly layer things: + +```csharp +public interface IElfParser +{ + ElfMetadata Parse(Stream stream, string? pathHint = null); +} + +public interface IElfFileInspector +{ + ElfMetadata InspectFile(string path); +} + +public interface IElfProcessInspector +{ + IReadOnlyList GetProcessModules(ElfProcessScanOptions? options = null); +} +``` + +With default implementations: + +* `ElfParser` – pure, stateless binary parser (no file I/O). +* `ElfFileInspector` – wraps `ElfParser` + file system. +* `ElfProcessInspector` – wraps `/proc//maps` (and optionally `dl_iterate_phdr`). + +This makes testing simpler (you can feed a `MemoryStream`) and keeps “how we read” decoupled from “how we parse.” + +### 1.2 Options objects & async variants + +Give users knobs and modern .NET ergonomics: + +```csharp +public sealed class ElfProcessScanOptions +{ + public int? Pid { get; init; } + public bool IncludeNonElfFiles { get; init; } = false; + public bool ParallelFileParsing { get; init; } = true; + public bool ComputeHashWhenBuildIdMissing { get; init; } = true; + public int? MaxFiles { get; init; } // safety valve on huge systems +} + +public static class ElfProcessScanner +{ + public static IReadOnlyList GetProcessModules( + ElfProcessScanOptions? options = null); + + public static IAsyncEnumerable GetProcessModulesAsync( + ElfProcessScanOptions? options = null, + CancellationToken cancellationToken = default); +} +``` + +Same for file scans: + +```csharp +public sealed class ElfFileScanOptions +{ + public bool ComputeFileHashWhenBuildIdPresent { get; init; } = false; + public bool ThrowOnNonElf { get; init; } = true; +} + +public static ElfMetadata ReadMetadata( + string path, + ElfFileScanOptions? options = null); +``` + +### 1.3 Strong types for identity + +Instead of `string BuildId`, add a value type: + +```csharp +public readonly struct ElfBuildId : IEquatable +{ + public string HexString { get; } // "a1b2c3..." + public string DebugPathComponent => $"{HexString[..2]}/{HexString[2..]}"; + + // Parse, TryParse, equality, GetHashCode, etc. +} +``` + +Then in `ElfMetadata`: + +```csharp +public ElfBuildId? BuildId { get; init; } // nullable +public string BuildIdSource { get; init; } // "NT_GNU_BUILD_ID" | "FileHash" | "None" +``` + +This prevents subtle bugs from string normalization and gives you the debuginfod‑style path precomputed. + +--- + +## 2. Make parsing spec‑accurate & robust + +### 2.1 Handle both PT_NOTE and SHT_NOTE `.note.gnu.build-id` + +Many binaries place build‑id in: + +* `PT_NOTE` segments **and/or** +* a section named `.note.gnu.build-id` (`SHT_NOTE`) + +Your spec only mentions `PT_NOTE`. For best coverage: + +1. Search all `PT_NOTE` segments for `NT_GNU_BUILD_ID`. +2. If none found, search `SHT_NOTE` sections with name `.note.gnu.build-id`. +3. If both exist and disagree (extremely rare), decide a precedence and log a diagnostic. + +### 2.2 Correct note alignment rules + +Spec nuance: + +* Note *fields* (`namesz`, `descsz`, `type`) are always 4‑byte aligned. +* On 64‑bit, the **overall note segment** may be aligned to 8 bytes, but the internal padding rules still use 4‑byte boundaries. + +Your spec uses `pad = (4 - (size % 4)) & 3`, which is correct, but I’d codify it clearly: + +```csharp +static int NotePadding(int size) => (4 - (size & 3)) & 3; +``` + +And call that everywhere you advance across notes so future maintainers don’t “optimize” it incorrectly. + +### 2.3 Be strict on bounds & corruption + +Add explicit, defensive checks: + +* Do not trust `p_offset` + `p_filesz` blindly. +* Before any read, verify `offset + length <= streamLength`. +* If the file lies about sizes, **fail gracefully** with a structured error. + +E.g.: + +```csharp +public sealed class ElfParseException : Exception +{ + public ElfParseErrorKind Kind { get; } + public string? Detail { get; } + + // ... +} + +public enum ElfParseErrorKind +{ + NotElf, + TruncatedHeader, + TruncatedProgramHeader, + TruncatedSectionHeader, + TruncatedNote, + UnsupportedClass, + UnsupportedEndianess, + IoError, + Unknown +} +``` + +And then: + +```csharp +if (header.Phoff + (ulong)header.Phnum * header.Phentsize > (ulong)fs.Length) + throw new ElfParseException(ElfParseErrorKind.TruncatedProgramHeader, "..."); +``` + +Best‑in‑class means you *never* trust the file, and your errors are debuggable. + +### 2.4 Big‑endian and 32‑bit are first‑class citizens + +Even if your primary target is x86_64 Linux, a robust spec: + +* Fully supports EI_CLASS = 1 and 2 (32/64). +* Fully supports EI_DATA = 1 and 2 (LSB/MSB). +* Has tests for at least one big‑endian ELF (e.g., sample artifacts in your test assets). + +Your current spec *mentions* big-endian, but I’d explicitly require: + +* A generic `EndianBinaryReader` abstraction that: + + * Wraps a `Stream` + * Exposes `ReadUInt16/32/64`, `ReadInt64`, `ReadBytes` with endianness. + +--- + +## 3. Performance & scale improvements + +### 3.1 Avoid full-file reads by design + +Your current design lets devs accidentally hash everything or read all sections even when not needed. + +Refine the spec so that **default path** is minimal I/O: + +* Read ELF header. +* Read program headers. +* Read only: + + * PT_NOTE ranges + * Section headers (once) + * `.shstrtab`, `.dynamic`, and its dynstr. + +Only compute SHA‑256 when expressly configured (via `ElfFileScanOptions.ComputeFileHashWhenBuildIdPresent` or `ComputeFileHashWhenBuildIdMissing`). + +### 3.2 Optional memory‑mapped mode + +For very large scans (filesystem crawls, containers), allow a mode that uses `MemoryMappedFile`: + +```csharp +public sealed class ElfReaderOptions +{ + public bool UseMemoryMappedFile { get; init; } = false; +} +``` + +Internally, you can spec that the implementation: + +* Uses `MemoryMappedFile.CreateFromFile` +* Creates views over relevant ranges (header, program headers, etc.) +* Avoids multiple OS reads for repeated random access. + +### 3.3 Parallel directory / image scanning + +If you foresee scanning whole images or file trees, define a helper: + +```csharp +public static class ElfDirectoryScanner +{ + public static IReadOnlyList Scan( + string rootDirectory, + ElfDirectoryScanOptions? options = null); + + public static IAsyncEnumerable ScanAsync( + string rootDirectory, + ElfDirectoryScanOptions? options = null, + CancellationToken cancellationToken = default); +} + +public sealed class ElfDirectoryScanOptions +{ + public SearchOption SearchOption { get; init; } = SearchOption.AllDirectories; + public int MaxDegreeOfParallelism { get; init; } = Environment.ProcessorCount; + public Func? PathFilter { get; init; } // e.g., skip /proc, /sys +} +``` + +And explicitly say that the implementation: + +* Uses `Parallel.ForEach` (or `Parallel.ForEachAsync` in .NET 8/9+) with bounded parallelism. +* Shares a single `ElfParser` across threads (it’s stateless). +* De‑dups by `(device, inode)` when possible (see below). + +--- + +## 4. Process scanner: correctness & completeness + +### 4.1 De‑duplication by inode, not just path + +The current spec de‑dups only by path. On Linux: + +* Same inode may have multiple paths (hard links, bind mounts, chroot/container overlays). + +For best‑in‑class accuracy of “unique binaries,” spec: + +* De‑duplicate entries by `(st_dev, st_ino)` from `stat(2)`, not just string path. +* Provide both views: unique by file identity and by path. + +API example: + +```csharp +public sealed class ElfProcessModules +{ + public IReadOnlyList UniqueFiles { get; init; } // dedup by inode + public IReadOnlyList Instances { get; init; } // per mapping +} + +public sealed class ElfModuleInstance +{ + public ElfMetadata Metadata { get; init; } + public string Path { get; init; } + public string? MappingRange { get; init; } // "7f2d9c214000-7f2d9c234000" +} +``` + +And `ElfProcessScanner.GetProcessModules` returns an `ElfProcessModules`, not just a flat list. + +### 4.2 Optional `dl_iterate_phdr` P/Invoke path + +For a “maximum correctness” mode, you can specify: + +* A secondary implementation that uses `dl_iterate_phdr` via P/Invoke. +* This gives you module base addresses and sometimes more consistent views across distros. +* You can hybridize: use `/proc//maps` for path enumeration and `dl_iterate_phdr` to confirm loaded segments (future feature). + +You don’t **have** to implement it day one, but the spec can carve out an extension point: + +```csharp +public enum ElfProcessModuleSource +{ + ProcMaps, + DlIteratePhdr +} + +public sealed class ElfProcessScanOptions +{ + public ElfProcessModuleSource Source { get; init; } = ElfProcessModuleSource.ProcMaps; +} +``` + +And define behavior if the requested source isn’t available. + +--- + +## 5. Observability & diagnostics + +Best‑in‑class libraries are easy to debug. + +### 5.1 Structured diagnostics on parse failures + +Instead of “swallow or log” in the scanner, define: + +```csharp +public sealed class ElfScanResult +{ + public IReadOnlyList Successes { get; init; } + public IReadOnlyList Errors { get; init; } +} + +public sealed class ElfScanError +{ + public string Path { get; init; } + public ElfParseErrorKind Kind { get; init; } + public string Message { get; init; } +} +``` + +And make `ElfProcessScanner.GetProcessModules` optionally return `ElfScanResult` (or have an overload). + +This way you can: + +* Report how many files failed. +* See common misconfigurations (e.g., insufficient permissions, truncated files). + +### 5.2 Logging hooks instead of hard-coded logging + +Don’t bake in a logging framework, but add a hook: + +```csharp +public interface IElfLogger +{ + void Debug(string message); + void Info(string message); + void Warn(string message); + void Error(string message, Exception? ex = null); +} + +public sealed class ElfReaderOptions +{ + public IElfLogger? Logger { get; init; } +} +``` + +Then use it for “soft failures” (skipping non‑ELF files, ignoring suspect sections, etc.). + +--- + +## 6. Security & safety considerations + +### 6.1 Treat inputs as untrusted + +Spec explicitly that: + +* No ELF is ever loaded or executed. +* No ld.so / dynamic loading is used: all reading is via `FileStream` / `MemoryMappedFile`. +* No writes occur to inspected paths. + +### 6.2 Control resource usage + +For environments scanning untrusted file trees (e.g., user uploads): + +* Have configurable caps on: + + * `MaxFileSizeBytes` to parse. + * `MaxNotesPerSegment` / `MaxSections` to avoid pathological “zip bomb” style ELFs. +* Fail with `ElfParseErrorKind.TruncatedHeader` or `Unsupported` rather than exhausting RAM. + +--- + +## 7. Testing & validation: make it part of the spec + +Instead of just “add tests,” bake them in as requirements. + +### 7.1 Golden tests vs `readelf` or `llvm-readobj` + +Define that CI must include: + +* For a set of ELFs (32‑bit, 64‑bit, big‑endian, stripped, PIE, static): + + * Compare `ElfMetadata.BuildId` with `readelf -n` output. + * Compare `ElfMetadata.Soname` with `readelf -d` / `objdump -p`. + +You don’t need to name the exact tools in the API, but the spec can say: + +> The library’s test suite **must** cross‑validate build‑id and SONAME values against a trusted system tool (such as `readelf` or `llvm-readobj`) for a curated set of binaries. + +### 7.2 Fuzzing & corruption tests + +Add: + +* A small fuzz harness that: + + * Mutates bytes in real ELF samples. + * Feeds them to `ElfParser`. + * Asserts: no crashes, only `ElfParseException`s. + +This directly supports the “never trust input” goal. + +### 7.3 Regression fixtures + +Check in a `testdata/` folder with: + +* Minimal 32‑bit/64‑bit ELF with build‑id. +* Minimal ELF without build‑id. +* Shared library with SONAME. +* Big‑endian sample. + +--- + +## 8. Extensibility hooks (future-friendly) + +Even if you only care about Linux/ELF today, you can design with “other formats later” in mind. + +### 8.1 Generalized module metadata interface + +```csharp +public interface IModuleMetadata +{ + string Path { get; } + string? Soname { get; } + string? BuildId { get; } + string Format { get; } // "ELF", "PE", "MachO" +} +``` + +`ElfMetadata` implements `IModuleMetadata`. That way, a future `PeMetadata` or `MachOMetadata` can slot into the same pipelines. + +### 8.2 Integration with SBOM & VEX + +Add a tiny, optional interface that lines up with your SBOM graph: + +```csharp +public interface IHasPackageCoordinates +{ + string? Purl { get; } +} + +public sealed partial class ElfMetadata : IHasPackageCoordinates +{ + public string? Purl { get; init; } // populated by your higher-layer resolver +} +``` + +The ELF layer doesn’t know how to compute `Purl`, but it gives a spot for higher layers to attach it without wrapping everything in another type. + +--- + +## 9. Documentation & usage examples + +Finally, “best in class” is as much about *developer experience* as code. + +Your spec should require: + +* XML docs on all public types/members (shown in IntelliSense). +* Samples: + + * “Read build‑id from a single file” + * “Enumerate current process modules and print build‑ids” + * “Scan a container filesystem for unique ELFs and dump JSON” + +For example: + +```csharp +// Example: dump all modules for the current process +var modules = ElfProcessScanner.GetProcessModules(); +foreach (var m in modules) +{ + Console.WriteLine($"{m.Path} | SONAME={m.Soname} | BUILD-ID={m.BuildId?.HexString ?? ""}"); +} +``` + +--- + +## TL;DR: What to actually change in your current spec + +If you just want a concrete checklist: + +1. **Refine API** + + * Introduce `ElfBuildId` struct, options objects, async variants. + * Split parser vs file/process scanners. + +2. **Parsing correctness** + + * Support build‑id in both PT_NOTE and `.note.gnu.build-id`. + * Add strict bounds checks and `ElfParseException` with `ElfParseErrorKind`. + * Treat big‑endian & 32‑bit as first‑class. + +3. **Performance** + + * Make full file hashing opt‑in. + * Avoid unnecessary section reads. + * Add optional memory‑mapped mode. + +4. **Process scanner** + + * De‑dup by inode, not just path. + * Return both unique files and per‑mapping instances. + * Add structured error reporting (successes + failures). + +5. **Testing & security** + + * Mandate cross‑validation vs `readelf`. + * Add fuzz/corruption tests. + * Add resource caps (max file size, max sections/notes). + +If you’d like, next step I can do is **rewrite the public C# surface** (interfaces, classes, XML docs) in one place with all of these improvements baked in, so your team can just drop it into a project and fill in the internals. diff --git a/docs/product-advisories/archived/20-Nov-2026 - Branch · Model .init_array Constructors as Reachability Roots.md b/docs/product-advisories/archived/20-Nov-2025 - Branch · Model .init_array Constructors as Reachability Roots.md similarity index 96% rename from docs/product-advisories/archived/20-Nov-2026 - Branch · Model .init_array Constructors as Reachability Roots.md rename to docs/product-advisories/archived/20-Nov-2025 - Branch · Model .init_array Constructors as Reachability Roots.md index f26f072e5..106903742 100644 --- a/docs/product-advisories/archived/20-Nov-2026 - Branch · Model .init_array Constructors as Reachability Roots.md +++ b/docs/product-advisories/archived/20-Nov-2025 - Branch · Model .init_array Constructors as Reachability Roots.md @@ -1,768 +1,768 @@ -Here’s a quick, practical heads‑up about **binary initialization routines** and why they matter for reachability and vuln triage. - ---- - -### What’s happening before `main()` - -In ELF binaries/shared objects, the runtime linker runs **constructors** *before* `main()`: - -* `.preinit_array` → runs first (rare, but highest priority) -* `.init_array` → common place for constructors (ordered by index) -* Legacy sections: `.init` (function) and `.ctors` (older toolchains) -* On exit you also have `.fini_array` / `.fini` - -These constructors can: - -* Register signal/atexit handlers -* Start threads, open sockets/files, tweak `LD_PRELOAD` hooks -* Call library code you assumed was only used later - -So if you’re doing **call‑graph reachability** for vulnerability impact, starting from only `main()` (or exported APIs) can **miss real edges** that execute at load time. - ---- - -### What to model (synthetic roots) - -Treat the following as **synthetic entry points** in your graph: - -1. All function pointers in `.preinit_array` -2. All function pointers in `.init_array` -3. The symbol `_init` (if present) and legacy `.ctors` entries -4. For completeness on teardown paths: `.fini_array`, `_fini` -5. **Dynamic loader interposition**: if `DT_NEEDED` libs have their own constructors, they’re roots too (even if you never call them explicitly) - -For PIE/DSO builds, remember that every loaded **dependency’s** init arrays run as part of `dlopen()`/program start—model those edges across DSOs. - ---- - -### How to extract quickly - -* **Static parse**: read `PT_DYNAMIC`, then `DT_PREINIT_ARRAY`, `DT_INIT_ARRAY`, their sizes; iterate pointers and add edges to your graph. -* **Symbol fallback**: if `DT_INIT`/`_init` exists, add it as a root. -* **Ordering**: preserve index order inside arrays (it can matter). -* **Relocations**: resolve `R_X86_64_RELATIVE` (etc.) so pointers point to the real code addresses. - -Mini‑C example (constructor runs pre‑main): - -```c -static void __attribute__((constructor)) boot(void) { - // vulnerable call here executes before main() -} -int main(){ return 0; } -``` - ---- - -### For Stella Ops (binary reachability) - -* **Graph seeds**: `roots = { init arrays of main ELF + all DT_NEEDED DSOs }` -* **Policy**: mark edges from these roots as `phase=load` vs `phase=runtime`, so your explainer can say “reachable at load time.” -* **PURLs**: attach edges to the package/node that owns the constructor symbol (DSO package purl), not just the main app. -* **Attestation**: store the discovered root list (addresses + resolved symbols + DSO soname) in your deterministic scan manifest, so audits can replay it. -* **Heuristics**: if `dlopen()` is detected statically (strings/symbols), add a potential root “DLOPEN_INIT[*]” bucket for libs found under common plugin dirs. - ---- - -### Quick checklist - -* [ ] Parse `.preinit_array`, `.init_array`, `.init` (and legacy `.ctors`) -* [ ] Resolve relocations; preserve order -* [ ] Seed graph with these as **synthetic roots** -* [ ] Include constructors of every `DT_NEEDED` DSO -* [ ] Tag edges as `phase=load` for prioritization/explainability -* [ ] Persist root list in the scan’s evidence bundle - -If you want, I can drop in a tiny .NET/ELF parser snippet or a Rust routine that walks `DT_INIT_ARRAY` and returns symbol‑resolved roots next. -Here’s a concrete, C#‑oriented spec you can hand to a developer to implement ELF init/constructor discovery and plug it into a reachability engine like Stella Ops. - -I’ll structure it like an internal design doc: - -1. What we need to do -2. Public API (what the rest of the system calls) -3. ELF parsing details (minimal, but correct) -4. Constructor / init routine discovery algorithm -5. Dynamic deps (DT_NEEDED) and load‑time roots -6. Integration with the call graph / reachability -7. Attestation / evidence output -8. Testing strategy - ---- - -## 1. Goal / Requirements - -**Business goal** - -When scanning ELF binaries and shared libraries, we must model functions that run **before `main()`** or at **library load/unload** as *synthetic entry points* in the call graph: - -* `.preinit_array` (pre‑init constructors) -* `.init_array` (constructors) -* Legacy constructs: - - * `.ctors` array - * `_init` (via `DT_INIT`) -* For teardown (optional but recommended): - - * `.fini_array` - * `_fini` (via `DT_FINI`) - -**We must:** - -* Discover all these routines in: - - * The main executable - * All its `DT_NEEDED` shared libraries (and any DSOs subsequently loaded, if we scan them) -* Represent them as **roots** in the reachability graph: - - * `phase = Load` for preinit/init/constructors - * `phase = Unload` for finalizers -* Resolve each routine to: - - * Owning binary path and SONAME - * Virtual address in the ELF - * Best‑effort symbol name (`_ZN...`, `my_ctor`, etc.) - * Order/index within its array (to preserve call order) -* Emit a structured **evidence/attestation** record so scans are replayable. - ---- - -## 2. Public API (C#) - -### 2.1 Data model - -Create a small domain model in a library, e.g. `StellaOps.ElfInit`: - -```csharp -namespace StellaOps.ElfInit; - -public enum InitRoutineKind -{ - PreInitArray, - InitArray, - LegacyCtorsSection, - LegacyInitSymbol, - FiniArray, - LegacyFiniSymbol -} - -public enum InitPhase -{ - Load, - Unload -} - -public sealed record InitRoutineRoot( - string BinaryPath, // Full path on disk - string? Soname, // From DT_SONAME if present - InitRoutineKind Kind, - InitPhase Phase, - ulong VirtualAddress, // VA within this ELF - ulong? FileOffset, // File offset (if resolved), null if unknown - string? SymbolName, // Best-effort name from symbol table - int? ArrayIndex // Index for array-based roots -); -``` - -### 2.2 Discovery service - -Public entry point that other components use: - -```csharp -public interface IInitRoutineDiscovery -{ - /// - /// Discover load/unload routines (constructors) in a single ELF file - /// and, optionally, in its DT_NEEDED dependencies. - /// - InitDiscoveryResult Discover(string elfPath, InitDiscoveryOptions options); -} - -public sealed record InitDiscoveryOptions -{ - /// - /// If true, also discover init routines in DT_NEEDED shared libraries - /// (using IElfDependencyResolver to locate them on disk). - /// - public bool IncludeDependencies { get; init; } = true; - - /// - /// If true, include fini routines (.fini_array, DT_FINI, etc.) - /// as unload-phase roots. - /// - public bool IncludeUnloadPhase { get; init; } = true; -} - -public sealed record InitDiscoveryResult( - IReadOnlyList Roots, - IReadOnlyList Errors // non-fatal problems per binary -); - -public sealed record InitRoutineError( - string BinaryPath, - string Message, - Exception? Exception = null -); -``` - -### 2.3 Dependency resolution - -We don’t hard‑code how to find `DT_NEEDED` libraries on disk. Define an abstraction: - -```csharp -public interface IElfDependencyResolver -{ - /// - /// Resolve SONAME (e.g. "libc.so.6") to a local file path. - /// Returns null if not found. - /// - string? ResolveLibrary(string soname, string referencingBinaryPath); -} -``` - -The implementation can respect `LD_LIBRARY_PATH`, typical system dirs, container images, etc., but that’s outside this spec. - -`IInitRoutineDiscovery` will depend on: - -* `IElfParser` -* `IElfDependencyResolver` -* `ISymbolResolver` (symbol tables) - ---- - -## 3. ELF Parsing Spec (C#‑friendly) - -You can either use a NuGet ELF library or implement a minimal in‑house parser. This spec assumes a **minimal custom parser** that supports: - -* ELF64, little‑endian -* ET_EXEC, ET_DYN -* x86‑64 (`e_machine == EM_X86_64`) as v1; keep architecture pluggable for later - -### 3.1 Core types - -Create an internal parser namespace, e.g. `StellaOps.Elf`: - -```csharp -internal sealed class ElfFile -{ - public string Path { get; } - public ElfClass ElfClass { get; } - public ElfEndianness Endianness { get; } - public ElfHeader Header { get; } - public IReadOnlyList ProgramHeaders { get; } - public IReadOnlyList SectionHeaders { get; } - public DynamicSection? Dynamic { get; } - - public ReadOnlyMemory RawBytes { get; } - - // Helper: mapping VA -> file offset using PT_LOAD segments - public bool TryMapVaToFileOffset(ulong virtualAddress, out ulong fileOffset); -} - -internal enum ElfClass { Elf32, Elf64 } -internal enum ElfEndianness { Little, Big } - -// Fill out ElfHeader / ProgramHeader / SectionHeader / DynamicEntry types -``` - -Implementation notes: - -* Read ELF header: - - * Validate magic: `0x7F 'E' 'L' 'F'` - * `EI_CLASS` → 32/64‑bit - * `EI_DATA` → endianness -* Read **program headers** (`e_phoff`, `e_phnum`). - - * Identify `PT_LOAD` (for VA→file mapping). - * Identify `PT_DYNAMIC` (for `DynamicSection`). -* Read **section headers** (`e_shoff`, `e_shnum`). - - * Identify sections by name: `.preinit_array`, `.init_array`, `.fini_array`, `.ctors`. - * You need the section name string table `.shstrtab` to decode names. - -### 3.2 Dynamic section parsing - -Define dynamic section model: - -```csharp -internal sealed class DynamicSection -{ - public IReadOnlyList Entries { get; } - public ulong? InitFunction { get; } // DT_INIT - public ulong? FiniFunction { get; } // DT_FINI - public ulong? InitArrayAddress { get; } // DT_INIT_ARRAY - public ulong? InitArraySize { get; } // DT_INIT_ARRAYSZ - public ulong? FiniArrayAddress { get; } // DT_FINI_ARRAY - public ulong? FiniArraySize { get; } // DT_FINI_ARRAYSZ - public ulong? PreInitArrayAddress { get; } // DT_PREINIT_ARRAY - public ulong? PreInitArraySize { get; } // DT_PREINIT_ARRAYSZ - - public string? Soname { get; } // DT_SONAME (decoded via DT_STRTAB) - public IReadOnlyList Needed { get; } // DT_NEEDED list - - public ulong? StrTabAddress { get; } - public ulong? SymTabAddress { get; } - public ulong? StrTabSize { get; } -} -``` - -Implementation details: - -* Dynamic entries are at `PT_DYNAMIC.p_offset`, each `Elf64_Dyn`: - - * `d_tag` (signed 64‑bit) - * `d_un` union (`d_val` or `d_ptr`, treat as `ulong`) - -* Tags of interest (values are from ELF spec): - - * `DT_NULL = 0` - * `DT_NEEDED = 1` - * `DT_STRTAB = 5` - * `DT_SYMTAB = 6` - * `DT_STRSZ = 10` - * `DT_INIT = 12` - * `DT_FINI = 13` - * `DT_SONAME = 14` - * `DT_INIT_ARRAY = 25` - * `DT_FINI_ARRAY = 26` - * `DT_INIT_ARRAYSZ = 27` - * `DT_FINI_ARRAYSZ = 28` - * `DT_PREINIT_ARRAY = 32` - * `DT_PREINIT_ARRAYSZ = 33` - -* To decode SONAME and NEEDED: - - * Use `DT_STRTAB` as base VA of the dynamic string table. - * Map VA to file offset with `TryMapVaToFileOffset`. - * For each `DT_NEEDED` / `DT_SONAME`, treat `d_val` as an offset into that string table; read a null‑terminated UTF‑8 C‑string. - ---- - -## 4. Constructor & Init Routine Discovery - -We now define the algorithm implemented by `InitRoutineDiscovery` for a **single ELF file**. - -High‑level steps: - -1. Parse `ElfFile`. -2. Parse `DynamicSection`. -3. Resolve: - - * Pre‑init array (`DT_PREINIT_ARRAY`, `.preinit_array`) - * Init array (`DT_INIT_ARRAY`, `.init_array`) - * Legacy `.ctors` - * `_init`, `_fini` via `DT_INIT`/`DT_FINI` - * Fini array (`DT_FINI_ARRAY`, `.fini_array`) -4. For each VA, optionally resolve symbol name. -5. Build `InitRoutineRoot` entries. - -### 4.1 Pointer size & endianness - -* For ELF64: - - * Pointer size = 8 bytes. -* For ELF32: - - * Pointer size = 4 bytes (if/when you support it). -* Use `BinaryPrimitives.ReadUInt64LittleEndian` or `ReadUInt64BigEndian` depending on `ElfEndianness`. - -### 4.2 Mapping VA → file offset - -`ElfFile.TryMapVaToFileOffset`: - -* Iterate `ProgramHeaders` with `p_type == PT_LOAD`. -* If `virtualAddress` in `[p_vaddr, p_vaddr + p_memsz)`: - - * `fileOffset = p_offset + (virtualAddress - p_vaddr)` -* Return false if no matching segment. - -### 4.3 Reading init arrays - -Generic helper: - -```csharp -internal static IReadOnlyList ReadPointerArray( - ElfFile elf, - ulong arrayVa, - ulong arrayBytes) -{ - var results = new List(); - if (!elf.TryMapVaToFileOffset(arrayVa, out var fileOffset)) - return results; - - int pointerSize = elf.ElfClass == ElfClass.Elf64 ? 8 : 4; - int count = (int)(arrayBytes / (ulong)pointerSize); - - var span = elf.RawBytes.Span; - for (int i = 0; i < count; i++) - { - ulong offset = fileOffset + (ulong)(i * pointerSize); - if (offset + (ulong)pointerSize > (ulong)span.Length) - break; - - ulong pointerValue = elf.Endianness switch - { - ElfEndianness.Little when pointerSize == 8 - => System.Buffers.Binary.BinaryPrimitives.ReadUInt64LittleEndian(span[(int)offset..]), - ElfEndianness.Little - => System.Buffers.Binary.BinaryPrimitives.ReadUInt32LittleEndian(span[(int)offset..]), - ElfEndianness.Big when pointerSize == 8 - => System.Buffers.Binary.BinaryPrimitives.ReadUInt64BigEndian(span[(int)offset..]), - _ // Big, 32-bit - => System.Buffers.Binary.BinaryPrimitives.ReadUInt32BigEndian(span[(int)offset..]), - }; - - if (pointerValue != 0) - results.Add(pointerValue); - } - - return results; -} -``` - -Apply to: - -* Pre‑init: if `Dynamic.PreInitArrayAddress` and `Dynamic.PreInitArraySize` present. -* Init: if `Dynamic.InitArrayAddress` and `Dynamic.InitArraySize` present. -* Fini: if `Dynamic.FiniArrayAddress` and `Dynamic.FiniArraySize` present. - -### 4.4 Legacy `.ctors` section - -Fallback for older toolchains: - -* Find section with `Name == ".ctors"`. -* Its contents are just an array of pointers (same pointer size as ELF). -* Some compilers include a sentinel `-1` or `0` at beginning or end. Treat: - - * `0` or `0xFFFFFFFFFFFFFFFF` (for 64‑bit) as sentinel; skip them. -* Use similar `ReadPointerArray` logic but starting from `sh_offset` rather than a VA. - -### 4.5 `_init` / `_fini` functions - -* `Dynamic.InitFunction` (from `DT_INIT`) is a single VA. -* `Dynamic.FiniFunction` (from `DT_FINI`) likewise. - -Even if arrays exist, these may also be present; treat them as **independent roots**. - ---- - -## 5. Symbol Resolution (best‑effort names) - -Define interface: - -```csharp -public interface ISymbolResolver -{ - /// - /// Find the symbol whose address matches `virtualAddress` exactly, - /// or, if not found, the closest preceding symbol (with an offset). - /// - SymbolInfo? ResolveSymbol(ElfFile elf, ulong virtualAddress); -} - -public sealed record SymbolInfo( - string Name, - ulong Value, - ulong Size -); -``` - -Implementation sketch: - -* Use `.dynsym` (dynamic symbol table), and `.symtab` (full symbol table) if available. -* Each symbol entry includes: - - * Name offset in string table - * Value (VA) - * Size - * Type/binding (function, object, etc.) -* Build an in‑memory index (e.g. sorted by `Value`) per ELF file. -* `ResolveSymbol`: - - * Prefer exact match of `Value`. - * If none, find symbol with largest `Value` less than `virtualAddress` and treat as “nearest symbol + offset”. - * You can show just `Name` or `Name+0xOFFSET` in explanations; for `InitRoutineRoot` we store plain `Name`. - ---- - -## 6. Dynamic Dependencies & Load-Time Roots - -When `InitDiscoveryOptions.IncludeDependencies == true`: - -1. For root ELF: - - * Discover its roots as above. -2. For each `neededSoname` in `Dynamic.Needed`: - - * Ask `IElfDependencyResolver.ResolveLibrary(neededSoname, rootElfPath)`. - * If it returns a path not yet processed: - - * Parse this ELF and recursively discover its roots. -3. Return a **flat list** of all `InitRoutineRoot` objects, but with their own `BinaryPath`/`Soname`. - -Important: **We do not implicitly model `dlopen()`** at this stage. That’s separate: - -* As an optional heuristic, if the binary imports `dlopen`, tag those DSOs so later we can add “potential plugin load” roots. You can park this as a TODO in the comments. - ---- - -## 7. Call Graph / Reachability Integration - -This depends on your existing modeling, but here’s a generic spec a C# dev can follow. - -Assume there is an internal model: - -```csharp -public sealed class CallGraph -{ - public Node GetOrCreateNode(string binaryPath, ulong virtualAddress, string? symbolName); - public Node GetOrCreateSyntheticRoot(string rootId, string description); - public void AddEdge(Node from, Node to, CallEdgeMetadata metadata); -} - -public sealed record CallEdgeMetadata( - string EdgeKind, // e.g. "loader-init" - InitPhase Phase, // Load / Unload - InitRoutineKind InitKind, - int? ArrayIndex -); -``` - -### 7.1 Synthetic loader node - -Create a single graph node representing the dynamic loader / program start: - -```csharp -var loaderNode = callGraph.GetOrCreateSyntheticRoot( - "LOADER", - "ELF dynamic loader / process start" -); -``` - -### 7.2 Adding edges for each root - -For each `InitRoutineRoot root`: - -1. Get or create a node for the target function: - - ```csharp - var target = callGraph.GetOrCreateNode( - root.BinaryPath, - root.VirtualAddress, - root.SymbolName - ); - ``` - -2. Add edge from loader: - - ```csharp - callGraph.AddEdge( - loaderNode, - target, - new CallEdgeMetadata( - EdgeKind: "loader-init", - Phase: root.Phase, - InitKind: root.Kind, - ArrayIndex: root.ArrayIndex - ) - ); - ``` - -3. Optional: If you model **per‑library** loader nodes, you can add: - - * `LOADER -> libLoaderNode` - * `libLoaderNode -> each constructor` - - but that’s a nice‑to‑have, not required. - -### 7.3 Phases - -* For `.preinit_array`, `.init_array`, `.ctors`, `_init`: - - * `Phase = InitPhase.Load` -* For `.fini_array`, `_fini`: - - * `Phase = InitPhase.Unload` - -This allows downstream UI to say e.g.: - -> This vulnerable function is reachable at **load time** via constructor `foo()` in `libbar.so`. - ---- - -## 8. Attestation / Evidence Output - -We want deterministic, auditable output per scan. - -Define a JSON schema (C# record) stored alongside other scan artifacts: - -```csharp -public sealed record InitRoutineEvidence( - string ScannerVersion, - DateTimeOffset ScanTimeUtc, - IReadOnlyList Entries -); - -public sealed record InitRoutineEvidenceEntry( - string BinaryPath, - string? Soname, - InitRoutineKind Kind, - InitPhase Phase, - ulong VirtualAddress, - ulong? FileOffset, - string? SymbolName, - int? ArrayIndex -); -``` - -Implementation details: - -* After `IInitRoutineDiscovery.Discover` completes: - - * Convert each `InitRoutineRoot` to `InitRoutineEvidenceEntry`. - * Serialize with `System.Text.Json` (property names in camelCase or snake_case; choose a stable convention). -* Store the evidence file e.g. `init_roots.json` inside the scan’s result directory. - ---- - -## 9. Implementation Details & Edge Cases - -### 9.1 Architectures - -First version: - -* Support: - - * `ElfClass.Elf64` - * `ElfEndianness.Little` - * `EM_X86_64` -* For anything else: - - * Log an `InitRoutineError` and skip (but don’t hard‑fail the whole scan). - -Design the parser so architecture is an enum: - -```csharp -internal enum ElfMachine : ushort -{ - X86_64 = 62, - // others later -} -``` - -### 9.2 Relocations (simplification) - -Real loaders apply relocations to constructor arrays; some pointers may be stored as relative relocations. - -For **v1 implementation**: - -* Assume that: - - * Array entries are already absolute VAs in the ELF’s address space (which is typical for non‑PIE or when link‑time addresses are used). -* If you need better fidelity later: - - * Parse `.rela.dyn` / `.rel.dyn`. - * Apply `R_X86_64_RELATIVE` relocations whose `r_offset` falls within the array’s address range: - - * Effective address = (base address + addend); if you treat base as 0, you get a VA that’s correct **within the file** (relative). - -Document this as a TODO so later you can extend without breaking the API. - -### 9.3 Error handling - -* All parsing errors **must be non‑fatal** to the overall scan: - - * Record `InitRoutineError` with `BinaryPath`, message, and exception. - * Continue with other binaries. -* If a binary is not ELF or has invalid magic: - - * Return no roots, but optionally log a low‑severity error. - ---- - -## 10. Testing Strategy - -### 10.1 Unit tests with synthetic ELF fixtures - -Create a small test project `StellaOps.ElfInit.Tests` with known ELF files checked into test resources: - -* Binaries compiled with small C programs like: - - ```c - static void __attribute__((constructor)) c1(void) {} - static void __attribute__((constructor)) c2(void) {} - static void __attribute__((destructor)) d1(void) {} - int main() { return 0; } - ``` - -* Variants: - - * Using `.ctors` (old GCC flags) for legacy coverage. - * Shared library with `__attribute__((constructor))` and `DT_NEEDED` from a main binary. - * Binary with no constructors (expect zero roots). - -Assertions: - -* The count of `InitRoutineRoot` matches expected. -* `Kind` and `Phase` are correct. -* `ArrayIndex` is correctly ordered: 0,1,2 … -* `SymbolName` contains expected mangled function names (if compiler doesn’t drop them). -* For dependencies: - - * Discover roots in `libfoo.so` when main depends on it via `DT_NEEDED`. - -### 10.2 Integration tests with call graph - -* Given a small binary and a known vulnerable function reachable from a constructor: - - * Run full pipeline. - * Assert that the vulnerable function is marked reachable from synthetic `LOADER` node via the constructor. - -### 10.3 Fuzz / robustness - -* Run the discovery on: - - * Random non‑ELF files. - * Truncated ELF files. - * Very large binaries. -* Ensure no unhandled exceptions; only `InitRoutineError` entries. - ---- - -## 11. Suggested C# Project Layout - -```text -src/ - StellaOps.ElfInit/ - IInitRoutineDiscovery.cs - InitRoutineModels.cs - InitRoutineDiscovery.cs - IElfDependencyResolver.cs - ISymbolResolver.cs - Evidence/ - InitRoutineEvidence.cs - Elf/ - ElfFile.cs - ElfParser.cs - ElfHeader.cs - ProgramHeader.cs - SectionHeader.cs - DynamicSection.cs - VaMapper.cs - PointerArrayReader.cs -tests/ - StellaOps.ElfInit.Tests/ - Resources/ - sample_no_ctor - sample_init_array - sample_preinit_init_fini - sample_with_deps_main - libsample_ctor.so - InitRoutineDiscoveryTests.cs -``` - ---- - -If you’d like, I can next: - -* Draft `InitRoutineDiscovery` in C# with full method bodies, or -* Provide a minimal `ElfFile`/`ElfParser` implementation skeleton you can fill in. +Here’s a quick, practical heads‑up about **binary initialization routines** and why they matter for reachability and vuln triage. + +--- + +### What’s happening before `main()` + +In ELF binaries/shared objects, the runtime linker runs **constructors** *before* `main()`: + +* `.preinit_array` → runs first (rare, but highest priority) +* `.init_array` → common place for constructors (ordered by index) +* Legacy sections: `.init` (function) and `.ctors` (older toolchains) +* On exit you also have `.fini_array` / `.fini` + +These constructors can: + +* Register signal/atexit handlers +* Start threads, open sockets/files, tweak `LD_PRELOAD` hooks +* Call library code you assumed was only used later + +So if you’re doing **call‑graph reachability** for vulnerability impact, starting from only `main()` (or exported APIs) can **miss real edges** that execute at load time. + +--- + +### What to model (synthetic roots) + +Treat the following as **synthetic entry points** in your graph: + +1. All function pointers in `.preinit_array` +2. All function pointers in `.init_array` +3. The symbol `_init` (if present) and legacy `.ctors` entries +4. For completeness on teardown paths: `.fini_array`, `_fini` +5. **Dynamic loader interposition**: if `DT_NEEDED` libs have their own constructors, they’re roots too (even if you never call them explicitly) + +For PIE/DSO builds, remember that every loaded **dependency’s** init arrays run as part of `dlopen()`/program start—model those edges across DSOs. + +--- + +### How to extract quickly + +* **Static parse**: read `PT_DYNAMIC`, then `DT_PREINIT_ARRAY`, `DT_INIT_ARRAY`, their sizes; iterate pointers and add edges to your graph. +* **Symbol fallback**: if `DT_INIT`/`_init` exists, add it as a root. +* **Ordering**: preserve index order inside arrays (it can matter). +* **Relocations**: resolve `R_X86_64_RELATIVE` (etc.) so pointers point to the real code addresses. + +Mini‑C example (constructor runs pre‑main): + +```c +static void __attribute__((constructor)) boot(void) { + // vulnerable call here executes before main() +} +int main(){ return 0; } +``` + +--- + +### For Stella Ops (binary reachability) + +* **Graph seeds**: `roots = { init arrays of main ELF + all DT_NEEDED DSOs }` +* **Policy**: mark edges from these roots as `phase=load` vs `phase=runtime`, so your explainer can say “reachable at load time.” +* **PURLs**: attach edges to the package/node that owns the constructor symbol (DSO package purl), not just the main app. +* **Attestation**: store the discovered root list (addresses + resolved symbols + DSO soname) in your deterministic scan manifest, so audits can replay it. +* **Heuristics**: if `dlopen()` is detected statically (strings/symbols), add a potential root “DLOPEN_INIT[*]” bucket for libs found under common plugin dirs. + +--- + +### Quick checklist + +* [ ] Parse `.preinit_array`, `.init_array`, `.init` (and legacy `.ctors`) +* [ ] Resolve relocations; preserve order +* [ ] Seed graph with these as **synthetic roots** +* [ ] Include constructors of every `DT_NEEDED` DSO +* [ ] Tag edges as `phase=load` for prioritization/explainability +* [ ] Persist root list in the scan’s evidence bundle + +If you want, I can drop in a tiny .NET/ELF parser snippet or a Rust routine that walks `DT_INIT_ARRAY` and returns symbol‑resolved roots next. +Here’s a concrete, C#‑oriented spec you can hand to a developer to implement ELF init/constructor discovery and plug it into a reachability engine like Stella Ops. + +I’ll structure it like an internal design doc: + +1. What we need to do +2. Public API (what the rest of the system calls) +3. ELF parsing details (minimal, but correct) +4. Constructor / init routine discovery algorithm +5. Dynamic deps (DT_NEEDED) and load‑time roots +6. Integration with the call graph / reachability +7. Attestation / evidence output +8. Testing strategy + +--- + +## 1. Goal / Requirements + +**Business goal** + +When scanning ELF binaries and shared libraries, we must model functions that run **before `main()`** or at **library load/unload** as *synthetic entry points* in the call graph: + +* `.preinit_array` (pre‑init constructors) +* `.init_array` (constructors) +* Legacy constructs: + + * `.ctors` array + * `_init` (via `DT_INIT`) +* For teardown (optional but recommended): + + * `.fini_array` + * `_fini` (via `DT_FINI`) + +**We must:** + +* Discover all these routines in: + + * The main executable + * All its `DT_NEEDED` shared libraries (and any DSOs subsequently loaded, if we scan them) +* Represent them as **roots** in the reachability graph: + + * `phase = Load` for preinit/init/constructors + * `phase = Unload` for finalizers +* Resolve each routine to: + + * Owning binary path and SONAME + * Virtual address in the ELF + * Best‑effort symbol name (`_ZN...`, `my_ctor`, etc.) + * Order/index within its array (to preserve call order) +* Emit a structured **evidence/attestation** record so scans are replayable. + +--- + +## 2. Public API (C#) + +### 2.1 Data model + +Create a small domain model in a library, e.g. `StellaOps.ElfInit`: + +```csharp +namespace StellaOps.ElfInit; + +public enum InitRoutineKind +{ + PreInitArray, + InitArray, + LegacyCtorsSection, + LegacyInitSymbol, + FiniArray, + LegacyFiniSymbol +} + +public enum InitPhase +{ + Load, + Unload +} + +public sealed record InitRoutineRoot( + string BinaryPath, // Full path on disk + string? Soname, // From DT_SONAME if present + InitRoutineKind Kind, + InitPhase Phase, + ulong VirtualAddress, // VA within this ELF + ulong? FileOffset, // File offset (if resolved), null if unknown + string? SymbolName, // Best-effort name from symbol table + int? ArrayIndex // Index for array-based roots +); +``` + +### 2.2 Discovery service + +Public entry point that other components use: + +```csharp +public interface IInitRoutineDiscovery +{ + /// + /// Discover load/unload routines (constructors) in a single ELF file + /// and, optionally, in its DT_NEEDED dependencies. + /// + InitDiscoveryResult Discover(string elfPath, InitDiscoveryOptions options); +} + +public sealed record InitDiscoveryOptions +{ + /// + /// If true, also discover init routines in DT_NEEDED shared libraries + /// (using IElfDependencyResolver to locate them on disk). + /// + public bool IncludeDependencies { get; init; } = true; + + /// + /// If true, include fini routines (.fini_array, DT_FINI, etc.) + /// as unload-phase roots. + /// + public bool IncludeUnloadPhase { get; init; } = true; +} + +public sealed record InitDiscoveryResult( + IReadOnlyList Roots, + IReadOnlyList Errors // non-fatal problems per binary +); + +public sealed record InitRoutineError( + string BinaryPath, + string Message, + Exception? Exception = null +); +``` + +### 2.3 Dependency resolution + +We don’t hard‑code how to find `DT_NEEDED` libraries on disk. Define an abstraction: + +```csharp +public interface IElfDependencyResolver +{ + /// + /// Resolve SONAME (e.g. "libc.so.6") to a local file path. + /// Returns null if not found. + /// + string? ResolveLibrary(string soname, string referencingBinaryPath); +} +``` + +The implementation can respect `LD_LIBRARY_PATH`, typical system dirs, container images, etc., but that’s outside this spec. + +`IInitRoutineDiscovery` will depend on: + +* `IElfParser` +* `IElfDependencyResolver` +* `ISymbolResolver` (symbol tables) + +--- + +## 3. ELF Parsing Spec (C#‑friendly) + +You can either use a NuGet ELF library or implement a minimal in‑house parser. This spec assumes a **minimal custom parser** that supports: + +* ELF64, little‑endian +* ET_EXEC, ET_DYN +* x86‑64 (`e_machine == EM_X86_64`) as v1; keep architecture pluggable for later + +### 3.1 Core types + +Create an internal parser namespace, e.g. `StellaOps.Elf`: + +```csharp +internal sealed class ElfFile +{ + public string Path { get; } + public ElfClass ElfClass { get; } + public ElfEndianness Endianness { get; } + public ElfHeader Header { get; } + public IReadOnlyList ProgramHeaders { get; } + public IReadOnlyList SectionHeaders { get; } + public DynamicSection? Dynamic { get; } + + public ReadOnlyMemory RawBytes { get; } + + // Helper: mapping VA -> file offset using PT_LOAD segments + public bool TryMapVaToFileOffset(ulong virtualAddress, out ulong fileOffset); +} + +internal enum ElfClass { Elf32, Elf64 } +internal enum ElfEndianness { Little, Big } + +// Fill out ElfHeader / ProgramHeader / SectionHeader / DynamicEntry types +``` + +Implementation notes: + +* Read ELF header: + + * Validate magic: `0x7F 'E' 'L' 'F'` + * `EI_CLASS` → 32/64‑bit + * `EI_DATA` → endianness +* Read **program headers** (`e_phoff`, `e_phnum`). + + * Identify `PT_LOAD` (for VA→file mapping). + * Identify `PT_DYNAMIC` (for `DynamicSection`). +* Read **section headers** (`e_shoff`, `e_shnum`). + + * Identify sections by name: `.preinit_array`, `.init_array`, `.fini_array`, `.ctors`. + * You need the section name string table `.shstrtab` to decode names. + +### 3.2 Dynamic section parsing + +Define dynamic section model: + +```csharp +internal sealed class DynamicSection +{ + public IReadOnlyList Entries { get; } + public ulong? InitFunction { get; } // DT_INIT + public ulong? FiniFunction { get; } // DT_FINI + public ulong? InitArrayAddress { get; } // DT_INIT_ARRAY + public ulong? InitArraySize { get; } // DT_INIT_ARRAYSZ + public ulong? FiniArrayAddress { get; } // DT_FINI_ARRAY + public ulong? FiniArraySize { get; } // DT_FINI_ARRAYSZ + public ulong? PreInitArrayAddress { get; } // DT_PREINIT_ARRAY + public ulong? PreInitArraySize { get; } // DT_PREINIT_ARRAYSZ + + public string? Soname { get; } // DT_SONAME (decoded via DT_STRTAB) + public IReadOnlyList Needed { get; } // DT_NEEDED list + + public ulong? StrTabAddress { get; } + public ulong? SymTabAddress { get; } + public ulong? StrTabSize { get; } +} +``` + +Implementation details: + +* Dynamic entries are at `PT_DYNAMIC.p_offset`, each `Elf64_Dyn`: + + * `d_tag` (signed 64‑bit) + * `d_un` union (`d_val` or `d_ptr`, treat as `ulong`) + +* Tags of interest (values are from ELF spec): + + * `DT_NULL = 0` + * `DT_NEEDED = 1` + * `DT_STRTAB = 5` + * `DT_SYMTAB = 6` + * `DT_STRSZ = 10` + * `DT_INIT = 12` + * `DT_FINI = 13` + * `DT_SONAME = 14` + * `DT_INIT_ARRAY = 25` + * `DT_FINI_ARRAY = 26` + * `DT_INIT_ARRAYSZ = 27` + * `DT_FINI_ARRAYSZ = 28` + * `DT_PREINIT_ARRAY = 32` + * `DT_PREINIT_ARRAYSZ = 33` + +* To decode SONAME and NEEDED: + + * Use `DT_STRTAB` as base VA of the dynamic string table. + * Map VA to file offset with `TryMapVaToFileOffset`. + * For each `DT_NEEDED` / `DT_SONAME`, treat `d_val` as an offset into that string table; read a null‑terminated UTF‑8 C‑string. + +--- + +## 4. Constructor & Init Routine Discovery + +We now define the algorithm implemented by `InitRoutineDiscovery` for a **single ELF file**. + +High‑level steps: + +1. Parse `ElfFile`. +2. Parse `DynamicSection`. +3. Resolve: + + * Pre‑init array (`DT_PREINIT_ARRAY`, `.preinit_array`) + * Init array (`DT_INIT_ARRAY`, `.init_array`) + * Legacy `.ctors` + * `_init`, `_fini` via `DT_INIT`/`DT_FINI` + * Fini array (`DT_FINI_ARRAY`, `.fini_array`) +4. For each VA, optionally resolve symbol name. +5. Build `InitRoutineRoot` entries. + +### 4.1 Pointer size & endianness + +* For ELF64: + + * Pointer size = 8 bytes. +* For ELF32: + + * Pointer size = 4 bytes (if/when you support it). +* Use `BinaryPrimitives.ReadUInt64LittleEndian` or `ReadUInt64BigEndian` depending on `ElfEndianness`. + +### 4.2 Mapping VA → file offset + +`ElfFile.TryMapVaToFileOffset`: + +* Iterate `ProgramHeaders` with `p_type == PT_LOAD`. +* If `virtualAddress` in `[p_vaddr, p_vaddr + p_memsz)`: + + * `fileOffset = p_offset + (virtualAddress - p_vaddr)` +* Return false if no matching segment. + +### 4.3 Reading init arrays + +Generic helper: + +```csharp +internal static IReadOnlyList ReadPointerArray( + ElfFile elf, + ulong arrayVa, + ulong arrayBytes) +{ + var results = new List(); + if (!elf.TryMapVaToFileOffset(arrayVa, out var fileOffset)) + return results; + + int pointerSize = elf.ElfClass == ElfClass.Elf64 ? 8 : 4; + int count = (int)(arrayBytes / (ulong)pointerSize); + + var span = elf.RawBytes.Span; + for (int i = 0; i < count; i++) + { + ulong offset = fileOffset + (ulong)(i * pointerSize); + if (offset + (ulong)pointerSize > (ulong)span.Length) + break; + + ulong pointerValue = elf.Endianness switch + { + ElfEndianness.Little when pointerSize == 8 + => System.Buffers.Binary.BinaryPrimitives.ReadUInt64LittleEndian(span[(int)offset..]), + ElfEndianness.Little + => System.Buffers.Binary.BinaryPrimitives.ReadUInt32LittleEndian(span[(int)offset..]), + ElfEndianness.Big when pointerSize == 8 + => System.Buffers.Binary.BinaryPrimitives.ReadUInt64BigEndian(span[(int)offset..]), + _ // Big, 32-bit + => System.Buffers.Binary.BinaryPrimitives.ReadUInt32BigEndian(span[(int)offset..]), + }; + + if (pointerValue != 0) + results.Add(pointerValue); + } + + return results; +} +``` + +Apply to: + +* Pre‑init: if `Dynamic.PreInitArrayAddress` and `Dynamic.PreInitArraySize` present. +* Init: if `Dynamic.InitArrayAddress` and `Dynamic.InitArraySize` present. +* Fini: if `Dynamic.FiniArrayAddress` and `Dynamic.FiniArraySize` present. + +### 4.4 Legacy `.ctors` section + +Fallback for older toolchains: + +* Find section with `Name == ".ctors"`. +* Its contents are just an array of pointers (same pointer size as ELF). +* Some compilers include a sentinel `-1` or `0` at beginning or end. Treat: + + * `0` or `0xFFFFFFFFFFFFFFFF` (for 64‑bit) as sentinel; skip them. +* Use similar `ReadPointerArray` logic but starting from `sh_offset` rather than a VA. + +### 4.5 `_init` / `_fini` functions + +* `Dynamic.InitFunction` (from `DT_INIT`) is a single VA. +* `Dynamic.FiniFunction` (from `DT_FINI`) likewise. + +Even if arrays exist, these may also be present; treat them as **independent roots**. + +--- + +## 5. Symbol Resolution (best‑effort names) + +Define interface: + +```csharp +public interface ISymbolResolver +{ + /// + /// Find the symbol whose address matches `virtualAddress` exactly, + /// or, if not found, the closest preceding symbol (with an offset). + /// + SymbolInfo? ResolveSymbol(ElfFile elf, ulong virtualAddress); +} + +public sealed record SymbolInfo( + string Name, + ulong Value, + ulong Size +); +``` + +Implementation sketch: + +* Use `.dynsym` (dynamic symbol table), and `.symtab` (full symbol table) if available. +* Each symbol entry includes: + + * Name offset in string table + * Value (VA) + * Size + * Type/binding (function, object, etc.) +* Build an in‑memory index (e.g. sorted by `Value`) per ELF file. +* `ResolveSymbol`: + + * Prefer exact match of `Value`. + * If none, find symbol with largest `Value` less than `virtualAddress` and treat as “nearest symbol + offset”. + * You can show just `Name` or `Name+0xOFFSET` in explanations; for `InitRoutineRoot` we store plain `Name`. + +--- + +## 6. Dynamic Dependencies & Load-Time Roots + +When `InitDiscoveryOptions.IncludeDependencies == true`: + +1. For root ELF: + + * Discover its roots as above. +2. For each `neededSoname` in `Dynamic.Needed`: + + * Ask `IElfDependencyResolver.ResolveLibrary(neededSoname, rootElfPath)`. + * If it returns a path not yet processed: + + * Parse this ELF and recursively discover its roots. +3. Return a **flat list** of all `InitRoutineRoot` objects, but with their own `BinaryPath`/`Soname`. + +Important: **We do not implicitly model `dlopen()`** at this stage. That’s separate: + +* As an optional heuristic, if the binary imports `dlopen`, tag those DSOs so later we can add “potential plugin load” roots. You can park this as a TODO in the comments. + +--- + +## 7. Call Graph / Reachability Integration + +This depends on your existing modeling, but here’s a generic spec a C# dev can follow. + +Assume there is an internal model: + +```csharp +public sealed class CallGraph +{ + public Node GetOrCreateNode(string binaryPath, ulong virtualAddress, string? symbolName); + public Node GetOrCreateSyntheticRoot(string rootId, string description); + public void AddEdge(Node from, Node to, CallEdgeMetadata metadata); +} + +public sealed record CallEdgeMetadata( + string EdgeKind, // e.g. "loader-init" + InitPhase Phase, // Load / Unload + InitRoutineKind InitKind, + int? ArrayIndex +); +``` + +### 7.1 Synthetic loader node + +Create a single graph node representing the dynamic loader / program start: + +```csharp +var loaderNode = callGraph.GetOrCreateSyntheticRoot( + "LOADER", + "ELF dynamic loader / process start" +); +``` + +### 7.2 Adding edges for each root + +For each `InitRoutineRoot root`: + +1. Get or create a node for the target function: + + ```csharp + var target = callGraph.GetOrCreateNode( + root.BinaryPath, + root.VirtualAddress, + root.SymbolName + ); + ``` + +2. Add edge from loader: + + ```csharp + callGraph.AddEdge( + loaderNode, + target, + new CallEdgeMetadata( + EdgeKind: "loader-init", + Phase: root.Phase, + InitKind: root.Kind, + ArrayIndex: root.ArrayIndex + ) + ); + ``` + +3. Optional: If you model **per‑library** loader nodes, you can add: + + * `LOADER -> libLoaderNode` + * `libLoaderNode -> each constructor` + + but that’s a nice‑to‑have, not required. + +### 7.3 Phases + +* For `.preinit_array`, `.init_array`, `.ctors`, `_init`: + + * `Phase = InitPhase.Load` +* For `.fini_array`, `_fini`: + + * `Phase = InitPhase.Unload` + +This allows downstream UI to say e.g.: + +> This vulnerable function is reachable at **load time** via constructor `foo()` in `libbar.so`. + +--- + +## 8. Attestation / Evidence Output + +We want deterministic, auditable output per scan. + +Define a JSON schema (C# record) stored alongside other scan artifacts: + +```csharp +public sealed record InitRoutineEvidence( + string ScannerVersion, + DateTimeOffset ScanTimeUtc, + IReadOnlyList Entries +); + +public sealed record InitRoutineEvidenceEntry( + string BinaryPath, + string? Soname, + InitRoutineKind Kind, + InitPhase Phase, + ulong VirtualAddress, + ulong? FileOffset, + string? SymbolName, + int? ArrayIndex +); +``` + +Implementation details: + +* After `IInitRoutineDiscovery.Discover` completes: + + * Convert each `InitRoutineRoot` to `InitRoutineEvidenceEntry`. + * Serialize with `System.Text.Json` (property names in camelCase or snake_case; choose a stable convention). +* Store the evidence file e.g. `init_roots.json` inside the scan’s result directory. + +--- + +## 9. Implementation Details & Edge Cases + +### 9.1 Architectures + +First version: + +* Support: + + * `ElfClass.Elf64` + * `ElfEndianness.Little` + * `EM_X86_64` +* For anything else: + + * Log an `InitRoutineError` and skip (but don’t hard‑fail the whole scan). + +Design the parser so architecture is an enum: + +```csharp +internal enum ElfMachine : ushort +{ + X86_64 = 62, + // others later +} +``` + +### 9.2 Relocations (simplification) + +Real loaders apply relocations to constructor arrays; some pointers may be stored as relative relocations. + +For **v1 implementation**: + +* Assume that: + + * Array entries are already absolute VAs in the ELF’s address space (which is typical for non‑PIE or when link‑time addresses are used). +* If you need better fidelity later: + + * Parse `.rela.dyn` / `.rel.dyn`. + * Apply `R_X86_64_RELATIVE` relocations whose `r_offset` falls within the array’s address range: + + * Effective address = (base address + addend); if you treat base as 0, you get a VA that’s correct **within the file** (relative). + +Document this as a TODO so later you can extend without breaking the API. + +### 9.3 Error handling + +* All parsing errors **must be non‑fatal** to the overall scan: + + * Record `InitRoutineError` with `BinaryPath`, message, and exception. + * Continue with other binaries. +* If a binary is not ELF or has invalid magic: + + * Return no roots, but optionally log a low‑severity error. + +--- + +## 10. Testing Strategy + +### 10.1 Unit tests with synthetic ELF fixtures + +Create a small test project `StellaOps.ElfInit.Tests` with known ELF files checked into test resources: + +* Binaries compiled with small C programs like: + + ```c + static void __attribute__((constructor)) c1(void) {} + static void __attribute__((constructor)) c2(void) {} + static void __attribute__((destructor)) d1(void) {} + int main() { return 0; } + ``` + +* Variants: + + * Using `.ctors` (old GCC flags) for legacy coverage. + * Shared library with `__attribute__((constructor))` and `DT_NEEDED` from a main binary. + * Binary with no constructors (expect zero roots). + +Assertions: + +* The count of `InitRoutineRoot` matches expected. +* `Kind` and `Phase` are correct. +* `ArrayIndex` is correctly ordered: 0,1,2 … +* `SymbolName` contains expected mangled function names (if compiler doesn’t drop them). +* For dependencies: + + * Discover roots in `libfoo.so` when main depends on it via `DT_NEEDED`. + +### 10.2 Integration tests with call graph + +* Given a small binary and a known vulnerable function reachable from a constructor: + + * Run full pipeline. + * Assert that the vulnerable function is marked reachable from synthetic `LOADER` node via the constructor. + +### 10.3 Fuzz / robustness + +* Run the discovery on: + + * Random non‑ELF files. + * Truncated ELF files. + * Very large binaries. +* Ensure no unhandled exceptions; only `InitRoutineError` entries. + +--- + +## 11. Suggested C# Project Layout + +```text +src/ + StellaOps.ElfInit/ + IInitRoutineDiscovery.cs + InitRoutineModels.cs + InitRoutineDiscovery.cs + IElfDependencyResolver.cs + ISymbolResolver.cs + Evidence/ + InitRoutineEvidence.cs + Elf/ + ElfFile.cs + ElfParser.cs + ElfHeader.cs + ProgramHeader.cs + SectionHeader.cs + DynamicSection.cs + VaMapper.cs + PointerArrayReader.cs +tests/ + StellaOps.ElfInit.Tests/ + Resources/ + sample_no_ctor + sample_init_array + sample_preinit_init_fini + sample_with_deps_main + libsample_ctor.so + InitRoutineDiscoveryTests.cs +``` + +--- + +If you’d like, I can next: + +* Draft `InitRoutineDiscovery` in C# with full method bodies, or +* Provide a minimal `ElfFile`/`ElfParser` implementation skeleton you can fill in. diff --git a/docs/product-advisories/archived/20-Nov-2026 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md b/docs/product-advisories/archived/20-Nov-2025 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md similarity index 96% rename from docs/product-advisories/archived/20-Nov-2026 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md rename to docs/product-advisories/archived/20-Nov-2025 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md index af2e26950..640c28196 100644 --- a/docs/product-advisories/archived/20-Nov-2026 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md +++ b/docs/product-advisories/archived/20-Nov-2025 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md @@ -1,1859 +1,1859 @@ -Short answer: -Yes, we have more than enough public information and ecosystem knowledge to design *better* reachability graphs for PHP and JavaScript than what Snyk/Semgrep expose today—especially in terms of openness, precision, and determinism. What we do **not** have is their exact proprietary heuristics, but we don’t need those to surpass them architecturally. - -Let me break it down in the dimensions that matter for Stella Ops. - ---- - -## 1. What we concretely know from Snyk & Semgrep - -From public material we can infer the *shape* of their systems: - -* **Snyk** - - * Builds a call graph of the application + dependencies to decide if vulnerable functions are on an execution path from “entry points” (e.g., HTTP handlers, CLI entry, etc.). ([Snyk][1]) - * For its “Reachable Vulnerabilities” feature, Snyk explicitly states that it ingests your repo, builds a call graph, then discards source and keeps only the graph + function names. ([docs.snyk.io][2]) - * Combines SCA with static analysis and uses reachability as a factor in “risk score” / prioritization. ([docs.snyk.io][3]) - -* **Semgrep (Supply Chain)** - - * Reachability is computed by correlating manifests/lockfiles with static analysis of the code to see whether vulnerable components are actually used. ([semgrep.dev][4]) - * Uses a dependency graph for supply chain (including transitive deps) and classifies findings as “always reachable / conditionally reachable / needs review / no reachability analysis.” ([semgrep.dev][5]) - * For PHP specifically, they now advertise reachability as GA in Supply Chain (we saw that in your earlier search). This tells us they do at least basic call-graph level reasoning + data flow for PHP. - -Conceptually, that already gives us the core primitives: - -* Call graphs (application + dependencies). -* Entry point modeling. -* Mapping vulnerable symbols (functions/methods/routes) to nodes in that graph. -* Reachability classification at the level of “reachable / no-path / conditional / not analyzed”. - -We also have additional public references (Endor Labs, Coana, GitLab, GitHub, etc.) that all describe more or less the same model: build call graphs or code property graphs and do forward/backward reachability over them. ([endorlabs.com][6]) - -So: the algorithmic *space* is well-documented. The secret sauce is mostly heuristics and engineering, not unknown math. - ---- - -## 2. Where the gaps actually are - -What we **do not** get from Snyk/Semgrep publicly: - -* Concrete internal call-graph algorithms and framework models (how they resolve dynamic imports, reflection, magic in PHP, complex JS bundler semantics). -* Their framework-specific “entry point catalogs” (e.g., mapping Express/Koa/NestJS/Next.js routes, Laravel/Symfony/WordPress hooks, etc.). -* Their internal tuning of false-positive / false-negative trade-offs per language and framework. -* Their private benchmarks and labeled datasets. - -That means we cannot “clone Snyk’s reachability,” but we absolutely can design: - -1. A **better graph spec**. -2. A **more transparent and deterministic pipeline**. -3. Stronger **binary + container + SBOM/VEX integration**. - -Which is exactly aligned with your Stella Ops vision. - ---- - -## 3. For PHP & JavaScript specifically: can we beat them? - -For **graph quality and expressiveness**, yes, we can. - -### JavaScript / TypeScript - -Existing tools face these pain points: - -* Highly dynamic imports (`require(...)`, `import()`, bundlers). -* Multiple module systems (CJS, ESM, UMD), tree-shaking, dead code elimination. -* Framework magic (Next.js, React SSR, Express middlewares, serverless handlers). - -Public info shows Snyk builds a call graph and analyzes execution paths, but details on how they handle all JS edge cases are abstracted away. ([Snyk][1]) - -What we can do better in Stella Ops graphs: - -* **First-class “resolution nodes”**: - - * Represent module resolution, bundler steps, and dynamic import decisions as explicit nodes/edges in the graph. - * This makes ambiguity *visible* instead of hidden inside a heuristic. -* **Framework contracts**: - - * Have pluggable “route/handler mappers” per framework (Express, Nest, Next, Fastify, serverless wrappers) so entry points are explicit graph roots, not magic. -* **Multiple call-graph layers**: - - * Source-level graph (TS/JS). - * Bundled output graph (Webpack/Vite/Rollup). - * Runtime-inferred hints (if we later choose to add traces), all merged into a unified reachability graph with provenance tags. - -If we design our graph format to preserve all uncertainty explicitly (e.g., edges tagged as “over-approximate”, “dynamic-guess”, “runtime-confirmed”), we will have *better analytical quality* even if raw coverage is comparable. - -### PHP - -Semgrep now has PHP reachability GA in Supply Chain, but again we only see the outcomes, not the internal graph model. ([DEV Community][7]) - -We can exploit known pain points in PHP: - -* Dynamic includes / autoloaders. -* Magic methods, dynamic dispatch, frameworks like Laravel/Symfony/WordPress/Drupal. -* Templating / view layers that act as “hidden” entry points. - -Improvements in the Stella Ops model: - -* **Autoloader-aware graph layer**: - - * Model Composer autoloading rules explicitly; edges from `composer.json` and PSR-4/PSR-0 rules into the graph. -* **Framework profiles**: - - * For Laravel/Symfony/etc., we ship profiles that define how controllers, routes, middlewares, commands, and events are wired. Those profiles become graph generators, not just regex signatures. -* **Source-to-SBOM linkage**: - - * Nodes are annotated with PURLs and SBOM component IDs, so you get reachability graph edges directly against SBOM + VEX. - -Again, even without their internals, we can design a **richer, more transparent graph representation**. - ---- - -## 4. How Stella Ops can clearly surpass them (graph-wise) - -Given your existing roadmap (SBOM spine, deterministic replay, lattice policies), we can deliberately design a reachability graph system that outclasses them in these axes: - -1. **Open, documented graph spec** - - * Define a “Reachability Graph Manifest”: - - * Nodes: functions/methods/routes/files/modules + dependency components (PURLs). - * Edges: call edges, data-flow edges, dependency edges, “resolution” edges. - * Metadata: language, framework, hashes, provenance, SBOM linkage. - * Publish it so others can generate/consume the same graphs. - -2. **Deterministic, replayable scans** - - * Every scan is defined by: - - * Exact sources (hashes). - * Analyzer version. - * Ruleset + framework profiles. - * Result: any reachability verdict can be re-computed bit-for-bit later. - -3. **PURL-level edges for supply chain** - - * Reachability graph includes direct edges: - - * `app:function` → `package:function` → `CVE`. - * This is exactly what most tools conceptually do, but we make it explicit and exportable. - -4. **Rich status model beyond “reachable / not”** - - * Adopt and extend Semgrep-like statuses: `always_reachable`, `conditionally_reachable`, `unreachable`, `not_analyzed`, `ambiguous`. - * Add: - - * Confidence levels. - * Types of ambiguity (dynamic dispatch, reflection, unresolved include, etc.). - -5. **Integration with lattice / policy engine** - - * You already plan a Trust Algebra / lattice engine. - * The reachability graph becomes a first-class input: - - * Policies like “treat `conditional_reachable` through untrusted input as High Risk; treat `always_reachable` behind auth as Medium; treat `unreachable` but exploitable via RCE as Special Case”. - -6. **Offline, sovereign, binary-aware** - - * Most commercial tools are SaaS and source-centric. - * We can: - - * Run completely offline. - * Attach reachability graphs to container images, SBOMs, and in-toto attestations. - * Extend to compiled artifacts later (ELF/PE/Mach-O, which you already plan). - -All of this is achievable with publicly available theory + our own engineering, independent of Snyk/Semgrep proprietary code. - ---- - -## 5. Practical conclusion - -So, to answer your question precisely: - -* We do **not** have enough information to duplicate Snyk or Semgrep’s *internal implementations*. -* We absolutely have enough information (from their docs + the broader reachability literature + open tools) to: - - * Design a **more expressive, auditable reachability graph model** for PHP and JavaScript. - * Implement our own call-graph construction and reachability analysis that is: - - * Deterministic, - * SBOM/VEX-native, - * Framework-aware, - * And aligned with Stella Ops’ offline/attested architecture. - -If you want, next step I can sketch a concrete “Reachability Graph v1” JSON schema for PHP/JS with PURL edges and confidence annotations that can plug directly into your Scanner/WebService and SBOM spine. - -[1]: https://snyk.io/blog/optimizing-prioritization-with-deep-application-level-context/?utm_source=chatgpt.com "Optimizing prioritization with deep application-level context" -[2]: https://docs.snyk.io/snyk-data-and-governance/how-snyk-handles-your-data?utm_source=chatgpt.com "How Snyk handles your data | Snyk User Docs" -[3]: https://docs.snyk.io/manage-risk/prioritize-issues-for-fixing/reachability-analysis?utm_source=chatgpt.com "Reachability analysis - Snyk User Docs" -[4]: https://semgrep.dev/blog/2024/sca-reachability-analysis-methods?utm_source=chatgpt.com "Comparing Reachability Analysis methods" -[5]: https://semgrep.dev/blog/2024/less-effort-more-insight-introducing-dependency-graph-for-supply-chain?utm_source=chatgpt.com "Less effort, more insight: Introducing Dependency Graph ..." -[6]: https://www.endorlabs.com/learn/what-is-reachability-based-dependency-analysis?utm_source=chatgpt.com "What is Reachability-Based Dependency Analysis? | Blog" -[7]: https://dev.to/semgrep/ai-memories-php-reachability-cve-policies-and-benchmarking-3naj?utm_source=chatgpt.com "AI Code Assistant Memories, PHP Reachability, CVE ..." -Good, let’s turn this into something a mid-level engineer can actually implement for PHP without guesswork. - -Below is a **concrete, implementation-ready reachability spec** for PHP v1, structured so you can give it directly to an engineer as requirements. - ---- - -## 1. Scope of PHP Reachability v1 - -**Goal** -Given: - -* A PHP project (source code), -* `composer.json` + `composer.lock`, -* A list of vulnerable symbols (e.g., FQNs from a vulnerability DB, each tied to a PURL), - -produce: - -1. A **call graph** of PHP functions/methods (with nodes and edges). -2. A **mapping** between nodes and dependency components (PURLs). -3. A **reachability report** per vulnerable symbol: - - * `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed` - * With example call paths when reachable. - -**Deliberate limitations of v1 (explicit, to keep it doable):** - -* Supported code: - - * Plain PHP functions. - * Class methods (normal, static). -* Supported calls: - - * Direct function calls: `foo()` - * Method calls: `$obj->bar()`, `Foo::bar()` -* Supported resolution features: - - * Namespaces + `use` imports. - * Composer autoload mapping (PSR-4/0, classmap) from `composer.json`. -* Not fully supported (treated conservatively as “maybe”): - - * Dynamic function names (`$fn()`). - * Dynamic method calls (`$obj->$name()`). - * Heavy reflection magic. - * Complex framework containers (Laravel, Symfony DI) – reserved for v2. - ---- - -## 2. Reachability Graph Document (JSON) - -The main artifact is a **graph document**. One file per scan: - -```json -{ - "schemaVersion": "1.0.0", - "language": "php", - "project": { - "projectId": "my-app", - "rootDir": "/src/app", - "hash": "sha256:..." - }, - "components": [ - { - "id": "comp-1", - "purl": "pkg:composer/vendor/lib-a@1.2.3", - "name": "vendor/lib-a", - "version": "1.2.3" - } - ], - "nodes": [], - "edges": [], - "vulnerabilities": [], - "reachabilityResults": [] -} -``` - -### 2.1 Node model - -Every node is a **callable** (function or method) or an **entry point**. - -```json -{ - "id": "node-uuid-or-hash", - "kind": "function | method | entrypoint", - "name": "index", - "fqn": "\\App\\Controller\\HomeController::index", - "file": "src/Controller/HomeController.php", - "line": 42, - "componentId": "comp-1", - "purl": "pkg:composer/vendor/lib-a@1.2.3", - "entryPointType": "http_route | cli | unknown | null", - "extras": { - "namespace": "\\App\\Controller", - "className": "HomeController", - "visibility": "public | protected | private | null" - } -} -``` - -**Rules for node creation** - -* **Function node** - - * `kind = "function"` - * `fqn` = `\Namespace\functionName` -* **Method node** - - * `kind = "method"` - * `fqn` = `\Namespace\ClassName::methodName` -* **Entrypoint node** - - * `kind = "entrypoint"` - * `entryPointType` set accordingly (may be `unknown` initially). - * Typically represents: - - * `public/index.php` - * `bin/console` commands, etc. - * Entrypoints can either: - - * Be separate nodes that **call** real functions/methods, or - * Be the same node as a method/function flagged as `entrypoint`. - For v1, keep it simple: **separate entrypoint nodes** that call “real” nodes. - -### 2.2 Edge model - -Edges capture relationships in the graph. - -```json -{ - "id": "edge-uuid-or-hash", - "from": "node-id-1", - "to": "node-id-2", - "type": "call | include | autoload | entry_call", - "confidence": "high | medium | low", - "extras": { - "callExpression": "Foo::bar($x)", - "file": "src/Controller/HomeController.php", - "line": 50 - } -} -``` - -**Edge types (v1)** - -* `call` - From a function/method to another function/method (resolved). -* `include` - From a file-level node or entrypoint to nodes defined in included file (optional for v1; can be “expanded” by treating all included definitions as reachable). -* `autoload` - From usage site to class definition when resolved via Composer autoload (optional to expose as a separate edge type; good for debug). -* `entry_call` - From an entrypoint node to the first callable(s) it invokes. - -For v1, an engineer can implement **only `call` + `entry_call`** and treat `include`/`autoload` as internal mechanics that result in `call` edges. - -### 2.3 Vulnerabilities model - -Input from your vulnerability database (or later from VEX) mapped into the graph: - -```json -{ - "id": "CVE-2020-1234", - "source": "internal-db-or-nvd-id", - "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", - "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", - "symbolKind": "method | function", - "severity": "critical | high | medium | low", - "extras": { - "description": "RCE in Foo::dangerousMethod", - "range": ">=1.0.0,<1.2.5" - } -} -``` - -At graph build time, you **pre-resolve** `symbolFqn` to `node.id` where possible and record it in `extras`. - ---- - -## 3. Reachability Results Structure - -Once you have the graph and the vulnerability list, you run reachability and produce: - -```json -{ - "vulnerabilityId": "CVE-2020-1234", - "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", - "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", - "targetNodeId": "node-123", - "status": "reachable | maybe_reachable | unreachable | not_analyzed", - "reason": "short explanation string", - "paths": [ - ["entry-node-1", "node-10", "node-20", "node-123"] - ], - "analysisMeta": { - "algorithmVersion": "1.0.0", - "maxDepth": 100, - "timestamp": "2025-11-20T19:30:00Z" - } -} -``` - -**Status semantics:** - -* `reachable` - There exists at least one **concrete call path** from an entrypoint node to `targetNodeId` using only `confidence = high` edges. -* `maybe_reachable` - A path exists but at least one edge along any path has `confidence = medium | low` (dynamic call, unresolved class alias, etc.). -* `unreachable` - No path exists from any entrypoint to the target node in the constructed graph. -* `not_analyzed` - We failed to build a node for the symbol or failed the analysis (parse errors, missing source, etc.). - ---- - -## 4. Analysis Pipeline Spec (Step-by-Step) - -This is the part a mid-level engineer can follow as tasks. - -### 4.1 Inputs - -* Directory with PHP code (`/app`). -* `composer.json`, `composer.lock`. -* List of vulnerabilities (as above). -* Optional SBOM mapping PURLs to file paths (if you have it; otherwise use Composer metadata only). - ---- - -### 4.2 Step 1 – Parse Composer Metadata & Build Components - -1. Read `composer.lock`. -2. For each package in `"packages"`: - - * Build `purl` like: - `pkg:composer/@` - * Create `components[]` entry (with generated `componentId`). -3. For the root project, create one component (e.g., `app`) with `purl = null` or a synthetic one (`pkg:composer/mycompany/myapp@dev`). - -**Output:** - -* `components[]` array. -* `componentIndex`: map from package name to `componentId`. - ---- - -### 4.3 Step 2 – PHP AST & Symbol Table - -Use a standard AST library (e.g., `nikic/php-parser`) – explicitly allowed and expected. - -For each PHP file in: - -* application source dirs (e.g. `src/`, `app/`), -* vendor dirs (if you choose to parse vendor code; v1 may do that only for needed components): - -Perform: - -1. Parse file → AST. -2. Extract: - - * File namespace. - * `use` imports (class aliases). - * Function definitions: name, line. - * Class definitions: name, namespace, methods. -3. Build **symbol table**: - -```php -// conceptual structure: -class SymbolTable { - // Fully qualified class or function name → node meta - public array $functionsByFqn; - public array $methodsByFqn; // "\Ns\Class::method" -} -``` - -4. Determine `componentId` for each file: - - * If path under `vendor/vendor-name/package-name/` → map to that Composer package → `componentId`. - * Else → root app component. - -5. Create **nodes**: - -* For each function: - - * Node `kind = "function"`. -* For each method: - - * Node `kind = "method"`. - -Assign `id`, `file`, `line`, `fqn`, `componentId`, `purl`. - -**Output:** - -* `nodes[]` with all functions/methods. -* `symbolTable` (for resolving calls). - ---- - -### 4.4 Step 3 – Entrypoint Detection - -v1 simple rules: - -1. Any of: - - * `public/index.php` - * `index.php` in project root - * Files under `bin/` or `cli/` with `#!/usr/bin/env php` shebang - are considered **entrypoint files**. - -2. For each entrypoint file: - - * Create an `entrypoint` node with: - - * `file` = that file - * `entryPointType` = `"http_route"` (for `public/index.php`) or `"cli"` (for `bin/*`) or `"unknown"`. - * Add to `nodes[]`. - -3. Later, when scanning each entrypoint file’s AST, you will create `entry_call` edges from the entrypoint node to the first layer of call targets inside that file. - -**Output:** - -* Additional `entrypoint` nodes. - ---- - -### 4.5 Step 4 – Call Graph Construction - -For each parsed file: - -1. Traverse AST for call expressions: - - * `foo()` → candidate function call. - * `$obj->bar()` → instance method call. - * `Foo::bar()` → static method call. - -2. **Resolve function calls**: - - Given: - - * Called name (may be qualified, relative, or unqualified). - * Current file namespace. - - Resolution rules: - - * If fully qualified (starts with `\`): use directly as FQN. - * Else: - - * Check `use` imports for alias match. - * If no alias, prepend current namespace. - * Look up FQN in `symbolTable.functionsByFqn` or `methodsByFqn`. - * If found → **resolved call** with `confidence = "high"`. - * If not found → mark `confidence = "low"` and set `to` to a synthetic node id like `unknown` or skip creating an edge in v1 (implementation choice – recommended: create edge to special `unknown` node). - -3. **Resolve method calls `$obj->bar()`** (v1 simplified): - - * Assume dynamic instance type is not known statically → resolution is ambiguous. - * For v1, treat these as: - - * `confidence = "medium"` and: - - * If `$obj` variable has a clear `new ClassName` assignment in the same function, try to infer class and use same resolution rules as static calls. - * Otherwise, create edges from calling node to all methods named `bar` in **any class inside the same component**. - * This is over-approximate but conservative. - -4. **Resolve static method calls `Foo::bar()`**: - - * Resolve `Foo` to FQN using namespace + imports (same as functions). - * Build FQN `\Ns\Foo::bar`. - * Look up in `symbolTable.methodsByFqn`. - * Mark `confidence = "high"` when resolved. - -5. **Connect entrypoints**: - - * For each entrypoint file: - - * Identify top-level calls in that file (same rules as above). - * Edges: - - * `type = "entry_call"` - * `from = entrypointNodeId` - * `to = resolved callee node` - -**Output:** - -* `edges[]` with `call` and `entry_call` edges. - ---- - -### 4.6 Step 5 – Map Vulnerabilities to Nodes - -For each vulnerability: - -1. If `symbolFqn` is not null: - - * If `symbolKind == "method"` → look into `symbolTable.methodsByFqn`. - * If `symbolKind == "function"` → `symbolTable.functionsByFqn`. - -2. If found → record `targetNodeId` in a lookup: `vulnId → nodeId`. - -3. If not found → `status` will later become `not_analyzed`. - ---- - -### 4.7 Step 6 – Reachability Algorithm - -Core logic: multiple BFS (or DFS) from entrypoints over the call graph. - -**Pre-compute entry roots:** - -* `entryNodes` = ids of all nodes with `kind = "entrypoint"`. - -**Algorithm (BFS from all entrypoints):** - -Pseudo-code (language-agnostic): - -```php -function computeReachability(Graph $graph, array $entryNodes): ReachabilityContext { - $queue = new SplQueue(); - $visited = []; // nodeId => true - $predecessor = []; // nodeId => parent nodeId (for path reconstruction) - $edgeConfidenceOnPath = []; // nodeId => "high" | "medium" | "low" - - foreach ($entryNodes as $entryId) { - $queue->enqueue($entryId); - $visited[$entryId] = true; - $edgeConfidenceOnPath[$entryId] = "high"; - } - - while (!$queue->isEmpty()) { - $current = $queue->dequeue(); - - foreach ($graph->outEdges($current) as $edge) { - if ($edge->type !== 'call' && $edge->type !== 'entry_call') { - continue; - } - - $next = $edge->to; - if (isset($visited[$next])) { - continue; - } - - $visited[$next] = true; - $predecessor[$next] = $current; - - // propagate confidence (lowest on the path wins) - $prevConf = $edgeConfidenceOnPath[$current] ?? "high"; - $edgeConf = $edge->confidence; // "high"/"medium"/"low" - $edgeConfidenceOnPath[$next] = minConfidence($prevConf, $edgeConf); - - $queue->enqueue($next); - } - } - - return new ReachabilityContext($visited, $predecessor, $edgeConfidenceOnPath); -} - -function minConfidence(string $a, string $b): string { - $order = ["high" => 3, "medium" => 2, "low" => 1]; - return ($order[$a] <= $order[$b]) ? $a : $b; -} -``` - -**Classify each vulnerability:** - -For each vulnerability with `targetNodeId`: - -1. If `targetNodeId` is missing → `status = "not_analyzed"`. -2. Else if `targetNodeId` is **not** in `visited` → `status = "unreachable"`. -3. Else: - - * Let `conf = edgeConfidenceOnPath[targetNodeId]`. - * If `conf == "high"` → `status = "reachable"`. - * If `conf == "medium" or "low"` → `status = "maybe_reachable"`. - -**Path reconstruction:** - -To generate one example path: - -```php -function reconstructPath(array $predecessor, string $targetId): array { - $path = []; - $current = $targetId; - while (isset($predecessor[$current])) { - array_unshift($path, $current); - $current = $predecessor[$current]; - } - array_unshift($path, $current); // entrypoint at start - return $path; -} -``` - -Store that `path` array in `reachabilityResults[].paths[]`. - ---- - -## 5. Handling PHP “messy bits” (v1 rules) - -This is where we mark things as `maybe` instead of pretending we know. - -1. **Dynamic function names** `$fn()`: - - * Create **no edges** by default in v1. - * Optionally, if `$fn` is a constant string literal obvious in the same function, treat as a normal call. - * Otherwise: leave it out and accept that some cases will be missed → vulnerability may be marked `unreachable` but flagged with `analysisMeta.dynamicCallsIgnored = true`. - -2. **Dynamic methods** `$obj->$method()`: - - * Same principle as above. - -3. **Reflection / `call_user_func` / `call_user_func_array`**: - - * v1: do not try to resolve. - * Optional: track the call sites; mark their outgoing edges as `confidence = "low"` and connect to **all** functions/methods of that name when the name is a string literal. - -4. **Includes** (`include`, `require`, `require_once`, `include_once`): - - * v1 simplest rule: - - * Treat the included file as **fully reachable** from the including file. - * Pseudo-implementation: when building symbol table, everything defined in the included file is considered potentially called by the including file’s entrypoint logic. - * Implementation shortcut: - - * For the first version, you can even skip modeling edges, and instead mark all nodes in included files as “reachable from the entrypoint” if included directly by an entrypoint file. Later refine. - ---- - -## 6. What the engineer actually builds (modules & tasks) - -You can frame it to them like this: - -1. **Module `PhpProjectLoader`** - - * Reads project root, finds `composer.json`, `composer.lock`. - * Produces `components[]` and mapping from file-path → componentId. - -2. **Module `PhpAstIndexer`** - - * Uses `nikic/php-parser`. - * For each `.php` file: - - * Produces entries in `symbolTable`. - * Produces base `nodes[]` (functions, methods). - * Creates `entrypoint` nodes based on known file patterns. - -3. **Module `PhpCallGraphBuilder`** - - * Walks AST again: - - * For each callable body, finds call expressions. - * Resolves calls via `symbolTable`. - * Produces `edges[]`. - -4. **Module `PhpReachabilityEngine`** - - * Runs BFS from entrypoints. - * Classifies per-vulnerability reachability. - -5. **Module `GraphSerializer`** - - * Assembles everything into the JSON schema described in sections 2–3. - -Each module is testable with small sample projects. - ---- - -## 7. Minimal working example (very small) - -Project: - -```php -// public/index.php -index(); - -// src/Controller/HomeController.php -dangerousMethod(); - } -} -``` - -Vulnerability: - -```json -{ - "id": "CVE-2020-1234", - "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", - "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", - "symbolKind": "method" -} -``` - -Expected reachability path (conceptually): - -```json -[ - "entry:public/index.php", - "\\App\\Controller\\HomeController::index", - "\\Vendor\\LibA\\Foo::dangerousMethod" -] -``` - -Status: `reachable` with `confidence = high`. - ---- - -If you’d like, next step I can: - -* Strip this into a **formal JSON Schema** file (`reachability-php-graph.schema.json`) and -* Propose a **directory layout + interfaces** in C#/.NET 10 for `StellaOps.Scanner.Php` so you can drop it straight into the repo. -Here is a JavaScript/TypeScript reachability spec that a mid-level engineer can actually implement, but which is still “best in class” in terms of clarity, determinism, and extensibility. - -I’ll mirror the PHP structure you already have so Scanner/WebService and Sbomer can treat them uniformly. - ---- - -## 1. Scope of JS Reachability v1 - -**Goal** - -Given: - -* A JS/TS project (Node-centric), -* `package.json` + lockfile (`package-lock.json` / `yarn.lock` / `pnpm-lock.yaml`), -* A list of vulnerable symbols (tied to npm PURLs), - -produce: - -1. A **function-level call graph** (nodes + edges). -2. Mapping of nodes to **components** (`pkg:npm/...` PURLs). -3. A **reachability verdict** for each vulnerable symbol: - - * `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed` - * With at least one example call path when reachable/maybe_reachable. - -**Deliberate v1 constraints** - -To keep it very implementable: - -* Target runtime: **Node.js** (server-side). -* Source: **TypeScript + JavaScript** in one unified analysis. - - * Use TypeScript compiler with `allowJs: true` so JS and TS share the same Program. -* Modules: - - * ES Modules (`import`/`export`). - * CommonJS (`require`, `module.exports`, `exports`). -* Supported calls: - - * Direct calls: `foo()`. - * Method calls: `obj.method()`, `Class.method()`. -* Bundlers (Webpack, Vite, etc.): **out of scope v1** (treat source before bundling). -* Dynamic features (handled conservatively, see below): - - * `eval`, `Function` constructor, dynamic imports, `obj[methodName]()`, etc. - ---- - -## 2. Reachability Graph Document (JSON) - -Same high-level shape as PHP, but annotated for JS/TS. - -```json -{ - "schemaVersion": "1.0.0", - "language": "javascript", - "project": { - "projectId": "my-node-app", - "rootDir": "/app", - "hash": "sha256:..." - }, - "components": [], - "nodes": [], - "edges": [], - "vulnerabilities": [], - "reachabilityResults": [] -} -``` - -### 2.1 Components - -Each npm package (including the root app) is a component. - -```json -{ - "id": "comp-1", - "purl": "pkg:npm/express@4.19.2", - "name": "express", - "version": "4.19.2", - "isRoot": false, - "extras": { - "resolvedPath": "node_modules/express" - } -} -``` - -For the root project, you can use: - -```json -{ - "id": "comp-root", - "purl": "pkg:npm/my-company-my-app@1.0.0", - "name": "my-company-my-app", - "version": "1.0.0", - "isRoot": true -} -``` - -A mid-level engineer can easily build this from `package.json` + the chosen lockfile. - ---- - -### 2.2 Nodes (callables & entrypoints) - -Every node is a callable or an entrypoint. - -```json -{ - "id": "node-uuid-or-hash", - "kind": "function | method | arrow | class_constructor | entrypoint", - "name": "handleRequest", - "fqn": "src/controllers/userController.ts::handleRequest", - "file": "src/controllers/userController.ts", - "line": 42, - "componentId": "comp-root", - "purl": "pkg:npm/my-company-my-app@1.0.0", - "exportName": "handleRequest", - "exportKind": "named | default | none", - "className": "UserController", - "entryPointType": "http_route | cli | worker | unknown | null", - "extras": { - "isAsync": true, - "isRouteHandler": true - } -} -``` - -**Rules for node creation** - -* **Function node** - - * `kind = "function"` for `function foo() {}` and `export function foo() {}`. - * `fqn` = `::foo`. -* **Arrow function node** - - * `kind = "arrow"` when it is used as a callback that matters (e.g. Express handler). - * Option: generate synthetic name `file.ts:::`. -* **Method node** - - * `kind = "method"` for class methods. - * `fqn` = `::ClassName.methodName`. -* **Class constructor node** - - * `kind = "class_constructor"` for `constructor()` if you want constructor-level analysis. -* **Entrypoint node** - - * `kind = "entrypoint"`. - * `entryPointType` according to detection rules (see §4). - * `fqn` = `::`, e.g. `src/server.ts::node-entry`. - -You don’t need to over-engineer FQNs; they just need to be stable and unique. - ---- - -### 2.3 Edges - -Edges model function/method/module relationships. - -```json -{ - "id": "edge-uuid-or-hash", - "from": "node-id-1", - "to": "node-id-2", - "type": "call | entry_call | import | export", - "confidence": "high | medium | low", - "extras": { - "callExpression": "userController.handleRequest(req, res)", - "file": "src/routes/userRoutes.ts", - "line": 30 - } -} -``` - -For reachability v1, **only `call` and `entry_call` are required**. `import`/`export` edges are useful for debugging but not strictly necessary for BFS reachability. - ---- - -### 2.4 Vulnerabilities - -Library-level vulnerabilities are described in terms of npm PURL and symbol. - -```json -{ - "id": "CVE-2020-1234", - "source": "internal-db-or-nvd-id", - "componentPurl": "pkg:npm/some-lib@1.2.3", - "packageName": "some-lib", - "symbolExportName": "dangerousFunction", - "symbolKind": "function | method", - "severity": "critical", - "extras": { - "description": "Prototype pollution in dangerousFunction", - "range": ">=1.0.0 <1.2.5" - } -} -``` - -At graph-build time, you pre-resolve `symbolExportName` → `node.id` where possible. - ---- - -### 2.5 Reachability Results - -Exactly the same shape as for PHP. - -```json -{ - "vulnerabilityId": "CVE-2020-1234", - "componentPurl": "pkg:npm/some-lib@1.2.3", - "symbolExportName": "dangerousFunction", - "targetNodeId": "node-123", - "status": "reachable | maybe_reachable | unreachable | not_analyzed", - "reason": "short explanation", - "paths": [ - ["entry-node-1", "node-20", "node-50", "node-123"] - ], - "analysisMeta": { - "algorithmVersion": "1.0.0", - "maxDepth": 200, - "timestamp": "2025-11-20T19:30:00Z" - } -} -``` - ---- - -## 3. Module & Symbol Resolution (JS/TS specifics) - -Backend: **TypeScript compiler API** with `allowJs: true`. - -### 3.1 Build TS Program - -1. Generate a `tsconfig.reachability.json` with: - - * `allowJs: true` - * `checkJs: true` - * `moduleResolution: "node"` or `"bundler"` depending on project. - * `rootDir` set to project root. -2. Use TS API to create `Program`. -3. Use `TypeChecker` to resolve symbols where possible. - -This gives you: - -* File list (including JS/TS). -* Symbols for exports/imports. -* Class and function definitions. - -### 3.2 Export indexing per module - -For each source file: - -* Enumerate: - - * `export function foo() {}` - * `export const bar = () => {}` - * `export default function () {}` / `export default class {}`. - * `export { foo }` statements. - * `module.exports = ...` / `exports.foo = ...` (handle as CommonJS exports). - -Build an index: - -```ts -interface ExportedSymbol { - moduleFile: string; // relative path - exportName: string; // "foo", "default" - nodeId: string; // ID in nodes[] -} -``` - -### 3.3 Import resolution - -For each `ImportDeclaration`: - -* `import { foo as localFoo } from 'some-lib'` - - * Map `localFoo` → `(module='some-lib', exportName='foo')`. - -* `import foo from 'some-lib'` - - * Map `foo` → `(module='some-lib', exportName='default')`. - -* `import * as lib from 'some-lib'` - - * Map namespace `lib` → `(module='some-lib', exportName='*')`. - -For CommonJS: - -* `const x = require('some-lib')` - - * Map `x` → `(module='some-lib', exportName='*')`. - -* `const { dangerousFunction } = require('some-lib')` - - * Map `dangerousFunction` → `(module='some-lib', exportName='dangerousFunction')`. - -Later, when you see calls, you use this mapping. - ---- - -## 4. Entrypoint Detection (Node-centric) - -v1 rules that are easy to implement: - -1. **CLI entrypoints** - - * Files listed in `bin` section of `package.json`. - * Files with `#!/usr/bin/env node` shebang. - * Node: - - * `kind = "entrypoint"`, - * `entryPointType = "cli"`. - -2. **Server entrypoints** - - * Heuristic: look for `src/server.ts`, `src/index.ts`, `index.js` at project root. - * Mark them as `entrypoint` with `entryPointType = "http_route"`. - -3. **Framework routes (Express v1)** - - * Pattern: `const app = express(); app.get('/path', handler)`: - - * `handler` can be: - - * Identifier (function name), - * Arrow function, - * Function expression. - - For each such route: - - * Create an `entrypoint` node per route or mark handler callable as reachable from server entrypoint: - - * Easiest v1: create **`entry_call` edge**: - - * From server entrypoint node (e.g., file `src/server.ts`) to handler node. - * Mark handler node `extras.isRouteHandler = true`. - -You do not have to model individual HTTP methods or paths semantically in v1; just treat each handler as a reachable entrypoint into business logic. - ---- - -## 5. Call Graph Construction - -This is the heart of the algorithm. - -### 5.1 Node creation (summary) - -While visiting AST: - -* For each: - - * `FunctionDeclaration` - * `MethodDeclaration` - * `ArrowFunction` (that is: - - * exported, or - * assigned to a variable that is used as a callback/handler) -* Create a `node`. - -Tie each node to: - -* `file` (relative path), -* `line` (start line), -* `componentId` (from mapping file path → package), -* optional `exportName` (if exported from module). - -### 5.2 Call extraction rules - -For each function/method body (i.e., node): - -#### 5.2.1 Direct calls: `foo()` - -* If callee is an identifier `foo`: - - 1. Check if `foo` is a **local function** in the same file. - 2. If not, check import alias table: - - * If `foo` maps to `(module='pkg', exportName='bar')`, then: - - * Resolve to exported symbol for `pkg` + `bar` if you have its sources. - * If library source not indexed, create a synthetic node for that library export (optional). - 3. If resolved, add edge: - - * `type = "call"`, - * `confidence = "high"`. - -#### 5.2.2 Property calls: `obj.method()` - -* If callee is `obj.method(...)`: - - 1. If `obj` is an imported namespace: - - * e.g. `import * as lib from 'some-lib'; lib.dangerousFunction()`. - * Then treat: - - * `module='some-lib'`, `exportName='dangerousFunction'`. - * Edge `confidence = "high"`. - - 2. If `obj` is created via `new ClassName()` where `ClassName` is known: - - * Use TypeScript type checker or simple pattern: - - * Look for `const obj = new ClassName(...)` in same function. - * Map to method `ClassName.method`. - * Edge `confidence = "high"`. - - 3. Else: - - * As a v1 heuristic, you **do not** spread to everything; instead: - - * Either: - - * Skip edge and lose some coverage, or - * Add `confidence = "medium"` edge from current node to **all methods called `method`** in the same component. - * Recommended: medium-confidence to all same-name methods in same component (conservative, but safe). - -#### 5.2.3 CommonJS require patterns - -* `const x = require('some-lib'); x.dangerousFunction()`: - - * Track variable → module mapping from `require`. - * When you see `x.something()`: - - * `module='some-lib'`, `exportName='something'`. - * `confidence = "medium"` (less structured than ES import). - -#### 5.2.4 Dynamic imports & very dynamic calls - -* `await import('some-lib')`, `obj[methodName]()`, `eval`, `Function`, etc.: - - v1 policy (simple and honest): - - * Do **not** create specific edges unless: - - * The target module name is a **string literal** and the method name is a **string literal** in same expression. - * Otherwise: - - * Optionally create a single edge from current node to a special `node-unknown` with `confidence = "low"`. - * This preserves a record that “something dynamic happens here” without lying. - ---- - -## 6. Mapping Nodes to Components (PURLs) - -Using the filesystem: - -* If file path begins with `node_modules//...`: - - * Map that file to component with `name = pkgName` and the version from lockfile. - -* All other files belong to the root component (the app) or to a local “workspace” package if you support monorepos later. - -Each node inherits `componentId` from its file. Each component has a `purl`: - -* `pkg:npm/@`. - -This is how you connect reachability to SBOM/VEX later. - ---- - -## 7. Vulnerability → Node mapping - -Given a vulnerability: - -```json -{ - "componentPurl": "pkg:npm/some-lib@1.2.3", - "packageName": "some-lib", - "symbolExportName": "dangerousFunction" -} -``` - -Steps: - -1. Find `componentId` by matching `componentPurl` or `packageName`. -2. In that component, find node(s) where: - - * `exportName == "dangerousFunction"`, or - * For CommonJS, any top-level function marked as part of the module’s exports under that name. -3. If found: - - * `targetNodeId = node.id`. -4. If not: - - * Mark `not_analyzed` later. - ---- - -## 8. Reachability Algorithm (BFS) - -Exactly like PHP v1, but now over JS nodes. - -**Pre-compute:** - -* `entryNodes` = all nodes where `kind = "entrypoint"`. - -**Compute reachable set:** - -```ts -function computeReachability(graph: Graph, entryNodes: string[]): ReachabilityContext { - const queue: string[] = []; - const visited: Record = {}; - const predecessor: Record = {}; - const edgeConfidenceOnPath: Record = {}; - - for (const entry of entryNodes) { - queue.push(entry); - visited[entry] = true; - edgeConfidenceOnPath[entry] = "high"; - } - - while (queue.length > 0) { - const current = queue.shift()!; - - for (const edge of graph.outEdges(current)) { - if (edge.type !== "call" && edge.type !== "entry_call") continue; - - const next = edge.to; - if (visited[next]) continue; - - visited[next] = true; - predecessor[next] = current; - - const prevConf = edgeConfidenceOnPath[current] ?? "high"; - const edgeConf = edge.confidence; - edgeConfidenceOnPath[next] = minConfidence(prevConf, edgeConf); - - queue.push(next); - } - } - - return { visited, predecessor, edgeConfidenceOnPath }; -} - -function minConfidence(a: "high" | "medium" | "low", - b: "high" | "medium" | "low"): "high" | "medium" | "low" { - const order: Record = { high: 3, medium: 2, low: 1 }; - return order[a] <= order[b] ? a : b; -} -``` - -**Classify per vulnerability:** - -For each vulnerability with `targetNodeId`: - -1. If missing → `status = "not_analyzed"`. -2. If `targetNodeId` not in `visited` → `status = "unreachable"`. -3. Otherwise: - - * `conf = edgeConfidenceOnPath[targetNodeId]`. - * If `conf == "high"` → `status = "reachable"`. - * Else (`medium` or `low`) → `status = "maybe_reachable"`. - -**Path reconstruction:** - -Same as PHP: - -```ts -function reconstructPath(predecessor: Record, - targetId: string): string[] { - const path: string[] = []; - let current: string | undefined = targetId; - - while (current !== undefined) { - path.unshift(current); - current = predecessor[current]; - } - - return path; -} -``` - -Store at least one path in `paths[]`. - ---- - -## 9. Handling JS “messy bits” (v1 rules) - -You want to be honest, not magical. So: - -1. **eval, new Function, dynamic import with non-literal arguments** - - * Do not pretend you know where control goes. - * Either: - - * Ignore for graph (recommended v1), or - * Edge to `node-unknown` with `confidence="low"`. - * Mark in `analysisMeta` that dynamic features were detected. - -2. **obj[methodName]() with unknown methodName** - - * If `methodName` is string literal and `obj` is clearly typed, you can resolve. - * Otherwise: no edges (or low-confidence to `node-unknown`). - -3. **No source for library** - - * If you do not index `node_modules`, you cannot trace inside vulnerable library. - * Still useful: we just need the library’s exported symbol node as “synthetic”: - - * Create a synthetic node representing `some-lib::dangerousFunction` and attach all calls to it. - * That node gets `componentId` for `some-lib`. - * Reachability is still valid (we do not need the internal implementation for SCA). - ---- - -## 10. Implementation plan for a mid-level engineer - -Assume this runs in a **Node.js/TypeScript container** that Scanner calls, returning JSON. - -### 10.1 Modules to build - -1. `JsProjectLoader` - - * Reads `package.json` + lockfile. - * Builds `components[]` (npm packages + root app). - * Maps file paths → `componentId`. - -2. `TsProgramBuilder` - - * Generates `tsconfig.reachability.json`. - * Creates TS Program with `allowJs: true`. - * Exposes `sourceFiles` and `typeChecker`. - -3. `JsSymbolIndexer` - - * Walks all source files. - * Indexes: - - * Exported functions/classes. - * Imported bindings / requires. - * Creates base `nodes[]` and export index. - -4. `JsEntrypointDetector` - - * Reads `package.json` for `bin` and main entry. - * Applies server/Express heuristics. - * Adds `entrypoint` nodes. - -5. `JsCallGraphBuilder` - - * For each function/method node: - - * Traverses its AST. - * Emits `call` edges as per §5. - * Emits `entry_call` edges for server/route wiring. - -6. `VulnerabilityMapper` - - * Takes vulnerability input (package + symbolExportName). - * Resolves them to `targetNodeId`. - -7. `ReachabilityEngine` - - * Implements BFS algorithm. - * Produces `reachabilityResults[]` with paths. - -8. `GraphSerializer` - - * Packs everything into the JSON structure described in §§2–3. - -### 10.2 Minimal example (to sanity check) - -Project snippet: - -```ts -// src/server.ts -import express from 'express'; -import { dangerousFunction } from 'some-lib'; - -const app = express(); - -app.get('/test', (req, res) => { - dangerousFunction(req.query.input); - res.send('ok'); -}); - -app.listen(3000); -``` - -Vulnerability: - -```json -{ - "id": "CVE-2020-1234", - "componentPurl": "pkg:npm/some-lib@1.2.3", - "packageName": "some-lib", - "symbolExportName": "dangerousFunction" -} -``` - -Expected graph & result: - -* Entrypoint node: - - * `src/server.ts::node-entry` (or similar). -* Handler arrow function node for `/test`. -* Node for `some-lib::dangerousFunction` (real or synthetic). - -Reachability path: - -```json -[ - "entry:src/server.ts::node-entry", - "node-handler-/test", - "node-some-lib::dangerousFunction" -] -``` - -Status: `reachable`, `confidence = high`. - ---- - -If you want, next iteration I can: - -* Turn this into a formal JSON Schema (`reachability-js-graph.schema.json`), -* And sketch the exact Node/TS code structure (file layout, TS interfaces) for the container that StellaOps.Scanner would call. -You already have a very solid v1 for both PHP and JS. The next step is not to add “more stuff” randomly, but to make the specs: - -* More **uniform** (one core reachability model, multiple languages). -* More **honest** about uncertainty. -* More **useful** for scoring, policy, and SBOM/VEX. - -Here is what I would improve. - ---- - -## 1. Cross‑language improvements (applies to both PHP & JS) - -### 1.1 Unify under a single core schema - -Right now PHP and JS are parallel but not explicitly unified. I would define: - -* A **language‑agnostic core**: - - * `Node` (id, kind, file, line, componentId, purl, tags). - * `Edge` (id, from, to, type, confidence, tags). - * `Vulnerability` (id, componentPurl, symbolId or symbolFqn, severity, tags). - * `ReachabilityResult` (vulnId, targetNodeId, status, paths[], analysisMeta). -* A **language extension block**: - - * `phpExtras` (namespace, className, visibility, etc.). - * `jsExtras` (exportName, exportKind, isAsync, etc.). - -This gives you one “Reachability Graph 1.x” spec with per‑language specialisation instead of two separate specs. - -### 1.2 Stronger identity & hashing rules - -Make node and edge IDs deterministic and explicitly specified: - -* Node ID derived from: - - * `language`, `componentId`, `file`, `fqn`, `kind` → `sha256` truncated. -* Edge ID derived from: - - * `from`, `to`, `type`, `file`, `line`. - -Benefits: - -* Stable IDs across runs for the same code → easy diffing, caching, incremental scans. -* Downstream tools (policy engine, UI) can key on IDs confidently. - -### 1.3 Multi‑axis confidence instead of a single label - -Replace the single `confidence` enum with **multi‑axis confidence**: - -```json -"confidence": { - "resolution": "high|medium|low", // how well we resolved the callee - "typeInference": "high|medium|low", - "controlFlow": "high|medium|low" -} -``` - -And define: - -* `pathConfidence` = min of all axes along the path. -* `status` still uses `reachable` / `maybe_reachable` / etc., but you retain the underlying breakdown for scoring and debugging. - -### 1.4 Path conditions and guards (lightweight) - -Introduce optional **path condition annotations** on edges: - -```json -"extras": { - "guard": "if ($userIsLoggedIn)", - "guardType": "auth | feature_flag | input_validation | unknown" -} -``` - -You do not need full symbolic execution. A simple heuristic suffices: - -* Detect `if (...)` around the call and capture the textual condition. -* Categorize by simple patterns (presence of `isAdmin`, `feature`, `flag`, etc.). - -Later, the Trust Algebra can say: “reachable only under feature flag + behind auth → downgrade risk.” - -### 1.5 Partial coverage & truncation flags - -Make the graph self‑describing about its **limitations**: - -At graph level: - -```json -"analysisMeta": { - "languages": ["php"], - "vendorCodeParsed": true, - "dynamicFeaturesHandled": ["dynamic-includes-partial", "reflection-ignored"], - "maxNodes": 500000, - "truncated": false -} -``` - -Per‑node or per‑file: - -```json -"extras": { - "parseErrors": false, - "analysisSkippedReason": null -} -``` - -Per‑vulnerability: - -* Add `coverageStatus`: `full`, `partial`, `unknown` to complement `status`. - -This avoids a common trap: tools silently dropping edges/nodes and still reporting “unreachable.” - -### 1.6 First‑class SBOM/VEX linkage - -You already include PURLs. Go one step further: - -* `componentId` links to: - - * `bomRef` (CycloneDX) or `componentId` (SPDX) if available. -* `vulnerabilityId` links to: - - * `vexRef` in any existing VEX document. - -This allows: - -* A VEX producer to say “not affected / affected but not exploited” with **explicit reference** to the reachability graph and specific `targetNodeId`s. - ---- - -## 2. PHP‑specific improvements - -### 2.1 Autoloader‑aware edges as first‑class concept - -Right now autoload is mostly implicit. Make it explicit and deterministic: - -* During Composer metadata processing, build: - - * **Autoload map**: `FQN class → file`. -* Add `autoload` edges: - - * From “usage site” node (where `new ClassName()` first appears) to a **file‑level node** representing the defining file. - -Why it helps: - -* Clarifies how classes were resolved (or not). -* Easier to debug “class not found” vs “we never parsed vendor code.” - -### 2.2 More precise includes / requires - -Upgrade the naive rule “everything in included file is reachable”: - -1. Represent each file as a special node `kind="file"`. -2. `include` / `require` statements produce `include` edges from current node/file to the file node. -3. Then: - - * All functions/methods defined in that file get `define_in` edges from file node. - * A separate simple pass marks them reachable from that file’s callers. - -Add a nuance: - -* If the include path is static and resolved at scan time → `resolution.high`. -* If dynamic (e.g., `include $baseDir.'/file.php';`) → `resolution.medium` or `low`. - -### 2.3 Better dynamic dispatch handling for methods - -Current v1 rule (“connect to all methods with that name in the component”) is safe but noisy. - -Refinement: - -* Use **local type inference** in the same function/method: - - * `$x = new Foo(); $x->bar();` → high resolution. - * `$x = factory(); $x->bar();`: - - * If factory returns a union of known types, edges to those types with `resolution.medium`. -* Introduce a tag on edges: - - * `extras.dispatchKind = "static" | "local-new" | "factory-heuristic" | "unknown"`. - -This preserves the safety of your current design but cuts down false positives for common patterns. - -### 2.4 Framework‑aware entrypoints (v2, but spec‑ready now) - -Extend `entryPointType` with framework flavors, even if initial implementation is shallow: - -* `laravel_http`, `symfony_http`, `wordpress_hook`, `drupal_hook`, etc. - -And allow: - -```json -"extras": { - "framework": "laravel", - "route": "GET /users", - "hookName": "init" -} -``` - -You do not have to implement every framework in v1, but the spec should allow these so you can ship small, incremental framework profiles without changing the schema. - ---- - -## 3. JavaScript/TypeScript‑specific improvements - -### 3.1 Explicit async / event‑loop edges - -Today all calls are treated uniformly. For JS/TS, you should model: - -* `setTimeout`, `setInterval`, `setImmediate`, `queueMicrotask`, `process.nextTick`, `Promise.then/catch/finally`, event emitters. - -Two improvements: - -1. Additional edge types: - - * `async_call`, `event_callback`, `timer_callback`. -2. Node extras: - - * `extras.trigger = "timer" | "promise" | "event" | "unknown"`. - -This lets you later express policies like: “reachable only via a rarely used cron‑like timer” vs “reachable via normal HTTP request.” - -### 3.2 Bundler awareness (but spec‑only in v1) - -Even if v1 implementation ignores bundlers, the spec should anticipate them: - -* Allow a **bundle mapping block**: - -```json -"bundles": [ - { - "id": "bundle-main", - "tool": "webpack", - "inputFiles": ["src/index.ts", "src/server.ts"], - "outputFiles": ["dist/main.js"] - } -] -``` - -* Optionally, allow edges: - - * `type = "bundle_map"` from source file nodes to bundled file nodes. - -You can attach reachability graphs to either pre‑bundle or post‑bundle views later, without breaking the schema. - -### 3.3 Stronger TypeScript‑based resolution - -Encode the fact that a call was resolved using TS type information vs heuristic: - -* On edges, add: - -```json -"extras": { - "resolutionStrategy": "ts-typechecker | local-scope | require-heuristic | unresolved" -} -``` - -This provides a clear line between “hard” and “soft” links for the scoring engine and for debugging why something is `maybe_reachable`. - -### 3.4 Workspace / monorepo semantics - -Support Yarn / pnpm / npm workspaces at the schema level: - -* Allow components to have: - -```json -"extras": { - "workspace": "packages/service-a", - "isWorkspaceRoot": false -} -``` - -And support edges: - -* `type = "workspace_dep"` for internal package imports. - -This makes it straightforward to see when a vulnerable library is pulled via an internal package boundary, which is common in large JS monorepos. - ---- - -## 4. Operational & lifecycle improvements - -### 4.1 Explicit incremental scan support - -Add an optional **delta section** so a scanner can emit only changes: - -```json -"delta": { - "baseGraphHash": "sha256:...", - "addedNodes": [...], - "removedNodeIds": [...], - "addedEdges": [...], - "removedEdgeIds": [...] -} -``` - -This is particularly valuable for large repos where full graphs are costly and CI needs fast turnaround. - -### 4.2 Test / non‑prod code classification - -Mark nodes/edges originating from tests or non‑prod code: - -* `extras.codeRole = "prod | test | devtool | unknown"`. - -Entry points from test runners (e.g., PHPUnit, Jest, Mocha) should either be: - -* Ignored (default), or -* Explicitly flagged as `entryPointType = "test"` so policies can decide whether to count that reachability. - -### 4.3 Normative definitions of statuses - -You already use `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed`. Make the semantics **normative** in the spec: - -* Tie `reachable` / `maybe_reachable` to: - - * Existence of a path from **at least one recognized entrypoint**. - * Minimum `pathConfidence` thresholds. -* Require that tools distinguish: - - * “No path in the graph” vs “graph incomplete here.” - -This allows multiple tools to implement the spec and still produce comparable, auditable results. - ---- - -If you want, the next concrete step could be: - -* A **“Reachability Graph 1.1”** document that: - - * Extracts the shared core, - * Adds multi‑axis confidence, - * Adds partial‑coverage metadata, - * Extends the enums for edge types and entrypoint types for PHP/JS. - -That gives your team a clean target for implementation without materially increasing complexity for a mid‑level engineer. +Short answer: +Yes, we have more than enough public information and ecosystem knowledge to design *better* reachability graphs for PHP and JavaScript than what Snyk/Semgrep expose today—especially in terms of openness, precision, and determinism. What we do **not** have is their exact proprietary heuristics, but we don’t need those to surpass them architecturally. + +Let me break it down in the dimensions that matter for Stella Ops. + +--- + +## 1. What we concretely know from Snyk & Semgrep + +From public material we can infer the *shape* of their systems: + +* **Snyk** + + * Builds a call graph of the application + dependencies to decide if vulnerable functions are on an execution path from “entry points” (e.g., HTTP handlers, CLI entry, etc.). ([Snyk][1]) + * For its “Reachable Vulnerabilities” feature, Snyk explicitly states that it ingests your repo, builds a call graph, then discards source and keeps only the graph + function names. ([docs.snyk.io][2]) + * Combines SCA with static analysis and uses reachability as a factor in “risk score” / prioritization. ([docs.snyk.io][3]) + +* **Semgrep (Supply Chain)** + + * Reachability is computed by correlating manifests/lockfiles with static analysis of the code to see whether vulnerable components are actually used. ([semgrep.dev][4]) + * Uses a dependency graph for supply chain (including transitive deps) and classifies findings as “always reachable / conditionally reachable / needs review / no reachability analysis.” ([semgrep.dev][5]) + * For PHP specifically, they now advertise reachability as GA in Supply Chain (we saw that in your earlier search). This tells us they do at least basic call-graph level reasoning + data flow for PHP. + +Conceptually, that already gives us the core primitives: + +* Call graphs (application + dependencies). +* Entry point modeling. +* Mapping vulnerable symbols (functions/methods/routes) to nodes in that graph. +* Reachability classification at the level of “reachable / no-path / conditional / not analyzed”. + +We also have additional public references (Endor Labs, Coana, GitLab, GitHub, etc.) that all describe more or less the same model: build call graphs or code property graphs and do forward/backward reachability over them. ([endorlabs.com][6]) + +So: the algorithmic *space* is well-documented. The secret sauce is mostly heuristics and engineering, not unknown math. + +--- + +## 2. Where the gaps actually are + +What we **do not** get from Snyk/Semgrep publicly: + +* Concrete internal call-graph algorithms and framework models (how they resolve dynamic imports, reflection, magic in PHP, complex JS bundler semantics). +* Their framework-specific “entry point catalogs” (e.g., mapping Express/Koa/NestJS/Next.js routes, Laravel/Symfony/WordPress hooks, etc.). +* Their internal tuning of false-positive / false-negative trade-offs per language and framework. +* Their private benchmarks and labeled datasets. + +That means we cannot “clone Snyk’s reachability,” but we absolutely can design: + +1. A **better graph spec**. +2. A **more transparent and deterministic pipeline**. +3. Stronger **binary + container + SBOM/VEX integration**. + +Which is exactly aligned with your Stella Ops vision. + +--- + +## 3. For PHP & JavaScript specifically: can we beat them? + +For **graph quality and expressiveness**, yes, we can. + +### JavaScript / TypeScript + +Existing tools face these pain points: + +* Highly dynamic imports (`require(...)`, `import()`, bundlers). +* Multiple module systems (CJS, ESM, UMD), tree-shaking, dead code elimination. +* Framework magic (Next.js, React SSR, Express middlewares, serverless handlers). + +Public info shows Snyk builds a call graph and analyzes execution paths, but details on how they handle all JS edge cases are abstracted away. ([Snyk][1]) + +What we can do better in Stella Ops graphs: + +* **First-class “resolution nodes”**: + + * Represent module resolution, bundler steps, and dynamic import decisions as explicit nodes/edges in the graph. + * This makes ambiguity *visible* instead of hidden inside a heuristic. +* **Framework contracts**: + + * Have pluggable “route/handler mappers” per framework (Express, Nest, Next, Fastify, serverless wrappers) so entry points are explicit graph roots, not magic. +* **Multiple call-graph layers**: + + * Source-level graph (TS/JS). + * Bundled output graph (Webpack/Vite/Rollup). + * Runtime-inferred hints (if we later choose to add traces), all merged into a unified reachability graph with provenance tags. + +If we design our graph format to preserve all uncertainty explicitly (e.g., edges tagged as “over-approximate”, “dynamic-guess”, “runtime-confirmed”), we will have *better analytical quality* even if raw coverage is comparable. + +### PHP + +Semgrep now has PHP reachability GA in Supply Chain, but again we only see the outcomes, not the internal graph model. ([DEV Community][7]) + +We can exploit known pain points in PHP: + +* Dynamic includes / autoloaders. +* Magic methods, dynamic dispatch, frameworks like Laravel/Symfony/WordPress/Drupal. +* Templating / view layers that act as “hidden” entry points. + +Improvements in the Stella Ops model: + +* **Autoloader-aware graph layer**: + + * Model Composer autoloading rules explicitly; edges from `composer.json` and PSR-4/PSR-0 rules into the graph. +* **Framework profiles**: + + * For Laravel/Symfony/etc., we ship profiles that define how controllers, routes, middlewares, commands, and events are wired. Those profiles become graph generators, not just regex signatures. +* **Source-to-SBOM linkage**: + + * Nodes are annotated with PURLs and SBOM component IDs, so you get reachability graph edges directly against SBOM + VEX. + +Again, even without their internals, we can design a **richer, more transparent graph representation**. + +--- + +## 4. How Stella Ops can clearly surpass them (graph-wise) + +Given your existing roadmap (SBOM spine, deterministic replay, lattice policies), we can deliberately design a reachability graph system that outclasses them in these axes: + +1. **Open, documented graph spec** + + * Define a “Reachability Graph Manifest”: + + * Nodes: functions/methods/routes/files/modules + dependency components (PURLs). + * Edges: call edges, data-flow edges, dependency edges, “resolution” edges. + * Metadata: language, framework, hashes, provenance, SBOM linkage. + * Publish it so others can generate/consume the same graphs. + +2. **Deterministic, replayable scans** + + * Every scan is defined by: + + * Exact sources (hashes). + * Analyzer version. + * Ruleset + framework profiles. + * Result: any reachability verdict can be re-computed bit-for-bit later. + +3. **PURL-level edges for supply chain** + + * Reachability graph includes direct edges: + + * `app:function` → `package:function` → `CVE`. + * This is exactly what most tools conceptually do, but we make it explicit and exportable. + +4. **Rich status model beyond “reachable / not”** + + * Adopt and extend Semgrep-like statuses: `always_reachable`, `conditionally_reachable`, `unreachable`, `not_analyzed`, `ambiguous`. + * Add: + + * Confidence levels. + * Types of ambiguity (dynamic dispatch, reflection, unresolved include, etc.). + +5. **Integration with lattice / policy engine** + + * You already plan a Trust Algebra / lattice engine. + * The reachability graph becomes a first-class input: + + * Policies like “treat `conditional_reachable` through untrusted input as High Risk; treat `always_reachable` behind auth as Medium; treat `unreachable` but exploitable via RCE as Special Case”. + +6. **Offline, sovereign, binary-aware** + + * Most commercial tools are SaaS and source-centric. + * We can: + + * Run completely offline. + * Attach reachability graphs to container images, SBOMs, and in-toto attestations. + * Extend to compiled artifacts later (ELF/PE/Mach-O, which you already plan). + +All of this is achievable with publicly available theory + our own engineering, independent of Snyk/Semgrep proprietary code. + +--- + +## 5. Practical conclusion + +So, to answer your question precisely: + +* We do **not** have enough information to duplicate Snyk or Semgrep’s *internal implementations*. +* We absolutely have enough information (from their docs + the broader reachability literature + open tools) to: + + * Design a **more expressive, auditable reachability graph model** for PHP and JavaScript. + * Implement our own call-graph construction and reachability analysis that is: + + * Deterministic, + * SBOM/VEX-native, + * Framework-aware, + * And aligned with Stella Ops’ offline/attested architecture. + +If you want, next step I can sketch a concrete “Reachability Graph v1” JSON schema for PHP/JS with PURL edges and confidence annotations that can plug directly into your Scanner/WebService and SBOM spine. + +[1]: https://snyk.io/blog/optimizing-prioritization-with-deep-application-level-context/?utm_source=chatgpt.com "Optimizing prioritization with deep application-level context" +[2]: https://docs.snyk.io/snyk-data-and-governance/how-snyk-handles-your-data?utm_source=chatgpt.com "How Snyk handles your data | Snyk User Docs" +[3]: https://docs.snyk.io/manage-risk/prioritize-issues-for-fixing/reachability-analysis?utm_source=chatgpt.com "Reachability analysis - Snyk User Docs" +[4]: https://semgrep.dev/blog/2024/sca-reachability-analysis-methods?utm_source=chatgpt.com "Comparing Reachability Analysis methods" +[5]: https://semgrep.dev/blog/2024/less-effort-more-insight-introducing-dependency-graph-for-supply-chain?utm_source=chatgpt.com "Less effort, more insight: Introducing Dependency Graph ..." +[6]: https://www.endorlabs.com/learn/what-is-reachability-based-dependency-analysis?utm_source=chatgpt.com "What is Reachability-Based Dependency Analysis? | Blog" +[7]: https://dev.to/semgrep/ai-memories-php-reachability-cve-policies-and-benchmarking-3naj?utm_source=chatgpt.com "AI Code Assistant Memories, PHP Reachability, CVE ..." +Good, let’s turn this into something a mid-level engineer can actually implement for PHP without guesswork. + +Below is a **concrete, implementation-ready reachability spec** for PHP v1, structured so you can give it directly to an engineer as requirements. + +--- + +## 1. Scope of PHP Reachability v1 + +**Goal** +Given: + +* A PHP project (source code), +* `composer.json` + `composer.lock`, +* A list of vulnerable symbols (e.g., FQNs from a vulnerability DB, each tied to a PURL), + +produce: + +1. A **call graph** of PHP functions/methods (with nodes and edges). +2. A **mapping** between nodes and dependency components (PURLs). +3. A **reachability report** per vulnerable symbol: + + * `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed` + * With example call paths when reachable. + +**Deliberate limitations of v1 (explicit, to keep it doable):** + +* Supported code: + + * Plain PHP functions. + * Class methods (normal, static). +* Supported calls: + + * Direct function calls: `foo()` + * Method calls: `$obj->bar()`, `Foo::bar()` +* Supported resolution features: + + * Namespaces + `use` imports. + * Composer autoload mapping (PSR-4/0, classmap) from `composer.json`. +* Not fully supported (treated conservatively as “maybe”): + + * Dynamic function names (`$fn()`). + * Dynamic method calls (`$obj->$name()`). + * Heavy reflection magic. + * Complex framework containers (Laravel, Symfony DI) – reserved for v2. + +--- + +## 2. Reachability Graph Document (JSON) + +The main artifact is a **graph document**. One file per scan: + +```json +{ + "schemaVersion": "1.0.0", + "language": "php", + "project": { + "projectId": "my-app", + "rootDir": "/src/app", + "hash": "sha256:..." + }, + "components": [ + { + "id": "comp-1", + "purl": "pkg:composer/vendor/lib-a@1.2.3", + "name": "vendor/lib-a", + "version": "1.2.3" + } + ], + "nodes": [], + "edges": [], + "vulnerabilities": [], + "reachabilityResults": [] +} +``` + +### 2.1 Node model + +Every node is a **callable** (function or method) or an **entry point**. + +```json +{ + "id": "node-uuid-or-hash", + "kind": "function | method | entrypoint", + "name": "index", + "fqn": "\\App\\Controller\\HomeController::index", + "file": "src/Controller/HomeController.php", + "line": 42, + "componentId": "comp-1", + "purl": "pkg:composer/vendor/lib-a@1.2.3", + "entryPointType": "http_route | cli | unknown | null", + "extras": { + "namespace": "\\App\\Controller", + "className": "HomeController", + "visibility": "public | protected | private | null" + } +} +``` + +**Rules for node creation** + +* **Function node** + + * `kind = "function"` + * `fqn` = `\Namespace\functionName` +* **Method node** + + * `kind = "method"` + * `fqn` = `\Namespace\ClassName::methodName` +* **Entrypoint node** + + * `kind = "entrypoint"` + * `entryPointType` set accordingly (may be `unknown` initially). + * Typically represents: + + * `public/index.php` + * `bin/console` commands, etc. + * Entrypoints can either: + + * Be separate nodes that **call** real functions/methods, or + * Be the same node as a method/function flagged as `entrypoint`. + For v1, keep it simple: **separate entrypoint nodes** that call “real” nodes. + +### 2.2 Edge model + +Edges capture relationships in the graph. + +```json +{ + "id": "edge-uuid-or-hash", + "from": "node-id-1", + "to": "node-id-2", + "type": "call | include | autoload | entry_call", + "confidence": "high | medium | low", + "extras": { + "callExpression": "Foo::bar($x)", + "file": "src/Controller/HomeController.php", + "line": 50 + } +} +``` + +**Edge types (v1)** + +* `call` + From a function/method to another function/method (resolved). +* `include` + From a file-level node or entrypoint to nodes defined in included file (optional for v1; can be “expanded” by treating all included definitions as reachable). +* `autoload` + From usage site to class definition when resolved via Composer autoload (optional to expose as a separate edge type; good for debug). +* `entry_call` + From an entrypoint node to the first callable(s) it invokes. + +For v1, an engineer can implement **only `call` + `entry_call`** and treat `include`/`autoload` as internal mechanics that result in `call` edges. + +### 2.3 Vulnerabilities model + +Input from your vulnerability database (or later from VEX) mapped into the graph: + +```json +{ + "id": "CVE-2020-1234", + "source": "internal-db-or-nvd-id", + "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", + "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", + "symbolKind": "method | function", + "severity": "critical | high | medium | low", + "extras": { + "description": "RCE in Foo::dangerousMethod", + "range": ">=1.0.0,<1.2.5" + } +} +``` + +At graph build time, you **pre-resolve** `symbolFqn` to `node.id` where possible and record it in `extras`. + +--- + +## 3. Reachability Results Structure + +Once you have the graph and the vulnerability list, you run reachability and produce: + +```json +{ + "vulnerabilityId": "CVE-2020-1234", + "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", + "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", + "targetNodeId": "node-123", + "status": "reachable | maybe_reachable | unreachable | not_analyzed", + "reason": "short explanation string", + "paths": [ + ["entry-node-1", "node-10", "node-20", "node-123"] + ], + "analysisMeta": { + "algorithmVersion": "1.0.0", + "maxDepth": 100, + "timestamp": "2025-11-20T19:30:00Z" + } +} +``` + +**Status semantics:** + +* `reachable` + There exists at least one **concrete call path** from an entrypoint node to `targetNodeId` using only `confidence = high` edges. +* `maybe_reachable` + A path exists but at least one edge along any path has `confidence = medium | low` (dynamic call, unresolved class alias, etc.). +* `unreachable` + No path exists from any entrypoint to the target node in the constructed graph. +* `not_analyzed` + We failed to build a node for the symbol or failed the analysis (parse errors, missing source, etc.). + +--- + +## 4. Analysis Pipeline Spec (Step-by-Step) + +This is the part a mid-level engineer can follow as tasks. + +### 4.1 Inputs + +* Directory with PHP code (`/app`). +* `composer.json`, `composer.lock`. +* List of vulnerabilities (as above). +* Optional SBOM mapping PURLs to file paths (if you have it; otherwise use Composer metadata only). + +--- + +### 4.2 Step 1 – Parse Composer Metadata & Build Components + +1. Read `composer.lock`. +2. For each package in `"packages"`: + + * Build `purl` like: + `pkg:composer/@` + * Create `components[]` entry (with generated `componentId`). +3. For the root project, create one component (e.g., `app`) with `purl = null` or a synthetic one (`pkg:composer/mycompany/myapp@dev`). + +**Output:** + +* `components[]` array. +* `componentIndex`: map from package name to `componentId`. + +--- + +### 4.3 Step 2 – PHP AST & Symbol Table + +Use a standard AST library (e.g., `nikic/php-parser`) – explicitly allowed and expected. + +For each PHP file in: + +* application source dirs (e.g. `src/`, `app/`), +* vendor dirs (if you choose to parse vendor code; v1 may do that only for needed components): + +Perform: + +1. Parse file → AST. +2. Extract: + + * File namespace. + * `use` imports (class aliases). + * Function definitions: name, line. + * Class definitions: name, namespace, methods. +3. Build **symbol table**: + +```php +// conceptual structure: +class SymbolTable { + // Fully qualified class or function name → node meta + public array $functionsByFqn; + public array $methodsByFqn; // "\Ns\Class::method" +} +``` + +4. Determine `componentId` for each file: + + * If path under `vendor/vendor-name/package-name/` → map to that Composer package → `componentId`. + * Else → root app component. + +5. Create **nodes**: + +* For each function: + + * Node `kind = "function"`. +* For each method: + + * Node `kind = "method"`. + +Assign `id`, `file`, `line`, `fqn`, `componentId`, `purl`. + +**Output:** + +* `nodes[]` with all functions/methods. +* `symbolTable` (for resolving calls). + +--- + +### 4.4 Step 3 – Entrypoint Detection + +v1 simple rules: + +1. Any of: + + * `public/index.php` + * `index.php` in project root + * Files under `bin/` or `cli/` with `#!/usr/bin/env php` shebang + are considered **entrypoint files**. + +2. For each entrypoint file: + + * Create an `entrypoint` node with: + + * `file` = that file + * `entryPointType` = `"http_route"` (for `public/index.php`) or `"cli"` (for `bin/*`) or `"unknown"`. + * Add to `nodes[]`. + +3. Later, when scanning each entrypoint file’s AST, you will create `entry_call` edges from the entrypoint node to the first layer of call targets inside that file. + +**Output:** + +* Additional `entrypoint` nodes. + +--- + +### 4.5 Step 4 – Call Graph Construction + +For each parsed file: + +1. Traverse AST for call expressions: + + * `foo()` → candidate function call. + * `$obj->bar()` → instance method call. + * `Foo::bar()` → static method call. + +2. **Resolve function calls**: + + Given: + + * Called name (may be qualified, relative, or unqualified). + * Current file namespace. + + Resolution rules: + + * If fully qualified (starts with `\`): use directly as FQN. + * Else: + + * Check `use` imports for alias match. + * If no alias, prepend current namespace. + * Look up FQN in `symbolTable.functionsByFqn` or `methodsByFqn`. + * If found → **resolved call** with `confidence = "high"`. + * If not found → mark `confidence = "low"` and set `to` to a synthetic node id like `unknown` or skip creating an edge in v1 (implementation choice – recommended: create edge to special `unknown` node). + +3. **Resolve method calls `$obj->bar()`** (v1 simplified): + + * Assume dynamic instance type is not known statically → resolution is ambiguous. + * For v1, treat these as: + + * `confidence = "medium"` and: + + * If `$obj` variable has a clear `new ClassName` assignment in the same function, try to infer class and use same resolution rules as static calls. + * Otherwise, create edges from calling node to all methods named `bar` in **any class inside the same component**. + * This is over-approximate but conservative. + +4. **Resolve static method calls `Foo::bar()`**: + + * Resolve `Foo` to FQN using namespace + imports (same as functions). + * Build FQN `\Ns\Foo::bar`. + * Look up in `symbolTable.methodsByFqn`. + * Mark `confidence = "high"` when resolved. + +5. **Connect entrypoints**: + + * For each entrypoint file: + + * Identify top-level calls in that file (same rules as above). + * Edges: + + * `type = "entry_call"` + * `from = entrypointNodeId` + * `to = resolved callee node` + +**Output:** + +* `edges[]` with `call` and `entry_call` edges. + +--- + +### 4.6 Step 5 – Map Vulnerabilities to Nodes + +For each vulnerability: + +1. If `symbolFqn` is not null: + + * If `symbolKind == "method"` → look into `symbolTable.methodsByFqn`. + * If `symbolKind == "function"` → `symbolTable.functionsByFqn`. + +2. If found → record `targetNodeId` in a lookup: `vulnId → nodeId`. + +3. If not found → `status` will later become `not_analyzed`. + +--- + +### 4.7 Step 6 – Reachability Algorithm + +Core logic: multiple BFS (or DFS) from entrypoints over the call graph. + +**Pre-compute entry roots:** + +* `entryNodes` = ids of all nodes with `kind = "entrypoint"`. + +**Algorithm (BFS from all entrypoints):** + +Pseudo-code (language-agnostic): + +```php +function computeReachability(Graph $graph, array $entryNodes): ReachabilityContext { + $queue = new SplQueue(); + $visited = []; // nodeId => true + $predecessor = []; // nodeId => parent nodeId (for path reconstruction) + $edgeConfidenceOnPath = []; // nodeId => "high" | "medium" | "low" + + foreach ($entryNodes as $entryId) { + $queue->enqueue($entryId); + $visited[$entryId] = true; + $edgeConfidenceOnPath[$entryId] = "high"; + } + + while (!$queue->isEmpty()) { + $current = $queue->dequeue(); + + foreach ($graph->outEdges($current) as $edge) { + if ($edge->type !== 'call' && $edge->type !== 'entry_call') { + continue; + } + + $next = $edge->to; + if (isset($visited[$next])) { + continue; + } + + $visited[$next] = true; + $predecessor[$next] = $current; + + // propagate confidence (lowest on the path wins) + $prevConf = $edgeConfidenceOnPath[$current] ?? "high"; + $edgeConf = $edge->confidence; // "high"/"medium"/"low" + $edgeConfidenceOnPath[$next] = minConfidence($prevConf, $edgeConf); + + $queue->enqueue($next); + } + } + + return new ReachabilityContext($visited, $predecessor, $edgeConfidenceOnPath); +} + +function minConfidence(string $a, string $b): string { + $order = ["high" => 3, "medium" => 2, "low" => 1]; + return ($order[$a] <= $order[$b]) ? $a : $b; +} +``` + +**Classify each vulnerability:** + +For each vulnerability with `targetNodeId`: + +1. If `targetNodeId` is missing → `status = "not_analyzed"`. +2. Else if `targetNodeId` is **not** in `visited` → `status = "unreachable"`. +3. Else: + + * Let `conf = edgeConfidenceOnPath[targetNodeId]`. + * If `conf == "high"` → `status = "reachable"`. + * If `conf == "medium" or "low"` → `status = "maybe_reachable"`. + +**Path reconstruction:** + +To generate one example path: + +```php +function reconstructPath(array $predecessor, string $targetId): array { + $path = []; + $current = $targetId; + while (isset($predecessor[$current])) { + array_unshift($path, $current); + $current = $predecessor[$current]; + } + array_unshift($path, $current); // entrypoint at start + return $path; +} +``` + +Store that `path` array in `reachabilityResults[].paths[]`. + +--- + +## 5. Handling PHP “messy bits” (v1 rules) + +This is where we mark things as `maybe` instead of pretending we know. + +1. **Dynamic function names** `$fn()`: + + * Create **no edges** by default in v1. + * Optionally, if `$fn` is a constant string literal obvious in the same function, treat as a normal call. + * Otherwise: leave it out and accept that some cases will be missed → vulnerability may be marked `unreachable` but flagged with `analysisMeta.dynamicCallsIgnored = true`. + +2. **Dynamic methods** `$obj->$method()`: + + * Same principle as above. + +3. **Reflection / `call_user_func` / `call_user_func_array`**: + + * v1: do not try to resolve. + * Optional: track the call sites; mark their outgoing edges as `confidence = "low"` and connect to **all** functions/methods of that name when the name is a string literal. + +4. **Includes** (`include`, `require`, `require_once`, `include_once`): + + * v1 simplest rule: + + * Treat the included file as **fully reachable** from the including file. + * Pseudo-implementation: when building symbol table, everything defined in the included file is considered potentially called by the including file’s entrypoint logic. + * Implementation shortcut: + + * For the first version, you can even skip modeling edges, and instead mark all nodes in included files as “reachable from the entrypoint” if included directly by an entrypoint file. Later refine. + +--- + +## 6. What the engineer actually builds (modules & tasks) + +You can frame it to them like this: + +1. **Module `PhpProjectLoader`** + + * Reads project root, finds `composer.json`, `composer.lock`. + * Produces `components[]` and mapping from file-path → componentId. + +2. **Module `PhpAstIndexer`** + + * Uses `nikic/php-parser`. + * For each `.php` file: + + * Produces entries in `symbolTable`. + * Produces base `nodes[]` (functions, methods). + * Creates `entrypoint` nodes based on known file patterns. + +3. **Module `PhpCallGraphBuilder`** + + * Walks AST again: + + * For each callable body, finds call expressions. + * Resolves calls via `symbolTable`. + * Produces `edges[]`. + +4. **Module `PhpReachabilityEngine`** + + * Runs BFS from entrypoints. + * Classifies per-vulnerability reachability. + +5. **Module `GraphSerializer`** + + * Assembles everything into the JSON schema described in sections 2–3. + +Each module is testable with small sample projects. + +--- + +## 7. Minimal working example (very small) + +Project: + +```php +// public/index.php +index(); + +// src/Controller/HomeController.php +dangerousMethod(); + } +} +``` + +Vulnerability: + +```json +{ + "id": "CVE-2020-1234", + "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", + "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", + "symbolKind": "method" +} +``` + +Expected reachability path (conceptually): + +```json +[ + "entry:public/index.php", + "\\App\\Controller\\HomeController::index", + "\\Vendor\\LibA\\Foo::dangerousMethod" +] +``` + +Status: `reachable` with `confidence = high`. + +--- + +If you’d like, next step I can: + +* Strip this into a **formal JSON Schema** file (`reachability-php-graph.schema.json`) and +* Propose a **directory layout + interfaces** in C#/.NET 10 for `StellaOps.Scanner.Php` so you can drop it straight into the repo. +Here is a JavaScript/TypeScript reachability spec that a mid-level engineer can actually implement, but which is still “best in class” in terms of clarity, determinism, and extensibility. + +I’ll mirror the PHP structure you already have so Scanner/WebService and Sbomer can treat them uniformly. + +--- + +## 1. Scope of JS Reachability v1 + +**Goal** + +Given: + +* A JS/TS project (Node-centric), +* `package.json` + lockfile (`package-lock.json` / `yarn.lock` / `pnpm-lock.yaml`), +* A list of vulnerable symbols (tied to npm PURLs), + +produce: + +1. A **function-level call graph** (nodes + edges). +2. Mapping of nodes to **components** (`pkg:npm/...` PURLs). +3. A **reachability verdict** for each vulnerable symbol: + + * `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed` + * With at least one example call path when reachable/maybe_reachable. + +**Deliberate v1 constraints** + +To keep it very implementable: + +* Target runtime: **Node.js** (server-side). +* Source: **TypeScript + JavaScript** in one unified analysis. + + * Use TypeScript compiler with `allowJs: true` so JS and TS share the same Program. +* Modules: + + * ES Modules (`import`/`export`). + * CommonJS (`require`, `module.exports`, `exports`). +* Supported calls: + + * Direct calls: `foo()`. + * Method calls: `obj.method()`, `Class.method()`. +* Bundlers (Webpack, Vite, etc.): **out of scope v1** (treat source before bundling). +* Dynamic features (handled conservatively, see below): + + * `eval`, `Function` constructor, dynamic imports, `obj[methodName]()`, etc. + +--- + +## 2. Reachability Graph Document (JSON) + +Same high-level shape as PHP, but annotated for JS/TS. + +```json +{ + "schemaVersion": "1.0.0", + "language": "javascript", + "project": { + "projectId": "my-node-app", + "rootDir": "/app", + "hash": "sha256:..." + }, + "components": [], + "nodes": [], + "edges": [], + "vulnerabilities": [], + "reachabilityResults": [] +} +``` + +### 2.1 Components + +Each npm package (including the root app) is a component. + +```json +{ + "id": "comp-1", + "purl": "pkg:npm/express@4.19.2", + "name": "express", + "version": "4.19.2", + "isRoot": false, + "extras": { + "resolvedPath": "node_modules/express" + } +} +``` + +For the root project, you can use: + +```json +{ + "id": "comp-root", + "purl": "pkg:npm/my-company-my-app@1.0.0", + "name": "my-company-my-app", + "version": "1.0.0", + "isRoot": true +} +``` + +A mid-level engineer can easily build this from `package.json` + the chosen lockfile. + +--- + +### 2.2 Nodes (callables & entrypoints) + +Every node is a callable or an entrypoint. + +```json +{ + "id": "node-uuid-or-hash", + "kind": "function | method | arrow | class_constructor | entrypoint", + "name": "handleRequest", + "fqn": "src/controllers/userController.ts::handleRequest", + "file": "src/controllers/userController.ts", + "line": 42, + "componentId": "comp-root", + "purl": "pkg:npm/my-company-my-app@1.0.0", + "exportName": "handleRequest", + "exportKind": "named | default | none", + "className": "UserController", + "entryPointType": "http_route | cli | worker | unknown | null", + "extras": { + "isAsync": true, + "isRouteHandler": true + } +} +``` + +**Rules for node creation** + +* **Function node** + + * `kind = "function"` for `function foo() {}` and `export function foo() {}`. + * `fqn` = `::foo`. +* **Arrow function node** + + * `kind = "arrow"` when it is used as a callback that matters (e.g. Express handler). + * Option: generate synthetic name `file.ts:::`. +* **Method node** + + * `kind = "method"` for class methods. + * `fqn` = `::ClassName.methodName`. +* **Class constructor node** + + * `kind = "class_constructor"` for `constructor()` if you want constructor-level analysis. +* **Entrypoint node** + + * `kind = "entrypoint"`. + * `entryPointType` according to detection rules (see §4). + * `fqn` = `::`, e.g. `src/server.ts::node-entry`. + +You don’t need to over-engineer FQNs; they just need to be stable and unique. + +--- + +### 2.3 Edges + +Edges model function/method/module relationships. + +```json +{ + "id": "edge-uuid-or-hash", + "from": "node-id-1", + "to": "node-id-2", + "type": "call | entry_call | import | export", + "confidence": "high | medium | low", + "extras": { + "callExpression": "userController.handleRequest(req, res)", + "file": "src/routes/userRoutes.ts", + "line": 30 + } +} +``` + +For reachability v1, **only `call` and `entry_call` are required**. `import`/`export` edges are useful for debugging but not strictly necessary for BFS reachability. + +--- + +### 2.4 Vulnerabilities + +Library-level vulnerabilities are described in terms of npm PURL and symbol. + +```json +{ + "id": "CVE-2020-1234", + "source": "internal-db-or-nvd-id", + "componentPurl": "pkg:npm/some-lib@1.2.3", + "packageName": "some-lib", + "symbolExportName": "dangerousFunction", + "symbolKind": "function | method", + "severity": "critical", + "extras": { + "description": "Prototype pollution in dangerousFunction", + "range": ">=1.0.0 <1.2.5" + } +} +``` + +At graph-build time, you pre-resolve `symbolExportName` → `node.id` where possible. + +--- + +### 2.5 Reachability Results + +Exactly the same shape as for PHP. + +```json +{ + "vulnerabilityId": "CVE-2020-1234", + "componentPurl": "pkg:npm/some-lib@1.2.3", + "symbolExportName": "dangerousFunction", + "targetNodeId": "node-123", + "status": "reachable | maybe_reachable | unreachable | not_analyzed", + "reason": "short explanation", + "paths": [ + ["entry-node-1", "node-20", "node-50", "node-123"] + ], + "analysisMeta": { + "algorithmVersion": "1.0.0", + "maxDepth": 200, + "timestamp": "2025-11-20T19:30:00Z" + } +} +``` + +--- + +## 3. Module & Symbol Resolution (JS/TS specifics) + +Backend: **TypeScript compiler API** with `allowJs: true`. + +### 3.1 Build TS Program + +1. Generate a `tsconfig.reachability.json` with: + + * `allowJs: true` + * `checkJs: true` + * `moduleResolution: "node"` or `"bundler"` depending on project. + * `rootDir` set to project root. +2. Use TS API to create `Program`. +3. Use `TypeChecker` to resolve symbols where possible. + +This gives you: + +* File list (including JS/TS). +* Symbols for exports/imports. +* Class and function definitions. + +### 3.2 Export indexing per module + +For each source file: + +* Enumerate: + + * `export function foo() {}` + * `export const bar = () => {}` + * `export default function () {}` / `export default class {}`. + * `export { foo }` statements. + * `module.exports = ...` / `exports.foo = ...` (handle as CommonJS exports). + +Build an index: + +```ts +interface ExportedSymbol { + moduleFile: string; // relative path + exportName: string; // "foo", "default" + nodeId: string; // ID in nodes[] +} +``` + +### 3.3 Import resolution + +For each `ImportDeclaration`: + +* `import { foo as localFoo } from 'some-lib'` + + * Map `localFoo` → `(module='some-lib', exportName='foo')`. + +* `import foo from 'some-lib'` + + * Map `foo` → `(module='some-lib', exportName='default')`. + +* `import * as lib from 'some-lib'` + + * Map namespace `lib` → `(module='some-lib', exportName='*')`. + +For CommonJS: + +* `const x = require('some-lib')` + + * Map `x` → `(module='some-lib', exportName='*')`. + +* `const { dangerousFunction } = require('some-lib')` + + * Map `dangerousFunction` → `(module='some-lib', exportName='dangerousFunction')`. + +Later, when you see calls, you use this mapping. + +--- + +## 4. Entrypoint Detection (Node-centric) + +v1 rules that are easy to implement: + +1. **CLI entrypoints** + + * Files listed in `bin` section of `package.json`. + * Files with `#!/usr/bin/env node` shebang. + * Node: + + * `kind = "entrypoint"`, + * `entryPointType = "cli"`. + +2. **Server entrypoints** + + * Heuristic: look for `src/server.ts`, `src/index.ts`, `index.js` at project root. + * Mark them as `entrypoint` with `entryPointType = "http_route"`. + +3. **Framework routes (Express v1)** + + * Pattern: `const app = express(); app.get('/path', handler)`: + + * `handler` can be: + + * Identifier (function name), + * Arrow function, + * Function expression. + + For each such route: + + * Create an `entrypoint` node per route or mark handler callable as reachable from server entrypoint: + + * Easiest v1: create **`entry_call` edge**: + + * From server entrypoint node (e.g., file `src/server.ts`) to handler node. + * Mark handler node `extras.isRouteHandler = true`. + +You do not have to model individual HTTP methods or paths semantically in v1; just treat each handler as a reachable entrypoint into business logic. + +--- + +## 5. Call Graph Construction + +This is the heart of the algorithm. + +### 5.1 Node creation (summary) + +While visiting AST: + +* For each: + + * `FunctionDeclaration` + * `MethodDeclaration` + * `ArrowFunction` (that is: + + * exported, or + * assigned to a variable that is used as a callback/handler) +* Create a `node`. + +Tie each node to: + +* `file` (relative path), +* `line` (start line), +* `componentId` (from mapping file path → package), +* optional `exportName` (if exported from module). + +### 5.2 Call extraction rules + +For each function/method body (i.e., node): + +#### 5.2.1 Direct calls: `foo()` + +* If callee is an identifier `foo`: + + 1. Check if `foo` is a **local function** in the same file. + 2. If not, check import alias table: + + * If `foo` maps to `(module='pkg', exportName='bar')`, then: + + * Resolve to exported symbol for `pkg` + `bar` if you have its sources. + * If library source not indexed, create a synthetic node for that library export (optional). + 3. If resolved, add edge: + + * `type = "call"`, + * `confidence = "high"`. + +#### 5.2.2 Property calls: `obj.method()` + +* If callee is `obj.method(...)`: + + 1. If `obj` is an imported namespace: + + * e.g. `import * as lib from 'some-lib'; lib.dangerousFunction()`. + * Then treat: + + * `module='some-lib'`, `exportName='dangerousFunction'`. + * Edge `confidence = "high"`. + + 2. If `obj` is created via `new ClassName()` where `ClassName` is known: + + * Use TypeScript type checker or simple pattern: + + * Look for `const obj = new ClassName(...)` in same function. + * Map to method `ClassName.method`. + * Edge `confidence = "high"`. + + 3. Else: + + * As a v1 heuristic, you **do not** spread to everything; instead: + + * Either: + + * Skip edge and lose some coverage, or + * Add `confidence = "medium"` edge from current node to **all methods called `method`** in the same component. + * Recommended: medium-confidence to all same-name methods in same component (conservative, but safe). + +#### 5.2.3 CommonJS require patterns + +* `const x = require('some-lib'); x.dangerousFunction()`: + + * Track variable → module mapping from `require`. + * When you see `x.something()`: + + * `module='some-lib'`, `exportName='something'`. + * `confidence = "medium"` (less structured than ES import). + +#### 5.2.4 Dynamic imports & very dynamic calls + +* `await import('some-lib')`, `obj[methodName]()`, `eval`, `Function`, etc.: + + v1 policy (simple and honest): + + * Do **not** create specific edges unless: + + * The target module name is a **string literal** and the method name is a **string literal** in same expression. + * Otherwise: + + * Optionally create a single edge from current node to a special `node-unknown` with `confidence = "low"`. + * This preserves a record that “something dynamic happens here” without lying. + +--- + +## 6. Mapping Nodes to Components (PURLs) + +Using the filesystem: + +* If file path begins with `node_modules//...`: + + * Map that file to component with `name = pkgName` and the version from lockfile. + +* All other files belong to the root component (the app) or to a local “workspace” package if you support monorepos later. + +Each node inherits `componentId` from its file. Each component has a `purl`: + +* `pkg:npm/@`. + +This is how you connect reachability to SBOM/VEX later. + +--- + +## 7. Vulnerability → Node mapping + +Given a vulnerability: + +```json +{ + "componentPurl": "pkg:npm/some-lib@1.2.3", + "packageName": "some-lib", + "symbolExportName": "dangerousFunction" +} +``` + +Steps: + +1. Find `componentId` by matching `componentPurl` or `packageName`. +2. In that component, find node(s) where: + + * `exportName == "dangerousFunction"`, or + * For CommonJS, any top-level function marked as part of the module’s exports under that name. +3. If found: + + * `targetNodeId = node.id`. +4. If not: + + * Mark `not_analyzed` later. + +--- + +## 8. Reachability Algorithm (BFS) + +Exactly like PHP v1, but now over JS nodes. + +**Pre-compute:** + +* `entryNodes` = all nodes where `kind = "entrypoint"`. + +**Compute reachable set:** + +```ts +function computeReachability(graph: Graph, entryNodes: string[]): ReachabilityContext { + const queue: string[] = []; + const visited: Record = {}; + const predecessor: Record = {}; + const edgeConfidenceOnPath: Record = {}; + + for (const entry of entryNodes) { + queue.push(entry); + visited[entry] = true; + edgeConfidenceOnPath[entry] = "high"; + } + + while (queue.length > 0) { + const current = queue.shift()!; + + for (const edge of graph.outEdges(current)) { + if (edge.type !== "call" && edge.type !== "entry_call") continue; + + const next = edge.to; + if (visited[next]) continue; + + visited[next] = true; + predecessor[next] = current; + + const prevConf = edgeConfidenceOnPath[current] ?? "high"; + const edgeConf = edge.confidence; + edgeConfidenceOnPath[next] = minConfidence(prevConf, edgeConf); + + queue.push(next); + } + } + + return { visited, predecessor, edgeConfidenceOnPath }; +} + +function minConfidence(a: "high" | "medium" | "low", + b: "high" | "medium" | "low"): "high" | "medium" | "low" { + const order: Record = { high: 3, medium: 2, low: 1 }; + return order[a] <= order[b] ? a : b; +} +``` + +**Classify per vulnerability:** + +For each vulnerability with `targetNodeId`: + +1. If missing → `status = "not_analyzed"`. +2. If `targetNodeId` not in `visited` → `status = "unreachable"`. +3. Otherwise: + + * `conf = edgeConfidenceOnPath[targetNodeId]`. + * If `conf == "high"` → `status = "reachable"`. + * Else (`medium` or `low`) → `status = "maybe_reachable"`. + +**Path reconstruction:** + +Same as PHP: + +```ts +function reconstructPath(predecessor: Record, + targetId: string): string[] { + const path: string[] = []; + let current: string | undefined = targetId; + + while (current !== undefined) { + path.unshift(current); + current = predecessor[current]; + } + + return path; +} +``` + +Store at least one path in `paths[]`. + +--- + +## 9. Handling JS “messy bits” (v1 rules) + +You want to be honest, not magical. So: + +1. **eval, new Function, dynamic import with non-literal arguments** + + * Do not pretend you know where control goes. + * Either: + + * Ignore for graph (recommended v1), or + * Edge to `node-unknown` with `confidence="low"`. + * Mark in `analysisMeta` that dynamic features were detected. + +2. **obj[methodName]() with unknown methodName** + + * If `methodName` is string literal and `obj` is clearly typed, you can resolve. + * Otherwise: no edges (or low-confidence to `node-unknown`). + +3. **No source for library** + + * If you do not index `node_modules`, you cannot trace inside vulnerable library. + * Still useful: we just need the library’s exported symbol node as “synthetic”: + + * Create a synthetic node representing `some-lib::dangerousFunction` and attach all calls to it. + * That node gets `componentId` for `some-lib`. + * Reachability is still valid (we do not need the internal implementation for SCA). + +--- + +## 10. Implementation plan for a mid-level engineer + +Assume this runs in a **Node.js/TypeScript container** that Scanner calls, returning JSON. + +### 10.1 Modules to build + +1. `JsProjectLoader` + + * Reads `package.json` + lockfile. + * Builds `components[]` (npm packages + root app). + * Maps file paths → `componentId`. + +2. `TsProgramBuilder` + + * Generates `tsconfig.reachability.json`. + * Creates TS Program with `allowJs: true`. + * Exposes `sourceFiles` and `typeChecker`. + +3. `JsSymbolIndexer` + + * Walks all source files. + * Indexes: + + * Exported functions/classes. + * Imported bindings / requires. + * Creates base `nodes[]` and export index. + +4. `JsEntrypointDetector` + + * Reads `package.json` for `bin` and main entry. + * Applies server/Express heuristics. + * Adds `entrypoint` nodes. + +5. `JsCallGraphBuilder` + + * For each function/method node: + + * Traverses its AST. + * Emits `call` edges as per §5. + * Emits `entry_call` edges for server/route wiring. + +6. `VulnerabilityMapper` + + * Takes vulnerability input (package + symbolExportName). + * Resolves them to `targetNodeId`. + +7. `ReachabilityEngine` + + * Implements BFS algorithm. + * Produces `reachabilityResults[]` with paths. + +8. `GraphSerializer` + + * Packs everything into the JSON structure described in §§2–3. + +### 10.2 Minimal example (to sanity check) + +Project snippet: + +```ts +// src/server.ts +import express from 'express'; +import { dangerousFunction } from 'some-lib'; + +const app = express(); + +app.get('/test', (req, res) => { + dangerousFunction(req.query.input); + res.send('ok'); +}); + +app.listen(3000); +``` + +Vulnerability: + +```json +{ + "id": "CVE-2020-1234", + "componentPurl": "pkg:npm/some-lib@1.2.3", + "packageName": "some-lib", + "symbolExportName": "dangerousFunction" +} +``` + +Expected graph & result: + +* Entrypoint node: + + * `src/server.ts::node-entry` (or similar). +* Handler arrow function node for `/test`. +* Node for `some-lib::dangerousFunction` (real or synthetic). + +Reachability path: + +```json +[ + "entry:src/server.ts::node-entry", + "node-handler-/test", + "node-some-lib::dangerousFunction" +] +``` + +Status: `reachable`, `confidence = high`. + +--- + +If you want, next iteration I can: + +* Turn this into a formal JSON Schema (`reachability-js-graph.schema.json`), +* And sketch the exact Node/TS code structure (file layout, TS interfaces) for the container that StellaOps.Scanner would call. +You already have a very solid v1 for both PHP and JS. The next step is not to add “more stuff” randomly, but to make the specs: + +* More **uniform** (one core reachability model, multiple languages). +* More **honest** about uncertainty. +* More **useful** for scoring, policy, and SBOM/VEX. + +Here is what I would improve. + +--- + +## 1. Cross‑language improvements (applies to both PHP & JS) + +### 1.1 Unify under a single core schema + +Right now PHP and JS are parallel but not explicitly unified. I would define: + +* A **language‑agnostic core**: + + * `Node` (id, kind, file, line, componentId, purl, tags). + * `Edge` (id, from, to, type, confidence, tags). + * `Vulnerability` (id, componentPurl, symbolId or symbolFqn, severity, tags). + * `ReachabilityResult` (vulnId, targetNodeId, status, paths[], analysisMeta). +* A **language extension block**: + + * `phpExtras` (namespace, className, visibility, etc.). + * `jsExtras` (exportName, exportKind, isAsync, etc.). + +This gives you one “Reachability Graph 1.x” spec with per‑language specialisation instead of two separate specs. + +### 1.2 Stronger identity & hashing rules + +Make node and edge IDs deterministic and explicitly specified: + +* Node ID derived from: + + * `language`, `componentId`, `file`, `fqn`, `kind` → `sha256` truncated. +* Edge ID derived from: + + * `from`, `to`, `type`, `file`, `line`. + +Benefits: + +* Stable IDs across runs for the same code → easy diffing, caching, incremental scans. +* Downstream tools (policy engine, UI) can key on IDs confidently. + +### 1.3 Multi‑axis confidence instead of a single label + +Replace the single `confidence` enum with **multi‑axis confidence**: + +```json +"confidence": { + "resolution": "high|medium|low", // how well we resolved the callee + "typeInference": "high|medium|low", + "controlFlow": "high|medium|low" +} +``` + +And define: + +* `pathConfidence` = min of all axes along the path. +* `status` still uses `reachable` / `maybe_reachable` / etc., but you retain the underlying breakdown for scoring and debugging. + +### 1.4 Path conditions and guards (lightweight) + +Introduce optional **path condition annotations** on edges: + +```json +"extras": { + "guard": "if ($userIsLoggedIn)", + "guardType": "auth | feature_flag | input_validation | unknown" +} +``` + +You do not need full symbolic execution. A simple heuristic suffices: + +* Detect `if (...)` around the call and capture the textual condition. +* Categorize by simple patterns (presence of `isAdmin`, `feature`, `flag`, etc.). + +Later, the Trust Algebra can say: “reachable only under feature flag + behind auth → downgrade risk.” + +### 1.5 Partial coverage & truncation flags + +Make the graph self‑describing about its **limitations**: + +At graph level: + +```json +"analysisMeta": { + "languages": ["php"], + "vendorCodeParsed": true, + "dynamicFeaturesHandled": ["dynamic-includes-partial", "reflection-ignored"], + "maxNodes": 500000, + "truncated": false +} +``` + +Per‑node or per‑file: + +```json +"extras": { + "parseErrors": false, + "analysisSkippedReason": null +} +``` + +Per‑vulnerability: + +* Add `coverageStatus`: `full`, `partial`, `unknown` to complement `status`. + +This avoids a common trap: tools silently dropping edges/nodes and still reporting “unreachable.” + +### 1.6 First‑class SBOM/VEX linkage + +You already include PURLs. Go one step further: + +* `componentId` links to: + + * `bomRef` (CycloneDX) or `componentId` (SPDX) if available. +* `vulnerabilityId` links to: + + * `vexRef` in any existing VEX document. + +This allows: + +* A VEX producer to say “not affected / affected but not exploited” with **explicit reference** to the reachability graph and specific `targetNodeId`s. + +--- + +## 2. PHP‑specific improvements + +### 2.1 Autoloader‑aware edges as first‑class concept + +Right now autoload is mostly implicit. Make it explicit and deterministic: + +* During Composer metadata processing, build: + + * **Autoload map**: `FQN class → file`. +* Add `autoload` edges: + + * From “usage site” node (where `new ClassName()` first appears) to a **file‑level node** representing the defining file. + +Why it helps: + +* Clarifies how classes were resolved (or not). +* Easier to debug “class not found” vs “we never parsed vendor code.” + +### 2.2 More precise includes / requires + +Upgrade the naive rule “everything in included file is reachable”: + +1. Represent each file as a special node `kind="file"`. +2. `include` / `require` statements produce `include` edges from current node/file to the file node. +3. Then: + + * All functions/methods defined in that file get `define_in` edges from file node. + * A separate simple pass marks them reachable from that file’s callers. + +Add a nuance: + +* If the include path is static and resolved at scan time → `resolution.high`. +* If dynamic (e.g., `include $baseDir.'/file.php';`) → `resolution.medium` or `low`. + +### 2.3 Better dynamic dispatch handling for methods + +Current v1 rule (“connect to all methods with that name in the component”) is safe but noisy. + +Refinement: + +* Use **local type inference** in the same function/method: + + * `$x = new Foo(); $x->bar();` → high resolution. + * `$x = factory(); $x->bar();`: + + * If factory returns a union of known types, edges to those types with `resolution.medium`. +* Introduce a tag on edges: + + * `extras.dispatchKind = "static" | "local-new" | "factory-heuristic" | "unknown"`. + +This preserves the safety of your current design but cuts down false positives for common patterns. + +### 2.4 Framework‑aware entrypoints (v2, but spec‑ready now) + +Extend `entryPointType` with framework flavors, even if initial implementation is shallow: + +* `laravel_http`, `symfony_http`, `wordpress_hook`, `drupal_hook`, etc. + +And allow: + +```json +"extras": { + "framework": "laravel", + "route": "GET /users", + "hookName": "init" +} +``` + +You do not have to implement every framework in v1, but the spec should allow these so you can ship small, incremental framework profiles without changing the schema. + +--- + +## 3. JavaScript/TypeScript‑specific improvements + +### 3.1 Explicit async / event‑loop edges + +Today all calls are treated uniformly. For JS/TS, you should model: + +* `setTimeout`, `setInterval`, `setImmediate`, `queueMicrotask`, `process.nextTick`, `Promise.then/catch/finally`, event emitters. + +Two improvements: + +1. Additional edge types: + + * `async_call`, `event_callback`, `timer_callback`. +2. Node extras: + + * `extras.trigger = "timer" | "promise" | "event" | "unknown"`. + +This lets you later express policies like: “reachable only via a rarely used cron‑like timer” vs “reachable via normal HTTP request.” + +### 3.2 Bundler awareness (but spec‑only in v1) + +Even if v1 implementation ignores bundlers, the spec should anticipate them: + +* Allow a **bundle mapping block**: + +```json +"bundles": [ + { + "id": "bundle-main", + "tool": "webpack", + "inputFiles": ["src/index.ts", "src/server.ts"], + "outputFiles": ["dist/main.js"] + } +] +``` + +* Optionally, allow edges: + + * `type = "bundle_map"` from source file nodes to bundled file nodes. + +You can attach reachability graphs to either pre‑bundle or post‑bundle views later, without breaking the schema. + +### 3.3 Stronger TypeScript‑based resolution + +Encode the fact that a call was resolved using TS type information vs heuristic: + +* On edges, add: + +```json +"extras": { + "resolutionStrategy": "ts-typechecker | local-scope | require-heuristic | unresolved" +} +``` + +This provides a clear line between “hard” and “soft” links for the scoring engine and for debugging why something is `maybe_reachable`. + +### 3.4 Workspace / monorepo semantics + +Support Yarn / pnpm / npm workspaces at the schema level: + +* Allow components to have: + +```json +"extras": { + "workspace": "packages/service-a", + "isWorkspaceRoot": false +} +``` + +And support edges: + +* `type = "workspace_dep"` for internal package imports. + +This makes it straightforward to see when a vulnerable library is pulled via an internal package boundary, which is common in large JS monorepos. + +--- + +## 4. Operational & lifecycle improvements + +### 4.1 Explicit incremental scan support + +Add an optional **delta section** so a scanner can emit only changes: + +```json +"delta": { + "baseGraphHash": "sha256:...", + "addedNodes": [...], + "removedNodeIds": [...], + "addedEdges": [...], + "removedEdgeIds": [...] +} +``` + +This is particularly valuable for large repos where full graphs are costly and CI needs fast turnaround. + +### 4.2 Test / non‑prod code classification + +Mark nodes/edges originating from tests or non‑prod code: + +* `extras.codeRole = "prod | test | devtool | unknown"`. + +Entry points from test runners (e.g., PHPUnit, Jest, Mocha) should either be: + +* Ignored (default), or +* Explicitly flagged as `entryPointType = "test"` so policies can decide whether to count that reachability. + +### 4.3 Normative definitions of statuses + +You already use `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed`. Make the semantics **normative** in the spec: + +* Tie `reachable` / `maybe_reachable` to: + + * Existence of a path from **at least one recognized entrypoint**. + * Minimum `pathConfidence` thresholds. +* Require that tools distinguish: + + * “No path in the graph” vs “graph incomplete here.” + +This allows multiple tools to implement the spec and still produce comparable, auditable results. + +--- + +If you want, the next concrete step could be: + +* A **“Reachability Graph 1.1”** document that: + + * Extracts the shared core, + * Adds multi‑axis confidence, + * Adds partial‑coverage metadata, + * Extends the enums for edge types and entrypoint types for PHP/JS. + +That gives your team a clean target for implementation without materially increasing complexity for a mid‑level engineer. diff --git a/docs/product-advisories/archived/20-Nov-2026 - Encoding Binary Reachability with PURL‑Resolved Edges.md b/docs/product-advisories/archived/20-Nov-2025 - Encoding Binary Reachability with PURL‑Resolved Edges.md similarity index 96% rename from docs/product-advisories/archived/20-Nov-2026 - Encoding Binary Reachability with PURL‑Resolved Edges.md rename to docs/product-advisories/archived/20-Nov-2025 - Encoding Binary Reachability with PURL‑Resolved Edges.md index 8630e0976..c069d38d6 100644 --- a/docs/product-advisories/archived/20-Nov-2026 - Encoding Binary Reachability with PURL‑Resolved Edges.md +++ b/docs/product-advisories/archived/20-Nov-2025 - Encoding Binary Reachability with PURL‑Resolved Edges.md @@ -1,1088 +1,1088 @@ - - - - -Here’s a simple, practical way to think about **binary reachability** that cleanly joins call graphs with SBOMs—without reusing external tools. - ---- - -### The big idea (plain English) - -* Each **function call edge** in a binary’s call graph is annotated with: - - * a **purl** (package URL) identifying which component the callee belongs to, and - * a **symbol digest** (stable hash of the callee’s normalized symbol signature). -* With those two tags, call graphs from **PE/ELF/Mach‑O** can be merged across binaries and mapped onto your **SBOM components**, giving a **single vulnerability graph** that answers: *“Is this vulnerable function reachable in my deployment?”* - ---- - -### Why this matters for Stella Ops - -* **One graph to rule them all:** Libraries used by multiple services merge naturally via the same purl, so you see cross‑service blast radius instantly. -* **Deterministic & auditable:** Digests + purls make edges reproducible (great for “replayable scans” and audit trails). -* **Zero tool reuse required:** You can implement PE/ELF/Mach‑O parsing once in C# and still interoperate with SBOM/VEX ecosystems via purls. - ---- - -### Minimal data model - -```json -{ - "nodes": [ - {"id":"sym:hash:callee","kind":"symbol","purl":"pkg:nuget/Newtonsoft.Json@13.0.3","sig":"Newtonsoft.Json.JsonConvert::DeserializeObject(string)"}, - {"id":"bin:hash:myapi","kind":"binary","format":"pe","name":"MyApi.exe","build":"sha256:..."} - ], - "edges": [ - { - "from":"sym:hash:caller", - "to":"sym:hash:callee", - "etype":"calls", - "purl":"pkg:nuget/Newtonsoft.Json@13.0.3", - "sym_digest":"sha256:SYM_CALLEE", - "site":{"binary":"bin:hash:myapi","offset":"0x0041AFD0"} - } - ], - "sbom": [ - {"purl":"pkg:nuget/Newtonsoft.Json@13.0.3","component_id":"c-123","files":["/app/MyApi.exe"] } - ] -} -``` - ---- - -### How to build it (C#‑centric, binary‑first) - -1. **Lift symbols per format** - - * **PE**: parse COFF + PDB (if present), fallback to export tables; normalize “namespace.type::method(sig)”. - * **ELF**: `.dynsym`/`.symtab` + DWARF (if present); demangle (Itanium/LLVM rules). - * **Mach‑O**: LC_SYMTAB + DWARF; demangle. -2. **Compute `symbol digests`** - - * Hash of normalized signature + (optionally) instruction fingerprint for resilience to addresses. -3. **Build intra‑binary call graph** - - * Conservative static: function→function edges from **import thunks**, relocation targets, and lightweight disassembly of direct calls. - * Optional dynamic refinement: PERF/eBPF or ETW traces to mark *observed* edges. -4. **Resolve each callee to a `purl`** - - * Map import/segment to owning file → map file to SBOM component → emit its purl. - * If multiple candidates, emit edge with a small `candidates[]` set; policy later can prune. -5. **Merge graphs across binaries** - - * Union by `(purl, sym_digest)` for callees; keep multiple `site` locations. -6. **Attach vulnerabilities** - - * From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable `(purl, sym_digest)`. - ---- - -### Practical policies that work well - -* **Entrypoints:** ASP.NET controller actions, `Main`, exported handlers, cron entry shims. -* **Edge confidence:** tag edges as `import`, `reloc`, `disasm`, or `runtime`; prefer runtime in prioritization. -* **Unknowns registry:** if symbol can’t be resolved, record `purl:"pkg:unknown"` with reason (stripped, obfuscated, thunk), so it’s visible—not silently dropped. - ---- - -### Quick win you can ship first - -* Start with **imports-only reachability** (no disassembly). For most CVEs in popular packages, imports + SBOM mapping already highlights real risk. -* Add **light disassembly** for direct `call` opcodes later to improve precision. - -If you want, I can turn this into a ready‑to‑drop **.NET 10 library skeleton**: parsers (PE/ELF/Mach‑O), symbol normalizer, digestor, graph model, and SBOM mapper with purl resolvers. - -Below is a concrete, implementation-ready specification aimed at a solid, “average” C# developer. The goal is that they can build this module without knowing all of StellaOps context. - ---- - -## 1. Purpose and Scope - -Implement a reusable .NET library that: - -1. Reads binaries (PE, ELF, Mach-O). -2. Extracts **functions/symbols** and their **call relationships** (call graph). -3. Annotates each call edge with: - - * The **callee’s purl** (package URL / SBOM component). - * A **symbol digest** (stable function identifier). -4. Produces a **reachability graph** in memory and as JSON. - -This will be used by other StellaOps services (Scanner / Sbomer / Vexer) to answer: -“Is this vulnerable function from package X reachable in my environment?” - -Non-goals for v1: - -* No dynamic tracing (no eBPF, no ETW). Static only. -* No external CLI tools (no `objdump`, `llvm-nm`, etc.). Everything in-process and in C#. - ---- - -## 2. Project Structure - -Create a new class library: - -* Project: `StellaOps.Scanner.BinaryReachability` -* TargetFramework: `net10.0` -* Nullable: `enable` -* Language: latest C# available for .NET 10 - -Recommended namespaces: - -* `StellaOps.Scanner.BinaryReachability` -* `StellaOps.Scanner.BinaryReachability.Model` -* `StellaOps.Scanner.BinaryReachability.Parsing` -* `StellaOps.Scanner.BinaryReachability.Parsing.Pe` -* `StellaOps.Scanner.BinaryReachability.Parsing.Elf` -* `StellaOps.Scanner.BinaryReachability.Parsing.MachO` -* `StellaOps.Scanner.BinaryReachability.Sbom` -* `StellaOps.Scanner.BinaryReachability.Graph` - ---- - -## 3. Core Domain Model - -### 3.1 Enumerations - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Model; - -public enum BinaryFormat -{ - Pe, - Elf, - MachO -} - -public enum SymbolKind -{ - Function, - Method, - Constructor, - Destructor, - ImportStub, - Thunk, - Unknown -} - -public enum EdgeKind -{ - DirectCall, - IndirectCall, - ImportCall, - ConstructorInit, // e.g. .init_array - Other -} - -public enum EdgeConfidence -{ - High, // import, relocation, clear direct call - Medium, // best-effort disassembly - Low // heuristics, fallback -} -``` - -### 3.2 Node and Edge Records - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Model; - -public sealed record BinaryNode( - string BinaryId, // e.g. "bin:sha256:..." - string FilePath, // path in image or filesystem - BinaryFormat Format, - string? BuildId, // ELF build-id, Mach-O UUID, PE pdb-signature (optional) - string FileHash // sha256 of binary bytes -); - -public sealed record SymbolNode( - string SymbolId, // stable within this graph: "sym:{digest}" - string NormalizedName, // normalized signature/name - SymbolKind Kind, - string? Purl, // nullable: may be unknown - string SymbolDigest // sha256 of normalized name -); -``` - -### 3.3 Call Edge and Call Site - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Model; - -public sealed record CallSite( - string BinaryId, - ulong Offset, // RVA / file offset - string? SourceFile, // Optional, if we can resolve - int? SourceLine // Optional -); - -public sealed record CallEdge( - string FromSymbolId, - string ToSymbolId, - EdgeKind EdgeKind, - EdgeConfidence Confidence, - string? CalleePurl, // resolved package of callee - string CalleeSymbolDigest, // same as target SymbolDigest - CallSite Site -); -``` - -### 3.4 Graph Container - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Graph; - -using StellaOps.Scanner.BinaryReachability.Model; - -public sealed class ReachabilityGraph -{ - public Dictionary Binaries { get; } = new(); - public Dictionary Symbols { get; } = new(); - public List Edges { get; } = new(); - - public void AddBinary(BinaryNode binary) => Binaries[binary.BinaryId] = binary; - public void AddSymbol(SymbolNode symbol) => Symbols[symbol.SymbolId] = symbol; - public void AddEdge(CallEdge edge) => Edges.Add(edge); -} -``` - ---- - -## 4. Public API (what other modules call) - -Define a simple facade service that other StellaOps components use. - -```csharp -namespace StellaOps.Scanner.BinaryReachability; - -using StellaOps.Scanner.BinaryReachability.Graph; -using StellaOps.Scanner.BinaryReachability.Model; -using StellaOps.Scanner.BinaryReachability.Sbom; - -public interface IBinaryReachabilityService -{ - /// - /// Builds a reachability graph for all binaries in the given directory (e.g. unpacked container filesystem), - /// using SBOM data to resolve PURLs. - /// - ReachabilityGraph BuildGraph( - string rootDirectory, - ISbomComponentResolver sbomResolver); - - /// - /// Serialize the graph to JSON for persistence / later replay. - /// - string SerializeGraph(ReachabilityGraph graph); -} -``` - -Implementation class: - -```csharp -public sealed class BinaryReachabilityService : IBinaryReachabilityService -{ - // Will compose format-specific parsers and SBOM resolver inside. -} -``` - ---- - -## 5. SBOM Component Resolver - -We need only a minimal interface to attach PURLs to binaries and symbols. - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Sbom; - -public interface ISbomComponentResolver -{ - /// - /// Resolve the purl for a binary file (by path or build-id). - /// Return null if not found. - /// - string? ResolvePurlForBinary(string filePath, string? buildId, string fileHash); - - /// - /// Optional: resolve purl by a library name only (e.g. "libssl.so.3", "libcrypto.so.3"). - /// Used when we have imports but not full path. - /// - string? ResolvePurlByLibraryName(string libraryName); -} -``` - -For the C# dev: - -* Implementation will consume **CycloneDX/SPDX SBOMs** that already map files (hash/path/buildId) to components and purls. -* For v1, a simple resolver that: - - * Loads SBOM JSON. - * Indexes components by: - - * File path (normalized). - * File hash. - * BuildId where available. - * Implements the two methods above using dictionary lookups. - ---- - -## 6. Binary Parsing Abstractions - -### 6.1 Common Interface - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Parsing; - -using StellaOps.Scanner.BinaryReachability.Model; - -public interface IBinaryParser -{ - bool CanParse(string filePath, ReadOnlySpan header); - - /// - /// Parse basic binary metadata: format, build-id, file-hash already computed by caller. - /// - BinaryNode ParseBinaryMetadata(string filePath, string fileHash); - - /// - /// Parse functions/symbols from this binary. - /// Return a list of SymbolNode with Purl left null (will be set later). - /// - IReadOnlyList ParseSymbols(BinaryNode binary); - - /// - /// Build intra-binary call edges (from this binary’s functions to others), without PURL info. - /// ToSymbolId should be based on SymbolDigest; PURL will be attached later. - /// - IReadOnlyList ParseCallGraph(BinaryNode binary, IReadOnlyList symbols); -} -``` - -### 6.2 Parser Implementations - -Create three concrete parsers: - -* `PeBinaryParser` in `Parsing.Pe` -* `ElfBinaryParser` in `Parsing.Elf` -* `MachOBinaryParser` in `Parsing.MachO` - -And a small factory: - -```csharp -public sealed class BinaryParserFactory -{ - private readonly List _parsers; - - public BinaryParserFactory() - { - _parsers = new List - { - new Pe.PeBinaryParser(), - new Elf.ElfBinaryParser(), - new MachO.MachOBinaryParser() - }; - } - - public IBinaryParser? GetParser(string filePath, ReadOnlySpan header) - => _parsers.FirstOrDefault(p => p.CanParse(filePath, header)); -} -``` - ---- - -## 7. Symbol Normalization and Digesting - -Create a small helper for consistent symbol IDs. - -```csharp -namespace StellaOps.Scanner.BinaryReachability.Model; - -public static class SymbolIdFactory -{ - public static string ComputeNormalizedName(string rawName) - => rawName.Trim(); // v1: minimal; later we can extend (demangling, etc.) - - public static string ComputeSymbolDigest(string normalizedName) - { - using var sha = System.Security.Cryptography.SHA256.Create(); - var bytes = System.Text.Encoding.UTF8.GetBytes(normalizedName); - var hash = sha.ComputeHash(bytes); - var hex = Convert.ToHexString(hash).ToLowerInvariant(); - return hex; - } - - public static string CreateSymbolId(string symbolDigest) - => $"sym:{symbolDigest}"; -} -``` - -Usage in parsers: - -* For each function name the parser finds: - - * `normalizedName = SymbolIdFactory.ComputeNormalizedName(rawName);` - * `digest = SymbolIdFactory.ComputeSymbolDigest(normalizedName);` - * `symbolId = SymbolIdFactory.CreateSymbolId(digest);` - * Create `SymbolNode`. - -Notes for developer: - -* Do not include file path or address in the digest (we want determinism across builds). -* In the future we can expand normalization to include demangled signatures and parameter types. - ---- - -## 8. Building the Graph (step-by-step) - -Implementation of `BinaryReachabilityService.BuildGraph` should follow this algorithm. - -### 8.1 Scan Files - -1. Recursively enumerate all files under `rootDirectory`. -2. For each file: - - * Open as stream. - * Read first 4–8 bytes as header. - * Try `BinaryParserFactory.GetParser`. - * If no parser, skip file. - -### 8.2 Parse Binary Metadata and Symbols - -For each parseable file: - -1. Compute SHA256 of file content → `fileHash`. -2. `parser.ParseBinaryMetadata(filePath, fileHash)` → `BinaryNode`. -3. Add `BinaryNode` to `ReachabilityGraph.Binaries`. -4. `parser.ParseSymbols(binary)` → list of `SymbolNode`. -5. For each symbol: - - * Add to `ReachabilityGraph.Symbols` if not already present: - - * Key: `SymbolId`. - * If existing, keep first or merge (for v1: keep first). - -Maintain an in-memory index: - -```csharp -// symbolDigest -> SymbolNode -Dictionary symbolsByDigest; -``` - -### 8.3 Parse Call Graph per Binary - -For each binary: - -1. `parser.ParseCallGraph(binary, itsSymbols)` → edges (without PURL attached). -2. For each edge: - - * Ensure `FromSymbolId` and `ToSymbolId` correspond to known `SymbolNode`: - - * `ToSymbolId` should be `sym:{digest}` for the callee. - * Add edge to `ReachabilityGraph.Edges`. - -At this point, edges know only `FromSymbolId`, `ToSymbolId`, kind, confidence, and `CallSite`. - -### 8.4 Attach PURLs - -Now run a second pass to attach PURLs to symbols and edges: - -1. For each `BinaryNode`: - - * Call `sbomResolver.ResolvePurlForBinary(binary.FilePath, binary.BuildId, binary.FileHash)`. - * If not null, this is the **binary’s own purl** (used for "who owns these functions"). -2. Maintain: - -```csharp -Dictionary binaryPurlsById; // BinaryId -> purl? -``` - -3. For each `CallEdge`: - - * Get callee symbol: - - * `var symbol = graph.Symbols[edge.ToSymbolId];` - * If `symbol.Purl` is null: - - * If callee is local (same binary – parser may mark it via metadata or `CallSite.BinaryId`): - - * Assign `symbol.Purl = binaryPurlsById[callSite.BinaryId]` (can be null). - * If callee is imported from an external library: - - * Parser should provide library name in `NormalizedName` or additional metadata (for v1, you can store library in a separate structure). - * Use `sbomResolver.ResolvePurlByLibraryName(libraryName)` to find purl. - * Set `symbol.Purl` to that value (even if null). - * Set `edge.CalleePurl = symbol.Purl`. - * Set `edge.CalleeSymbolDigest = symbol.SymbolDigest`. - -Note: For v1 you can simplify: - -* Assume all callees in this binary belong to `binary`’s purl. -* Later, extend to per-library mapping. - ---- - -## 9. Format-Specific Minimum Requirements - -For each parser, aim for this minimum. - -### 9.1 PE Parser (Windows) - -Tasks: - -1. Identify PE by `MZ` + PE header. -2. Extract: - - * Machine type. - * Optional: PDB signature / age (for potential BuildId in the future). -3. Symbols: - - * Use export table for exported functions. - * Use import table for imported functions (these represent edges from this binary to others). -4. Call graph: - - * For v1: edges from each local function to imported functions via import table. - * Later: add simple disassembly of `.text` section to detect intra-binary calls. - -Practical approach: - -* Use `System.Reflection.PortableExecutable` if possible, or a small custom PE reader. -* Represent imported function name as `"!"` in `NormalizedName`. - -### 9.2 ELF Parser (Linux) - -Tasks: - -1. Detect ELF by magic `0x7F 'E' 'L' 'F'`. -2. Extract: - - * BuildId (from `.note.gnu.build-id` if present). - * Architecture. -3. Symbols: - - * Read `.dynsym` (dynamic symbols) and `.symtab` if present. - * Functions only (symbol type FUNC). -4. Call graph (minimum): - - * Imports via PLT/GOT entries (function calls to shared libs). - * Map symbol names to `SymbolNode` as above. - -Implementation: - -* Write a simple ELF reader: parse header, section headers, locate `.dynsym`, `.strtab`, `.symtab`, `.note.gnu.build-id`. - -### 9.3 Mach-O Parser (macOS) - -Tasks: - -1. Detect Mach-O via magic (`0xFEEDFACE`, `0xFEEDFACF`, etc.). -2. Extract: - - * UUID (LC_UUID) as BuildId equivalent. -3. Symbols: - - * Use LC_SYMTAB and associated string table. -4. Call graph: - - * Similar approach as ELF for imports; minimum: cross-binary call edges via import stubs. - -Implementation: - -* Minimal Mach-O parser: read load commands, find LC_SYMTAB and LC_UUID. - ---- - -## 10. JSON Serialization Format - -Use System.Text.Json with simple DTOs mirroring `ReachabilityGraph`. For v1, you can serialize the domain model directly. - -Example structure (for reference only): - -```json -{ - "nodes": { - "binaries": [ - { "binaryId": "bin:sha256:...", "filePath": "/app/MyApi.exe", "format": "Pe", "buildId": null, "fileHash": "..." } - ], - "symbols": [ - { "symbolId": "sym:...", "normalizedName": "MyNamespace.MyType::MyMethod()", "kind": "Function", "purl": "pkg:nuget/MyLib@1.2.3", "symbolDigest": "..." } - ] - }, - "edges": [ - { - "fromSymbolId": "sym:...", - "toSymbolId": "sym:...", - "edgeKind": "ImportCall", - "confidence": "High", - "calleePurl": "pkg:nuget/MyLib@1.2.3", - "calleeSymbolDigest": "...", - "site": { "binaryId": "bin:sha256:...", "offset": "0", "sourceFile": null, "sourceLine": null } - } - ] -} -``` - ---- - -## 11. Error Handling & Logging - -* For unreadable or unsupported binaries: - - * Log a warning and continue. -* For parsing errors: - - * Catch exceptions, log with file path and format, continue with other files. -* For SBOM resolution failures: - - * Not an error; leave Purl as null. - -Logs should at least include: - -* Number of binaries discovered, parsed successfully, failed. -* Number of symbols and edges created. -* Number of edges with `CalleePurl` null vs non-null. - ---- - -## 12. Test Plan (high-level) - -1. **Unit tests** for: - - * `SymbolIdFactory` (deterministic digests). - * `BinaryReachabilityService` with mocked parsers & SBOM resolver. -2. **Integration tests** (per platform) using small sample binaries: - - * A PE with one import (e.g. `MessageBoxA`). - * An ELF binary calling `printf`. - * A Mach-O binary with a simple imported function. -3. Check that: - - * Graph contains expected binaries and symbols. - * Call edges exist and have correct `FromSymbolId` / `ToSymbolId`. - * PURLs are attached when SBOM resolver is provided with matching entries. - ---- - -If you want, next step I can break this into a concrete task list (Jira-style) for a single mid-level C# developer over 1–2 sprints, including approximate order and dependencies. -You can push these specs significantly closer to “best in class” by tightening a few correctness details, making some implicit assumptions explicit, and adding non‑functional and quality gates. Below is a structured list of concrete improvements you can fold back into the spec. - -I’ll focus on changes that materially affect correctness, extensibility, and implementation quality for an “average C# dev” without exploding complexity. - ---- - -## 1. Clarify Non‑Functional Requirements - -Right now the spec is almost entirely functional. Add a short NFR section so the developer has explicit targets: - -**Add a “Non‑Functional Requirements” section:** - -* **Performance** - - * Target scanning throughput, e.g. “On commodity hardware, aim for at least 50–100 MB/s of binaries scanned in static mode.” - * Specify acceptable complexity: “All parsing operations must be linear in file size where possible; avoid quadratic algorithms over symbol tables.” - -* **Memory** - - * Provide a rough upper bound, e.g. “Graph building must not exceed 512 MB RAM for 10k binaries with typical Linux container images.” - -* **Thread safety** - - * Clarify: “All parser implementations must be stateless and thread‑safe; `BinaryReachabilityService.BuildGraph` may scan binaries in parallel.” - -* **Portability** - - * Minimum supported OS set (Windows, Linux, macOS) and CPU architectures (x86_64, ARM64); important because ELF/Mach‑O vary. - -This keeps the implementation from being “correct but unusably slow” and tells the dev what “good enough” looks like. - ---- - -## 2. Fix and Strengthen Symbol Identity (Very Important) - -Current spec uses `SymbolId = "sym:{digest}"` where digest is only based on normalized name. That will collapse distinct functions that happen to share the same name/signature across different libraries/packages, which is unacceptable once you care about cross‑component reachability. - -**Improve the spec as follows:** - -1. **Split “symbol node identity” from “canonical symbol key”:** - - * Keep a local identity that is always unique per binary: - - ```csharp - public sealed record SymbolNode( - string SymbolId, // e.g. "sym:{binaryId}:{localIndex}" - string NormalizedName, - SymbolKind Kind, - string? Purl, - string SymbolDigest // stable digest of NormalizedName - ); - ``` - - * Define a **canonical symbol key** struct for cross‑binary grouping: - - ```csharp - public readonly record struct CanonicalSymbolKey( - string SymbolDigest, // sha256(normalizedName) - string? Purl // null for unknown package - ); - ``` - - * Inside `ReachabilityGraph`, add: - - ```csharp - public Dictionary> CanonicalSymbolIndex { get; } = new(); - ``` - -2. **Clarify behavior:** - - * Never merge two `SymbolNode`s just because they share the same digest. - * For “global reasoning” (e.g. “all call sites to the vulnerable function X from package Y”), use `CanonicalSymbolKey(SymbolDigest, Purl)`. - -3. **Update `CallEdge`:** - - * Keep `FromSymbolId` and `ToSymbolId` as node IDs. - * Include the canonical key in a dedicated field: - - ```csharp - public sealed record CallEdge( - string FromSymbolId, - string ToSymbolId, - EdgeKind EdgeKind, - EdgeConfidence Confidence, - CanonicalSymbolKey? CalleeKey, - CallSite Site - ); - ``` - -This single change prevents subtle and serious misattribution across libraries with overlapping APIs. - ---- - -## 3. Explicit Build Identity Semantics (PE/ELF/Mach‑O) - -The spec currently says `BuildId` is “optional” and format‑specific, but does not define **how** to compute it per format. Best‑in‑class means this is deterministic and documented. - -**Extend the spec with a “Binary Identity” section:** - -* **PE (Windows)** - - * `BuildId` = PDB GUID + Age if available (from CodeView debug directory). - * If PDB info is missing, set `BuildId = null` and rely on `FileHash`. -* **ELF (Linux)** - - * `BuildId` = contents of `.note.gnu.build-id` if present. -* **Mach‑O (macOS)** - - * `BuildId` = UUID from `LC_UUID` load command. - -Also specify: - -* **Primary identity order**: `(BuildId, FileHash)`; if `BuildId` is null, use `FileHash` only. -* SBOM resolvers MUST treat `(BuildId, FileHash)` as the canonical key to map binaries to components, with file path only as a hint. - -This gives you robust correlation between SBOM entries and binaries, across containers and file renames. - ---- - -## 4. Enrich the Edge Model and Call Site Semantics - -For precision and debuggability, specify what edges mean more rigorously. - -**Add fields and definitions:** - -1. **Direction and type:** - - Add a small discriminator describing the origin of the edge: - - ```csharp - public enum EdgeSource - { - ImportTable, // import thunk / PLT / stub - Relocation, // relocation to symbol - Disassembly, // decoded CALL / BL / JAL - Metadata, // .NET metadata, DWARF, etc. - Other - } - ``` - - Extend `CallEdge`: - - ```csharp - public sealed record CallEdge( - string FromSymbolId, - string ToSymbolId, - EdgeKind EdgeKind, - EdgeConfidence Confidence, - EdgeSource Source, - CanonicalSymbolKey? CalleeKey, - CallSite Site - ); - ``` - -2. **Intra‑ vs inter‑binary** - - * Define: `Site.BinaryId` always refers to the binary containing the call instruction. - * Intra‑binary edge: `FromSymbol` and `ToSymbol` share same `BinaryId`. - * Inter‑binary edge: otherwise. - -3. **Unknown or unresolved callees** - - * Do not drop unresolved calls; add a special `UnknownSymbolNode` per binary: - - * `NormalizedName = ""`, `Kind = SymbolKind.Unknown`, `Purl = null`. - * Edges to unknown must have `Confidence = EdgeConfidence.Low`. - -This makes downstream consumers able to distinguish “we are sure this is a call to libX.Y” from “we saw a call but do not know to where”. - ---- - -## 5. Strengthen Symbol Normalization Rules (Demangling etc.) - -For best‑in‑class results, you want reproducible signatures independent of compiler version, and you want to unify mangled C++/Rust/etc. names. - -**Extend the `SymbolIdFactory` spec with clear rules:** - -1. **Language‑agnostic core** - - * Always: - - * Demangle if possible. - * Normalize whitespace. - * Normalize namespace separators to `.` and member separator to `::`. - * Remove address/offset suffixes embedded in names. - -2. **Format‑ and language‑specific guidance** - - * For C/C++ (MSVC / Itanium ABI): - - * Use a demangler (your own or library) to get `retType namespace.Type::Func(paramTypes...)`. - * Omit return type in normalization to make signatures more stable: `namespace.Type::Func(paramTypes...)`. - * For Rust: - - * Strip hash suffixes from symbol name. - * Use “crate::module::Type::func(params...)” pattern where possible. - * For Go: - - * Normalize from `runtime.main_main` → `runtime.main.main` etc. - * For .NET (if/when you add managed parsing later): - - * Use fully qualified CLR names: `Namespace.Type::Method(ParamType1,ParamType2)`. - -3. **Document stability guarantees** - - * Given identical source (function name + parameter list), the `SymbolDigest` must remain stable across builds, architectures, optimization levels, and link addresses. - * If demangling fails, fallback to raw name but strip obvious hashes if safe. - -Specify this in prose and keep the implementation flexible, but the rules must be clear enough that two developers implementing the parser will produce the same digest for the same symbol. - ---- - -## 6. More Precise SBOM & PURL Resolution Behavior - -The SBOM integration is crucial to StellaOps; push this further so it is deterministic and auditable. - -**Extend `ISbomComponentResolver` behavior:** - -1. **Resolution order** - - Document a strict order: - - 1. `(BuildId, FileHash)` match. - 2. `FileHash` only. - 3. Normalized file path if SBOM has explicit path mapping. - 4. Library name fallback via `ResolvePurlByLibraryName`. - -2. **Multiple SBOMs and conflicts** - - * Allow multiple SBOM sources; if two SBOMs claim different purls for the same `(BuildId, FileHash)`, define a policy: - - * e.g. fail fast with a “conflicting SBOM” error; or choose a deterministic priority order. - -3. **Library name mapping contract** - - Add a small DTO to make the mapping explicit: - - ```csharp - public sealed record LibraryReference( - string BinaryId, - string LibraryName, // "libssl.so.3" / "KERNEL32.dll" - string? ResolvedPath // if the loader path is known - ); - ``` - - Extend `IBinaryParser` with: - - ```csharp - IReadOnlyList ParseLibraryReferences(BinaryNode binary); - ``` - - Then describe how `BinaryReachabilityService` uses those to call `ResolvePurlByLibraryName`. - -4. **Unknown purls** - - * Require that unknowns are explicit: - - * When `ResolvePurlForBinary` returns null, store `Purl = null` and flag this in logs: “No SBOM component for binary X (BuildId=..., Hash=...)”. - -This ensures SBOM resolution remains a traceable, deterministic step rather than a best‑effort guess. - ---- - -## 7. Explicit JSON Schema & Versioning - -For replayability and compatibility, define a clear JSON schema and version. - -**Add:** - -* A top‑level metadata section: - - ```json - { - "schemaVersion": "1.0.0", - "generatedAt": "2025-11-20T12:34:56Z", - "tool": "StellaOps.Scanner.BinaryReachability", - "toolVersion": "1.0.0", - "graph": { ... } - } - ``` - -* Commit to: - - * Only additive changes in minor versions. - * Backwards‑compatible changes within the same major version. - * If you change anything structural (e.g. how symbol IDs work), bump `schemaVersion` major. - -Optionally, provide a compact JSON schema file (or at least a documented shape) so other teams can implement readers in other languages. - ---- - -## 8. Concurrency, Streaming, and Large Images - -For best‑in‑class scalability, specify how large images are handled. - -**Clarify in the spec:** - -1. **Parallelization** - - * `BinaryReachabilityService.BuildGraph`: - - * May scan binaries in parallel using `Parallel.ForEach`. - * All parsers must be thread‑safe and not rely on shared mutable state. - -2. **Streaming option (optional but recommended)** - - * Provide a second API for very large repositories: - - ```csharp - public interface IGraphSink - { - void OnBinary(BinaryNode binary); - void OnSymbol(SymbolNode symbol); - void OnEdge(CallEdge edge); - } - - void BuildGraphStreaming(string rootDirectory, ISbomComponentResolver sbomResolver, IGraphSink sink); - ``` - - * This allows building graphs into a database or message bus without keeping everything in memory. - -Even if you do not implement streaming immediately, designing the interface now keeps the architecture future‑proof. - ---- - -## 9. Observability and Diagnostics - -Best‑in‑class implementation requires good introspection for debugging wrong reachability conclusions. - -**Specify minimal observability requirements:** - -* **Logging** - - * At least: - - * Info: number of binaries, symbols, edges, time taken. - * Warning: unsupported binary formats, SBOM resolution failures, demangling failures. - * Error: parser exceptions per file (with file path and format). - -* **Debug artifacts** - - * Optional environment or flag that dumps per‑binary debug info: - - * Raw symbol table (names + addresses). - * Normalized names and digests. - * Library references. - * Call edges for that binary. - -* **Metrics hooks** - - * Provide a simple interface for metrics: - - ```csharp - public interface IReachabilityMetrics - { - void IncrementCounter(string name, long value = 1); - void ObserveDuration(string name, TimeSpan duration); - } - ``` - - And allow `BinaryReachabilityService` to be constructed with an optional metrics implementation. - ---- - -## 10. Expanded Test Strategy and Quality Gates - -Your test plan is decent but can be made more systematic. - -**Extend test plan:** - -1. **Golden corpus** - - * Maintain a small but curated set of PE/ELF/Mach‑O binaries (checked in or generated) where: - - * Expected symbols and edges are stored as JSON. - * CI compares current output with the golden graph byte‑for‑byte (or structurally). - -2. **Cross‑compiler coverage** - - * At least: - - * C/C++ built by different toolchains (MSVC, clang, gcc). - * Different optimization levels (`-O0`, `-O2`) to ensure stability of parsing. - -3. **Fuzzing / robustness** - - * Create tests with truncated / corrupted binaries to ensure: - - * No crashes. - * Meaningful, bounded error behavior. - -4. **SBOM integration tests** - - * For a test root directory: - - * Synthetic SBOM mapping files to binaries. - * Validate correct purl assignment and conflict handling. - -5. **Determinism tests** - - * Run `BuildGraph` twice on the same directory and assert that: - - * Graph is structurally identical (including order‑independent comparison). - -This makes it much harder for regressions to slip in when you extend parsers or normalization. - ---- - -## 11. Clear Extension Points and Roadmap Notes - -Finally, add a short “Future Extensions” section so the dev knows what to keep in mind when structuring code: - -* Support for: - - * Inlined function tracking (via DWARF/PDB). - * Managed .NET assemblies’ metadata (C# IL call graph). - * Dynamic edge sources (runtime traces) merged into the same graph. -* The spec should instruct: “Design parsers and the graph model so they can accept additional `EdgeSource` types and symbol metadata without breaking existing consumers.” - -That gives the current implementation a clear direction and prevents design dead ends. - ---- - -If you want, I can now take your current spec and rewrite it into a single, consolidated “v1.0 Best‑in‑Class Spec” document that incorporates all of the above changes, so you can hand it directly to an implementation team. + + + + +Here’s a simple, practical way to think about **binary reachability** that cleanly joins call graphs with SBOMs—without reusing external tools. + +--- + +### The big idea (plain English) + +* Each **function call edge** in a binary’s call graph is annotated with: + + * a **purl** (package URL) identifying which component the callee belongs to, and + * a **symbol digest** (stable hash of the callee’s normalized symbol signature). +* With those two tags, call graphs from **PE/ELF/Mach‑O** can be merged across binaries and mapped onto your **SBOM components**, giving a **single vulnerability graph** that answers: *“Is this vulnerable function reachable in my deployment?”* + +--- + +### Why this matters for Stella Ops + +* **One graph to rule them all:** Libraries used by multiple services merge naturally via the same purl, so you see cross‑service blast radius instantly. +* **Deterministic & auditable:** Digests + purls make edges reproducible (great for “replayable scans” and audit trails). +* **Zero tool reuse required:** You can implement PE/ELF/Mach‑O parsing once in C# and still interoperate with SBOM/VEX ecosystems via purls. + +--- + +### Minimal data model + +```json +{ + "nodes": [ + {"id":"sym:hash:callee","kind":"symbol","purl":"pkg:nuget/Newtonsoft.Json@13.0.3","sig":"Newtonsoft.Json.JsonConvert::DeserializeObject(string)"}, + {"id":"bin:hash:myapi","kind":"binary","format":"pe","name":"MyApi.exe","build":"sha256:..."} + ], + "edges": [ + { + "from":"sym:hash:caller", + "to":"sym:hash:callee", + "etype":"calls", + "purl":"pkg:nuget/Newtonsoft.Json@13.0.3", + "sym_digest":"sha256:SYM_CALLEE", + "site":{"binary":"bin:hash:myapi","offset":"0x0041AFD0"} + } + ], + "sbom": [ + {"purl":"pkg:nuget/Newtonsoft.Json@13.0.3","component_id":"c-123","files":["/app/MyApi.exe"] } + ] +} +``` + +--- + +### How to build it (C#‑centric, binary‑first) + +1. **Lift symbols per format** + + * **PE**: parse COFF + PDB (if present), fallback to export tables; normalize “namespace.type::method(sig)”. + * **ELF**: `.dynsym`/`.symtab` + DWARF (if present); demangle (Itanium/LLVM rules). + * **Mach‑O**: LC_SYMTAB + DWARF; demangle. +2. **Compute `symbol digests`** + + * Hash of normalized signature + (optionally) instruction fingerprint for resilience to addresses. +3. **Build intra‑binary call graph** + + * Conservative static: function→function edges from **import thunks**, relocation targets, and lightweight disassembly of direct calls. + * Optional dynamic refinement: PERF/eBPF or ETW traces to mark *observed* edges. +4. **Resolve each callee to a `purl`** + + * Map import/segment to owning file → map file to SBOM component → emit its purl. + * If multiple candidates, emit edge with a small `candidates[]` set; policy later can prune. +5. **Merge graphs across binaries** + + * Union by `(purl, sym_digest)` for callees; keep multiple `site` locations. +6. **Attach vulnerabilities** + + * From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable `(purl, sym_digest)`. + +--- + +### Practical policies that work well + +* **Entrypoints:** ASP.NET controller actions, `Main`, exported handlers, cron entry shims. +* **Edge confidence:** tag edges as `import`, `reloc`, `disasm`, or `runtime`; prefer runtime in prioritization. +* **Unknowns registry:** if symbol can’t be resolved, record `purl:"pkg:unknown"` with reason (stripped, obfuscated, thunk), so it’s visible—not silently dropped. + +--- + +### Quick win you can ship first + +* Start with **imports-only reachability** (no disassembly). For most CVEs in popular packages, imports + SBOM mapping already highlights real risk. +* Add **light disassembly** for direct `call` opcodes later to improve precision. + +If you want, I can turn this into a ready‑to‑drop **.NET 10 library skeleton**: parsers (PE/ELF/Mach‑O), symbol normalizer, digestor, graph model, and SBOM mapper with purl resolvers. + +Below is a concrete, implementation-ready specification aimed at a solid, “average” C# developer. The goal is that they can build this module without knowing all of StellaOps context. + +--- + +## 1. Purpose and Scope + +Implement a reusable .NET library that: + +1. Reads binaries (PE, ELF, Mach-O). +2. Extracts **functions/symbols** and their **call relationships** (call graph). +3. Annotates each call edge with: + + * The **callee’s purl** (package URL / SBOM component). + * A **symbol digest** (stable function identifier). +4. Produces a **reachability graph** in memory and as JSON. + +This will be used by other StellaOps services (Scanner / Sbomer / Vexer) to answer: +“Is this vulnerable function from package X reachable in my environment?” + +Non-goals for v1: + +* No dynamic tracing (no eBPF, no ETW). Static only. +* No external CLI tools (no `objdump`, `llvm-nm`, etc.). Everything in-process and in C#. + +--- + +## 2. Project Structure + +Create a new class library: + +* Project: `StellaOps.Scanner.BinaryReachability` +* TargetFramework: `net10.0` +* Nullable: `enable` +* Language: latest C# available for .NET 10 + +Recommended namespaces: + +* `StellaOps.Scanner.BinaryReachability` +* `StellaOps.Scanner.BinaryReachability.Model` +* `StellaOps.Scanner.BinaryReachability.Parsing` +* `StellaOps.Scanner.BinaryReachability.Parsing.Pe` +* `StellaOps.Scanner.BinaryReachability.Parsing.Elf` +* `StellaOps.Scanner.BinaryReachability.Parsing.MachO` +* `StellaOps.Scanner.BinaryReachability.Sbom` +* `StellaOps.Scanner.BinaryReachability.Graph` + +--- + +## 3. Core Domain Model + +### 3.1 Enumerations + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public enum BinaryFormat +{ + Pe, + Elf, + MachO +} + +public enum SymbolKind +{ + Function, + Method, + Constructor, + Destructor, + ImportStub, + Thunk, + Unknown +} + +public enum EdgeKind +{ + DirectCall, + IndirectCall, + ImportCall, + ConstructorInit, // e.g. .init_array + Other +} + +public enum EdgeConfidence +{ + High, // import, relocation, clear direct call + Medium, // best-effort disassembly + Low // heuristics, fallback +} +``` + +### 3.2 Node and Edge Records + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public sealed record BinaryNode( + string BinaryId, // e.g. "bin:sha256:..." + string FilePath, // path in image or filesystem + BinaryFormat Format, + string? BuildId, // ELF build-id, Mach-O UUID, PE pdb-signature (optional) + string FileHash // sha256 of binary bytes +); + +public sealed record SymbolNode( + string SymbolId, // stable within this graph: "sym:{digest}" + string NormalizedName, // normalized signature/name + SymbolKind Kind, + string? Purl, // nullable: may be unknown + string SymbolDigest // sha256 of normalized name +); +``` + +### 3.3 Call Edge and Call Site + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public sealed record CallSite( + string BinaryId, + ulong Offset, // RVA / file offset + string? SourceFile, // Optional, if we can resolve + int? SourceLine // Optional +); + +public sealed record CallEdge( + string FromSymbolId, + string ToSymbolId, + EdgeKind EdgeKind, + EdgeConfidence Confidence, + string? CalleePurl, // resolved package of callee + string CalleeSymbolDigest, // same as target SymbolDigest + CallSite Site +); +``` + +### 3.4 Graph Container + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Graph; + +using StellaOps.Scanner.BinaryReachability.Model; + +public sealed class ReachabilityGraph +{ + public Dictionary Binaries { get; } = new(); + public Dictionary Symbols { get; } = new(); + public List Edges { get; } = new(); + + public void AddBinary(BinaryNode binary) => Binaries[binary.BinaryId] = binary; + public void AddSymbol(SymbolNode symbol) => Symbols[symbol.SymbolId] = symbol; + public void AddEdge(CallEdge edge) => Edges.Add(edge); +} +``` + +--- + +## 4. Public API (what other modules call) + +Define a simple facade service that other StellaOps components use. + +```csharp +namespace StellaOps.Scanner.BinaryReachability; + +using StellaOps.Scanner.BinaryReachability.Graph; +using StellaOps.Scanner.BinaryReachability.Model; +using StellaOps.Scanner.BinaryReachability.Sbom; + +public interface IBinaryReachabilityService +{ + /// + /// Builds a reachability graph for all binaries in the given directory (e.g. unpacked container filesystem), + /// using SBOM data to resolve PURLs. + /// + ReachabilityGraph BuildGraph( + string rootDirectory, + ISbomComponentResolver sbomResolver); + + /// + /// Serialize the graph to JSON for persistence / later replay. + /// + string SerializeGraph(ReachabilityGraph graph); +} +``` + +Implementation class: + +```csharp +public sealed class BinaryReachabilityService : IBinaryReachabilityService +{ + // Will compose format-specific parsers and SBOM resolver inside. +} +``` + +--- + +## 5. SBOM Component Resolver + +We need only a minimal interface to attach PURLs to binaries and symbols. + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Sbom; + +public interface ISbomComponentResolver +{ + /// + /// Resolve the purl for a binary file (by path or build-id). + /// Return null if not found. + /// + string? ResolvePurlForBinary(string filePath, string? buildId, string fileHash); + + /// + /// Optional: resolve purl by a library name only (e.g. "libssl.so.3", "libcrypto.so.3"). + /// Used when we have imports but not full path. + /// + string? ResolvePurlByLibraryName(string libraryName); +} +``` + +For the C# dev: + +* Implementation will consume **CycloneDX/SPDX SBOMs** that already map files (hash/path/buildId) to components and purls. +* For v1, a simple resolver that: + + * Loads SBOM JSON. + * Indexes components by: + + * File path (normalized). + * File hash. + * BuildId where available. + * Implements the two methods above using dictionary lookups. + +--- + +## 6. Binary Parsing Abstractions + +### 6.1 Common Interface + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Parsing; + +using StellaOps.Scanner.BinaryReachability.Model; + +public interface IBinaryParser +{ + bool CanParse(string filePath, ReadOnlySpan header); + + /// + /// Parse basic binary metadata: format, build-id, file-hash already computed by caller. + /// + BinaryNode ParseBinaryMetadata(string filePath, string fileHash); + + /// + /// Parse functions/symbols from this binary. + /// Return a list of SymbolNode with Purl left null (will be set later). + /// + IReadOnlyList ParseSymbols(BinaryNode binary); + + /// + /// Build intra-binary call edges (from this binary’s functions to others), without PURL info. + /// ToSymbolId should be based on SymbolDigest; PURL will be attached later. + /// + IReadOnlyList ParseCallGraph(BinaryNode binary, IReadOnlyList symbols); +} +``` + +### 6.2 Parser Implementations + +Create three concrete parsers: + +* `PeBinaryParser` in `Parsing.Pe` +* `ElfBinaryParser` in `Parsing.Elf` +* `MachOBinaryParser` in `Parsing.MachO` + +And a small factory: + +```csharp +public sealed class BinaryParserFactory +{ + private readonly List _parsers; + + public BinaryParserFactory() + { + _parsers = new List + { + new Pe.PeBinaryParser(), + new Elf.ElfBinaryParser(), + new MachO.MachOBinaryParser() + }; + } + + public IBinaryParser? GetParser(string filePath, ReadOnlySpan header) + => _parsers.FirstOrDefault(p => p.CanParse(filePath, header)); +} +``` + +--- + +## 7. Symbol Normalization and Digesting + +Create a small helper for consistent symbol IDs. + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public static class SymbolIdFactory +{ + public static string ComputeNormalizedName(string rawName) + => rawName.Trim(); // v1: minimal; later we can extend (demangling, etc.) + + public static string ComputeSymbolDigest(string normalizedName) + { + using var sha = System.Security.Cryptography.SHA256.Create(); + var bytes = System.Text.Encoding.UTF8.GetBytes(normalizedName); + var hash = sha.ComputeHash(bytes); + var hex = Convert.ToHexString(hash).ToLowerInvariant(); + return hex; + } + + public static string CreateSymbolId(string symbolDigest) + => $"sym:{symbolDigest}"; +} +``` + +Usage in parsers: + +* For each function name the parser finds: + + * `normalizedName = SymbolIdFactory.ComputeNormalizedName(rawName);` + * `digest = SymbolIdFactory.ComputeSymbolDigest(normalizedName);` + * `symbolId = SymbolIdFactory.CreateSymbolId(digest);` + * Create `SymbolNode`. + +Notes for developer: + +* Do not include file path or address in the digest (we want determinism across builds). +* In the future we can expand normalization to include demangled signatures and parameter types. + +--- + +## 8. Building the Graph (step-by-step) + +Implementation of `BinaryReachabilityService.BuildGraph` should follow this algorithm. + +### 8.1 Scan Files + +1. Recursively enumerate all files under `rootDirectory`. +2. For each file: + + * Open as stream. + * Read first 4–8 bytes as header. + * Try `BinaryParserFactory.GetParser`. + * If no parser, skip file. + +### 8.2 Parse Binary Metadata and Symbols + +For each parseable file: + +1. Compute SHA256 of file content → `fileHash`. +2. `parser.ParseBinaryMetadata(filePath, fileHash)` → `BinaryNode`. +3. Add `BinaryNode` to `ReachabilityGraph.Binaries`. +4. `parser.ParseSymbols(binary)` → list of `SymbolNode`. +5. For each symbol: + + * Add to `ReachabilityGraph.Symbols` if not already present: + + * Key: `SymbolId`. + * If existing, keep first or merge (for v1: keep first). + +Maintain an in-memory index: + +```csharp +// symbolDigest -> SymbolNode +Dictionary symbolsByDigest; +``` + +### 8.3 Parse Call Graph per Binary + +For each binary: + +1. `parser.ParseCallGraph(binary, itsSymbols)` → edges (without PURL attached). +2. For each edge: + + * Ensure `FromSymbolId` and `ToSymbolId` correspond to known `SymbolNode`: + + * `ToSymbolId` should be `sym:{digest}` for the callee. + * Add edge to `ReachabilityGraph.Edges`. + +At this point, edges know only `FromSymbolId`, `ToSymbolId`, kind, confidence, and `CallSite`. + +### 8.4 Attach PURLs + +Now run a second pass to attach PURLs to symbols and edges: + +1. For each `BinaryNode`: + + * Call `sbomResolver.ResolvePurlForBinary(binary.FilePath, binary.BuildId, binary.FileHash)`. + * If not null, this is the **binary’s own purl** (used for "who owns these functions"). +2. Maintain: + +```csharp +Dictionary binaryPurlsById; // BinaryId -> purl? +``` + +3. For each `CallEdge`: + + * Get callee symbol: + + * `var symbol = graph.Symbols[edge.ToSymbolId];` + * If `symbol.Purl` is null: + + * If callee is local (same binary – parser may mark it via metadata or `CallSite.BinaryId`): + + * Assign `symbol.Purl = binaryPurlsById[callSite.BinaryId]` (can be null). + * If callee is imported from an external library: + + * Parser should provide library name in `NormalizedName` or additional metadata (for v1, you can store library in a separate structure). + * Use `sbomResolver.ResolvePurlByLibraryName(libraryName)` to find purl. + * Set `symbol.Purl` to that value (even if null). + * Set `edge.CalleePurl = symbol.Purl`. + * Set `edge.CalleeSymbolDigest = symbol.SymbolDigest`. + +Note: For v1 you can simplify: + +* Assume all callees in this binary belong to `binary`’s purl. +* Later, extend to per-library mapping. + +--- + +## 9. Format-Specific Minimum Requirements + +For each parser, aim for this minimum. + +### 9.1 PE Parser (Windows) + +Tasks: + +1. Identify PE by `MZ` + PE header. +2. Extract: + + * Machine type. + * Optional: PDB signature / age (for potential BuildId in the future). +3. Symbols: + + * Use export table for exported functions. + * Use import table for imported functions (these represent edges from this binary to others). +4. Call graph: + + * For v1: edges from each local function to imported functions via import table. + * Later: add simple disassembly of `.text` section to detect intra-binary calls. + +Practical approach: + +* Use `System.Reflection.PortableExecutable` if possible, or a small custom PE reader. +* Represent imported function name as `"!"` in `NormalizedName`. + +### 9.2 ELF Parser (Linux) + +Tasks: + +1. Detect ELF by magic `0x7F 'E' 'L' 'F'`. +2. Extract: + + * BuildId (from `.note.gnu.build-id` if present). + * Architecture. +3. Symbols: + + * Read `.dynsym` (dynamic symbols) and `.symtab` if present. + * Functions only (symbol type FUNC). +4. Call graph (minimum): + + * Imports via PLT/GOT entries (function calls to shared libs). + * Map symbol names to `SymbolNode` as above. + +Implementation: + +* Write a simple ELF reader: parse header, section headers, locate `.dynsym`, `.strtab`, `.symtab`, `.note.gnu.build-id`. + +### 9.3 Mach-O Parser (macOS) + +Tasks: + +1. Detect Mach-O via magic (`0xFEEDFACE`, `0xFEEDFACF`, etc.). +2. Extract: + + * UUID (LC_UUID) as BuildId equivalent. +3. Symbols: + + * Use LC_SYMTAB and associated string table. +4. Call graph: + + * Similar approach as ELF for imports; minimum: cross-binary call edges via import stubs. + +Implementation: + +* Minimal Mach-O parser: read load commands, find LC_SYMTAB and LC_UUID. + +--- + +## 10. JSON Serialization Format + +Use System.Text.Json with simple DTOs mirroring `ReachabilityGraph`. For v1, you can serialize the domain model directly. + +Example structure (for reference only): + +```json +{ + "nodes": { + "binaries": [ + { "binaryId": "bin:sha256:...", "filePath": "/app/MyApi.exe", "format": "Pe", "buildId": null, "fileHash": "..." } + ], + "symbols": [ + { "symbolId": "sym:...", "normalizedName": "MyNamespace.MyType::MyMethod()", "kind": "Function", "purl": "pkg:nuget/MyLib@1.2.3", "symbolDigest": "..." } + ] + }, + "edges": [ + { + "fromSymbolId": "sym:...", + "toSymbolId": "sym:...", + "edgeKind": "ImportCall", + "confidence": "High", + "calleePurl": "pkg:nuget/MyLib@1.2.3", + "calleeSymbolDigest": "...", + "site": { "binaryId": "bin:sha256:...", "offset": "0", "sourceFile": null, "sourceLine": null } + } + ] +} +``` + +--- + +## 11. Error Handling & Logging + +* For unreadable or unsupported binaries: + + * Log a warning and continue. +* For parsing errors: + + * Catch exceptions, log with file path and format, continue with other files. +* For SBOM resolution failures: + + * Not an error; leave Purl as null. + +Logs should at least include: + +* Number of binaries discovered, parsed successfully, failed. +* Number of symbols and edges created. +* Number of edges with `CalleePurl` null vs non-null. + +--- + +## 12. Test Plan (high-level) + +1. **Unit tests** for: + + * `SymbolIdFactory` (deterministic digests). + * `BinaryReachabilityService` with mocked parsers & SBOM resolver. +2. **Integration tests** (per platform) using small sample binaries: + + * A PE with one import (e.g. `MessageBoxA`). + * An ELF binary calling `printf`. + * A Mach-O binary with a simple imported function. +3. Check that: + + * Graph contains expected binaries and symbols. + * Call edges exist and have correct `FromSymbolId` / `ToSymbolId`. + * PURLs are attached when SBOM resolver is provided with matching entries. + +--- + +If you want, next step I can break this into a concrete task list (Jira-style) for a single mid-level C# developer over 1–2 sprints, including approximate order and dependencies. +You can push these specs significantly closer to “best in class” by tightening a few correctness details, making some implicit assumptions explicit, and adding non‑functional and quality gates. Below is a structured list of concrete improvements you can fold back into the spec. + +I’ll focus on changes that materially affect correctness, extensibility, and implementation quality for an “average C# dev” without exploding complexity. + +--- + +## 1. Clarify Non‑Functional Requirements + +Right now the spec is almost entirely functional. Add a short NFR section so the developer has explicit targets: + +**Add a “Non‑Functional Requirements” section:** + +* **Performance** + + * Target scanning throughput, e.g. “On commodity hardware, aim for at least 50–100 MB/s of binaries scanned in static mode.” + * Specify acceptable complexity: “All parsing operations must be linear in file size where possible; avoid quadratic algorithms over symbol tables.” + +* **Memory** + + * Provide a rough upper bound, e.g. “Graph building must not exceed 512 MB RAM for 10k binaries with typical Linux container images.” + +* **Thread safety** + + * Clarify: “All parser implementations must be stateless and thread‑safe; `BinaryReachabilityService.BuildGraph` may scan binaries in parallel.” + +* **Portability** + + * Minimum supported OS set (Windows, Linux, macOS) and CPU architectures (x86_64, ARM64); important because ELF/Mach‑O vary. + +This keeps the implementation from being “correct but unusably slow” and tells the dev what “good enough” looks like. + +--- + +## 2. Fix and Strengthen Symbol Identity (Very Important) + +Current spec uses `SymbolId = "sym:{digest}"` where digest is only based on normalized name. That will collapse distinct functions that happen to share the same name/signature across different libraries/packages, which is unacceptable once you care about cross‑component reachability. + +**Improve the spec as follows:** + +1. **Split “symbol node identity” from “canonical symbol key”:** + + * Keep a local identity that is always unique per binary: + + ```csharp + public sealed record SymbolNode( + string SymbolId, // e.g. "sym:{binaryId}:{localIndex}" + string NormalizedName, + SymbolKind Kind, + string? Purl, + string SymbolDigest // stable digest of NormalizedName + ); + ``` + + * Define a **canonical symbol key** struct for cross‑binary grouping: + + ```csharp + public readonly record struct CanonicalSymbolKey( + string SymbolDigest, // sha256(normalizedName) + string? Purl // null for unknown package + ); + ``` + + * Inside `ReachabilityGraph`, add: + + ```csharp + public Dictionary> CanonicalSymbolIndex { get; } = new(); + ``` + +2. **Clarify behavior:** + + * Never merge two `SymbolNode`s just because they share the same digest. + * For “global reasoning” (e.g. “all call sites to the vulnerable function X from package Y”), use `CanonicalSymbolKey(SymbolDigest, Purl)`. + +3. **Update `CallEdge`:** + + * Keep `FromSymbolId` and `ToSymbolId` as node IDs. + * Include the canonical key in a dedicated field: + + ```csharp + public sealed record CallEdge( + string FromSymbolId, + string ToSymbolId, + EdgeKind EdgeKind, + EdgeConfidence Confidence, + CanonicalSymbolKey? CalleeKey, + CallSite Site + ); + ``` + +This single change prevents subtle and serious misattribution across libraries with overlapping APIs. + +--- + +## 3. Explicit Build Identity Semantics (PE/ELF/Mach‑O) + +The spec currently says `BuildId` is “optional” and format‑specific, but does not define **how** to compute it per format. Best‑in‑class means this is deterministic and documented. + +**Extend the spec with a “Binary Identity” section:** + +* **PE (Windows)** + + * `BuildId` = PDB GUID + Age if available (from CodeView debug directory). + * If PDB info is missing, set `BuildId = null` and rely on `FileHash`. +* **ELF (Linux)** + + * `BuildId` = contents of `.note.gnu.build-id` if present. +* **Mach‑O (macOS)** + + * `BuildId` = UUID from `LC_UUID` load command. + +Also specify: + +* **Primary identity order**: `(BuildId, FileHash)`; if `BuildId` is null, use `FileHash` only. +* SBOM resolvers MUST treat `(BuildId, FileHash)` as the canonical key to map binaries to components, with file path only as a hint. + +This gives you robust correlation between SBOM entries and binaries, across containers and file renames. + +--- + +## 4. Enrich the Edge Model and Call Site Semantics + +For precision and debuggability, specify what edges mean more rigorously. + +**Add fields and definitions:** + +1. **Direction and type:** + + Add a small discriminator describing the origin of the edge: + + ```csharp + public enum EdgeSource + { + ImportTable, // import thunk / PLT / stub + Relocation, // relocation to symbol + Disassembly, // decoded CALL / BL / JAL + Metadata, // .NET metadata, DWARF, etc. + Other + } + ``` + + Extend `CallEdge`: + + ```csharp + public sealed record CallEdge( + string FromSymbolId, + string ToSymbolId, + EdgeKind EdgeKind, + EdgeConfidence Confidence, + EdgeSource Source, + CanonicalSymbolKey? CalleeKey, + CallSite Site + ); + ``` + +2. **Intra‑ vs inter‑binary** + + * Define: `Site.BinaryId` always refers to the binary containing the call instruction. + * Intra‑binary edge: `FromSymbol` and `ToSymbol` share same `BinaryId`. + * Inter‑binary edge: otherwise. + +3. **Unknown or unresolved callees** + + * Do not drop unresolved calls; add a special `UnknownSymbolNode` per binary: + + * `NormalizedName = ""`, `Kind = SymbolKind.Unknown`, `Purl = null`. + * Edges to unknown must have `Confidence = EdgeConfidence.Low`. + +This makes downstream consumers able to distinguish “we are sure this is a call to libX.Y” from “we saw a call but do not know to where”. + +--- + +## 5. Strengthen Symbol Normalization Rules (Demangling etc.) + +For best‑in‑class results, you want reproducible signatures independent of compiler version, and you want to unify mangled C++/Rust/etc. names. + +**Extend the `SymbolIdFactory` spec with clear rules:** + +1. **Language‑agnostic core** + + * Always: + + * Demangle if possible. + * Normalize whitespace. + * Normalize namespace separators to `.` and member separator to `::`. + * Remove address/offset suffixes embedded in names. + +2. **Format‑ and language‑specific guidance** + + * For C/C++ (MSVC / Itanium ABI): + + * Use a demangler (your own or library) to get `retType namespace.Type::Func(paramTypes...)`. + * Omit return type in normalization to make signatures more stable: `namespace.Type::Func(paramTypes...)`. + * For Rust: + + * Strip hash suffixes from symbol name. + * Use “crate::module::Type::func(params...)” pattern where possible. + * For Go: + + * Normalize from `runtime.main_main` → `runtime.main.main` etc. + * For .NET (if/when you add managed parsing later): + + * Use fully qualified CLR names: `Namespace.Type::Method(ParamType1,ParamType2)`. + +3. **Document stability guarantees** + + * Given identical source (function name + parameter list), the `SymbolDigest` must remain stable across builds, architectures, optimization levels, and link addresses. + * If demangling fails, fallback to raw name but strip obvious hashes if safe. + +Specify this in prose and keep the implementation flexible, but the rules must be clear enough that two developers implementing the parser will produce the same digest for the same symbol. + +--- + +## 6. More Precise SBOM & PURL Resolution Behavior + +The SBOM integration is crucial to StellaOps; push this further so it is deterministic and auditable. + +**Extend `ISbomComponentResolver` behavior:** + +1. **Resolution order** + + Document a strict order: + + 1. `(BuildId, FileHash)` match. + 2. `FileHash` only. + 3. Normalized file path if SBOM has explicit path mapping. + 4. Library name fallback via `ResolvePurlByLibraryName`. + +2. **Multiple SBOMs and conflicts** + + * Allow multiple SBOM sources; if two SBOMs claim different purls for the same `(BuildId, FileHash)`, define a policy: + + * e.g. fail fast with a “conflicting SBOM” error; or choose a deterministic priority order. + +3. **Library name mapping contract** + + Add a small DTO to make the mapping explicit: + + ```csharp + public sealed record LibraryReference( + string BinaryId, + string LibraryName, // "libssl.so.3" / "KERNEL32.dll" + string? ResolvedPath // if the loader path is known + ); + ``` + + Extend `IBinaryParser` with: + + ```csharp + IReadOnlyList ParseLibraryReferences(BinaryNode binary); + ``` + + Then describe how `BinaryReachabilityService` uses those to call `ResolvePurlByLibraryName`. + +4. **Unknown purls** + + * Require that unknowns are explicit: + + * When `ResolvePurlForBinary` returns null, store `Purl = null` and flag this in logs: “No SBOM component for binary X (BuildId=..., Hash=...)”. + +This ensures SBOM resolution remains a traceable, deterministic step rather than a best‑effort guess. + +--- + +## 7. Explicit JSON Schema & Versioning + +For replayability and compatibility, define a clear JSON schema and version. + +**Add:** + +* A top‑level metadata section: + + ```json + { + "schemaVersion": "1.0.0", + "generatedAt": "2025-11-20T12:34:56Z", + "tool": "StellaOps.Scanner.BinaryReachability", + "toolVersion": "1.0.0", + "graph": { ... } + } + ``` + +* Commit to: + + * Only additive changes in minor versions. + * Backwards‑compatible changes within the same major version. + * If you change anything structural (e.g. how symbol IDs work), bump `schemaVersion` major. + +Optionally, provide a compact JSON schema file (or at least a documented shape) so other teams can implement readers in other languages. + +--- + +## 8. Concurrency, Streaming, and Large Images + +For best‑in‑class scalability, specify how large images are handled. + +**Clarify in the spec:** + +1. **Parallelization** + + * `BinaryReachabilityService.BuildGraph`: + + * May scan binaries in parallel using `Parallel.ForEach`. + * All parsers must be thread‑safe and not rely on shared mutable state. + +2. **Streaming option (optional but recommended)** + + * Provide a second API for very large repositories: + + ```csharp + public interface IGraphSink + { + void OnBinary(BinaryNode binary); + void OnSymbol(SymbolNode symbol); + void OnEdge(CallEdge edge); + } + + void BuildGraphStreaming(string rootDirectory, ISbomComponentResolver sbomResolver, IGraphSink sink); + ``` + + * This allows building graphs into a database or message bus without keeping everything in memory. + +Even if you do not implement streaming immediately, designing the interface now keeps the architecture future‑proof. + +--- + +## 9. Observability and Diagnostics + +Best‑in‑class implementation requires good introspection for debugging wrong reachability conclusions. + +**Specify minimal observability requirements:** + +* **Logging** + + * At least: + + * Info: number of binaries, symbols, edges, time taken. + * Warning: unsupported binary formats, SBOM resolution failures, demangling failures. + * Error: parser exceptions per file (with file path and format). + +* **Debug artifacts** + + * Optional environment or flag that dumps per‑binary debug info: + + * Raw symbol table (names + addresses). + * Normalized names and digests. + * Library references. + * Call edges for that binary. + +* **Metrics hooks** + + * Provide a simple interface for metrics: + + ```csharp + public interface IReachabilityMetrics + { + void IncrementCounter(string name, long value = 1); + void ObserveDuration(string name, TimeSpan duration); + } + ``` + + And allow `BinaryReachabilityService` to be constructed with an optional metrics implementation. + +--- + +## 10. Expanded Test Strategy and Quality Gates + +Your test plan is decent but can be made more systematic. + +**Extend test plan:** + +1. **Golden corpus** + + * Maintain a small but curated set of PE/ELF/Mach‑O binaries (checked in or generated) where: + + * Expected symbols and edges are stored as JSON. + * CI compares current output with the golden graph byte‑for‑byte (or structurally). + +2. **Cross‑compiler coverage** + + * At least: + + * C/C++ built by different toolchains (MSVC, clang, gcc). + * Different optimization levels (`-O0`, `-O2`) to ensure stability of parsing. + +3. **Fuzzing / robustness** + + * Create tests with truncated / corrupted binaries to ensure: + + * No crashes. + * Meaningful, bounded error behavior. + +4. **SBOM integration tests** + + * For a test root directory: + + * Synthetic SBOM mapping files to binaries. + * Validate correct purl assignment and conflict handling. + +5. **Determinism tests** + + * Run `BuildGraph` twice on the same directory and assert that: + + * Graph is structurally identical (including order‑independent comparison). + +This makes it much harder for regressions to slip in when you extend parsers or normalization. + +--- + +## 11. Clear Extension Points and Roadmap Notes + +Finally, add a short “Future Extensions” section so the dev knows what to keep in mind when structuring code: + +* Support for: + + * Inlined function tracking (via DWARF/PDB). + * Managed .NET assemblies’ metadata (C# IL call graph). + * Dynamic edge sources (runtime traces) merged into the same graph. +* The spec should instruct: “Design parsers and the graph model so they can accept additional `EdgeSource` types and symbol metadata without breaking existing consumers.” + +That gives the current implementation a clear direction and prevents design dead ends. + +--- + +If you want, I can now take your current spec and rewrite it into a single, consolidated “v1.0 Best‑in‑Class Spec” document that incorporates all of the above changes, so you can hand it directly to an implementation team.