feat: Add new provenance and crypto registry documentation
Some checks failed
api-governance / spectral-lint (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled

- Introduced attestation inventory and subject-rekor mapping files for tracking Docker packages.
- Added a comprehensive crypto registry decision document outlining defaults and required follow-ups.
- Created an offline feeds manifest for bundling air-gap resources.
- Implemented a script to generate and update binary manifests for curated binaries.
- Added a verification script to ensure binary artefacts are located in approved directories.
- Defined new schemas for AdvisoryEvidenceBundle, OrchestratorEnvelope, ScannerReportReadyPayload, and ScannerScanCompletedPayload.
- Established project files for StellaOps.Orchestrator.Schemas and StellaOps.PolicyAuthoritySignals.Contracts.
- Updated vendor manifest to track pinned binaries for integrity.
This commit is contained in:
master
2025-11-18 23:47:13 +02:00
parent d3ecd7f8e6
commit e91da22836
44 changed files with 6793 additions and 99 deletions

View File

@@ -0,0 +1,927 @@
Heres a crisp idea that could give StellaOps a real moat: **binarylevel reachability**—linking CVEs directly to the exact functions and offsets inside compiled artifacts (ELF/PE/MachO), not just to packages.
---
### Why this matters (quick background)
* **Packagelevel flags are noisy.** Most scanners say “vuln in `libX v1.2`,” but that library might be present and never executed.
* **Languagelevel call graphs help** (when you have source or rich metadata), but containers often ship only **stripped binaries**.
* **Binary reachability** answers: *Is the vulnerable function actually in this image? Is its code path reachable from the entrypoints we observed or can construct?*
---
### The missing layer: Symbolization
Build a **symbolization layer** that normalizes debug and symbol info across platforms:
* **Inputs**: DWARF (ELF/MachO), PDB (PE/Windows), symtabs, exported symbols, `.eh_frame`, and (when stripped) heuristic signatures (e.g., function bytehashes, CFG fingerprints).
* **Outputs**: a sourceagnostic map: `{binary → sections → functions → (addresses, ranges, hashes, demangled names, inlined frames)}`.
* **Normalization**: Put everything into a common schema (e.g., `Stella.Symbolix.v1`) so higher layers dont care if it came from DWARF or PDB.
---
### Endtoend reachability (binaryfirst, sourceagnostic)
1. **Acquire & parse**
* Detect format (ELF/PE/MachO), parse headers, sections, symbol tables.
* If debug info present: parse DWARF/PDB; else fall back to disassembly + function boundary recovery.
2. **Function catalog**
* Assign stable IDs per function: `(imageHash, textSectionHash, startVA, size, fnHashXX)`.
* Record xrefs (calls/jumps), imports/exports, PLT/IAT edges.
3. **Entrypoint discovery**
* Docker entry, process launch args, service scripts; infer likely mains (Go `main.main`, .NET hostfxr path, JVM launcher, etc.).
4. **Callgraph build (binary CFG)**
* Build inter/intraprocedural graph (direct + resolved indirect via IAT/PLT). Keep “unknowntarget” edges for conservative safety.
5. **CVE→function linking**
* Maintain a **signature bank** per CVE advisory: vulnerable function names, file paths, and—crucially—**bytesequence or basicblock fingerprints** for patched vs vulnerable versions (works even when stripped).
6. **Reachability analysis**
* Is the vulnerable function present? Is there a path from any entrypoint to it (under conservative assumptions)? Tag as `Present+Reachable`, `Present+Uncertain`, or `Absent`.
7. **Runtime confirmation (optional, when users allow)**
* Lightweight probes (eBPF on Linux, ETW on Windows, perf/JFR/EventPipe) capture function hits; crosscheck with the static result to upgrade confidence.
---
### Minimal component plan (drop into StellaOps)
* **Scanner.Symbolizer**
Parsers: ELF/DWARF (libdw or puremanaged reader), PE/PDB (Dia/LLVM PDB), MachO/DSYM.
Output: `Symbolix.v1` blobs stored in OCI layer cache.
* **Scanner.CFG**
Lifts functions to a normalized IR (capstone/icedx86 for decode) → builds CFG & call graph.
* **Advisory.FingerprintBank**
Ingests CSAF/OpenVEX plus curated fingerprints (fn names, block hashes, patch diff markers). Versioned, signed, airgapsyncable.
* **Reachability.Engine**
Joins (`Symbolix` + `CFG` + `FingerprintBank`) → emits `ReachabilityEvidence` with lattice states for VEX.
* **VEXer.Adapter**
Emits **OpenVEX** statements with `status: affected/not_affected` and `justification: function_not_present | function_not_reachable | mitigated_at_runtime`, attaching Evidence URIs.
* **Console UX**
“Why not affected?” panel showing entrypoint→…→function path (or absence), with bytehash proof.
---
### Data model sketch (concise)
* `ImageFunction { id, name?, startVA, size, fnHash, sectionHash, demangled?, provenance:{DWARF|PDB|Heuristic} }`
* `Edge { srcFnId, dstFnId, kind:{direct|plt|iat|indirect?} }`
* `CveSignature { cveId, fnName?, libHints[], blockFingerprints[], versionRanges }`
* `Evidence { cveId, imageId, functionMatches[], reachable: bool?, confidence:[low|med|high], method:[static|runtime|hybrid] }`
---
### Practical phases (810 weeks of focused work)
1. **P0**: ELF/DWARF symbolizer + basic function catalog; link a handful of CVEs via nameonly; emit OpenVEX `function_not_present`.
2. **P1**: CFG builder (direct calls) + PLT/IAT resolution; simple reachability; first fingerprints for top 50 CVEs in glibc, openssl, curl, zlib.
3. **P2**: Strippedbinary heuristics (block hashing) + Go/Rust name demangling; Windows PDB ingestion for PE.
4. **P3**: Runtime probes (optin) + confidence upgrade logic; Console path explorer; evidence signing (DSSE).
---
### KPIs to prove the moat
* **Noise cut**: % reduction in “affected” flags after reachability (target 4070% on typical containers).
* **Precision**: Groundtruth validation vs PoC images (TP/FP/FN on presence & reachability).
* **Coverage**: % images where we can make a determination without source (goal: >80%).
* **Latency**: Added scan time per image (<15s typical with caches).
---
### Risks & how to handle them
* **Stripped binaries** mitigate with blockhash fingerprints & libraryversion heuristics.
* **Obfuscated/packed code** mark `Uncertain`; allow usersupplied hints; prefer runtime confirmation.
* **Advisory inconsistency** keep our own curated CVEfunction fingerprint bank; sign & version it.
* **Platform spread** start Linux/ELF, then Windows/PDB, then MachO.
---
### Why competitors struggle
Most tools stop at packages because binary CFG + fingerprint curation is hard and expensive. Shipping a **sourceagnostic reachability engine** tied to signed evidence in VEX would set StellaOps apartespecially in offline/airgapped and sovereign contexts you already target.
If you want, I can draft:
* the `Symbolix.v1` protobuf,
* a tiny PoC (ELFfunctionsmatch CVE with a block fingerprint),
* and the OpenVEX emission snippet your VEXer can produce.
Below is a detailed architecture plan for implementing reachability and call-graph analysis in Stella Ops, covering JavaScript, Python, PHP, and binaries, and integrating with your existing Scanner / Concelier / VEXer stack.
I will assume:
* .NET 10 for core services.
* Scanner is the place where all trust algebra / lattice runs (per your standing rule).
* Concelier and VEXer remain preserve/prune layers and do not run lattice logic.
* Output must be JSON-centric with PURLs and OpenVEX.
---
## 1. Scope & Objectives
### 1.1 Primary goals
1. From an OCI image, build:
* A **library-level usage graph** (which libraries are used by which entrypoints).
* A **function-level call graph** for JS / Python / PHP / binaries.
2. Map CVEs (from Concelier) to:
* Concrete **components** (PURLs) in the SBOM.
* Concrete **functions / entrypoints / code regions** inside those components.
3. Perform **reachability analysis** to classify each vulnerability as:
* `present + reachable`
* `present + not_reachable`
* `function_not_present` (no vulnerable symbol)
* `uncertain` (dynamic features, unresolved calls)
4. Emit:
* **Structured JSON** with PURLs and call-graph nodes/edges (“reachability evidence”).
* **OpenVEX** documents with appropriate `status`/`justification`.
### 1.2 Non-goals (for now)
* Full dynamic analysis of the running container (eBPF, ptrace, etc.) leave as Phase 3+ optional add-on.
* Perfect call graph precision for dynamic languages (aim for safe, conservative approximations).
* Automatic fix recommendations (handled by other Stella Ops agents later).
---
## 2. High-Level Architecture
### 2.1 Major components
Within Stella Ops:
* **Scanner.WebService**
* User-facing API.
* Orchestrates full scan (SBOM, CVEs, reachability).
* Hosts the **Lattice/Policy engine** that merges evidence and produces decisions.
* **Scanner.Worker**
* Runs per-image analysis jobs.
* Invokes analyzers (JS, Python, PHP, Binary) inside its own container context.
* **Scanner.Reachability Core Library**
* Unified IR for call graphs and reachability evidence.
* Interfaces for language and binary analyzers.
* Graph algorithms (BFS/DFS, lattice evaluation, entrypoint expansion).
* **Language Analyzers**
* `Scanner.Analyzers.JavaScript`
* `Scanner.Analyzers.Python`
* `Scanner.Analyzers.Php`
* `Scanner.Analyzers.Binary`
* **Symbolization & CFG (for binaries)**
* `Scanner.Symbolization` (ELF, PE, Mach-O parsers, DWARF/PDB)
* `Scanner.Cfg` (CFG + call graph for binaries)
* **Vulnerability Signature Bank**
* `Concelier.Signatures` (curated CVEfunction/library fingerprints).
* Exposed to Scanner as **offline bundle**.
* **VEXer**
* `Vexer.Adapter.Reachability` transforms reachability evidence into OpenVEX.
### 2.2 Data flow (logical)
```mermaid
flowchart LR
A[OCI Image / Tar] --> B[Scanner.Worker: Extract FS]
B --> C[SBOM Engine (CycloneDX/SPDX)]
C --> D[Vuln Match (Concelier feeds)]
B --> E1[JS Analyzer]
B --> E2[Python Analyzer]
B --> E3[PHP Analyzer]
B --> E4[Binary Analyzer + Symbolizer/CFG]
D --> F[Reachability Orchestrator]
E1 --> F
E2 --> F
E3 --> F
E4 --> F
F --> G[Lattice/Policy Engine (Scanner.WebService)]
G --> H[Reachability Evidence JSON]
G --> I[VEXer: OpenVEX]
G --> J[Graph/Cartographer (optional)]
```
---
## 3. Data Model & JSON Contracts
### 3.1 Core IR types (Scanner.Reachability)
Define in a central assembly, e.g. `StellaOps.Scanner.Reachability`:
```csharp
public record ComponentRef(
string Purl,
string? BomRef,
string? Name,
string? Version);
public enum SymbolKind { Function, Method, Constructor, Lambda, Import, Export }
public record SymbolId(
string Language, // "js", "python", "php", "binary"
string ComponentPurl, // SBOM component PURL or "" for app code
string LogicalName, // e.g., "server.js:handleLogin"
string? FilePath,
int? Line);
public record CallGraphNode(
string Id, // stable id, e.g., hash(SymbolId)
SymbolId Symbol,
SymbolKind Kind,
bool IsEntrypoint);
public enum CallEdgeKind { Direct, Indirect, Dynamic, External, Ffi }
public record CallGraphEdge(
string FromNodeId,
string ToNodeId,
CallEdgeKind Kind);
public record CallGraph(
string GraphId,
IReadOnlyList<CallGraphNode> Nodes,
IReadOnlyList<CallGraphEdge> Edges);
```
### 3.2 Vulnerability mapping
```csharp
public record VulnerabilitySignature(
string Source, // "csaf", "nvd", "vendor"
string Id, // "CVE-2023-12345"
IReadOnlyList<string> Purls,
IReadOnlyList<string> TargetSymbolPatterns, // glob-like or regex
IReadOnlyList<string>? FilePathPatterns,
IReadOnlyList<string>? BlockFingerprints // for binaries, optional
);
```
### 3.3 Reachability evidence
```csharp
public enum ReachabilityStatus
{
PresentReachable,
PresentNotReachable,
FunctionNotPresent,
Unknown
}
public record ReachabilityEvidence
(
string ImageRef,
string VulnId, // CVE or advisory id
ComponentRef Component,
ReachabilityStatus Status,
double Confidence, // 0..1
string Method, // "static-callgraph", "binary-fingerprint", etc.
IReadOnlyList<string> EntrypointNodeIds,
IReadOnlyList<IReadOnlyList<string>>? ExamplePaths // optional list of node-paths
);
```
### 3.4 JSON structure (external)
Minimal external JSON (what you store / expose):
```json
{
"image": "registry.example.com/app:1.2.3",
"components": [
{
"purl": "pkg:npm/express@4.18.0",
"bomRef": "component-1"
}
],
"callGraphs": [
{
"graphId": "js-main",
"language": "js",
"nodes": [ /* CallGraphNode */ ],
"edges": [ /* CallGraphEdge */ ]
}
],
"reachability": [
{
"vulnId": "CVE-2023-12345",
"componentPurl": "pkg:npm/express@4.18.0",
"status": "PresentReachable",
"confidence": 0.92,
"entrypoints": [ "node:..." ],
"paths": [
["node:entry", "node:routeHandler", "node:vulnFn"]
]
}
]
}
```
---
## 4. Scanner-Side Architecture
### 4.1 Project layout (suggested)
```text
src/
Scanner/
StellaOps.Scanner.WebService/
StellaOps.Scanner.Worker/
StellaOps.Scanner.Core/ # shared scan domain
StellaOps.Scanner.Reachability/
StellaOps.Scanner.Symbolization/
StellaOps.Scanner.Cfg/
StellaOps.Scanner.Analyzers.JavaScript/
StellaOps.Scanner.Analyzers.Python/
StellaOps.Scanner.Analyzers.Php/
StellaOps.Scanner.Analyzers.Binary/
```
### 4.2 API surface (Scanner.WebService)
* `POST /api/scan/image`
* Request: `{ "imageRef": "...", "profile": { "reachability": true, ... } }`
* Returns: scan id.
* `GET /api/scan/{id}/reachability`
* Returns: `ReachabilityEvidence[]`, plus call graph summary (optional).
* `GET /api/scan/{id}/vex`
* Returns: OpenVEX with statuses based on reachability lattice.
### 4.3 Worker orchestration
`StellaOps.Scanner.Worker`:
1. Receives scan job with `imageRef`.
2. Extracts filesystem (layered rootfs) under `/mnt/scans/{scanId}/rootfs`.
3. Invokes SBOM generator (CycloneDX/SPDX).
4. Invokes Concelier via offline feeds to get:
* Component vulnerabilities (CVE list per PURL).
* Vulnerability signatures (fingerprints).
5. Builds a `ReachabilityPlan`:
```csharp
public record ReachabilityPlan(
IReadOnlyList<ComponentRef> Components,
IReadOnlyList<VulnerabilitySignature> Vulns,
IReadOnlyList<AnalyzerTarget> AnalyzerTargets // files/dirs grouped by language
);
```
6. For each language target, dispatch analyzer:
* JavaScript: `IReachabilityAnalyzer` implementation for JS.
* Python: likewise.
* PHP: likewise.
* Binary: symbolizer + CFG.
7. Collects call graphs from each analyzer and merges them into a single IR (or separate per-language graphs with shared IDs).
8. Sends merged graphs + vuln list to **Reachability Engine** (Scanner.Reachability).
---
## 5. Language Analyzers (JS / Python / PHP)
All analyzers implement a common interface:
```csharp
public interface IReachabilityAnalyzer
{
string Language { get; } // "js", "python", "php"
Task<CallGraph> AnalyzeAsync(AnalyzerContext context, CancellationToken ct);
}
public record AnalyzerContext(
string RootFsPath,
IReadOnlyList<ComponentRef> Components,
IReadOnlyList<VulnerabilitySignature> Vulnerabilities,
IReadOnlyDictionary<string, string> Env, // container env, entrypoint, etc.
string? EntrypointCommand // container CMD/ENTRYPOINT
);
```
### 5.1 JavaScript (Node.js focus)
**Inputs:**
* `/app` tree inside container (or discovered via SBOM).
* `package.json` files.
* Container entrypoint (e.g., `["node", "server.js"]`).
**Core steps:**
1. Identify **app root**:
* Heuristics: directory containing `package.json` that owns the entry script.
2. Parse:
* All `.js`, `.mjs`, `.cjs` in app and `node_modules` for vulnerable PURLs.
* Use a parsing frontend (e.g., Tree-sitter via .NET binding, or Node+AST-as-JSON).
3. Build module graph:
* `require`, `import`, `export`.
4. Function-level graph:
* For each function/method, create `CallGraphNode`.
* For each `callExpression`, create `CallGraphEdge` (try to resolve callee).
5. Entrypoints:
* Main script in CMD/ENTRYPOINT.
* HTTP route handlers (for express/koa) detected by patterns (e.g., `app.get("/...")`).
6. Map vulnerable symbols:
* From `VulnerabilitySignature.TargetSymbolPatterns` (e.g., `express/lib/router/layer.js:handle_request`).
* Identify nodes whose `SymbolId` matches patterns.
**Output:**
* `CallGraph` for JS with:
* `IsEntrypoint = true` for main and detected handlers.
* Node attributes include file path, line, component PURL.
### 5.2 Python
**Inputs:**
* Site-packages paths from SBOM.
* Entrypoint script (CMD/ENTRYPOINT).
* Framework heuristics (Django, Flask) from environment variables or common entrypoints.
**Core steps:**
1. Discover Python interpreter chain: not needed for pure static, but useful for heuristics.
2. Parse `.py` files of:
* App code.
* Vulnerable packages (per PURL).
3. Build module import graph (`import`, `from x import y`).
4. Function-level graph:
* Nodes for functions, methods, class constructors.
* Edges for call expressions; conservative for dynamic calls.
5. Entrypoints:
* Main script.
* WSGI callable (e.g., `application` in `wsgi.py`).
* Django URLconf -> view functions.
6. Map vulnerable symbols using `TargetSymbolPatterns` like `django.middleware.security.SecurityMiddleware.__call__`.
### 5.3 PHP
**Inputs:**
* Web root (from container image or conventional paths `/var/www/html`, `/app/public`, etc.).
* Composer metadata (`composer.json`, `vendor/`).
* Web server config if present (optional).
**Core steps:**
1. Discover front controllers (e.g., `index.php`, `public/index.php`).
2. Parse PHP files (again, via Tree-sitter or any suitable parser).
3. Resolve include/require chains to build file-level inclusion graph.
4. Build function/method graph:
* Functions, methods, class constructors.
* Calls with best-effort resolution for namespaced functions.
5. Entrypoints:
* Front controllers and router entrypoints (e.g., Symfony, Laravel detection).
6. Map vulnerable symbols (e.g., functions in certain vendor packages, particular methods).
---
## 6. Binary Analyzer & Symbolizer
Project: `StellaOps.Scanner.Analyzers.Binary` + `Symbolization` + `Cfg`.
### 6.1 Inputs
* All binaries and shared libraries in:
* `/usr/lib`, `/lib`, `/app/bin`, etc.
* SBOM link: each binary mapped to its component PURL when possible.
* Vulnerability signatures for native libs: function names, symbol names, fingerprints.
### 6.2 Symbolization
Module: `StellaOps.Scanner.Symbolization`
* Detect format: ELF, PE, Mach-O.
* For ELF/Mach-O:
* Parse symbol tables (`.symtab`, `.dynsym`).
* Parse DWARF (if present) to map functions to source files/lines.
* For PE:
* Parse PDB (if present) or export table.
* For stripped binaries:
* Run function boundary recovery (linear sweep + heuristic).
* Compute block/fn-level hashes for fingerprinting.
Output:
```csharp
public record ImageFunction(
string ImageId, // e.g., SHA256 of file
ulong StartVa,
uint Size,
string? SymbolName, // demangled if possible
string FnHash, // stable hash of bytes / CFG
string? SourceFile,
int? SourceLine);
```
### 6.3 CFG + Call graph
Module: `StellaOps.Scanner.Cfg`
* Disassemble `.text` using Capstone/Iced.x86.
* Build basic blocks and CFG.
* Identify:
* Direct calls (resolved).
* PLT/IAT indirections to shared libraries.
* Build `CallGraph` for binary functions:
* Entrypoints: `main`, exported functions, Go `main.main`, etc.
* Map application functions to library functions via PLT/IAT edges.
### 6.4 Linking vulnerabilities
* For each vulnerability affecting a native library (e.g., OpenSSL):
* Map to candidate binaries via SBOM + PURL.
* Within library image, find `ImageFunction`s matching:
* `SymbolName` patterns.
* `FnHash` / `BlockFingerprints` (for precise detection).
* Determine reachability:
* Starting from application entrypoints, traverse call graph to see if calls to vulnerable library function occur.
---
## 7. Reachability Engine & Lattice (Scanner.WebService)
Project: `StellaOps.Scanner.Reachability`
### 7.1 Inputs to engine
* Combined `CallGraph[]` (per language + binary).
* Vulnerability list (CVE, GHSA, etc.) with affected PURLs.
* Vulnerability signatures.
* Entrypoint hints:
* Container CMD/ENTRYPOINT.
* Detected HTTP handlers, WSGI/PSGI entrypoints, etc.
### 7.2 Algorithm steps
1. **Entrypoint expansion**
* Identify all `CallGraphNode` with `IsEntrypoint=true`.
* Add language-specific “framework entrypoints” (e.g., Express route dispatch, Django URL dispatch) when detected.
2. **Graph traversal**
* For each entrypoint node:
* BFS/DFS through edges.
* Maintain `reachable` bit on each node.
* For dynamic edges:
* Conservative: if target cannot be resolved, mark affected path as partially unknown and downgrade confidence.
3. **Vuln symbol resolution**
* For each vulnerability:
* For each vulnerable component PURL found in SBOM:
* Find candidate nodes whose `SymbolId` matches `TargetSymbolPatterns` / binary fingerprints.
* If none found:
* `FunctionNotPresent` (if component version range indicates vulnerable but we cannot find symbol low confidence).
* If found:
* Check `reachable` bit:
* If reachable by at least one entrypoint, `PresentReachable`.
* Else, `PresentNotReachable`.
4. **Confidence computation**
* Start from:
* `1.0` for direct match with explicit function name & static call.
* Lower for:
* Heuristic framework entrypoints.
* Dynamic calls.
* Fingerprint-only matches on stripped binaries.
* Example rule-of-thumb:
* direct static path only: 0.951.0.
* dynamic edges but symbol found: 0.70.9.
* symbol not found but version says vulnerable: 0.40.6.
5. **Lattice merge**
* Represent each CVE+component pair as a lattice element with states: `{affected, not_affected, unknown}`.
* Reachability engine produces a **local state**:
* `PresentReachable` → candidate `affected`.
* `PresentNotReachable` or `FunctionNotPresent` → candidate `not_affected`.
* `Unknown` → `unknown`.
* Merge with:
* Upstream vendor VEX (from Concelier).
* Policy overrides (e.g., “treat certain CVEs as affected unless vendor says otherwise”).
* Final state computed here (Scanner.WebService), not in Concelier or VEXer.
6. **Evidence output**
* For each vulnerability:
* Emit `ReachabilityEvidence` with:
* Status.
* Confidence.
* Method.
* Example entrypoint paths (for UX and audit).
* Persist this evidence alongside regular scan results.
---
## 8. Integration with SBOM & VEX
### 8.1 SBOM annotation
* Extend SBOM documents (CycloneDX / SPDX) with extra properties:
* CycloneDX:
* `component.properties`:
* `stellaops:reachability:status` = `present_reachable|present_not_reachable|function_not_present|unknown`
* `stellaops:reachability:confidence` = `0.0-1.0`
* SPDX:
* `Annotation` or `ExternalRef` with similar metadata.
### 8.2 OpenVEX generation
Module: `StellaOps.Vexer.Adapter.Reachability`
* For each `(vuln, component)` pair:
* Map to VEX statement:
* If `PresentReachable`:
* `status: affected`
* `justification: component_not_fixed` or similar.
* If `PresentNotReachable`:
* `status: not_affected`
* `justification: function_not_reachable`
* If `FunctionNotPresent`:
* `status: not_affected`
* `justification: component_not_present` or `function_not_present`
* If `Unknown`:
* `status: under_investigation` (configurable).
* Attach evidence via:
* `analysis` / `details` fields (link to internal evidence JSON or audit link).
* VEXer does not recalculate reachability; it uses the already computed decision + evidence.
---
## 9. Executable Containers & Offline Operation
### 9.1 Executable containers
* Analyzers run inside a dedicated Scanner worker container that has:
* .NET 10 runtime.
* Language runtimes if needed for parsing (Node, Python, PHP), or Tree-sitter-based parsing.
* Target image filesystem is mounted read-only under `/mnt/rootfs`.
* No network access (offline/air-gap).
* This satisfies “we will use executable containers” while keeping separation between:
* Target image (mount only).
* Analyzer container (StellaOps code).
### 9.2 Offline signature bundles
* Concelier periodically exports:
* Vulnerability database (CSAF/NVD).
* Vulnerability Signature Bank.
* Bundles are:
* DSSE-signed.
* Versioned (e.g., `signatures-2025-11-01.tar.zst`).
* Scanner uses:
* The bundle digest as part of the **Scan Manifest** for deterministic replay.
---
## 10. Determinism & Caching
### 10.1 Layer-level caching
* Key: `layerDigest + analyzerVersion + signatureBundleVersion`.
* Cache artifacts:
* CallGraph(s) per layer (for JS/Python/PHP code present in that layer).
* Symbolization results per binary file hash.
* For images sharing layers:
* Merge cached graphs instead of re-analyzing.
### 10.2 Deterministic scan manifest
For each scan, produce:
```json
{
"imageRef": "registry/app:1.2.3",
"imageDigest": "sha256:...",
"scannerVersion": "1.4.0",
"analyzerVersions": {
"js": "1.0.0",
"python": "1.0.0",
"php": "1.0.0",
"binary": "1.0.0"
},
"signatureBundleDigest": "sha256:...",
"callGraphDigest": "sha256:...", // canonical JSON hash
"reachabilityEvidenceDigest": "sha256:..."
}
```
This manifest can be signed (Authority module) and used for audits and replay.
---
## 11. Implementation Roadmap (Phased)
### Phase 0 Infrastructure & Binary presence
**Duration:** 1 sprint
* Set up `Scanner.Reachability` core types and interfaces.
* Implement:
* Basic Symbolizer for ELF + DWARF.
* Binary function catalog without CFG.
* Link a small set of CVEs to binary function presence via `SymbolName`.
* Expose minimal evidence:
* `PresentReachable`/`FunctionNotPresent` based only on presence (no call graph).
* Integrate with VEXer to emit `function_not_present` justifications.
**Success criteria:**
* For selected demo images with known vulnerable/ patched OpenSSL, scanner can:
* Distinguish images where vulnerable function is present vs. absent.
* Emit OpenVEX with correct `not_affected` when patched.
---
### Phase 1 JS/Python/PHP call graphs & basic reachability
**Duration:** 12 sprints
* Implement:
* `Scanner.Analyzers.JavaScript` with module + function call graph.
* `Scanner.Analyzers.Python` and `Scanner.Analyzers.Php` with basic graphs.
* Entrypoint detection:
* JS: main script from CMD, basic HTTP handlers.
* Python: main script + Django/Flask heuristics.
* PHP: front controllers.
* Implement core reachability algorithm (BFS/DFS).
* Implement simple `VulnerabilitySignature` that uses function names and file paths.
* Hook lattice engine in Scanner.WebService and integrate with:
* Concelier vulnerability feeds.
* VEXer.
**Success criteria:**
* For demo apps (Node, Django, Laravel):
* Identify vulnerable functions and mark them reachable/unreachable.
* Demonstrate noise reduction (some CVEs flagged as `not_affected`).
---
### Phase 2 Binary CFG & Fingerprinting, Improved Confidence
**Duration:** 12 sprints
* Extend Symbolizer & CFG for:
* Stripped binaries (function hashing).
* Shared libraries (PLT/IAT resolution).
* Implement `VulnerabilitySignature.BlockFingerprints` to distinguish patched vs vulnerable binary functions.
* Refine confidence scoring:
* Use fingerprint match quality.
* Consider presence/absence of debug info.
* Expand coverage:
* glibc, curl, zlib, OpenSSL, libxml2, etc.
**Success criteria:**
* For curated images:
* Confirm ability to differentiate patched vs vulnerable versions even when binaries are stripped.
* Reachability reflects true call paths across app→lib boundaries.
---
### Phase 3 Runtime hooks (optional), UX, and Hardening
**Duration:** 2+ sprints
* Add opt-in runtime confirmation:
* eBPF probes for function hits (Linux).
* Map runtime addresses back to `ImageFunction` via symbolization.
* Enhance console UX:
* Path explorer UI: show entrypoint → … → vulnerable function path.
* Evidence view with hash-based proofs.
* Hardening:
* Performance optimization for large images (parallel analysis, caching).
* Conservative fallbacks for dynamic language features.
**Success criteria:**
* For selected environments where runtime is allowed:
* Static reachability is confirmed by runtime traces in majority of cases.
* No significant performance regression on typical images.
---
## 12. How this satisfies your initial bullets
From your initial requirements:
1. **JavaScript, Python, PHP, binary**
→ Dedicated analyzers per language + binary symbolization/CFG, unified in `Scanner.Reachability`.
2. **Executable containers**
→ Analyzers run inside Scanners worker container, mounting the target image rootfs; no network access.
3. **Libraries usage call graph**
→ Call graphs map from entrypoints → app code → library functions; SBOM + PURLs tie functions to libraries.
4. **Reachability analysis**
→ BFS/DFS from entrypoints over per-language and binary graphs, with lattice-based merging in `Scanner.WebService`.
5. **JSON + PURLs**
→ All evidence is JSON with PURL-tagged components; SBOM is annotated, and VEX statements reference those PURLs.
---
If you like, next step can be: I draft concrete C# interface definitions (including some initial Tree-sitter integration stubs for JS/Python/PHP) and a skeleton of the `ReachabilityPlan` and `ReachabilityEngine` classes that you can drop into the monorepo.