git.stella-ops.org/18-Nov-2026 - Binary-Reachability-Engine.md at 65b15992292c4b4b3100430a80d289a461e7d16c - git.stella-ops.org

Files

Docs CI / lint-and-preview (push) Has been cancelled

Details

feat: Add comprehensive documentation for binary reachability with PURL-resolved edges

- Introduced a detailed specification for encoding binary reachability that integrates call graphs with SBOMs.
- Defined a minimal data model including nodes, edges, and SBOM components.
- Outlined a step-by-step guide for building the reachability graph in a C#-centric manner.
- Established core domain models, including enumerations for binary formats and symbol kinds.
- Created a public API for the binary reachability service, including methods for graph building and serialization.
- Specified SBOM component resolution and binary parsing abstractions for PE, ELF, and Mach-O formats.
- Enhanced symbol normalization and digesting processes to ensure deterministic signatures.
- Included error handling, logging, and a high-level test plan to ensure robustness and correctness.
- Added non-functional requirements to guide performance, memory usage, and thread safety.

2025-11-20 23:16:02 +02:00

29 KiB

Raw Blame History

Here’s a crisp idea that could give Stella Ops a real moat: binary‑level reachability—linking CVEs directly to the exact functions and offsets inside compiled artifacts (ELF/PE/Mach‑O), not just to packages.

Why this matters (quick background)

Package‑level flags are noisy. Most scanners say “vuln in libX v1.2,” but that library might be present and never executed.
Language‑level call graphs help (when you have source or rich metadata), but containers often ship only stripped binaries.
Binary reachability answers: Is the vulnerable function actually in this image? Is its code path reachable from the entrypoints we observed or can construct?

The missing layer: Symbolization

Build a symbolization layer that normalizes debug and symbol info across platforms:

Inputs: DWARF (ELF/Mach‑O), PDB (PE/Windows), symtabs, exported symbols, .eh_frame, and (when stripped) heuristic signatures (e.g., function byte‑hashes, CFG fingerprints).
Outputs: a source‑agnostic map: {binary → sections → functions → (addresses, ranges, hashes, demangled names, inlined frames)}.
Normalization: Put everything into a common schema (e.g., Stella.Symbolix.v1) so higher layers don’t care if it came from DWARF or PDB.

End‑to‑end reachability (binary‑first, source‑agnostic)

Acquire & parse
- Detect format (ELF/PE/Mach‑O), parse headers, sections, symbol tables.
- If debug info present: parse DWARF/PDB; else fall back to disassembly + function boundary recovery.
Function catalog
- Assign stable IDs per function: (imageHash, textSectionHash, startVA, size, fnHashXX).
- Record x‑refs (calls/jumps), imports/exports, PLT/IAT edges.
Entrypoint discovery
- Docker entry, process launch args, service scripts; infer likely mains (Go main.main, .NET hostfxr path, JVM launcher, etc.).
Call‑graph build (binary CFG)
- Build inter/intra‑procedural graph (direct + resolved indirect via IAT/PLT). Keep “unknown‑target” edges for conservative safety.
CVE→function linking
- Maintain a signature bank per CVE advisory: vulnerable function names, file paths, and—crucially—byte‑sequence or basic‑block fingerprints for patched vs vulnerable versions (works even when stripped).
Reachability analysis
- Is the vulnerable function present? Is there a path from any entrypoint to it (under conservative assumptions)? Tag as Present+Reachable, Present+Uncertain, or Absent.
Runtime confirmation (optional, when users allow)
- Lightweight probes (eBPF on Linux, ETW on Windows, perf/JFR/EventPipe) capture function hits; cross‑check with the static result to upgrade confidence.

Minimal component plan (drop into Stella Ops)

Scanner.Symbolizer Parsers: ELF/DWARF (libdw or pure‑managed reader), PE/PDB (Dia/LLVM PDB), Mach‑O/DSYM. Output: Symbolix.v1 blobs stored in OCI layer cache.
Scanner.CFG Lifts functions to a normalized IR (capstone/iced‑x86 for decode) → builds CFG & call graph.
Advisory.FingerprintBank Ingests CSAF/OpenVEX plus curated fingerprints (fn names, block hashes, patch diff markers). Versioned, signed, air‑gap‑syncable.
Reachability.Engine Joins (Symbolix + CFG + FingerprintBank) → emits ReachabilityEvidence with lattice states for VEX.
VEXer.Adapter Emits OpenVEX statements with status: affected/not_affected and justification: function_not_present | function_not_reachable | mitigated_at_runtime, attaching Evidence URIs.
Console UX “Why not affected?” panel showing entrypoint→…→function path (or absence), with byte‑hash proof.

Data model sketch (concise)

ImageFunction { id, name?, startVA, size, fnHash, sectionHash, demangled?, provenance:{DWARF|PDB|Heuristic} }
Edge { srcFnId, dstFnId, kind:{direct|plt|iat|indirect?} }
CveSignature { cveId, fnName?, libHints[], blockFingerprints[], versionRanges }
Evidence { cveId, imageId, functionMatches[], reachable: bool?, confidence:[low|med|high], method:[static|runtime|hybrid] }

Practical phases (8–10 weeks of focused work)

P0: ELF/DWARF symbolizer + basic function catalog; link a handful of CVEs via name‑only; emit OpenVEX function_not_present.
P1: CFG builder (direct calls) + PLT/IAT resolution; simple reachability; first fingerprints for top 50 CVEs in glibc, openssl, curl, zlib.
P2: Stripped‑binary heuristics (block hashing) + Go/Rust name demangling; Windows PDB ingestion for PE.
P3: Runtime probes (opt‑in) + confidence upgrade logic; Console path explorer; evidence signing (DSSE).

KPIs to prove the moat

Noise cut: % reduction in “affected” flags after reachability (target 40–70% on typical containers).
Precision: Ground‑truth validation vs PoC images (TP/FP/FN on presence & reachability).
Coverage: % images where we can make a determination without source (goal: >80%).
Latency: Added scan time per image (<15s typical with caches).

Risks & how to handle them

Stripped binaries → mitigate with block‑hash fingerprints & library‑version heuristics.
Obfuscated/packed code → mark Uncertain; allow user‑supplied hints; prefer runtime confirmation.
Advisory inconsistency → keep our own curated CVE→function fingerprint bank; sign & version it.
Platform spread → start Linux/ELF, then Windows/PDB, then Mach‑O.

Why competitors struggle

Most tools stop at packages because binary CFG + fingerprint curation is hard and expensive. Shipping a source‑agnostic reachability engine tied to signed evidence in VEX would set Stella Ops apart—especially in offline/air‑gapped and sovereign contexts you already target.

If you want, I can draft:

the Symbolix.v1 protobuf,
a tiny PoC (ELF→functions→match CVE with a block fingerprint),
and the OpenVEX emission snippet your VEXer can produce. Below is a detailed architecture plan for implementing reachability and call-graph analysis in Stella Ops, covering JavaScript, Python, PHP, and binaries, and integrating with your existing Scanner / Concelier / VEXer stack.

I will assume:

.NET 10 for core services.
Scanner is the place where all “trust algebra / lattice” runs (per your standing rule).
Concelier and VEXer remain “preserve/prune” layers and do not run lattice logic.
Output must be JSON-centric with PURLs and OpenVEX.

1. Scope & Objectives

1.1 Primary goals

From an OCI image, build:
- A library-level usage graph (which libraries are used by which entrypoints).
- A function-level call graph for JS / Python / PHP / binaries.
Map CVEs (from Concelier) to:
- Concrete components (PURLs) in the SBOM.
- Concrete functions / entrypoints / code regions inside those components.
Perform reachability analysis to classify each vulnerability as:
- present + reachable
- present + not_reachable
- function_not_present (no vulnerable symbol)
- uncertain (dynamic features, unresolved calls)
Emit:
- Structured JSON with PURLs and call-graph nodes/edges (“reachability evidence”).
- OpenVEX documents with appropriate status/justification.

1.2 Non-goals (for now)

Full dynamic analysis of the running container (eBPF, ptrace, etc.) – leave as Phase 3+ optional add-on.
Perfect call graph precision for dynamic languages (aim for safe, conservative approximations).
Automatic “fix recommendations” (handled by other Stella Ops agents later).

2. High-Level Architecture

2.1 Major components

Within Stella Ops:

Scanner.WebService
- User-facing API.
- Orchestrates full scan (SBOM, CVEs, reachability).
- Hosts the Lattice/Policy engine that merges evidence and produces decisions.
Scanner.Worker
- Runs per-image analysis jobs.
- Invokes analyzers (JS, Python, PHP, Binary) inside its own container context.
Scanner.Reachability Core Library
- Unified IR for call graphs and reachability evidence.
- Interfaces for language and binary analyzers.
- Graph algorithms (BFS/DFS, lattice evaluation, entrypoint expansion).
Language Analyzers
- Scanner.Analyzers.JavaScript
- Scanner.Analyzers.Python
- Scanner.Analyzers.Php
- Scanner.Analyzers.Binary
Symbolization & CFG (for binaries)
- Scanner.Symbolization (ELF, PE, Mach-O parsers, DWARF/PDB)
- Scanner.Cfg (CFG + call graph for binaries)
Vulnerability Signature Bank
- Concelier.Signatures (curated CVE→function/library fingerprints).
- Exposed to Scanner as offline bundle.
VEXer
- Vexer.Adapter.Reachability – transforms reachability evidence into OpenVEX.

2.2 Data flow (logical)

flowchart LR
  A[OCI Image / Tar] --> B[Scanner.Worker: Extract FS]
  B --> C[SBOM Engine (CycloneDX/SPDX)]
  C --> D[Vuln Match (Concelier feeds)]
  B --> E1[JS Analyzer]
  B --> E2[Python Analyzer]
  B --> E3[PHP Analyzer]
  B --> E4[Binary Analyzer + Symbolizer/CFG]

  D --> F[Reachability Orchestrator]
  E1 --> F
  E2 --> F
  E3 --> F
  E4 --> F
  F --> G[Lattice/Policy Engine (Scanner.WebService)]
  G --> H[Reachability Evidence JSON]
  G --> I[VEXer: OpenVEX]
  G --> J[Graph/Cartographer (optional)]

3. Data Model & JSON Contracts

3.1 Core IR types (Scanner.Reachability)

Define in a central assembly, e.g. StellaOps.Scanner.Reachability:

public record ComponentRef(
    string Purl,
    string? BomRef,
    string? Name,
    string? Version);

public enum SymbolKind { Function, Method, Constructor, Lambda, Import, Export }

public record SymbolId(
    string Language,       // "js", "python", "php", "binary"
    string ComponentPurl,  // SBOM component PURL or "" for app code
    string LogicalName,    // e.g., "server.js:handleLogin"
    string? FilePath,
    int? Line);

public record CallGraphNode(
    string Id,                 // stable id, e.g., hash(SymbolId)
    SymbolId Symbol,
    SymbolKind Kind,
    bool IsEntrypoint);

public enum CallEdgeKind { Direct, Indirect, Dynamic, External, Ffi }

public record CallGraphEdge(
    string FromNodeId,
    string ToNodeId,
    CallEdgeKind Kind);

public record CallGraph(
    string GraphId,
    IReadOnlyList<CallGraphNode> Nodes,
    IReadOnlyList<CallGraphEdge> Edges);

3.2 Vulnerability mapping

public record VulnerabilitySignature(
    string Source,             // "csaf", "nvd", "vendor"
    string Id,                 // "CVE-2023-12345"
    IReadOnlyList<string> Purls,
    IReadOnlyList<string> TargetSymbolPatterns, // glob-like or regex
    IReadOnlyList<string>? FilePathPatterns,
    IReadOnlyList<string>? BlockFingerprints    // for binaries, optional
);

3.3 Reachability evidence

public enum ReachabilityStatus
{
    PresentReachable,
    PresentNotReachable,
    FunctionNotPresent,
    Unknown
}

public record ReachabilityEvidence
(
    string ImageRef,
    string VulnId,               // CVE or advisory id
    ComponentRef Component,
    ReachabilityStatus Status,
    double Confidence,           // 0..1
    string Method,               // "static-callgraph", "binary-fingerprint", etc.
    IReadOnlyList<string> EntrypointNodeIds,
    IReadOnlyList<IReadOnlyList<string>>? ExamplePaths // optional list of node-paths
);

3.4 JSON structure (external)

Minimal external JSON (what you store / expose):

{
  "image": "registry.example.com/app:1.2.3",
  "components": [
    {
      "purl": "pkg:npm/express@4.18.0",
      "bomRef": "component-1"
    }
  ],
  "callGraphs": [
    {
      "graphId": "js-main",
      "language": "js",
      "nodes": [ /* CallGraphNode */ ],
      "edges": [ /* CallGraphEdge */ ]
    }
  ],
  "reachability": [
    {
      "vulnId": "CVE-2023-12345",
      "componentPurl": "pkg:npm/express@4.18.0",
      "status": "PresentReachable",
      "confidence": 0.92,
      "entrypoints": [ "node:..." ],
      "paths": [
        ["node:entry", "node:routeHandler", "node:vulnFn"]
      ]
    }
  ]
}

4. Scanner-Side Architecture

4.1 Project layout (suggested)

src/
  Scanner/
    StellaOps.Scanner.WebService/
    StellaOps.Scanner.Worker/
    StellaOps.Scanner.Core/        # shared scan domain
    StellaOps.Scanner.Reachability/
    StellaOps.Scanner.Symbolization/
    StellaOps.Scanner.Cfg/
    StellaOps.Scanner.Analyzers.JavaScript/
    StellaOps.Scanner.Analyzers.Python/
    StellaOps.Scanner.Analyzers.Php/
    StellaOps.Scanner.Analyzers.Binary/

4.2 API surface (Scanner.WebService)

POST /api/scan/image
- Request: { "imageRef": "...", "profile": { "reachability": true, ... } }
- Returns: scan id.
GET /api/scan/{id}/reachability
- Returns: ReachabilityEvidence[], plus call graph summary (optional).
GET /api/scan/{id}/vex
- Returns: OpenVEX with statuses based on reachability lattice.

4.3 Worker orchestration

StellaOps.Scanner.Worker:

Receives scan job with imageRef.
Extracts filesystem (layered rootfs) under /mnt/scans/{scanId}/rootfs.
Invokes SBOM generator (CycloneDX/SPDX).
Invokes Concelier via offline feeds to get:
- Component vulnerabilities (CVE list per PURL).
- Vulnerability signatures (fingerprints).

Builds a ReachabilityPlan:

public record ReachabilityPlan(
    IReadOnlyList<ComponentRef> Components,
    IReadOnlyList<VulnerabilitySignature> Vulns,
    IReadOnlyList<AnalyzerTarget> AnalyzerTargets // files/dirs grouped by language
);

For each language target, dispatch analyzer:
- JavaScript: IReachabilityAnalyzer implementation for JS.
- Python: likewise.
- PHP: likewise.
- Binary: symbolizer + CFG.
Collects call graphs from each analyzer and merges them into a single IR (or separate per-language graphs with shared IDs).
Sends merged graphs + vuln list to Reachability Engine (Scanner.Reachability).

5. Language Analyzers (JS / Python / PHP)

All analyzers implement a common interface:

public interface IReachabilityAnalyzer
{
    string Language { get; } // "js", "python", "php"

    Task<CallGraph> AnalyzeAsync(AnalyzerContext context, CancellationToken ct);
}

public record AnalyzerContext(
    string RootFsPath,
    IReadOnlyList<ComponentRef> Components,
    IReadOnlyList<VulnerabilitySignature> Vulnerabilities,
    IReadOnlyDictionary<string, string> Env,   // container env, entrypoint, etc.
    string? EntrypointCommand                  // container CMD/ENTRYPOINT
);

5.1 JavaScript (Node.js focus)

Inputs:

/app tree inside container (or discovered via SBOM).
package.json files.
Container entrypoint (e.g., ["node", "server.js"]).

Core steps:

Identify app root:
- Heuristics: directory containing package.json that owns the entry script.
Parse:
- All .js, .mjs, .cjs in app and node_modules for vulnerable PURLs.
- Use a parsing frontend (e.g., Tree-sitter via .NET binding, or Node+AST-as-JSON).
Build module graph:
- require, import, export.
Function-level graph:
- For each function/method, create CallGraphNode.
- For each callExpression, create CallGraphEdge (try to resolve callee).
Entrypoints:
- Main script in CMD/ENTRYPOINT.
- HTTP route handlers (for express/koa) detected by patterns (e.g., app.get("/...")).
Map vulnerable symbols:
- From VulnerabilitySignature.TargetSymbolPatterns (e.g., express/lib/router/layer.js:handle_request).
- Identify nodes whose SymbolId matches patterns.

Output:

CallGraph for JS with:
- IsEntrypoint = true for main and detected handlers.
- Node attributes include file path, line, component PURL.

5.2 Python

Inputs:

Site-packages paths from SBOM.
Entrypoint script (CMD/ENTRYPOINT).
Framework heuristics (Django, Flask) from environment variables or common entrypoints.

Core steps:

Discover Python interpreter chain: not needed for pure static, but useful for heuristics.
Parse .py files of:
- App code.
- Vulnerable packages (per PURL).
Build module import graph (import, from x import y).
Function-level graph:
- Nodes for functions, methods, class constructors.
- Edges for call expressions; conservative for dynamic calls.
Entrypoints:
- Main script.
- WSGI callable (e.g., application in wsgi.py).
- Django URLconf -> view functions.
Map vulnerable symbols using TargetSymbolPatterns like django.middleware.security.SecurityMiddleware.__call__.

5.3 PHP

Inputs:

Web root (from container image or conventional paths /var/www/html, /app/public, etc.).
Composer metadata (composer.json, vendor/).
Web server config if present (optional).

Core steps:

Discover front controllers (e.g., index.php, public/index.php).
Parse PHP files (again, via Tree-sitter or any suitable parser).
Resolve include/require chains to build file-level inclusion graph.
Build function/method graph:
- Functions, methods, class constructors.
- Calls with best-effort resolution for namespaced functions.
Entrypoints:
- Front controllers and router entrypoints (e.g., Symfony, Laravel detection).
Map vulnerable symbols (e.g., functions in certain vendor packages, particular methods).

6. Binary Analyzer & Symbolizer

Project: StellaOps.Scanner.Analyzers.Binary + Symbolization + Cfg.

6.1 Inputs

All binaries and shared libraries in:
- /usr/lib, /lib, /app/bin, etc.
SBOM link: each binary mapped to its component PURL when possible.
Vulnerability signatures for native libs: function names, symbol names, fingerprints.

6.2 Symbolization

Module: StellaOps.Scanner.Symbolization

Detect format: ELF, PE, Mach-O.
For ELF/Mach-O:
- Parse symbol tables (.symtab, .dynsym).
- Parse DWARF (if present) to map functions to source files/lines.
For PE:
- Parse PDB (if present) or export table.
For stripped binaries:
- Run function boundary recovery (linear sweep + heuristic).
- Compute block/fn-level hashes for fingerprinting.

Output:

public record ImageFunction(
    string ImageId,      // e.g., SHA256 of file
    ulong StartVa,
    uint Size,
    string? SymbolName,  // demangled if possible
    string FnHash,       // stable hash of bytes / CFG
    string? SourceFile,
    int? SourceLine);

6.3 CFG + Call graph

Module: StellaOps.Scanner.Cfg

Disassemble .text using Capstone/Iced.x86.
Build basic blocks and CFG.
Identify:
- Direct calls (resolved).
- PLT/IAT indirections to shared libraries.
Build CallGraph for binary functions:
- Entrypoints: main, exported functions, Go main.main, etc.
- Map application functions to library functions via PLT/IAT edges.

6.4 Linking vulnerabilities

For each vulnerability affecting a native library (e.g., OpenSSL):
- Map to candidate binaries via SBOM + PURL.
- Within library image, find ImageFunctions matching:
  - SymbolName patterns.
  - FnHash / BlockFingerprints (for precise detection).
Determine reachability:
- Starting from application entrypoints, traverse call graph to see if calls to vulnerable library function occur.

7. Reachability Engine & Lattice (Scanner.WebService)

Project: StellaOps.Scanner.Reachability

7.1 Inputs to engine

Combined CallGraph[] (per language + binary).
Vulnerability list (CVE, GHSA, etc.) with affected PURLs.
Vulnerability signatures.
Entrypoint hints:
- Container CMD/ENTRYPOINT.
- Detected HTTP handlers, WSGI/PSGI entrypoints, etc.

7.2 Algorithm steps

Entrypoint expansion
- Identify all CallGraphNode with IsEntrypoint=true.
- Add language-specific “framework entrypoints” (e.g., Express route dispatch, Django URL dispatch) when detected.
Graph traversal
- For each entrypoint node:
  - BFS/DFS through edges.
  - Maintain reachable bit on each node.
- For dynamic edges:
  - Conservative: if target cannot be resolved, mark affected path as partially unknown and downgrade confidence.
Vuln symbol resolution
- For each vulnerability:
  - For each vulnerable component PURL found in SBOM:
    - Find candidate nodes whose SymbolId matches TargetSymbolPatterns / binary fingerprints.
- If none found:
  - FunctionNotPresent (if component version range indicates vulnerable but we cannot find symbol – low confidence).
- If found:
  - Check reachable bit:
    - If reachable by at least one entrypoint, PresentReachable.
    - Else, PresentNotReachable.
Confidence computation
- Start from:
  - 1.0 for direct match with explicit function name & static call.
  - Lower for:
    - Heuristic framework entrypoints.
    - Dynamic calls.
    - Fingerprint-only matches on stripped binaries.
- Example rule-of-thumb:
  - direct static path only: 0.95–1.0.
  - dynamic edges but symbol found: 0.7–0.9.
  - symbol not found but version says vulnerable: 0.4–0.6.
Lattice merge
- Represent each CVE+component pair as a lattice element with states: {affected, not_affected, unknown}.
- Reachability engine produces a local state:
  - PresentReachable → candidate affected.
  - PresentNotReachable or FunctionNotPresent → candidate not_affected.
  - Unknown → unknown.
- Merge with:
  - Upstream vendor VEX (from Concelier).
  - Policy overrides (e.g., “treat certain CVEs as affected unless vendor says otherwise”).
- Final state computed here (Scanner.WebService), not in Concelier or VEXer.
Evidence output
- For each vulnerability:
  - Emit ReachabilityEvidence with:
    - Status.
    - Confidence.
    - Method.
    - Example entrypoint paths (for UX and audit).
- Persist this evidence alongside regular scan results.

8. Integration with SBOM & VEX

8.1 SBOM annotation

Extend SBOM documents (CycloneDX / SPDX) with extra properties:
- CycloneDX:
  - component.properties:
    - stellaops:reachability:status = present_reachable|present_not_reachable|function_not_present|unknown
    - stellaops:reachability:confidence = 0.0-1.0
- SPDX:
  - Annotation or ExternalRef with similar metadata.

8.2 OpenVEX generation

Module: StellaOps.Vexer.Adapter.Reachability

For each (vuln, component) pair:
- Map to VEX statement:
  - If PresentReachable:
    - status: affected
    - justification: component_not_fixed or similar.
  - If PresentNotReachable:
    - status: not_affected
    - justification: function_not_reachable
  - If FunctionNotPresent:
    - status: not_affected
    - justification: component_not_present or function_not_present
  - If Unknown:
    - status: under_investigation (configurable).
Attach evidence via:
- analysis / details fields (link to internal evidence JSON or audit link).
VEXer does not recalculate reachability; it uses the already computed decision + evidence.

9. Executable Containers & Offline Operation

9.1 Executable containers

Analyzers run inside a dedicated Scanner worker container that has:
- .NET 10 runtime.
- Language runtimes if needed for parsing (Node, Python, PHP), or Tree-sitter-based parsing.
Target image filesystem is mounted read-only under /mnt/rootfs.
No network access (offline/air-gap).
This satisfies “we will use executable containers” while keeping separation between:
- Target image (mount only).
- Analyzer container (StellaOps code).

9.2 Offline signature bundles

Concelier periodically exports:
- Vulnerability database (CSAF/NVD).
- Vulnerability Signature Bank.
Bundles are:
- DSSE-signed.
- Versioned (e.g., signatures-2025-11-01.tar.zst).
Scanner uses:
- The bundle digest as part of the Scan Manifest for deterministic replay.

10. Determinism & Caching

10.1 Layer-level caching

Key: layerDigest + analyzerVersion + signatureBundleVersion.
Cache artifacts:
- CallGraph(s) per layer (for JS/Python/PHP code present in that layer).
- Symbolization results per binary file hash.
For images sharing layers:
- Merge cached graphs instead of re-analyzing.

10.2 Deterministic scan manifest

For each scan, produce:

{
  "imageRef": "registry/app:1.2.3",
  "imageDigest": "sha256:...",
  "scannerVersion": "1.4.0",
  "analyzerVersions": {
    "js": "1.0.0",
    "python": "1.0.0",
    "php": "1.0.0",
    "binary": "1.0.0"
  },
  "signatureBundleDigest": "sha256:...",
  "callGraphDigest": "sha256:...",    // canonical JSON hash
  "reachabilityEvidenceDigest": "sha256:..."
}

This manifest can be signed (Authority module) and used for audits and replay.

11. Implementation Roadmap (Phased)

Phase 0 – Infrastructure & Binary presence

Duration: 1 sprint

Set up Scanner.Reachability core types and interfaces.
Implement:
- Basic Symbolizer for ELF + DWARF.
- Binary function catalog without CFG.
Link a small set of CVEs to binary function presence via SymbolName.
Expose minimal evidence:
- PresentReachable/FunctionNotPresent based only on presence (no call graph).
Integrate with VEXer to emit function_not_present justifications.

Success criteria:

For selected demo images with known vulnerable/ patched OpenSSL, scanner can:
- Distinguish images where vulnerable function is present vs. absent.
- Emit OpenVEX with correct not_affected when patched.

Phase 1 – JS/Python/PHP call graphs & basic reachability

Duration: 1–2 sprints

Implement:
- Scanner.Analyzers.JavaScript with module + function call graph.
- Scanner.Analyzers.Python and Scanner.Analyzers.Php with basic graphs.
Entrypoint detection:
- JS: main script from CMD, basic HTTP handlers.
- Python: main script + Django/Flask heuristics.
- PHP: front controllers.
Implement core reachability algorithm (BFS/DFS).
Implement simple VulnerabilitySignature that uses function names and file paths.
Hook lattice engine in Scanner.WebService and integrate with:
- Concelier vulnerability feeds.
- VEXer.

Success criteria:

For demo apps (Node, Django, Laravel):
- Identify vulnerable functions and mark them reachable/unreachable.
- Demonstrate noise reduction (some CVEs flagged as not_affected).

Phase 2 – Binary CFG & Fingerprinting, Improved Confidence

Duration: 1–2 sprints

Extend Symbolizer & CFG for:
- Stripped binaries (function hashing).
- Shared libraries (PLT/IAT resolution).
Implement VulnerabilitySignature.BlockFingerprints to distinguish patched vs vulnerable binary functions.
Refine confidence scoring:
- Use fingerprint match quality.
- Consider presence/absence of debug info.
Expand coverage:
- glibc, curl, zlib, OpenSSL, libxml2, etc.

Success criteria:

For curated images:
- Confirm ability to differentiate patched vs vulnerable versions even when binaries are stripped.
- Reachability reflects true call paths across app→lib boundaries.

Phase 3 – Runtime hooks (optional), UX, and Hardening

Duration: 2+ sprints

Add opt-in runtime confirmation:
- eBPF probes for function hits (Linux).
- Map runtime addresses back to ImageFunction via symbolization.
Enhance console UX:
- Path explorer UI: show entrypoint → … → vulnerable function path.
- Evidence view with hash-based proofs.
Hardening:
- Performance optimization for large images (parallel analysis, caching).
- Conservative fallbacks for dynamic language features.

Success criteria:

For selected environments where runtime is allowed:
- Static reachability is confirmed by runtime traces in majority of cases.
- No significant performance regression on typical images.

12. How this satisfies your initial bullets

From your initial requirements:

JavaScript, Python, PHP, binary → Dedicated analyzers per language + binary symbolization/CFG, unified in Scanner.Reachability.
Executable containers → Analyzers run inside Scanner’s worker container, mounting the target image rootfs; no network access.
Libraries usage call graph → Call graphs map from entrypoints → app code → library functions; SBOM + PURLs tie functions to libraries.
Reachability analysis → BFS/DFS from entrypoints over per-language and binary graphs, with lattice-based merging in Scanner.WebService.
JSON + PURLs → All evidence is JSON with PURL-tagged components; SBOM is annotated, and VEX statements reference those PURLs.

If you like, next step can be: I draft concrete C# interface definitions (including some initial Tree-sitter integration stubs for JS/Python/PHP) and a skeleton of the ReachabilityPlan and ReachabilityEngine classes that you can drop into the monorepo.

29 KiB Raw Blame History Unescape Escape

Why this matters (quick background)

The missing layer: Symbolization

End‑to‑end reachability (binary‑first, source‑agnostic)

Minimal component plan (drop into Stella Ops)

Data model sketch (concise)

Practical phases (8–10 weeks of focused work)

KPIs to prove the moat

Risks & how to handle them

Why competitors struggle

1. Scope & Objectives

1.1 Primary goals

1.2 Non-goals (for now)

2. High-Level Architecture

2.1 Major components

2.2 Data flow (logical)

3. Data Model & JSON Contracts

3.1 Core IR types (Scanner.Reachability)

3.2 Vulnerability mapping

3.3 Reachability evidence

3.4 JSON structure (external)

4. Scanner-Side Architecture

4.1 Project layout (suggested)

4.2 API surface (Scanner.WebService)

4.3 Worker orchestration

5. Language Analyzers (JS / Python / PHP)

5.1 JavaScript (Node.js focus)

5.2 Python

5.3 PHP

6. Binary Analyzer & Symbolizer

6.1 Inputs

6.2 Symbolization

6.3 CFG + Call graph

6.4 Linking vulnerabilities

7. Reachability Engine & Lattice (Scanner.WebService)

7.1 Inputs to engine

7.2 Algorithm steps

8. Integration with SBOM & VEX

8.1 SBOM annotation

8.2 OpenVEX generation

9. Executable Containers & Offline Operation

9.1 Executable containers

9.2 Offline signature bundles

10. Determinism & Caching

10.1 Layer-level caching

10.2 Deterministic scan manifest

11. Implementation Roadmap (Phased)

Phase 0 – Infrastructure & Binary presence

Phase 1 – JS/Python/PHP call graphs & basic reachability

Phase 2 – Binary CFG & Fingerprinting, Improved Confidence

Phase 3 – Runtime hooks (optional), UX, and Hardening

12. How this satisfies your initial bullets

29 KiB

Raw Blame History

Minimal component plan (drop into Stella Ops)