Files
git.stella-ops.org/docs/product-advisories/18-Nov-2026 - Binary-Reachability-Engine.md
master 522fff73cd
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Add comprehensive documentation for binary reachability with PURL-resolved edges
- Introduced a detailed specification for encoding binary reachability that integrates call graphs with SBOMs.
- Defined a minimal data model including nodes, edges, and SBOM components.
- Outlined a step-by-step guide for building the reachability graph in a C#-centric manner.
- Established core domain models, including enumerations for binary formats and symbol kinds.
- Created a public API for the binary reachability service, including methods for graph building and serialization.
- Specified SBOM component resolution and binary parsing abstractions for PE, ELF, and Mach-O formats.
- Enhanced symbol normalization and digesting processes to ensure deterministic signatures.
- Included error handling, logging, and a high-level test plan to ensure robustness and correctness.
- Added non-functional requirements to guide performance, memory usage, and thread safety.
2025-11-20 23:16:02 +02:00

29 KiB
Raw Blame History

Heres a crisp idea that could give StellaOps a real moat: binarylevel reachability—linking CVEs directly to the exact functions and offsets inside compiled artifacts (ELF/PE/MachO), not just to packages.


Why this matters (quick background)

  • Packagelevel flags are noisy. Most scanners say “vuln in libX v1.2,” but that library might be present and never executed.
  • Languagelevel call graphs help (when you have source or rich metadata), but containers often ship only stripped binaries.
  • Binary reachability answers: Is the vulnerable function actually in this image? Is its code path reachable from the entrypoints we observed or can construct?

The missing layer: Symbolization

Build a symbolization layer that normalizes debug and symbol info across platforms:

  • Inputs: DWARF (ELF/MachO), PDB (PE/Windows), symtabs, exported symbols, .eh_frame, and (when stripped) heuristic signatures (e.g., function bytehashes, CFG fingerprints).
  • Outputs: a sourceagnostic map: {binary → sections → functions → (addresses, ranges, hashes, demangled names, inlined frames)}.
  • Normalization: Put everything into a common schema (e.g., Stella.Symbolix.v1) so higher layers dont care if it came from DWARF or PDB.

Endtoend reachability (binaryfirst, sourceagnostic)

  1. Acquire & parse

    • Detect format (ELF/PE/MachO), parse headers, sections, symbol tables.
    • If debug info present: parse DWARF/PDB; else fall back to disassembly + function boundary recovery.
  2. Function catalog

    • Assign stable IDs per function: (imageHash, textSectionHash, startVA, size, fnHashXX).
    • Record xrefs (calls/jumps), imports/exports, PLT/IAT edges.
  3. Entrypoint discovery

    • Docker entry, process launch args, service scripts; infer likely mains (Go main.main, .NET hostfxr path, JVM launcher, etc.).
  4. Callgraph build (binary CFG)

    • Build inter/intraprocedural graph (direct + resolved indirect via IAT/PLT). Keep “unknowntarget” edges for conservative safety.
  5. CVE→function linking

    • Maintain a signature bank per CVE advisory: vulnerable function names, file paths, and—crucially—bytesequence or basicblock fingerprints for patched vs vulnerable versions (works even when stripped).
  6. Reachability analysis

    • Is the vulnerable function present? Is there a path from any entrypoint to it (under conservative assumptions)? Tag as Present+Reachable, Present+Uncertain, or Absent.
  7. Runtime confirmation (optional, when users allow)

    • Lightweight probes (eBPF on Linux, ETW on Windows, perf/JFR/EventPipe) capture function hits; crosscheck with the static result to upgrade confidence.

Minimal component plan (drop into StellaOps)

  • Scanner.Symbolizer Parsers: ELF/DWARF (libdw or puremanaged reader), PE/PDB (Dia/LLVM PDB), MachO/DSYM. Output: Symbolix.v1 blobs stored in OCI layer cache.
  • Scanner.CFG Lifts functions to a normalized IR (capstone/icedx86 for decode) → builds CFG & call graph.
  • Advisory.FingerprintBank Ingests CSAF/OpenVEX plus curated fingerprints (fn names, block hashes, patch diff markers). Versioned, signed, airgapsyncable.
  • Reachability.Engine Joins (Symbolix + CFG + FingerprintBank) → emits ReachabilityEvidence with lattice states for VEX.
  • VEXer.Adapter Emits OpenVEX statements with status: affected/not_affected and justification: function_not_present | function_not_reachable | mitigated_at_runtime, attaching Evidence URIs.
  • Console UX “Why not affected?” panel showing entrypoint→…→function path (or absence), with bytehash proof.

Data model sketch (concise)

  • ImageFunction { id, name?, startVA, size, fnHash, sectionHash, demangled?, provenance:{DWARF|PDB|Heuristic} }
  • Edge { srcFnId, dstFnId, kind:{direct|plt|iat|indirect?} }
  • CveSignature { cveId, fnName?, libHints[], blockFingerprints[], versionRanges }
  • Evidence { cveId, imageId, functionMatches[], reachable: bool?, confidence:[low|med|high], method:[static|runtime|hybrid] }

Practical phases (810 weeks of focused work)

  1. P0: ELF/DWARF symbolizer + basic function catalog; link a handful of CVEs via nameonly; emit OpenVEX function_not_present.
  2. P1: CFG builder (direct calls) + PLT/IAT resolution; simple reachability; first fingerprints for top 50 CVEs in glibc, openssl, curl, zlib.
  3. P2: Strippedbinary heuristics (block hashing) + Go/Rust name demangling; Windows PDB ingestion for PE.
  4. P3: Runtime probes (optin) + confidence upgrade logic; Console path explorer; evidence signing (DSSE).

KPIs to prove the moat

  • Noise cut: % reduction in “affected” flags after reachability (target 4070% on typical containers).
  • Precision: Groundtruth validation vs PoC images (TP/FP/FN on presence & reachability).
  • Coverage: % images where we can make a determination without source (goal: >80%).
  • Latency: Added scan time per image (<15s typical with caches).

Risks & how to handle them

  • Stripped binaries → mitigate with blockhash fingerprints & libraryversion heuristics.
  • Obfuscated/packed code → mark Uncertain; allow usersupplied hints; prefer runtime confirmation.
  • Advisory inconsistency → keep our own curated CVE→function fingerprint bank; sign & version it.
  • Platform spread → start Linux/ELF, then Windows/PDB, then MachO.

Why competitors struggle

Most tools stop at packages because binary CFG + fingerprint curation is hard and expensive. Shipping a sourceagnostic reachability engine tied to signed evidence in VEX would set StellaOps apart—especially in offline/airgapped and sovereign contexts you already target.

If you want, I can draft:

  • the Symbolix.v1 protobuf,
  • a tiny PoC (ELF→functions→match CVE with a block fingerprint),
  • and the OpenVEX emission snippet your VEXer can produce. Below is a detailed architecture plan for implementing reachability and call-graph analysis in Stella Ops, covering JavaScript, Python, PHP, and binaries, and integrating with your existing Scanner / Concelier / VEXer stack.

I will assume:

  • .NET 10 for core services.
  • Scanner is the place where all “trust algebra / lattice” runs (per your standing rule).
  • Concelier and VEXer remain “preserve/prune” layers and do not run lattice logic.
  • Output must be JSON-centric with PURLs and OpenVEX.

1. Scope & Objectives

1.1 Primary goals

  1. From an OCI image, build:

    • A library-level usage graph (which libraries are used by which entrypoints).
    • A function-level call graph for JS / Python / PHP / binaries.
  2. Map CVEs (from Concelier) to:

    • Concrete components (PURLs) in the SBOM.
    • Concrete functions / entrypoints / code regions inside those components.
  3. Perform reachability analysis to classify each vulnerability as:

    • present + reachable
    • present + not_reachable
    • function_not_present (no vulnerable symbol)
    • uncertain (dynamic features, unresolved calls)
  4. Emit:

    • Structured JSON with PURLs and call-graph nodes/edges (“reachability evidence”).
    • OpenVEX documents with appropriate status/justification.

1.2 Non-goals (for now)

  • Full dynamic analysis of the running container (eBPF, ptrace, etc.) leave as Phase 3+ optional add-on.
  • Perfect call graph precision for dynamic languages (aim for safe, conservative approximations).
  • Automatic “fix recommendations” (handled by other Stella Ops agents later).

2. High-Level Architecture

2.1 Major components

Within Stella Ops:

  • Scanner.WebService

    • User-facing API.
    • Orchestrates full scan (SBOM, CVEs, reachability).
    • Hosts the Lattice/Policy engine that merges evidence and produces decisions.
  • Scanner.Worker

    • Runs per-image analysis jobs.
    • Invokes analyzers (JS, Python, PHP, Binary) inside its own container context.
  • Scanner.Reachability Core Library

    • Unified IR for call graphs and reachability evidence.
    • Interfaces for language and binary analyzers.
    • Graph algorithms (BFS/DFS, lattice evaluation, entrypoint expansion).
  • Language Analyzers

    • Scanner.Analyzers.JavaScript
    • Scanner.Analyzers.Python
    • Scanner.Analyzers.Php
    • Scanner.Analyzers.Binary
  • Symbolization & CFG (for binaries)

    • Scanner.Symbolization (ELF, PE, Mach-O parsers, DWARF/PDB)
    • Scanner.Cfg (CFG + call graph for binaries)
  • Vulnerability Signature Bank

    • Concelier.Signatures (curated CVE→function/library fingerprints).
    • Exposed to Scanner as offline bundle.
  • VEXer

    • Vexer.Adapter.Reachability transforms reachability evidence into OpenVEX.

2.2 Data flow (logical)

flowchart LR
  A[OCI Image / Tar] --> B[Scanner.Worker: Extract FS]
  B --> C[SBOM Engine (CycloneDX/SPDX)]
  C --> D[Vuln Match (Concelier feeds)]
  B --> E1[JS Analyzer]
  B --> E2[Python Analyzer]
  B --> E3[PHP Analyzer]
  B --> E4[Binary Analyzer + Symbolizer/CFG]

  D --> F[Reachability Orchestrator]
  E1 --> F
  E2 --> F
  E3 --> F
  E4 --> F
  F --> G[Lattice/Policy Engine (Scanner.WebService)]
  G --> H[Reachability Evidence JSON]
  G --> I[VEXer: OpenVEX]
  G --> J[Graph/Cartographer (optional)]

3. Data Model & JSON Contracts

3.1 Core IR types (Scanner.Reachability)

Define in a central assembly, e.g. StellaOps.Scanner.Reachability:

public record ComponentRef(
    string Purl,
    string? BomRef,
    string? Name,
    string? Version);

public enum SymbolKind { Function, Method, Constructor, Lambda, Import, Export }

public record SymbolId(
    string Language,       // "js", "python", "php", "binary"
    string ComponentPurl,  // SBOM component PURL or "" for app code
    string LogicalName,    // e.g., "server.js:handleLogin"
    string? FilePath,
    int? Line);

public record CallGraphNode(
    string Id,                 // stable id, e.g., hash(SymbolId)
    SymbolId Symbol,
    SymbolKind Kind,
    bool IsEntrypoint);

public enum CallEdgeKind { Direct, Indirect, Dynamic, External, Ffi }

public record CallGraphEdge(
    string FromNodeId,
    string ToNodeId,
    CallEdgeKind Kind);

public record CallGraph(
    string GraphId,
    IReadOnlyList<CallGraphNode> Nodes,
    IReadOnlyList<CallGraphEdge> Edges);

3.2 Vulnerability mapping

public record VulnerabilitySignature(
    string Source,             // "csaf", "nvd", "vendor"
    string Id,                 // "CVE-2023-12345"
    IReadOnlyList<string> Purls,
    IReadOnlyList<string> TargetSymbolPatterns, // glob-like or regex
    IReadOnlyList<string>? FilePathPatterns,
    IReadOnlyList<string>? BlockFingerprints    // for binaries, optional
);

3.3 Reachability evidence

public enum ReachabilityStatus
{
    PresentReachable,
    PresentNotReachable,
    FunctionNotPresent,
    Unknown
}

public record ReachabilityEvidence
(
    string ImageRef,
    string VulnId,               // CVE or advisory id
    ComponentRef Component,
    ReachabilityStatus Status,
    double Confidence,           // 0..1
    string Method,               // "static-callgraph", "binary-fingerprint", etc.
    IReadOnlyList<string> EntrypointNodeIds,
    IReadOnlyList<IReadOnlyList<string>>? ExamplePaths // optional list of node-paths
);

3.4 JSON structure (external)

Minimal external JSON (what you store / expose):

{
  "image": "registry.example.com/app:1.2.3",
  "components": [
    {
      "purl": "pkg:npm/express@4.18.0",
      "bomRef": "component-1"
    }
  ],
  "callGraphs": [
    {
      "graphId": "js-main",
      "language": "js",
      "nodes": [ /* CallGraphNode */ ],
      "edges": [ /* CallGraphEdge */ ]
    }
  ],
  "reachability": [
    {
      "vulnId": "CVE-2023-12345",
      "componentPurl": "pkg:npm/express@4.18.0",
      "status": "PresentReachable",
      "confidence": 0.92,
      "entrypoints": [ "node:..." ],
      "paths": [
        ["node:entry", "node:routeHandler", "node:vulnFn"]
      ]
    }
  ]
}

4. Scanner-Side Architecture

4.1 Project layout (suggested)

src/
  Scanner/
    StellaOps.Scanner.WebService/
    StellaOps.Scanner.Worker/
    StellaOps.Scanner.Core/        # shared scan domain
    StellaOps.Scanner.Reachability/
    StellaOps.Scanner.Symbolization/
    StellaOps.Scanner.Cfg/
    StellaOps.Scanner.Analyzers.JavaScript/
    StellaOps.Scanner.Analyzers.Python/
    StellaOps.Scanner.Analyzers.Php/
    StellaOps.Scanner.Analyzers.Binary/

4.2 API surface (Scanner.WebService)

  • POST /api/scan/image

    • Request: { "imageRef": "...", "profile": { "reachability": true, ... } }
    • Returns: scan id.
  • GET /api/scan/{id}/reachability

    • Returns: ReachabilityEvidence[], plus call graph summary (optional).
  • GET /api/scan/{id}/vex

    • Returns: OpenVEX with statuses based on reachability lattice.

4.3 Worker orchestration

StellaOps.Scanner.Worker:

  1. Receives scan job with imageRef.

  2. Extracts filesystem (layered rootfs) under /mnt/scans/{scanId}/rootfs.

  3. Invokes SBOM generator (CycloneDX/SPDX).

  4. Invokes Concelier via offline feeds to get:

    • Component vulnerabilities (CVE list per PURL).
    • Vulnerability signatures (fingerprints).
  5. Builds a ReachabilityPlan:

    public record ReachabilityPlan(
        IReadOnlyList<ComponentRef> Components,
        IReadOnlyList<VulnerabilitySignature> Vulns,
        IReadOnlyList<AnalyzerTarget> AnalyzerTargets // files/dirs grouped by language
    );
    
  6. For each language target, dispatch analyzer:

    • JavaScript: IReachabilityAnalyzer implementation for JS.
    • Python: likewise.
    • PHP: likewise.
    • Binary: symbolizer + CFG.
  7. Collects call graphs from each analyzer and merges them into a single IR (or separate per-language graphs with shared IDs).

  8. Sends merged graphs + vuln list to Reachability Engine (Scanner.Reachability).


5. Language Analyzers (JS / Python / PHP)

All analyzers implement a common interface:

public interface IReachabilityAnalyzer
{
    string Language { get; } // "js", "python", "php"

    Task<CallGraph> AnalyzeAsync(AnalyzerContext context, CancellationToken ct);
}

public record AnalyzerContext(
    string RootFsPath,
    IReadOnlyList<ComponentRef> Components,
    IReadOnlyList<VulnerabilitySignature> Vulnerabilities,
    IReadOnlyDictionary<string, string> Env,   // container env, entrypoint, etc.
    string? EntrypointCommand                  // container CMD/ENTRYPOINT
);

5.1 JavaScript (Node.js focus)

Inputs:

  • /app tree inside container (or discovered via SBOM).
  • package.json files.
  • Container entrypoint (e.g., ["node", "server.js"]).

Core steps:

  1. Identify app root:

    • Heuristics: directory containing package.json that owns the entry script.
  2. Parse:

    • All .js, .mjs, .cjs in app and node_modules for vulnerable PURLs.
    • Use a parsing frontend (e.g., Tree-sitter via .NET binding, or Node+AST-as-JSON).
  3. Build module graph:

    • require, import, export.
  4. Function-level graph:

    • For each function/method, create CallGraphNode.
    • For each callExpression, create CallGraphEdge (try to resolve callee).
  5. Entrypoints:

    • Main script in CMD/ENTRYPOINT.
    • HTTP route handlers (for express/koa) detected by patterns (e.g., app.get("/...")).
  6. Map vulnerable symbols:

    • From VulnerabilitySignature.TargetSymbolPatterns (e.g., express/lib/router/layer.js:handle_request).
    • Identify nodes whose SymbolId matches patterns.

Output:

  • CallGraph for JS with:

    • IsEntrypoint = true for main and detected handlers.
    • Node attributes include file path, line, component PURL.

5.2 Python

Inputs:

  • Site-packages paths from SBOM.
  • Entrypoint script (CMD/ENTRYPOINT).
  • Framework heuristics (Django, Flask) from environment variables or common entrypoints.

Core steps:

  1. Discover Python interpreter chain: not needed for pure static, but useful for heuristics.

  2. Parse .py files of:

    • App code.
    • Vulnerable packages (per PURL).
  3. Build module import graph (import, from x import y).

  4. Function-level graph:

    • Nodes for functions, methods, class constructors.
    • Edges for call expressions; conservative for dynamic calls.
  5. Entrypoints:

    • Main script.
    • WSGI callable (e.g., application in wsgi.py).
    • Django URLconf -> view functions.
  6. Map vulnerable symbols using TargetSymbolPatterns like django.middleware.security.SecurityMiddleware.__call__.

5.3 PHP

Inputs:

  • Web root (from container image or conventional paths /var/www/html, /app/public, etc.).
  • Composer metadata (composer.json, vendor/).
  • Web server config if present (optional).

Core steps:

  1. Discover front controllers (e.g., index.php, public/index.php).

  2. Parse PHP files (again, via Tree-sitter or any suitable parser).

  3. Resolve include/require chains to build file-level inclusion graph.

  4. Build function/method graph:

    • Functions, methods, class constructors.
    • Calls with best-effort resolution for namespaced functions.
  5. Entrypoints:

    • Front controllers and router entrypoints (e.g., Symfony, Laravel detection).
  6. Map vulnerable symbols (e.g., functions in certain vendor packages, particular methods).


6. Binary Analyzer & Symbolizer

Project: StellaOps.Scanner.Analyzers.Binary + Symbolization + Cfg.

6.1 Inputs

  • All binaries and shared libraries in:

    • /usr/lib, /lib, /app/bin, etc.
  • SBOM link: each binary mapped to its component PURL when possible.

  • Vulnerability signatures for native libs: function names, symbol names, fingerprints.

6.2 Symbolization

Module: StellaOps.Scanner.Symbolization

  • Detect format: ELF, PE, Mach-O.

  • For ELF/Mach-O:

    • Parse symbol tables (.symtab, .dynsym).
    • Parse DWARF (if present) to map functions to source files/lines.
  • For PE:

    • Parse PDB (if present) or export table.
  • For stripped binaries:

    • Run function boundary recovery (linear sweep + heuristic).
    • Compute block/fn-level hashes for fingerprinting.

Output:

public record ImageFunction(
    string ImageId,      // e.g., SHA256 of file
    ulong StartVa,
    uint Size,
    string? SymbolName,  // demangled if possible
    string FnHash,       // stable hash of bytes / CFG
    string? SourceFile,
    int? SourceLine);

6.3 CFG + Call graph

Module: StellaOps.Scanner.Cfg

  • Disassemble .text using Capstone/Iced.x86.

  • Build basic blocks and CFG.

  • Identify:

    • Direct calls (resolved).
    • PLT/IAT indirections to shared libraries.
  • Build CallGraph for binary functions:

    • Entrypoints: main, exported functions, Go main.main, etc.
    • Map application functions to library functions via PLT/IAT edges.

6.4 Linking vulnerabilities

  • For each vulnerability affecting a native library (e.g., OpenSSL):

    • Map to candidate binaries via SBOM + PURL.

    • Within library image, find ImageFunctions matching:

      • SymbolName patterns.
      • FnHash / BlockFingerprints (for precise detection).
  • Determine reachability:

    • Starting from application entrypoints, traverse call graph to see if calls to vulnerable library function occur.

7. Reachability Engine & Lattice (Scanner.WebService)

Project: StellaOps.Scanner.Reachability

7.1 Inputs to engine

  • Combined CallGraph[] (per language + binary).

  • Vulnerability list (CVE, GHSA, etc.) with affected PURLs.

  • Vulnerability signatures.

  • Entrypoint hints:

    • Container CMD/ENTRYPOINT.
    • Detected HTTP handlers, WSGI/PSGI entrypoints, etc.

7.2 Algorithm steps

  1. Entrypoint expansion

    • Identify all CallGraphNode with IsEntrypoint=true.
    • Add language-specific “framework entrypoints” (e.g., Express route dispatch, Django URL dispatch) when detected.
  2. Graph traversal

    • For each entrypoint node:

      • BFS/DFS through edges.
      • Maintain reachable bit on each node.
    • For dynamic edges:

      • Conservative: if target cannot be resolved, mark affected path as partially unknown and downgrade confidence.
  3. Vuln symbol resolution

    • For each vulnerability:

      • For each vulnerable component PURL found in SBOM:

        • Find candidate nodes whose SymbolId matches TargetSymbolPatterns / binary fingerprints.
    • If none found:

      • FunctionNotPresent (if component version range indicates vulnerable but we cannot find symbol low confidence).
    • If found:

      • Check reachable bit:

        • If reachable by at least one entrypoint, PresentReachable.
        • Else, PresentNotReachable.
  4. Confidence computation

    • Start from:

      • 1.0 for direct match with explicit function name & static call.

      • Lower for:

        • Heuristic framework entrypoints.
        • Dynamic calls.
        • Fingerprint-only matches on stripped binaries.
    • Example rule-of-thumb:

      • direct static path only: 0.951.0.
      • dynamic edges but symbol found: 0.70.9.
      • symbol not found but version says vulnerable: 0.40.6.
  5. Lattice merge

    • Represent each CVE+component pair as a lattice element with states: {affected, not_affected, unknown}.

    • Reachability engine produces a local state:

      • PresentReachable → candidate affected.
      • PresentNotReachable or FunctionNotPresent → candidate not_affected.
      • Unknownunknown.
    • Merge with:

      • Upstream vendor VEX (from Concelier).
      • Policy overrides (e.g., “treat certain CVEs as affected unless vendor says otherwise”).
    • Final state computed here (Scanner.WebService), not in Concelier or VEXer.

  6. Evidence output

    • For each vulnerability:

      • Emit ReachabilityEvidence with:

        • Status.
        • Confidence.
        • Method.
        • Example entrypoint paths (for UX and audit).
    • Persist this evidence alongside regular scan results.


8. Integration with SBOM & VEX

8.1 SBOM annotation

  • Extend SBOM documents (CycloneDX / SPDX) with extra properties:

    • CycloneDX:

      • component.properties:

        • stellaops:reachability:status = present_reachable|present_not_reachable|function_not_present|unknown
        • stellaops:reachability:confidence = 0.0-1.0
    • SPDX:

      • Annotation or ExternalRef with similar metadata.

8.2 OpenVEX generation

Module: StellaOps.Vexer.Adapter.Reachability

  • For each (vuln, component) pair:

    • Map to VEX statement:

      • If PresentReachable:

        • status: affected
        • justification: component_not_fixed or similar.
      • If PresentNotReachable:

        • status: not_affected
        • justification: function_not_reachable
      • If FunctionNotPresent:

        • status: not_affected
        • justification: component_not_present or function_not_present
      • If Unknown:

        • status: under_investigation (configurable).
  • Attach evidence via:

    • analysis / details fields (link to internal evidence JSON or audit link).
  • VEXer does not recalculate reachability; it uses the already computed decision + evidence.


9. Executable Containers & Offline Operation

9.1 Executable containers

  • Analyzers run inside a dedicated Scanner worker container that has:

    • .NET 10 runtime.
    • Language runtimes if needed for parsing (Node, Python, PHP), or Tree-sitter-based parsing.
  • Target image filesystem is mounted read-only under /mnt/rootfs.

  • No network access (offline/air-gap).

  • This satisfies “we will use executable containers” while keeping separation between:

    • Target image (mount only).
    • Analyzer container (StellaOps code).

9.2 Offline signature bundles

  • Concelier periodically exports:

    • Vulnerability database (CSAF/NVD).
    • Vulnerability Signature Bank.
  • Bundles are:

    • DSSE-signed.
    • Versioned (e.g., signatures-2025-11-01.tar.zst).
  • Scanner uses:

    • The bundle digest as part of the Scan Manifest for deterministic replay.

10. Determinism & Caching

10.1 Layer-level caching

  • Key: layerDigest + analyzerVersion + signatureBundleVersion.

  • Cache artifacts:

    • CallGraph(s) per layer (for JS/Python/PHP code present in that layer).
    • Symbolization results per binary file hash.
  • For images sharing layers:

    • Merge cached graphs instead of re-analyzing.

10.2 Deterministic scan manifest

For each scan, produce:

{
  "imageRef": "registry/app:1.2.3",
  "imageDigest": "sha256:...",
  "scannerVersion": "1.4.0",
  "analyzerVersions": {
    "js": "1.0.0",
    "python": "1.0.0",
    "php": "1.0.0",
    "binary": "1.0.0"
  },
  "signatureBundleDigest": "sha256:...",
  "callGraphDigest": "sha256:...",    // canonical JSON hash
  "reachabilityEvidenceDigest": "sha256:..."
}

This manifest can be signed (Authority module) and used for audits and replay.


11. Implementation Roadmap (Phased)

Phase 0 Infrastructure & Binary presence

Duration: 1 sprint

  • Set up Scanner.Reachability core types and interfaces.

  • Implement:

    • Basic Symbolizer for ELF + DWARF.
    • Binary function catalog without CFG.
  • Link a small set of CVEs to binary function presence via SymbolName.

  • Expose minimal evidence:

    • PresentReachable/FunctionNotPresent based only on presence (no call graph).
  • Integrate with VEXer to emit function_not_present justifications.

Success criteria:

  • For selected demo images with known vulnerable/ patched OpenSSL, scanner can:

    • Distinguish images where vulnerable function is present vs. absent.
    • Emit OpenVEX with correct not_affected when patched.

Phase 1 JS/Python/PHP call graphs & basic reachability

Duration: 12 sprints

  • Implement:

    • Scanner.Analyzers.JavaScript with module + function call graph.
    • Scanner.Analyzers.Python and Scanner.Analyzers.Php with basic graphs.
  • Entrypoint detection:

    • JS: main script from CMD, basic HTTP handlers.
    • Python: main script + Django/Flask heuristics.
    • PHP: front controllers.
  • Implement core reachability algorithm (BFS/DFS).

  • Implement simple VulnerabilitySignature that uses function names and file paths.

  • Hook lattice engine in Scanner.WebService and integrate with:

    • Concelier vulnerability feeds.
    • VEXer.

Success criteria:

  • For demo apps (Node, Django, Laravel):

    • Identify vulnerable functions and mark them reachable/unreachable.
    • Demonstrate noise reduction (some CVEs flagged as not_affected).

Phase 2 Binary CFG & Fingerprinting, Improved Confidence

Duration: 12 sprints

  • Extend Symbolizer & CFG for:

    • Stripped binaries (function hashing).
    • Shared libraries (PLT/IAT resolution).
  • Implement VulnerabilitySignature.BlockFingerprints to distinguish patched vs vulnerable binary functions.

  • Refine confidence scoring:

    • Use fingerprint match quality.
    • Consider presence/absence of debug info.
  • Expand coverage:

    • glibc, curl, zlib, OpenSSL, libxml2, etc.

Success criteria:

  • For curated images:

    • Confirm ability to differentiate patched vs vulnerable versions even when binaries are stripped.
    • Reachability reflects true call paths across app→lib boundaries.

Phase 3 Runtime hooks (optional), UX, and Hardening

Duration: 2+ sprints

  • Add opt-in runtime confirmation:

    • eBPF probes for function hits (Linux).
    • Map runtime addresses back to ImageFunction via symbolization.
  • Enhance console UX:

    • Path explorer UI: show entrypoint → … → vulnerable function path.
    • Evidence view with hash-based proofs.
  • Hardening:

    • Performance optimization for large images (parallel analysis, caching).
    • Conservative fallbacks for dynamic language features.

Success criteria:

  • For selected environments where runtime is allowed:

    • Static reachability is confirmed by runtime traces in majority of cases.
    • No significant performance regression on typical images.

12. How this satisfies your initial bullets

From your initial requirements:

  1. JavaScript, Python, PHP, binary → Dedicated analyzers per language + binary symbolization/CFG, unified in Scanner.Reachability.

  2. Executable containers → Analyzers run inside Scanners worker container, mounting the target image rootfs; no network access.

  3. Libraries usage call graph → Call graphs map from entrypoints → app code → library functions; SBOM + PURLs tie functions to libraries.

  4. Reachability analysis → BFS/DFS from entrypoints over per-language and binary graphs, with lattice-based merging in Scanner.WebService.

  5. JSON + PURLs → All evidence is JSON with PURL-tagged components; SBOM is annotated, and VEX statements reference those PURLs.


If you like, next step can be: I draft concrete C# interface definitions (including some initial Tree-sitter integration stubs for JS/Python/PHP) and a skeleton of the ReachabilityPlan and ReachabilityEngine classes that you can drop into the monorepo.