Files
git.stella-ops.org/docs/product-advisories/archived/18-Dec-2025 - Building Better Binary Mapping and Call‑Stack Reachability.md
StellaOps Bot 7d5250238c save progress
2025-12-18 09:53:46 +02:00

27 KiB
Raw Blame History

Here are two practical ways to make your software supplychain evidence both useful and verifiable—with enough background to get you shipping.


1) Binary SBOMs that still work when theres no package manager

Why this matters: Container images built FROM scratch or “distroless” often lack package metadata, so typical SBOMs go blank. A binary SBOM extracts facts directly from executables—so you still know “whats inside,” even in bare images.

Core idea (plain English):

  • Parse binaries (ELF on Linux, PE on Windows, MachO on macOS).
  • Record file paths, cryptographic hashes, import tables, compiler/linker hints, and for ELF also the .note.gnu.build-id (a unique ID most linkers embed).
  • Map these fingerprints to known packages/versions (vendor fingerprints, distro databases, your own allowlists).
  • Sign the result as an attestation so others can trust it without rerunning your scanner.

Minimal pipeline sketch:

  • Extract: readelf -n (ELF notes), objdump/otool for imports; compute SHA256 for every binary.
  • Normalize: Emit CycloneDX or SPDX components for binaries, not just packages.
  • Map: Use BuildID → package hints (e.g., glibc, OpenSSL), symbol/version patterns, and path heuristics.
  • Attest: Wrap the SBOM in DSSE + intoto and push to your registry alongside the image digest.

Pragmatic spec for developers:

  • Inputs: OCI image digest.

  • Outputs:

    • binary-sbom.cdx.json (CycloneDX) or binary-sbom.spdx.json.
    • attestation.intoto.jsonl (DSSE envelope referencing the SBOMs SHA256 and the image digest).
  • Data fields to capture per artifact:

    • algorithm: sha256, digest: <hex>, type: elf|pe|macho, path, size,
    • elf.build_id (if present), imports[], compiler[], arch, endian.
  • Verification:

    • cosign verify-attestation --type sbom --digest <image-digest> ...

Why the ELF BuildID is gold: its a stable, linkeremitted identifier that helps correlate stripped binaries to upstream packages—critical when filenames and symbols lie.


2) Reachability analysis so you only page people for real risk

Why this matters: Not every CVE in your deps can actually be hit by your app. If you can show “no call path reaches the vulnerable sink,” you can denoise alerts and ship faster.

Core idea (plain English):

  • Build an interprocedural call graph of your app (across modules/packages).
  • Mark known “sinks” from vulnerability advisories (e.g., dangerous API + version range).
  • Compute graph reachability from your entrypoints (HTTP handlers, CLI main, background jobs).
  • The intersection of {reachable nodes} × {vulnerable sinks} = “actionable” findings.
  • Emit a signed witness (attestation) that states which sinks are reachable/unreachable and why.

Minimal pipeline sketch:

  • Ingest code/bytecode: languagespecific frontends (e.g., .NET IL, JVM bytecode, Python AST, Go SSA).

  • Build graph: nodes = functions/methods; edges = call sites (include dynamic edges conservatively).

  • Mark entrypoints: web routes, message handlers, cron jobs, exported CLIs.

  • Mark sinks: from your vuln DB (API signature + version).

  • Decide: run graph search from entrypoints → is any sink reachable?

  • Attest: DSSE witness with:

    • artifact digest (commit SHA / image digest),
    • tool version + rule set hash,
    • list of reachable sinks with at least one example call path,
    • list of proven unreachable sinks (under stated assumptions).

Developer contract (portable across languages):

  • Inputs: source/bytecode zip + manifest of entrypoints.

  • Outputs:

    • reachability.witness.json (DSSE envelope),
    • optional paths/ folder with topN call paths as compact JSON (for UX rendering).
  • Verification:

    • Recompute call graph deterministically given the same inputs + tool version,
    • cosign verify-attestation --type reachability ...

How these two pieces fit together

  • Binary SBOM = “What exactly is in the artifact?” (even in bare images)

  • Reachability witness = “Which vulns actually matter to this app build?”

  • Sign both as DSSE/intoto attestations and attach to the image/release. Your CI can enforce:

    • “Block if highseverity + reachable,”
    • “Warn (dont block) if highseverity but unreachable with a fresh witness.”

Quick starter checklist (copy/paste to a task board)

  • Binary extractors: ELF/PE/MachO parsers; hash & BuildID capture.
  • Mapping rules: BuildID → known package DB; symbol/version heuristics.
  • Emit CycloneDX/SPDX; add filelevel components for binaries.
  • DSSE signing and cosign/rekor publish for SBOM attestation.
  • Language frontends for reachability (pick your top 12 first).
  • Callgraph builder + entrypoint detector.
  • Sink catalog normalizer (map CVE → API signature).
  • Reachability engine + example path extractor.
  • DSSE witness for reachability; attach to build.
  • CI policy: block on “reachable high/critical”; surface paths in UI.

If you want, I can turn this into concrete .NETfirst tasks with sample code scaffolds and a tiny demo repo that builds an image, extracts a binary SBOM, runs reachability on a toy service, and emits both attestations. Below is a concrete, “dothisthenthis” implementation plan for a layered binary→PURL mapping system that fits StellaOps constraints: offline, deterministic, SBOMfirst, and with unknowns recorded instead of guessing.

Im going to assume your target is the common pain case StellaOps itself calls out: when package metadata is missing, Scanner falls back to binary identity (bin:{sha256}) and you want to deterministically “lift” those binaries into stable package identities (PURLs) without turning the core SBOM into fuzzy guesswork. StellaOps own Scanner docs emphasize deterministic analyzers, no fuzzy identity in core, and keeping heuristics as optin addons. (Stella Ops)


0) What “binary mapping” means in StellaOps terms

In Scanners architecture, the component key is:

  • PURL when present
  • otherwise bin:{sha256} (Stella Ops)

So “better binary mapping” = systematically converting more of those bin:* components into PURLs (or at least producing actionable mapping evidence + Unknowns) while preserving:

  • deterministic replay (same inputs ⇒ same output)
  • offline operation (airgapped kits)
  • policy safety (dont hide false negatives behind fuzzy IDs)

Also, StellaOps already has the concept of “gaps” being firstclass via the Unknowns Registry (identity gaps, missing buildid, version conflicts, missing edges, etc.). (Gitea: Git with a cup of tea) Your binary mapping work should feed this system.


1) Design constraints you must keep (or youll fight the platform)

1.1 Determinism rules

StellaOps Scanner architecture is explicit: core analyzers are deterministic; heuristic plugins must not contaminate the core SBOM unless explicitly enabled. (Stella Ops)

That implies:

  • No probabilistic “best guess” PURL in the default mapping path.

  • If you do fuzzy inference, it must be emitted as:

    • “hints” attached to Unknowns, or
    • a separate heuristic artifact gated by flags.

1.2 Offline kit + debug store is already a hook you can exploit

Offline kits already bundle:

  • scanner plugins (OS + language analyzers packaged under plugins/scanner/analyzers/**)
  • a debug store layout: debug/.build-id/<aa>/<rest>.debug
  • a debug-manifest.json that maps buildids → originating images (for symbol retrieval) (Stella Ops)

This is perfect for building a BuildID→PURL index that remains offline and signed.

1.3 Scanner Worker already loads analyzers via directory catalogs

The Worker loads OS and language analyzer plugins from default directories (unless overridden), using deterministic directory normalization and a “seal” concept on the last directory. (Gitea: Git with a cup of tea)

So you can add a third catalog for native/binary mapping that behaves the same way.


2) Layering strategy: what to implement (and in what order)

You want a resolver pipeline with strict ordering from “hard evidence” → “soft evidence”.

Layer 0 — Inimage authoritative mapping (highest confidence)

These sources are authoritative because they come from within the artifact:

  1. OS package DB present (dpkg/rpm/apk):
  • Map path → package using file ownership lists.
  • If you can also compute file hashes/buildids, store them as evidence.
  1. Language ecosystem metadata present (already handled by language analyzers):
  • For example, a Python wheel RECORD or a Go buildinfo section can directly imply module versions.

Decision rule: If a binary file is owned by an OS package, prefer that over any external mapping index.

Layer 1 — “Build provenance” mapping via build IDs / UUIDs (strong, portable)

When package DB is missing (distroless/scratch), use compiler/linker stable IDs:

  • ELF: .note.gnu.build-id
  • MachO: LC_UUID
  • PE: CodeView (PDB GUID+Age) / build signature

This should be your primary fallback because it survives stripping and renaming.

Layer 2 — Hash mapping for curated or vendorpinned binaries (strong but brittle across rebuilds)

Use SHA256 → PURL mapping when:

  • binaries are redistributed unchanged (busybox, chromium, embedded runtimes)
  • you maintain a curated “known binaries” manifest

StellaOps already has “curated binary manifest generation” mentioned in its repo history, and a vendor/manifest.json concept exists (for pinned artifacts / binaries in the system). (Gitea: Git with a cup of tea) For your ops environment youll create a similar manifest for your fleet.

Layer 3 — Dependency closure constraints (helpful as a disambiguator, not a primary mapper)

If the binarys DT_NEEDED / imports point to libs you can identify, you can use that to disambiguate multiple possible candidates (“this openssl build-id matches, but only one candidate has the required glibc baseline”).

This must remain deterministic and rulesbased.

Layer 4 — Heuristic hints (never change the core SBOM by default)

Examples:

  • symbol version patterns (GLIBC_2.28, etc.)
  • embedded version strings
  • import tables
  • compiler metadata

These produce Unknown evidence/hints, not a resolved identity, unless a special “heuristics allowed” flag is turned on.

Layer 5 — Unknowns Registry output (mandatory when you cant decide)

If a mapping cant be made decisively:

  • emit Unknowns (identity_gap, missing_build_id, version_conflict, etc.) (Gitea: Git with a cup of tea) This is not optional; its how you prevent silent false negatives.

3) Concrete data model you should implement

3.1 Binary identity record

Create a single canonical identity structure that every layer uses:

public enum BinaryFormat { Elf, Pe, MachO, Unknown }

public sealed record BinaryIdentity(
    BinaryFormat Format,
    string Path,              // normalized (posix style), rooted at image root
    string Sha256,            // always present
    string? BuildId,          // ELF
    string? MachOUuid,        // Mach-O
    string? PeCodeViewGuid,   // PE/PDB
    string? Arch,             // amd64/arm64/...
    long SizeBytes
);

Determinism tip: normalize Path to a single separator and collapse //, ./, etc.

3.2 Mapping candidate

Each resolver layer returns candidates like:

public enum MappingVerdict { Resolved, Unresolved, Ambiguous }

public sealed record BinaryMappingCandidate(
    string Purl,
    double Confidence,          // 0..1 but deterministic
    string ResolverId,          // e.g. "os.fileowner", "buildid.index.v1"
    IReadOnlyList<string> Evidence, // stable ordering
    IReadOnlyDictionary<string,string> Properties // stable ordering
);

3.3 Final mapping result

public sealed record BinaryMappingResult(
    MappingVerdict Verdict,
    BinaryIdentity Subject,
    BinaryMappingCandidate? Winner,
    IReadOnlyList<BinaryMappingCandidate> Alternatives,
    string MappingIndexDigest // sha256 of index snapshot used (or "none")
);

4) Build the “Binary Map Index” that makes Layer 1 and 2 work offline

4.1 Where it lives in StellaOps

Put it in the Offline Kit as a signed artifact, next to other feeds and plug-ins. Offline kit packaging already includes plug-ins and a debug store with a deterministic layout. (Stella Ops)

Recommended layout:

offline-kit/
  feeds/
    binary-map/
      v1/
        buildid.map.zst
        sha256.map.zst
        index.manifest.json
        index.manifest.json.sig   (DSSE or JWS, consistent with your kit)

4.2 Index record schema (v1)

Make each record explicit and replayable:

{
  "schema": "stellaops.binary-map.v1",
  "records": [
    {
      "key": { "kind": "elf.build_id", "value": "2f3a..."},
      "purl": "pkg:deb/debian/openssl@3.0.11-1~deb12u2?arch=amd64",
      "evidence": {
        "source": "os.dpkg.fileowner",
        "source_image": "sha256:....",
        "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
        "package": "openssl",
        "package_version": "3.0.11-1~deb12u2"
      }
    }
  ]
}

Key points:

  • key.kind is one of elf.build_id, macho.uuid, pe.codeview, file.sha256
  • include evidence with enough detail to justify mapping

4.3 How to generate the index (deterministically)

You need an offline index builder pipeline. In StellaOps terms, this is best treated like a feed exporter step (build-time), then shipped in the Offline Kit.

Input set options (choose one or mix):

  1. “Golden base images” list (your fleets base images)
  2. Distro repositories mirrored into the airgap (Deb/RPM/APK archives)
  3. Previously scanned images that are allowed into the kit

Generation steps:

  1. For each input image:

    • Extract rootfs in a deterministic path order.
    • Run OS analyzers (dpkg/rpm/apk) + native identity collection (ELF/PE/MachO).
  2. Produce raw tuples:

    • (build_id | uuid | codeview | sha256) → (purl, evidence)
  3. Deduplicate:

    • Canonicalize PURLs (normalize qualifiers order, lowercasing rules).
    • If the same key maps to multiple distinct PURLs, keep them all and mark as conflict (do not pick one).
  4. Sort:

    • Sort by (key.kind, key.value, purl) lexicographically.
  5. Serialize:

    • Emit linedelimited JSON or a simple binary format.
    • Compress (zstd).
  6. Compute digests:

    • sha256 of each artifact.
    • sha256 of concatenated (artifact name + sha) for a manifest hash.
  7. Sign:

    • include in kit manifest and sign with the same process you use for other offline kit elements. Offline kit import in StellaOps validates digests and signatures. (Stella Ops)

5) Runtime side: implement the layered resolver in Scanner Worker

5.1 Where to hook in

You want this to run after OS + language analyzers have produced fragments, and after native identity collection has produced binary identities.

Scanner Worker already executes analyzers and appends fragments to context.Analysis. (Gitea: Git with a cup of tea)

Scanner module responsibilities explicitly include OS, language, and native ecosystems as restart-only plug-ins. (Gitea: Git with a cup of tea) So implement binary mapping as either:

  • part of the native ecosystem analyzer output stage, or
  • a post-analyzer enrichment stage that runs before SBOM composition.

I recommend: post-analyzer enrichment stage, because it can consult OS+lang analyzer results and unify decisions.

5.2 Add a new ScanAnalysis key

Store collected binary identities in analysis:

  • ScanAnalysisKeys.NativeBinaryIdentitiesImmutableArray<BinaryIdentity>

And store mapping results:

  • ScanAnalysisKeys.NativeBinaryMappingsImmutableArray<BinaryMappingResult>

5.3 Implement the resolver pipeline (deterministic ordering)

public interface IBinaryMappingResolver
{
    string Id { get; }      // stable ID
    int Order { get; }      // deterministic
    BinaryMappingCandidate? TryResolve(BinaryIdentity identity, MappingContext ctx);
}

Pipeline:

  1. Sort resolvers by (Order, Id) (Ordinal comparison).

  2. For each resolver:

    • if it returns a candidate, add it to candidates list.
    • if the resolver is “authoritative” (Layer 0), you can shortcircuit on first hit.
  3. Decide:

    • If 0 candidates ⇒ Unresolved

    • If 1 candidate ⇒ Resolved

    • If >1:

      • If candidates have different PURLs ⇒ Ambiguous unless a deterministic “dominates” rule exists
      • If candidates have same PURL (from multiple sources) ⇒ merge evidence

5.4 Implement each layer as a resolver

Resolver A: OS file owner (Layer 0)

Inputs:

  • OS analyzer results in context.Analysis (theyre already stored in ScanAnalysisKeys.OsPackageAnalyzers). (Gitea: Git with a cup of tea)
  • You need OS analyzers to expose file ownership mapping.

Implementation options:

  • Extend OS analyzers to produce path → packageId maps.
  • Or load that from dpkg/rpm DB at mapping time (fast enough if you only query per binary path).

Candidate:

  • Purl = pkg:<ecosystem>/<name>@<version>?arch=...

  • Confidence = 1.0

  • Evidence includes:

    • analyzer id
    • package name/version
    • file path

Resolver B: BuildID index (Layer 1)

Inputs:

  • identity.BuildId (or uuid/codeview)
  • BinaryMapIndex loaded from Offline Kit feeds/binary-map/v1/buildid.map.zst

Implementation:

  • On worker startup: load and parse index into an immutable structure:

    • FrozenDictionary<string, BuildIdEntry[]> (or sorted arrays + binary search)
  • If key maps to multiple PURLs:

    • return multiple candidates (same resolver id), forcing Ambiguous verdict upstream

Candidate:

  • Confidence = 0.95 (still deterministic)
  • Evidence includes index manifest digest + record evidence

Resolver C: SHA256 index (Layer 2)

Inputs:

  • identity.Sha256
  • feeds/binary-map/v1/sha256.map.zst OR your ops “curated binaries” manifest

Candidate:

  • Confidence:

    • 0.9 if from signed curated manifest
    • 0.7 if from “observed in previous scan cache” (Id avoid this unless you version and sign the cache)

Resolver D: Dependency closure constraints (Layer 3)

Only run if you have native dependency parsing output (DT_NEEDED / imports). The resolver does not return a mapping on its own; instead, it can:

  • bump confidence for existing candidates
  • or rule out candidates deterministically (e.g., glibc baseline mismatch)

Make this a “candidate rewriter” stage:

public interface ICandidateRefiner
{
    string Id { get; }
    int Order { get; }
    IReadOnlyList<BinaryMappingCandidate> Refine(BinaryIdentity id, IReadOnlyList<BinaryMappingCandidate> cands, MappingContext ctx);
}

Resolver E: Heuristic hints (Layer 4)

Never resolves to a PURL by default. It just produces Unknown evidence payload:

  • extracted strings (“OpenSSL 3.0.11”)
  • imported symbol names
  • SONAME
  • symbol version requirements

6) SBOM composition behavior: how to “lift” bin components safely

6.1 Dont break the component key rules

Scanner uses:

  • key = PURL when present, else bin:{sha256} (Stella Ops)

When you resolve a binary identity to a PURL, you have two clean options:

Option 1 (recommended): replace the component key with the PURL

  • This makes downstream policy/advisory matching work naturally.
  • Its deterministic as long as the mapping index is versioned and shipped with the kit.

Option 2: keep bin:{sha256} as the component key and attach resolved_purl

  • Lower disruption to diffing, but policy now has to understand the “resolved_purl” field.
  • If StellaOps policy assumes component.purl is the canonical key, this will cause pain.

Given StellaOps emphasizes PURLs as the canonical key for identity, Id implement Option 1, but record robust evidence + index digest.

6.2 Preserve file-level evidence

Even after lifting to PURL, keep evidence that ties the package identity back to file bytes:

  • file path(s)
  • sha256
  • build-id/uuid
  • mapping resolver id + index digest

This is what makes attestations verifiable and helps operators debug.


7) Unknowns integration: emit Unknowns whenever mapping isnt decisive

The Unknowns Registry exists precisely for “unresolved symbol → package mapping”, “missing build-id”, “ambiguous purl”, etc. (Gitea: Git with a cup of tea)

7.1 When to emit Unknowns

Emit Unknowns for:

  1. identity.BuildId == null for ELF

    • unknown_type = missing_build_id
    • evidence: “ELF missing .note.gnu.build-id; using sha256 only”
  2. Multiple candidates with different PURLs

    • unknown_type = version_conflict (or identity_gap)
    • evidence: list candidates + their evidence
  3. Heuristic hints found but no authoritative mapping

    • unknown_type = identity_gap
    • evidence: imported symbols, strings, SONAME

7.2 How to compute unknown_id deterministically

Unknowns schema suggests:

Do:

  • stable JSON canonicalization of scope + unknown_type + primary evidence fields
  • sha256
  • prefix with unk:sha256:<...>

This guarantees idempotent ingestion behavior (POST /unknowns/ingest upsert). (Gitea: Git with a cup of tea)


8) Packaging as a StellaOps plug-in (so ops can upgrade it offline)

8.1 Plug-in manifest

Scanner plug-ins use a manifest.json with schemaVersion, id, entryPoint (dotnet assembly + typeName), etc. (Gitea: Git with a cup of tea)

Create something like:

{
  "schemaVersion": "1.0",
  "id": "stellaops.analyzer.native.binarymap",
  "displayName": "StellaOps Native Binary Mapper",
  "version": "0.1.0",
  "requiresRestart": true,
  "entryPoint": {
    "type": "dotnet",
    "assembly": "StellaOps.Scanner.Analyzers.Native.BinaryMap.dll",
    "typeName": "StellaOps.Scanner.Analyzers.Native.BinaryMap.BinaryMapPlugin"
  },
  "capabilities": [
    "native-analyzer",
    "binary-mapper",
    "elf",
    "pe",
    "macho"
  ],
  "metadata": {
    "org.stellaops.analyzer.kind": "native",
    "org.stellaops.restart.required": "true"
  }
}

8.2 Worker loading

Mirror the pattern in CompositeScanAnalyzerDispatcher:

  • add a catalog INativeAnalyzerPluginCatalog
  • default directory: plugins/scanner/analyzers/native
  • load directories with the same “seal last directory” behavior (Gitea: Git with a cup of tea)

9) Tests and performance gates (what “done” looks like)

StellaOps has determinism tests and golden fixtures for analyzers; follow that style. (Gitea: Git with a cup of tea)

9.1 Determinism tests

Create fixtures with:

  • same binaries in different file order
  • same binaries hardlinked/symlinked
  • stripped ELF missing build-id
  • multi-arch variants

Assert:

  • mapping output JSON byte-for-byte stable
  • unknown ids stable
  • candidate ordering stable

9.2 “No fuzzy identity” guardrail tests

Add tests that:

  • heuristic resolver never emits a Resolved verdict unless a feature flag is enabled
  • ambiguous candidates never auto-select a winner

9.3 Performance budgets

For ops, you care about scan wall time. Adopt budgets like:

  • identity extraction < 25ms / binary (native parsing)
  • mapping lookup O(1) / binary (frozen dict) or O(log n) with sorted arrays
  • index load time bounded (lazy load per worker start)

Track metrics:

  • count resolved per layer
  • count ambiguous/unresolved
  • unknown density (ties into Unknowns Registry scoring later) (Gitea: Git with a cup of tea)

10) Practical “ops” workflow: how to keep improving mapping safely

10.1 Add a feedback loop from Unknowns → index builder

Unknowns are your backlog:

  • “missing build-id”
  • “ambiguous mapping”
  • “hash seen but not in index”

For each Unknown:

  1. decide if it should be mapped in core (needs authoritative source)
  2. if yes: add reference artifact to your index builder input set
  3. rebuild the BinaryMap index
  4. ship via Offline Kit update (signed)

10.2 Dont let your index silently drift

Because determinism matters, treat the BinaryMap index like a feed:

  • version it (v1, v2)
  • sign it
  • store index digest in scan evidence

That way you can explain: “This binary was mapped using binary-map/v1 digest XYZ”.


11) Minimal implementation checklist (if you want the shortest path to value)

If you only do 3 things, do these:

  1. BuildID extraction everywhere (ELF/MachO/PE) and always store it in evidence (also emit Unknown when missing, as StellaOps expects) (Gitea: Git with a cup of tea)

  2. Offline BuildID → PURL index shipped in Offline Kit (fits perfectly with the existing debug-store + kit pattern) (Stella Ops)

  3. Deterministic resolver pipeline + Unknowns emission (so you improve mapping without introducing silent risk) (Gitea: Git with a cup of tea)


If you tell me whether your main pain is distroless, FROM scratch, or vendorbundled runtimes (chromium/node/openssl/etc.), I can give you the best “Layer 1 index builder” recipe for that category (what to use as authoritative sources and how to avoid collisions) — but the plan above is already safe and implementable without further assumptions.