27 KiB
Here are two practical ways to make your software supply‑chain evidence both useful and verifiable—with enough background to get you shipping.
1) Binary SBOMs that still work when there’s no package manager
Why this matters: Container images built FROM scratch or “distroless” often lack package metadata, so typical SBOMs go blank. A binary SBOM extracts facts directly from executables—so you still know “what’s inside,” even in bare images.
Core idea (plain English):
- Parse binaries (ELF on Linux, PE on Windows, Mach‑O on macOS).
- Record file paths, cryptographic hashes, import tables, compiler/linker hints, and for ELF also the
.note.gnu.build-id(a unique ID most linkers embed). - Map these fingerprints to known packages/versions (vendor fingerprints, distro databases, your own allowlists).
- Sign the result as an attestation so others can trust it without re‑running your scanner.
Minimal pipeline sketch:
- Extract:
readelf -n(ELF notes),objdump/otoolfor imports; compute SHA‑256 for every binary. - Normalize: Emit CycloneDX or SPDX components for binaries, not just packages.
- Map: Use Build‑ID → package hints (e.g., glibc, OpenSSL), symbol/version patterns, and path heuristics.
- Attest: Wrap the SBOM in DSSE + in‑toto and push to your registry alongside the image digest.
Pragmatic spec for developers:
-
Inputs: OCI image digest.
-
Outputs:
binary-sbom.cdx.json(CycloneDX) orbinary-sbom.spdx.json.attestation.intoto.jsonl(DSSE envelope referencing the SBOM’s SHA‑256 and the image digest).
-
Data fields to capture per artifact:
algorithm: sha256,digest: <hex>,type: elf|pe|macho,path,size,elf.build_id(if present),imports[],compiler[],arch,endian.
-
Verification:
cosign verify-attestation --type sbom --digest <image-digest> ...
Why the ELF Build‑ID is gold: it’s a stable, linker‑emitted identifier that helps correlate stripped binaries to upstream packages—critical when filenames and symbols lie.
2) Reachability analysis so you only page people for real risk
Why this matters: Not every CVE in your deps can actually be hit by your app. If you can show “no call path reaches the vulnerable sink,” you can de‑noise alerts and ship faster.
Core idea (plain English):
- Build an interprocedural call graph of your app (across modules/packages).
- Mark known “sinks” from vulnerability advisories (e.g., dangerous API + version range).
- Compute graph reachability from your entrypoints (HTTP handlers, CLI
main, background jobs). - The intersection of {reachable nodes} × {vulnerable sinks} = “actionable” findings.
- Emit a signed witness (attestation) that states which sinks are reachable/unreachable and why.
Minimal pipeline sketch:
-
Ingest code/bytecode: language‑specific frontends (e.g., .NET IL, JVM bytecode, Python AST, Go SSA).
-
Build graph: nodes = functions/methods; edges = call sites (include dynamic edges conservatively).
-
Mark entrypoints: web routes, message handlers, cron jobs, exported CLIs.
-
Mark sinks: from your vuln DB (API signature + version).
-
Decide: run graph search from entrypoints → is any sink reachable?
-
Attest: DSSE witness with:
- artifact digest (commit SHA / image digest),
- tool version + rule set hash,
- list of reachable sinks with at least one example call path,
- list of proven unreachable sinks (under stated assumptions).
Developer contract (portable across languages):
-
Inputs: source/bytecode zip + manifest of entrypoints.
-
Outputs:
reachability.witness.json(DSSE envelope),- optional
paths/folder with top‑N call paths as compact JSON (for UX rendering).
-
Verification:
- Recompute call graph deterministically given the same inputs + tool version,
cosign verify-attestation --type reachability ...
How these two pieces fit together
-
Binary SBOM = “What exactly is in the artifact?” (even in bare images)
-
Reachability witness = “Which vulns actually matter to this app build?”
-
Sign both as DSSE/in‑toto attestations and attach to the image/release. Your CI can enforce:
- “Block if high‑severity + reachable,”
- “Warn (don’t block) if high‑severity but unreachable with a fresh witness.”
Quick starter checklist (copy/paste to a task board)
- Binary extractors: ELF/PE/Mach‑O parsers; hash & Build‑ID capture.
- Mapping rules: Build‑ID → known package DB; symbol/version heuristics.
- Emit CycloneDX/SPDX; add file‑level components for binaries.
- DSSE signing and
cosign/rekorpublish for SBOM attestation. - Language frontends for reachability (pick your top 1–2 first).
- Call‑graph builder + entrypoint detector.
- Sink catalog normalizer (map CVE → API signature).
- Reachability engine + example path extractor.
- DSSE witness for reachability; attach to build.
- CI policy: block on “reachable high/critical”; surface paths in UI.
If you want, I can turn this into concrete .NET‑first tasks with sample code scaffolds and a tiny demo repo that builds an image, extracts a binary SBOM, runs reachability on a toy service, and emits both attestations. Below is a concrete, “do‑this‑then‑this” implementation plan for a layered binary→PURL mapping system that fits StellaOps’ constraints: offline, deterministic, SBOM‑first, and with unknowns recorded instead of guessing.
I’m going to assume your target is the common pain case StellaOps itself calls out: when package metadata is missing, Scanner falls back to binary identity (bin:{sha256}) and you want to deterministically “lift” those binaries into stable package identities (PURLs) without turning the core SBOM into fuzzy guesswork. StellaOps’ own Scanner docs emphasize deterministic analyzers, no fuzzy identity in core, and keeping heuristics as opt‑in add‑ons. (Stella Ops)
0) What “binary mapping” means in StellaOps terms
In Scanner’s architecture, the component key is:
- PURL when present
- otherwise
bin:{sha256}(Stella Ops)
So “better binary mapping” = systematically converting more of those bin:* components into PURLs (or at least producing actionable mapping evidence + Unknowns) while preserving:
- deterministic replay (same inputs ⇒ same output)
- offline operation (air‑gapped kits)
- policy safety (don’t hide false negatives behind fuzzy IDs)
Also, StellaOps already has the concept of “gaps” being first‑class via the Unknowns Registry (identity gaps, missing build‑id, version conflicts, missing edges, etc.). (Gitea: Git with a cup of tea) Your binary mapping work should feed this system.
1) Design constraints you must keep (or you’ll fight the platform)
1.1 Determinism rules
StellaOps’ Scanner architecture is explicit: core analyzers are deterministic; heuristic plug‑ins must not contaminate the core SBOM unless explicitly enabled. (Stella Ops)
That implies:
-
No probabilistic “best guess” PURL in the default mapping path.
-
If you do fuzzy inference, it must be emitted as:
- “hints” attached to Unknowns, or
- a separate heuristic artifact gated by flags.
1.2 Offline kit + debug store is already a hook you can exploit
Offline kits already bundle:
- scanner plug‑ins (OS + language analyzers packaged under
plugins/scanner/analyzers/**) - a debug store layout:
debug/.build-id/<aa>/<rest>.debug - a
debug-manifest.jsonthat maps build‑ids → originating images (for symbol retrieval) (Stella Ops)
This is perfect for building a Build‑ID→PURL index that remains offline and signed.
1.3 Scanner Worker already loads analyzers via directory catalogs
The Worker loads OS and language analyzer plug‑ins from default directories (unless overridden), using deterministic directory normalization and a “seal” concept on the last directory. (Gitea: Git with a cup of tea)
So you can add a third catalog for native/binary mapping that behaves the same way.
2) Layering strategy: what to implement (and in what order)
You want a resolver pipeline with strict ordering from “hard evidence” → “soft evidence”.
Layer 0 — In‑image authoritative mapping (highest confidence)
These sources are authoritative because they come from within the artifact:
- OS package DB present (dpkg/rpm/apk):
- Map
path → packageusing file ownership lists. - If you can also compute file hashes/build‑ids, store them as evidence.
- Language ecosystem metadata present (already handled by language analyzers):
- For example, a Python wheel RECORD or a Go buildinfo section can directly imply module versions.
Decision rule: If a binary file is owned by an OS package, prefer that over any external mapping index.
Layer 1 — “Build provenance” mapping via build IDs / UUIDs (strong, portable)
When package DB is missing (distroless/scratch), use compiler/linker stable IDs:
- ELF:
.note.gnu.build-id - Mach‑O:
LC_UUID - PE: CodeView (PDB GUID+Age) / build signature
This should be your primary fallback because it survives stripping and renaming.
Layer 2 — Hash mapping for curated or vendor‑pinned binaries (strong but brittle across rebuilds)
Use SHA‑256 → PURL mapping when:
- binaries are redistributed unchanged (busybox, chromium, embedded runtimes)
- you maintain a curated “known binaries” manifest
StellaOps already has “curated binary manifest generation” mentioned in its repo history, and a vendor/manifest.json concept exists (for pinned artifacts / binaries in the system). (Gitea: Git with a cup of tea)
For your ops environment you’ll create a similar manifest for your fleet.
Layer 3 — Dependency closure constraints (helpful as a disambiguator, not a primary mapper)
If the binary’s DT_NEEDED / imports point to libs you can identify, you can use that to disambiguate multiple possible candidates (“this openssl build-id matches, but only one candidate has the required glibc baseline”).
This must remain deterministic and rules‑based.
Layer 4 — Heuristic hints (never change the core SBOM by default)
Examples:
- symbol version patterns (
GLIBC_2.28, etc.) - embedded version strings
- import tables
- compiler metadata
These produce Unknown evidence/hints, not a resolved identity, unless a special “heuristics allowed” flag is turned on.
Layer 5 — Unknowns Registry output (mandatory when you can’t decide)
If a mapping can’t be made decisively:
- emit Unknowns (identity_gap, missing_build_id, version_conflict, etc.) (Gitea: Git with a cup of tea) This is not optional; it’s how you prevent silent false negatives.
3) Concrete data model you should implement
3.1 Binary identity record
Create a single canonical identity structure that every layer uses:
public enum BinaryFormat { Elf, Pe, MachO, Unknown }
public sealed record BinaryIdentity(
BinaryFormat Format,
string Path, // normalized (posix style), rooted at image root
string Sha256, // always present
string? BuildId, // ELF
string? MachOUuid, // Mach-O
string? PeCodeViewGuid, // PE/PDB
string? Arch, // amd64/arm64/...
long SizeBytes
);
Determinism tip: normalize Path to a single separator and collapse //, ./, etc.
3.2 Mapping candidate
Each resolver layer returns candidates like:
public enum MappingVerdict { Resolved, Unresolved, Ambiguous }
public sealed record BinaryMappingCandidate(
string Purl,
double Confidence, // 0..1 but deterministic
string ResolverId, // e.g. "os.fileowner", "buildid.index.v1"
IReadOnlyList<string> Evidence, // stable ordering
IReadOnlyDictionary<string,string> Properties // stable ordering
);
3.3 Final mapping result
public sealed record BinaryMappingResult(
MappingVerdict Verdict,
BinaryIdentity Subject,
BinaryMappingCandidate? Winner,
IReadOnlyList<BinaryMappingCandidate> Alternatives,
string MappingIndexDigest // sha256 of index snapshot used (or "none")
);
4) Build the “Binary Map Index” that makes Layer 1 and 2 work offline
4.1 Where it lives in StellaOps
Put it in the Offline Kit as a signed artifact, next to other feeds and plug-ins. Offline kit packaging already includes plug-ins and a debug store with a deterministic layout. (Stella Ops)
Recommended layout:
offline-kit/
feeds/
binary-map/
v1/
buildid.map.zst
sha256.map.zst
index.manifest.json
index.manifest.json.sig (DSSE or JWS, consistent with your kit)
4.2 Index record schema (v1)
Make each record explicit and replayable:
{
"schema": "stellaops.binary-map.v1",
"records": [
{
"key": { "kind": "elf.build_id", "value": "2f3a..."},
"purl": "pkg:deb/debian/openssl@3.0.11-1~deb12u2?arch=amd64",
"evidence": {
"source": "os.dpkg.fileowner",
"source_image": "sha256:....",
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
"package": "openssl",
"package_version": "3.0.11-1~deb12u2"
}
}
]
}
Key points:
key.kindis one ofelf.build_id,macho.uuid,pe.codeview,file.sha256- include evidence with enough detail to justify mapping
4.3 How to generate the index (deterministically)
You need an offline index builder pipeline. In StellaOps terms, this is best treated like a feed exporter step (build-time), then shipped in the Offline Kit.
Input set options (choose one or mix):
- “Golden base images” list (your fleet’s base images)
- Distro repositories mirrored into the airgap (Deb/RPM/APK archives)
- Previously scanned images that are allowed into the kit
Generation steps:
-
For each input image:
- Extract rootfs in a deterministic path order.
- Run OS analyzers (dpkg/rpm/apk) + native identity collection (ELF/PE/MachO).
-
Produce raw tuples:
(build_id | uuid | codeview | sha256) → (purl, evidence)
-
Deduplicate:
- Canonicalize PURLs (normalize qualifiers order, lowercasing rules).
- If the same key maps to multiple distinct PURLs, keep them all and mark as conflict (do not pick one).
-
Sort:
- Sort by
(key.kind, key.value, purl)lexicographically.
- Sort by
-
Serialize:
- Emit line‑delimited JSON or a simple binary format.
- Compress (zstd).
-
Compute digests:
sha256of each artifact.sha256of concatenated(artifact name + sha)for a manifest hash.
-
Sign:
- include in kit manifest and sign with the same process you use for other offline kit elements. Offline kit import in StellaOps validates digests and signatures. (Stella Ops)
5) Runtime side: implement the layered resolver in Scanner Worker
5.1 Where to hook in
You want this to run after OS + language analyzers have produced fragments, and after native identity collection has produced binary identities.
Scanner Worker already executes analyzers and appends fragments to context.Analysis. (Gitea: Git with a cup of tea)
Scanner module responsibilities explicitly include OS, language, and native ecosystems as restart-only plug-ins. (Gitea: Git with a cup of tea) So implement binary mapping as either:
- part of the native ecosystem analyzer output stage, or
- a post-analyzer enrichment stage that runs before SBOM composition.
I recommend: post-analyzer enrichment stage, because it can consult OS+lang analyzer results and unify decisions.
5.2 Add a new ScanAnalysis key
Store collected binary identities in analysis:
ScanAnalysisKeys.NativeBinaryIdentities→ImmutableArray<BinaryIdentity>
And store mapping results:
ScanAnalysisKeys.NativeBinaryMappings→ImmutableArray<BinaryMappingResult>
5.3 Implement the resolver pipeline (deterministic ordering)
public interface IBinaryMappingResolver
{
string Id { get; } // stable ID
int Order { get; } // deterministic
BinaryMappingCandidate? TryResolve(BinaryIdentity identity, MappingContext ctx);
}
Pipeline:
-
Sort resolvers by
(Order, Id)(Ordinal comparison). -
For each resolver:
- if it returns a candidate, add it to candidates list.
- if the resolver is “authoritative” (Layer 0), you can short‑circuit on first hit.
-
Decide:
-
If 0 candidates ⇒
Unresolved -
If 1 candidate ⇒
Resolved -
If >1:
- If candidates have different PURLs ⇒
Ambiguousunless a deterministic “dominates” rule exists - If candidates have same PURL (from multiple sources) ⇒ merge evidence
- If candidates have different PURLs ⇒
-
5.4 Implement each layer as a resolver
Resolver A: OS file owner (Layer 0)
Inputs:
- OS analyzer results in
context.Analysis(they’re already stored inScanAnalysisKeys.OsPackageAnalyzers). (Gitea: Git with a cup of tea) - You need OS analyzers to expose file ownership mapping.
Implementation options:
- Extend OS analyzers to produce
path → packageIdmaps. - Or load that from dpkg/rpm DB at mapping time (fast enough if you only query per binary path).
Candidate:
-
Purl = pkg:<ecosystem>/<name>@<version>?arch=... -
Confidence =
1.0 -
Evidence includes:
- analyzer id
- package name/version
- file path
Resolver B: Build‑ID index (Layer 1)
Inputs:
identity.BuildId(or uuid/codeview)BinaryMapIndexloaded from Offline Kitfeeds/binary-map/v1/buildid.map.zst
Implementation:
-
On worker startup: load and parse index into an immutable structure:
FrozenDictionary<string, BuildIdEntry[]>(or sorted arrays + binary search)
-
If key maps to multiple PURLs:
- return multiple candidates (same resolver id), forcing
Ambiguousverdict upstream
- return multiple candidates (same resolver id), forcing
Candidate:
- Confidence =
0.95(still deterministic) - Evidence includes index manifest digest + record evidence
Resolver C: SHA‑256 index (Layer 2)
Inputs:
identity.Sha256feeds/binary-map/v1/sha256.map.zstOR your ops “curated binaries” manifest
Candidate:
-
Confidence:
0.9if from signed curated manifest0.7if from “observed in previous scan cache” (I’d avoid this unless you version and sign the cache)
Resolver D: Dependency closure constraints (Layer 3)
Only run if you have native dependency parsing output (DT_NEEDED / imports). The resolver does not return a mapping on its own; instead, it can:
- bump confidence for existing candidates
- or rule out candidates deterministically (e.g., glibc baseline mismatch)
Make this a “candidate rewriter” stage:
public interface ICandidateRefiner
{
string Id { get; }
int Order { get; }
IReadOnlyList<BinaryMappingCandidate> Refine(BinaryIdentity id, IReadOnlyList<BinaryMappingCandidate> cands, MappingContext ctx);
}
Resolver E: Heuristic hints (Layer 4)
Never resolves to a PURL by default. It just produces Unknown evidence payload:
- extracted strings (“OpenSSL 3.0.11”)
- imported symbol names
- SONAME
- symbol version requirements
6) SBOM composition behavior: how to “lift” bin components safely
6.1 Don’t break the component key rules
Scanner uses:
- key = PURL when present, else
bin:{sha256}(Stella Ops)
When you resolve a binary identity to a PURL, you have two clean options:
Option 1 (recommended): replace the component key with the PURL
- This makes downstream policy/advisory matching work naturally.
- It’s deterministic as long as the mapping index is versioned and shipped with the kit.
Option 2: keep bin:{sha256} as the component key and attach resolved_purl
- Lower disruption to diffing, but policy now has to understand the “resolved_purl” field.
- If StellaOps policy assumes
component.purlis the canonical key, this will cause pain.
Given StellaOps emphasizes PURLs as the canonical key for identity, I’d implement Option 1, but record robust evidence + index digest.
6.2 Preserve file-level evidence
Even after lifting to PURL, keep evidence that ties the package identity back to file bytes:
- file path(s)
- sha256
- build-id/uuid
- mapping resolver id + index digest
This is what makes attestations verifiable and helps operators debug.
7) Unknowns integration: emit Unknowns whenever mapping isn’t decisive
The Unknowns Registry exists precisely for “unresolved symbol → package mapping”, “missing build-id”, “ambiguous purl”, etc. (Gitea: Git with a cup of tea)
7.1 When to emit Unknowns
Emit Unknowns for:
-
identity.BuildId == nullfor ELFunknown_type = missing_build_id- evidence: “ELF missing .note.gnu.build-id; using sha256 only”
-
Multiple candidates with different PURLs
unknown_type = version_conflict(oridentity_gap)- evidence: list candidates + their evidence
-
Heuristic hints found but no authoritative mapping
unknown_type = identity_gap- evidence: imported symbols, strings, SONAME
7.2 How to compute unknown_id deterministically
Unknowns schema suggests:
unknown_idis derived from sha256 over(type + scope + evidence)(Gitea: Git with a cup of tea)
Do:
- stable JSON canonicalization of
scope+unknown_type+primary evidence fields - sha256
- prefix with
unk:sha256:<...>
This guarantees idempotent ingestion behavior (POST /unknowns/ingest upsert). (Gitea: Git with a cup of tea)
8) Packaging as a StellaOps plug-in (so ops can upgrade it offline)
8.1 Plug-in manifest
Scanner plug-ins use a manifest.json with schemaVersion, id, entryPoint (dotnet assembly + typeName), etc. (Gitea: Git with a cup of tea)
Create something like:
{
"schemaVersion": "1.0",
"id": "stellaops.analyzer.native.binarymap",
"displayName": "StellaOps Native Binary Mapper",
"version": "0.1.0",
"requiresRestart": true,
"entryPoint": {
"type": "dotnet",
"assembly": "StellaOps.Scanner.Analyzers.Native.BinaryMap.dll",
"typeName": "StellaOps.Scanner.Analyzers.Native.BinaryMap.BinaryMapPlugin"
},
"capabilities": [
"native-analyzer",
"binary-mapper",
"elf",
"pe",
"macho"
],
"metadata": {
"org.stellaops.analyzer.kind": "native",
"org.stellaops.restart.required": "true"
}
}
8.2 Worker loading
Mirror the pattern in CompositeScanAnalyzerDispatcher:
- add a catalog
INativeAnalyzerPluginCatalog - default directory:
plugins/scanner/analyzers/native - load directories with the same “seal last directory” behavior (Gitea: Git with a cup of tea)
9) Tests and performance gates (what “done” looks like)
StellaOps has determinism tests and golden fixtures for analyzers; follow that style. (Gitea: Git with a cup of tea)
9.1 Determinism tests
Create fixtures with:
- same binaries in different file order
- same binaries hardlinked/symlinked
- stripped ELF missing build-id
- multi-arch variants
Assert:
- mapping output JSON byte-for-byte stable
- unknown ids stable
- candidate ordering stable
9.2 “No fuzzy identity” guardrail tests
Add tests that:
- heuristic resolver never emits a
Resolvedverdict unless a feature flag is enabled - ambiguous candidates never auto-select a winner
9.3 Performance budgets
For ops, you care about scan wall time. Adopt budgets like:
- identity extraction < 25ms / binary (native parsing)
- mapping lookup O(1) / binary (frozen dict) or O(log n) with sorted arrays
- index load time bounded (lazy load per worker start)
Track metrics:
- count resolved per layer
- count ambiguous/unresolved
- unknown density (ties into Unknowns Registry scoring later) (Gitea: Git with a cup of tea)
10) Practical “ops” workflow: how to keep improving mapping safely
10.1 Add a feedback loop from Unknowns → index builder
Unknowns are your backlog:
- “missing build-id”
- “ambiguous mapping”
- “hash seen but not in index”
For each Unknown:
- decide if it should be mapped in core (needs authoritative source)
- if yes: add reference artifact to your index builder input set
- rebuild the BinaryMap index
- ship via Offline Kit update (signed)
10.2 Don’t let your index silently drift
Because determinism matters, treat the BinaryMap index like a feed:
- version it (
v1,v2) - sign it
- store index digest in scan evidence
That way you can explain: “This binary was mapped using binary-map/v1 digest XYZ”.
11) Minimal implementation checklist (if you want the shortest path to value)
If you only do 3 things, do these:
-
Build‑ID extraction everywhere (ELF/Mach‑O/PE) and always store it in evidence (also emit Unknown when missing, as StellaOps expects) (Gitea: Git with a cup of tea)
-
Offline Build‑ID → PURL index shipped in Offline Kit (fits perfectly with the existing debug-store + kit pattern) (Stella Ops)
-
Deterministic resolver pipeline + Unknowns emission (so you improve mapping without introducing silent risk) (Gitea: Git with a cup of tea)
If you tell me whether your main pain is distroless, FROM scratch, or vendor‑bundled runtimes (chromium/node/openssl/etc.), I can give you the best “Layer 1 index builder” recipe for that category (what to use as authoritative sources and how to avoid collisions) — but the plan above is already safe and implementable without further assumptions.