save progress
This commit is contained in:
@@ -0,0 +1,721 @@
|
||||
Here are two practical ways to make your software supply‑chain evidence both *useful* and *verifiable*—with enough background to get you shipping.
|
||||
|
||||
---
|
||||
|
||||
# 1) Binary SBOMs that still work when there’s no package manager
|
||||
|
||||
**Why this matters:** Container images built `FROM scratch` or “distroless” often lack package metadata, so typical SBOMs go blank. A *binary SBOM* extracts facts directly from executables—so you still know “what’s inside,” even in bare images.
|
||||
|
||||
**Core idea (plain English):**
|
||||
|
||||
* Parse binaries (ELF on Linux, PE on Windows, Mach‑O on macOS).
|
||||
* Record file paths, cryptographic hashes, import tables, compiler/linker hints, and for ELF also the `.note.gnu.build-id` (a unique ID most linkers embed).
|
||||
* Map these fingerprints to known packages/versions (vendor fingerprints, distro databases, your own allowlists).
|
||||
* Sign the result as an attestation so others can trust it without re‑running your scanner.
|
||||
|
||||
**Minimal pipeline sketch:**
|
||||
|
||||
* **Extract:** `readelf -n` (ELF notes), `objdump`/`otool` for imports; compute SHA‑256 for every binary.
|
||||
* **Normalize:** Emit CycloneDX or SPDX components for *binaries*, not just packages.
|
||||
* **Map:** Use Build‑ID → package hints (e.g., glibc, OpenSSL), symbol/version patterns, and path heuristics.
|
||||
* **Attest:** Wrap the SBOM in DSSE + in‑toto and push to your registry alongside the image digest.
|
||||
|
||||
**Pragmatic spec for developers:**
|
||||
|
||||
* Inputs: OCI image digest.
|
||||
* Outputs:
|
||||
|
||||
* `binary-sbom.cdx.json` (CycloneDX) or `binary-sbom.spdx.json`.
|
||||
* `attestation.intoto.jsonl` (DSSE envelope referencing the SBOM’s SHA‑256 and the *image digest*).
|
||||
* Data fields to capture per artifact:
|
||||
|
||||
* `algorithm: sha256`, `digest: <hex>`, `type: elf|pe|macho`, `path`, `size`,
|
||||
* `elf.build_id` (if present), `imports[]`, `compiler[]`, `arch`, `endian`.
|
||||
* Verification:
|
||||
|
||||
* `cosign verify-attestation --type sbom --digest <image-digest> ...`
|
||||
|
||||
**Why the ELF Build‑ID is gold:** it’s a stable, linker‑emitted identifier that helps correlate stripped binaries to upstream packages—critical when filenames and symbols lie.
|
||||
|
||||
---
|
||||
|
||||
# 2) Reachability analysis so you only page people for *real* risk
|
||||
|
||||
**Why this matters:** Not every CVE in your deps can actually be hit by your app. If you can show “no call path reaches the vulnerable sink,” you can *de‑noise* alerts and ship faster.
|
||||
|
||||
**Core idea (plain English):**
|
||||
|
||||
* Build an *interprocedural call graph* of your app (across modules/packages).
|
||||
* Mark known “sinks” from vulnerability advisories (e.g., dangerous API + version range).
|
||||
* Compute graph reachability from your entrypoints (HTTP handlers, CLI `main`, background jobs).
|
||||
* The intersection of {reachable nodes} × {vulnerable sinks} = “actionable” findings.
|
||||
* Emit a signed *witness* (attestation) that states which sinks are reachable/unreachable and why.
|
||||
|
||||
**Minimal pipeline sketch:**
|
||||
|
||||
* **Ingest code/bytecode:** language‑specific frontends (e.g., .NET IL, JVM bytecode, Python AST, Go SSA).
|
||||
* **Build graph:** nodes = functions/methods; edges = call sites (include dynamic edges conservatively).
|
||||
* **Mark entrypoints:** web routes, message handlers, cron jobs, exported CLIs.
|
||||
* **Mark sinks:** from your vuln DB (API signature + version).
|
||||
* **Decide:** run graph search from entrypoints → is any sink reachable?
|
||||
* **Attest:** DSSE witness with:
|
||||
|
||||
* artifact digest (commit SHA / image digest),
|
||||
* tool version + rule set hash,
|
||||
* list of reachable sinks with at least one example call path,
|
||||
* list of *proven* unreachable sinks (under stated assumptions).
|
||||
|
||||
**Developer contract (portable across languages):**
|
||||
|
||||
* Inputs: source/bytecode zip + manifest of entrypoints.
|
||||
* Outputs:
|
||||
|
||||
* `reachability.witness.json` (DSSE envelope),
|
||||
* optional `paths/` folder with top‑N call paths as compact JSON (for UX rendering).
|
||||
* Verification:
|
||||
|
||||
* Recompute call graph deterministically given the same inputs + tool version,
|
||||
* `cosign verify-attestation --type reachability ...`
|
||||
|
||||
---
|
||||
|
||||
# How these two pieces fit together
|
||||
|
||||
* **Binary SBOM** = “What exactly is in the artifact?” (even in bare images)
|
||||
* **Reachability witness** = “Which vulns actually matter to *this* app build?”
|
||||
* Sign both as **DSSE/in‑toto attestations** and attach to the image/release. Your CI can enforce:
|
||||
|
||||
* “Block if high‑severity + *reachable*,”
|
||||
* “Warn (don’t block) if high‑severity but *unreachable* with a fresh witness.”
|
||||
|
||||
---
|
||||
|
||||
# Quick starter checklist (copy/paste to a task board)
|
||||
|
||||
* [ ] Binary extractors: ELF/PE/Mach‑O parsers; hash & Build‑ID capture.
|
||||
* [ ] Mapping rules: Build‑ID → known package DB; symbol/version heuristics.
|
||||
* [ ] Emit CycloneDX/SPDX; add file‑level components for binaries.
|
||||
* [ ] DSSE signing and `cosign`/`rekor` publish for SBOM attestation.
|
||||
* [ ] Language frontends for reachability (pick your top 1–2 first).
|
||||
* [ ] Call‑graph builder + entrypoint detector.
|
||||
* [ ] Sink catalog normalizer (map CVE → API signature).
|
||||
* [ ] Reachability engine + example path extractor.
|
||||
* [ ] DSSE witness for reachability; attach to build.
|
||||
* [ ] CI policy: block on “reachable high/critical”; surface paths in UI.
|
||||
|
||||
If you want, I can turn this into concrete .NET‑first tasks with sample code scaffolds and a tiny demo repo that builds an image, extracts a binary SBOM, runs reachability on a toy service, and emits both attestations.
|
||||
Below is a concrete, “do‑this‑then‑this” implementation plan for a **layered binary→PURL mapping system** that fits StellaOps’ constraints: **offline**, **deterministic**, **SBOM‑first**, and with **unknowns recorded instead of guessing**.
|
||||
|
||||
I’m going to assume your target is the common pain case StellaOps itself calls out: when package metadata is missing, Scanner falls back to binary identity (`bin:{sha256}`) and you want to deterministically “lift” those binaries into stable package identities (PURLs) without turning the core SBOM into fuzzy guesswork. StellaOps’ own Scanner docs emphasize **deterministic analyzers**, **no fuzzy identity in core**, and keeping heuristics as opt‑in add‑ons. ([Stella Ops][1])
|
||||
|
||||
---
|
||||
|
||||
## 0) What “binary mapping” means in StellaOps terms
|
||||
|
||||
In Scanner’s architecture, the **component key** is:
|
||||
|
||||
* **PURL when present**
|
||||
* otherwise `bin:{sha256}` ([Stella Ops][1])
|
||||
|
||||
So “better binary mapping” = systematically converting more of those `bin:*` components into **PURLs** (or at least producing **actionable mapping evidence + Unknowns**) while preserving:
|
||||
|
||||
* deterministic replay (same inputs ⇒ same output)
|
||||
* offline operation (air‑gapped kits)
|
||||
* policy safety (don’t hide false negatives behind fuzzy IDs)
|
||||
|
||||
Also, StellaOps already has the concept of “gaps” being first‑class via the **Unknowns Registry** (identity gaps, missing build‑id, version conflicts, missing edges, etc.). ([Gitea: Git with a cup of tea][2]) Your binary mapping work should *feed* this system.
|
||||
|
||||
---
|
||||
|
||||
## 1) Design constraints you must keep (or you’ll fight the platform)
|
||||
|
||||
### 1.1 Determinism rules
|
||||
|
||||
StellaOps’ Scanner architecture is explicit: core analyzers are deterministic; heuristic plug‑ins must not contaminate the core SBOM unless explicitly enabled. ([Stella Ops][1])
|
||||
|
||||
That implies:
|
||||
|
||||
* **No probabilistic “best guess” PURL** in the default mapping path.
|
||||
* If you do fuzzy inference, it must be emitted as:
|
||||
|
||||
* “hints” attached to Unknowns, or
|
||||
* a separate heuristic artifact gated by flags.
|
||||
|
||||
### 1.2 Offline kit + debug store is already a hook you can exploit
|
||||
|
||||
Offline kits already bundle:
|
||||
|
||||
* scanner plug‑ins (OS + language analyzers packaged under `plugins/scanner/analyzers/**`)
|
||||
* a **debug store** layout: `debug/.build-id/<aa>/<rest>.debug`
|
||||
* a `debug-manifest.json` that maps build‑ids → originating images (for symbol retrieval) ([Stella Ops][3])
|
||||
|
||||
This is perfect for building a **Build‑ID→PURL index** that remains offline and signed.
|
||||
|
||||
### 1.3 Scanner Worker already loads analyzers via directory catalogs
|
||||
|
||||
The Worker loads OS and language analyzer plug‑ins from default directories (unless overridden), using deterministic directory normalization and a “seal” concept on the last directory. ([Gitea: Git with a cup of tea][4])
|
||||
|
||||
So you can add a third catalog for **native/binary mapping** that behaves the same way.
|
||||
|
||||
---
|
||||
|
||||
## 2) Layering strategy: what to implement (and in what order)
|
||||
|
||||
You want a **resolver pipeline** with strict ordering from “hard evidence” → “soft evidence”.
|
||||
|
||||
### Layer 0 — In‑image authoritative mapping (highest confidence)
|
||||
|
||||
These sources are authoritative because they come from within the artifact:
|
||||
|
||||
1. **OS package DB present** (dpkg/rpm/apk):
|
||||
|
||||
* Map `path → package` using file ownership lists.
|
||||
* If you can also compute file hashes/build‑ids, store them as evidence.
|
||||
|
||||
2. **Language ecosystem metadata present** (already handled by language analyzers):
|
||||
|
||||
* For example, a Python wheel RECORD or a Go buildinfo section can directly imply module versions.
|
||||
|
||||
**Decision rule**: If a binary file is owned by an OS package, **prefer that** over any external mapping index.
|
||||
|
||||
### Layer 1 — “Build provenance” mapping via build IDs / UUIDs (strong, portable)
|
||||
|
||||
When package DB is missing (distroless/scratch), use **compiler/linker stable IDs**:
|
||||
|
||||
* ELF: `.note.gnu.build-id`
|
||||
* Mach‑O: `LC_UUID`
|
||||
* PE: CodeView (PDB GUID+Age) / build signature
|
||||
|
||||
This should be your primary fallback because it survives stripping and renaming.
|
||||
|
||||
### Layer 2 — Hash mapping for curated or vendor‑pinned binaries (strong but brittle across rebuilds)
|
||||
|
||||
Use SHA‑256 → PURL mapping when:
|
||||
|
||||
* binaries are redistributed unchanged (busybox, chromium, embedded runtimes)
|
||||
* you maintain a curated “known binaries” manifest
|
||||
|
||||
StellaOps already has “curated binary manifest generation” mentioned in its repo history, and a `vendor/manifest.json` concept exists (for pinned artifacts / binaries in the system). ([Gitea: Git with a cup of tea][5])
|
||||
For your ops environment you’ll create a similar manifest **for your fleet**.
|
||||
|
||||
### Layer 3 — Dependency closure constraints (helpful as a disambiguator, not a primary mapper)
|
||||
|
||||
If the binary’s DT_NEEDED / imports point to libs you *can* identify, you can use that to disambiguate multiple possible candidates (“this openssl build-id matches, but only one candidate has the required glibc baseline”).
|
||||
|
||||
This must remain deterministic and rules‑based.
|
||||
|
||||
### Layer 4 — Heuristic hints (never change the core SBOM by default)
|
||||
|
||||
Examples:
|
||||
|
||||
* symbol version patterns (`GLIBC_2.28`, etc.)
|
||||
* embedded version strings
|
||||
* import tables
|
||||
* compiler metadata
|
||||
|
||||
These produce **Unknown evidence/hints**, not a resolved identity, unless a special “heuristics allowed” flag is turned on.
|
||||
|
||||
### Layer 5 — Unknowns Registry output (mandatory when you can’t decide)
|
||||
|
||||
If a mapping can’t be made decisively:
|
||||
|
||||
* emit Unknowns (identity_gap, missing_build_id, version_conflict, etc.) ([Gitea: Git with a cup of tea][2])
|
||||
This is not optional; it’s how you prevent silent false negatives.
|
||||
|
||||
---
|
||||
|
||||
## 3) Concrete data model you should implement
|
||||
|
||||
### 3.1 Binary identity record
|
||||
|
||||
Create a single canonical identity structure that *every layer* uses:
|
||||
|
||||
```csharp
|
||||
public enum BinaryFormat { Elf, Pe, MachO, Unknown }
|
||||
|
||||
public sealed record BinaryIdentity(
|
||||
BinaryFormat Format,
|
||||
string Path, // normalized (posix style), rooted at image root
|
||||
string Sha256, // always present
|
||||
string? BuildId, // ELF
|
||||
string? MachOUuid, // Mach-O
|
||||
string? PeCodeViewGuid, // PE/PDB
|
||||
string? Arch, // amd64/arm64/...
|
||||
long SizeBytes
|
||||
);
|
||||
```
|
||||
|
||||
**Determinism tip**: normalize `Path` to a single separator and collapse `//`, `./`, etc.
|
||||
|
||||
### 3.2 Mapping candidate
|
||||
|
||||
Each resolver layer returns candidates like:
|
||||
|
||||
```csharp
|
||||
public enum MappingVerdict { Resolved, Unresolved, Ambiguous }
|
||||
|
||||
public sealed record BinaryMappingCandidate(
|
||||
string Purl,
|
||||
double Confidence, // 0..1 but deterministic
|
||||
string ResolverId, // e.g. "os.fileowner", "buildid.index.v1"
|
||||
IReadOnlyList<string> Evidence, // stable ordering
|
||||
IReadOnlyDictionary<string,string> Properties // stable ordering
|
||||
);
|
||||
```
|
||||
|
||||
### 3.3 Final mapping result
|
||||
|
||||
```csharp
|
||||
public sealed record BinaryMappingResult(
|
||||
MappingVerdict Verdict,
|
||||
BinaryIdentity Subject,
|
||||
BinaryMappingCandidate? Winner,
|
||||
IReadOnlyList<BinaryMappingCandidate> Alternatives,
|
||||
string MappingIndexDigest // sha256 of index snapshot used (or "none")
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4) Build the “Binary Map Index” that makes Layer 1 and 2 work offline
|
||||
|
||||
### 4.1 Where it lives in StellaOps
|
||||
|
||||
Put it in the Offline Kit as a signed artifact, next to other feeds and plug-ins. Offline kit packaging already includes plug-ins and a debug store with a deterministic layout. ([Stella Ops][3])
|
||||
|
||||
Recommended layout:
|
||||
|
||||
```
|
||||
offline-kit/
|
||||
feeds/
|
||||
binary-map/
|
||||
v1/
|
||||
buildid.map.zst
|
||||
sha256.map.zst
|
||||
index.manifest.json
|
||||
index.manifest.json.sig (DSSE or JWS, consistent with your kit)
|
||||
```
|
||||
|
||||
### 4.2 Index record schema (v1)
|
||||
|
||||
Make each record explicit and replayable:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema": "stellaops.binary-map.v1",
|
||||
"records": [
|
||||
{
|
||||
"key": { "kind": "elf.build_id", "value": "2f3a..."},
|
||||
"purl": "pkg:deb/debian/openssl@3.0.11-1~deb12u2?arch=amd64",
|
||||
"evidence": {
|
||||
"source": "os.dpkg.fileowner",
|
||||
"source_image": "sha256:....",
|
||||
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
|
||||
"package": "openssl",
|
||||
"package_version": "3.0.11-1~deb12u2"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Key points:
|
||||
|
||||
* `key.kind` is one of `elf.build_id`, `macho.uuid`, `pe.codeview`, `file.sha256`
|
||||
* include evidence with enough detail to justify mapping
|
||||
|
||||
### 4.3 How to *generate* the index (deterministically)
|
||||
|
||||
You need an **offline index builder** pipeline. In StellaOps terms, this is best treated like a feed exporter step (build-time), then shipped in the Offline Kit.
|
||||
|
||||
**Input set options** (choose one or mix):
|
||||
|
||||
1. “Golden base images” list (your fleet’s base images)
|
||||
2. Distro repositories mirrored into the airgap (Deb/RPM/APK archives)
|
||||
3. Previously scanned images that are allowed into the kit
|
||||
|
||||
**Generation steps**:
|
||||
|
||||
1. For each input image:
|
||||
|
||||
* Extract rootfs in a deterministic path order.
|
||||
* Run OS analyzers (dpkg/rpm/apk) + native identity collection (ELF/PE/MachO).
|
||||
2. Produce raw tuples:
|
||||
|
||||
* `(build_id | uuid | codeview | sha256) → (purl, evidence)`
|
||||
3. Deduplicate:
|
||||
|
||||
* Canonicalize PURLs (normalize qualifiers order, lowercasing rules).
|
||||
* If the same key maps to **multiple distinct PURLs**, keep them all and mark as conflict (do not pick one).
|
||||
4. Sort:
|
||||
|
||||
* Sort by `(key.kind, key.value, purl)` lexicographically.
|
||||
5. Serialize:
|
||||
|
||||
* Emit line‑delimited JSON or a simple binary format.
|
||||
* Compress (zstd).
|
||||
6. Compute digests:
|
||||
|
||||
* `sha256` of each artifact.
|
||||
* `sha256` of concatenated `(artifact name + sha)` for a manifest hash.
|
||||
7. Sign:
|
||||
|
||||
* include in kit manifest and sign with the same process you use for other offline kit elements. Offline kit import in StellaOps validates digests and signatures. ([Stella Ops][3])
|
||||
|
||||
---
|
||||
|
||||
## 5) Runtime side: implement the layered resolver in Scanner Worker
|
||||
|
||||
### 5.1 Where to hook in
|
||||
|
||||
You want this to run after OS + language analyzers have produced fragments, and after native identity collection has produced binary identities.
|
||||
|
||||
Scanner Worker already executes analyzers and appends fragments to `context.Analysis`. ([Gitea: Git with a cup of tea][4])
|
||||
|
||||
Scanner module responsibilities explicitly include OS, language, and native ecosystems as restart-only plug-ins. ([Gitea: Git with a cup of tea][6])
|
||||
So implement binary mapping as either:
|
||||
|
||||
* part of the **native ecosystem analyzer output stage**, or
|
||||
* a **post-analyzer enrichment stage** that runs before SBOM composition.
|
||||
|
||||
I recommend: **post-analyzer enrichment stage**, because it can consult OS+lang analyzer results and unify decisions.
|
||||
|
||||
### 5.2 Add a new ScanAnalysis key
|
||||
|
||||
Store collected binary identities in analysis:
|
||||
|
||||
* `ScanAnalysisKeys.NativeBinaryIdentities` → `ImmutableArray<BinaryIdentity>`
|
||||
|
||||
And store mapping results:
|
||||
|
||||
* `ScanAnalysisKeys.NativeBinaryMappings` → `ImmutableArray<BinaryMappingResult>`
|
||||
|
||||
### 5.3 Implement the resolver pipeline (deterministic ordering)
|
||||
|
||||
```csharp
|
||||
public interface IBinaryMappingResolver
|
||||
{
|
||||
string Id { get; } // stable ID
|
||||
int Order { get; } // deterministic
|
||||
BinaryMappingCandidate? TryResolve(BinaryIdentity identity, MappingContext ctx);
|
||||
}
|
||||
```
|
||||
|
||||
Pipeline:
|
||||
|
||||
1. Sort resolvers by `(Order, Id)` (Ordinal comparison).
|
||||
2. For each resolver:
|
||||
|
||||
* if it returns a candidate, add it to candidates list.
|
||||
* if the resolver is “authoritative” (Layer 0), you can short‑circuit on first hit.
|
||||
3. Decide:
|
||||
|
||||
* If 0 candidates ⇒ `Unresolved`
|
||||
* If 1 candidate ⇒ `Resolved`
|
||||
* If >1:
|
||||
|
||||
* If candidates have different PURLs ⇒ `Ambiguous` unless a deterministic “dominates” rule exists
|
||||
* If candidates have same PURL (from multiple sources) ⇒ merge evidence
|
||||
|
||||
### 5.4 Implement each layer as a resolver
|
||||
|
||||
#### Resolver A: OS file owner (Layer 0)
|
||||
|
||||
Inputs:
|
||||
|
||||
* OS analyzer results in `context.Analysis` (they’re already stored in `ScanAnalysisKeys.OsPackageAnalyzers`). ([Gitea: Git with a cup of tea][4])
|
||||
* You need OS analyzers to expose file ownership mapping.
|
||||
|
||||
Implementation options:
|
||||
|
||||
* Extend OS analyzers to produce `path → packageId` maps.
|
||||
* Or load that from dpkg/rpm DB at mapping time (fast enough if you only query per binary path).
|
||||
|
||||
Candidate:
|
||||
|
||||
* `Purl = pkg:<ecosystem>/<name>@<version>?arch=...`
|
||||
* Confidence = `1.0`
|
||||
* Evidence includes:
|
||||
|
||||
* analyzer id
|
||||
* package name/version
|
||||
* file path
|
||||
|
||||
#### Resolver B: Build‑ID index (Layer 1)
|
||||
|
||||
Inputs:
|
||||
|
||||
* `identity.BuildId` (or uuid/codeview)
|
||||
* `BinaryMapIndex` loaded from Offline Kit `feeds/binary-map/v1/buildid.map.zst`
|
||||
|
||||
Implementation:
|
||||
|
||||
* On worker startup: load and parse index into an immutable structure:
|
||||
|
||||
* `FrozenDictionary<string, BuildIdEntry[]>` (or sorted arrays + binary search)
|
||||
* If key maps to multiple PURLs:
|
||||
|
||||
* return multiple candidates (same resolver id), forcing `Ambiguous` verdict upstream
|
||||
|
||||
Candidate:
|
||||
|
||||
* Confidence = `0.95` (still deterministic)
|
||||
* Evidence includes index manifest digest + record evidence
|
||||
|
||||
#### Resolver C: SHA‑256 index (Layer 2)
|
||||
|
||||
Inputs:
|
||||
|
||||
* `identity.Sha256`
|
||||
* `feeds/binary-map/v1/sha256.map.zst` OR your ops “curated binaries” manifest
|
||||
|
||||
Candidate:
|
||||
|
||||
* Confidence:
|
||||
|
||||
* `0.9` if from signed curated manifest
|
||||
* `0.7` if from “observed in previous scan cache” (I’d avoid this unless you version and sign the cache)
|
||||
|
||||
#### Resolver D: Dependency closure constraints (Layer 3)
|
||||
|
||||
Only run if you have native dependency parsing output (DT_NEEDED / imports). The resolver does **not** return a mapping on its own; instead, it can:
|
||||
|
||||
* bump confidence for existing candidates
|
||||
* or rule out candidates deterministically (e.g., glibc baseline mismatch)
|
||||
|
||||
Make this a “candidate rewriter” stage:
|
||||
|
||||
```csharp
|
||||
public interface ICandidateRefiner
|
||||
{
|
||||
string Id { get; }
|
||||
int Order { get; }
|
||||
IReadOnlyList<BinaryMappingCandidate> Refine(BinaryIdentity id, IReadOnlyList<BinaryMappingCandidate> cands, MappingContext ctx);
|
||||
}
|
||||
```
|
||||
|
||||
#### Resolver E: Heuristic hints (Layer 4)
|
||||
|
||||
Never resolves to a PURL by default. It just produces Unknown evidence payload:
|
||||
|
||||
* extracted strings (“OpenSSL 3.0.11”)
|
||||
* imported symbol names
|
||||
* SONAME
|
||||
* symbol version requirements
|
||||
|
||||
---
|
||||
|
||||
## 6) SBOM composition behavior: how to “lift” bin components safely
|
||||
|
||||
### 6.1 Don’t break the component key rules
|
||||
|
||||
Scanner uses:
|
||||
|
||||
* key = PURL when present, else `bin:{sha256}` ([Stella Ops][1])
|
||||
|
||||
When you resolve a binary identity to a PURL, you have two clean options:
|
||||
|
||||
**Option 1 (recommended): replace the component key with the PURL**
|
||||
|
||||
* This makes downstream policy/advisory matching work naturally.
|
||||
* It’s deterministic as long as the mapping index is versioned and shipped with the kit.
|
||||
|
||||
**Option 2: keep `bin:{sha256}` as the component key and attach `resolved_purl`**
|
||||
|
||||
* Lower disruption to diffing, but policy now has to understand the “resolved_purl” field.
|
||||
* If StellaOps policy assumes `component.purl` is the canonical key, this will cause pain.
|
||||
|
||||
Given StellaOps emphasizes PURLs as the canonical key for identity, I’d implement **Option 1**, but record robust evidence + index digest.
|
||||
|
||||
### 6.2 Preserve file-level evidence
|
||||
|
||||
Even after lifting to PURL, keep evidence that ties the package identity back to file bytes:
|
||||
|
||||
* file path(s)
|
||||
* sha256
|
||||
* build-id/uuid
|
||||
* mapping resolver id + index digest
|
||||
|
||||
This is what makes attestations verifiable and helps operators debug.
|
||||
|
||||
---
|
||||
|
||||
## 7) Unknowns integration: emit Unknowns whenever mapping isn’t decisive
|
||||
|
||||
The Unknowns Registry exists precisely for “unresolved symbol → package mapping”, “missing build-id”, “ambiguous purl”, etc. ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
### 7.1 When to emit Unknowns
|
||||
|
||||
Emit Unknowns for:
|
||||
|
||||
1. `identity.BuildId == null` for ELF
|
||||
|
||||
* `unknown_type = missing_build_id`
|
||||
* evidence: “ELF missing .note.gnu.build-id; using sha256 only”
|
||||
|
||||
2. Multiple candidates with different PURLs
|
||||
|
||||
* `unknown_type = version_conflict` (or `identity_gap`)
|
||||
* evidence: list candidates + their evidence
|
||||
|
||||
3. Heuristic hints found but no authoritative mapping
|
||||
|
||||
* `unknown_type = identity_gap`
|
||||
* evidence: imported symbols, strings, SONAME
|
||||
|
||||
### 7.2 How to compute `unknown_id` deterministically
|
||||
|
||||
Unknowns schema suggests:
|
||||
|
||||
* `unknown_id` is derived from sha256 over `(type + scope + evidence)` ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
Do:
|
||||
|
||||
* stable JSON canonicalization of `scope` + `unknown_type` + `primary evidence fields`
|
||||
* sha256
|
||||
* prefix with `unk:sha256:<...>`
|
||||
|
||||
This guarantees idempotent ingestion behavior (`POST /unknowns/ingest` upsert). ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
---
|
||||
|
||||
## 8) Packaging as a StellaOps plug-in (so ops can upgrade it offline)
|
||||
|
||||
### 8.1 Plug-in manifest
|
||||
|
||||
Scanner plug-ins use a `manifest.json` with `schemaVersion`, `id`, `entryPoint` (dotnet assembly + typeName), etc. ([Gitea: Git with a cup of tea][7])
|
||||
|
||||
Create something like:
|
||||
|
||||
```json
|
||||
{
|
||||
"schemaVersion": "1.0",
|
||||
"id": "stellaops.analyzer.native.binarymap",
|
||||
"displayName": "StellaOps Native Binary Mapper",
|
||||
"version": "0.1.0",
|
||||
"requiresRestart": true,
|
||||
"entryPoint": {
|
||||
"type": "dotnet",
|
||||
"assembly": "StellaOps.Scanner.Analyzers.Native.BinaryMap.dll",
|
||||
"typeName": "StellaOps.Scanner.Analyzers.Native.BinaryMap.BinaryMapPlugin"
|
||||
},
|
||||
"capabilities": [
|
||||
"native-analyzer",
|
||||
"binary-mapper",
|
||||
"elf",
|
||||
"pe",
|
||||
"macho"
|
||||
],
|
||||
"metadata": {
|
||||
"org.stellaops.analyzer.kind": "native",
|
||||
"org.stellaops.restart.required": "true"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 Worker loading
|
||||
|
||||
Mirror the pattern in `CompositeScanAnalyzerDispatcher`:
|
||||
|
||||
* add a catalog `INativeAnalyzerPluginCatalog`
|
||||
* default directory: `plugins/scanner/analyzers/native`
|
||||
* load directories with the same “seal last directory” behavior ([Gitea: Git with a cup of tea][4])
|
||||
|
||||
---
|
||||
|
||||
## 9) Tests and performance gates (what “done” looks like)
|
||||
|
||||
StellaOps has determinism tests and golden fixtures for analyzers; follow that style. ([Gitea: Git with a cup of tea][6])
|
||||
|
||||
### 9.1 Determinism tests
|
||||
|
||||
Create fixtures with:
|
||||
|
||||
* same binaries in different file order
|
||||
* same binaries hardlinked/symlinked
|
||||
* stripped ELF missing build-id
|
||||
* multi-arch variants
|
||||
|
||||
Assert:
|
||||
|
||||
* mapping output JSON byte-for-byte stable
|
||||
* unknown ids stable
|
||||
* candidate ordering stable
|
||||
|
||||
### 9.2 “No fuzzy identity” guardrail tests
|
||||
|
||||
Add tests that:
|
||||
|
||||
* heuristic resolver never emits a `Resolved` verdict unless a feature flag is enabled
|
||||
* ambiguous candidates never auto-select a winner
|
||||
|
||||
### 9.3 Performance budgets
|
||||
|
||||
For ops, you care about scan wall time. Adopt budgets like:
|
||||
|
||||
* identity extraction < 25ms / binary (native parsing)
|
||||
* mapping lookup O(1) / binary (frozen dict) or O(log n) with sorted arrays
|
||||
* index load time bounded (lazy load per worker start)
|
||||
|
||||
Track metrics:
|
||||
|
||||
* count resolved per layer
|
||||
* count ambiguous/unresolved
|
||||
* unknown density (ties into Unknowns Registry scoring later) ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
---
|
||||
|
||||
## 10) Practical “ops” workflow: how to keep improving mapping safely
|
||||
|
||||
### 10.1 Add a feedback loop from Unknowns → index builder
|
||||
|
||||
Unknowns are your backlog:
|
||||
|
||||
* “missing build-id”
|
||||
* “ambiguous mapping”
|
||||
* “hash seen but not in index”
|
||||
|
||||
For each Unknown:
|
||||
|
||||
1. decide if it should be mapped in core (needs authoritative source)
|
||||
2. if yes: add reference artifact to your **index builder input set**
|
||||
3. rebuild the BinaryMap index
|
||||
4. ship via Offline Kit update (signed)
|
||||
|
||||
### 10.2 Don’t let your index silently drift
|
||||
|
||||
Because determinism matters, treat the BinaryMap index like a feed:
|
||||
|
||||
* version it (`v1`, `v2`)
|
||||
* sign it
|
||||
* store index digest in scan evidence
|
||||
|
||||
That way you can explain: “This binary was mapped using binary-map/v1 digest XYZ”.
|
||||
|
||||
---
|
||||
|
||||
## 11) Minimal implementation checklist (if you want the shortest path to value)
|
||||
|
||||
If you only do 3 things, do these:
|
||||
|
||||
1. **Build‑ID extraction everywhere** (ELF/Mach‑O/PE) and always store it in evidence
|
||||
(also emit Unknown when missing, as StellaOps expects) ([Gitea: Git with a cup of tea][8])
|
||||
|
||||
2. **Offline Build‑ID → PURL index** shipped in Offline Kit
|
||||
(fits perfectly with the existing debug-store + kit pattern) ([Stella Ops][3])
|
||||
|
||||
3. **Deterministic resolver pipeline + Unknowns emission**
|
||||
(so you improve mapping without introducing silent risk) ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
---
|
||||
|
||||
If you tell me whether your main pain is **distroless**, **FROM scratch**, or **vendor‑bundled runtimes** (chromium/node/openssl/etc.), I can give you the best “Layer 1 index builder” recipe for that category (what to use as authoritative sources and how to avoid collisions) — but the plan above is already safe and implementable without further assumptions.
|
||||
|
||||
[1]: https://stella-ops.org/docs/modules/scanner/architecture/ "Stella Ops – Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d519782a8f0b30f425c9b6ae0f316b19259972a2/docs/signals/unknowns-registry.md "git.stella-ops.org/unknowns-registry.md at d519782a8f0b30f425c9b6ae0f316b19259972a2 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[3]: https://stella-ops.org/docs/24_offline_kit/index.html "Stella Ops – Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||
[4]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/18f28168f022c73736bfd29033c71daef5e11044/src/Scanner/StellaOps.Scanner.Worker/Processing/CompositeScanAnalyzerDispatcher.cs "git.stella-ops.org/CompositeScanAnalyzerDispatcher.cs at 18f28168f022c73736bfd29033c71daef5e11044 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/8d78dd219b5e44c835e511491a4750f4a3ee3640/vendor/manifest.json?utm_source=chatgpt.com "git.stella-ops.org/manifest.json at ..."
|
||||
[6]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/bc0762e97d251723854b9c4e482b218c8efb1e04/docs/modules/scanner "git.stella-ops.org/scanner at bc0762e97d251723854b9c4e482b218c8efb1e04 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/c37722993137dac4b3a4104045826ca33b9dc289/plugins/scanner/analyzers/lang/StellaOps.Scanner.Analyzers.Lang.Go/manifest.json "git.stella-ops.org/manifest.json at c37722993137dac4b3a4104045826ca33b9dc289 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[8]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d519782a8f0b30f425c9b6ae0f316b19259972a2/docs/reachability/evidence-schema.md?utm_source=chatgpt.com "git.stella-ops.org/evidence-schema.md at ..."
|
||||
Reference in New Issue
Block a user