save progress
This commit is contained in:
@@ -0,0 +1,721 @@
|
||||
Here are two practical ways to make your software supply‑chain evidence both *useful* and *verifiable*—with enough background to get you shipping.
|
||||
|
||||
---
|
||||
|
||||
# 1) Binary SBOMs that still work when there’s no package manager
|
||||
|
||||
**Why this matters:** Container images built `FROM scratch` or “distroless” often lack package metadata, so typical SBOMs go blank. A *binary SBOM* extracts facts directly from executables—so you still know “what’s inside,” even in bare images.
|
||||
|
||||
**Core idea (plain English):**
|
||||
|
||||
* Parse binaries (ELF on Linux, PE on Windows, Mach‑O on macOS).
|
||||
* Record file paths, cryptographic hashes, import tables, compiler/linker hints, and for ELF also the `.note.gnu.build-id` (a unique ID most linkers embed).
|
||||
* Map these fingerprints to known packages/versions (vendor fingerprints, distro databases, your own allowlists).
|
||||
* Sign the result as an attestation so others can trust it without re‑running your scanner.
|
||||
|
||||
**Minimal pipeline sketch:**
|
||||
|
||||
* **Extract:** `readelf -n` (ELF notes), `objdump`/`otool` for imports; compute SHA‑256 for every binary.
|
||||
* **Normalize:** Emit CycloneDX or SPDX components for *binaries*, not just packages.
|
||||
* **Map:** Use Build‑ID → package hints (e.g., glibc, OpenSSL), symbol/version patterns, and path heuristics.
|
||||
* **Attest:** Wrap the SBOM in DSSE + in‑toto and push to your registry alongside the image digest.
|
||||
|
||||
**Pragmatic spec for developers:**
|
||||
|
||||
* Inputs: OCI image digest.
|
||||
* Outputs:
|
||||
|
||||
* `binary-sbom.cdx.json` (CycloneDX) or `binary-sbom.spdx.json`.
|
||||
* `attestation.intoto.jsonl` (DSSE envelope referencing the SBOM’s SHA‑256 and the *image digest*).
|
||||
* Data fields to capture per artifact:
|
||||
|
||||
* `algorithm: sha256`, `digest: <hex>`, `type: elf|pe|macho`, `path`, `size`,
|
||||
* `elf.build_id` (if present), `imports[]`, `compiler[]`, `arch`, `endian`.
|
||||
* Verification:
|
||||
|
||||
* `cosign verify-attestation --type sbom --digest <image-digest> ...`
|
||||
|
||||
**Why the ELF Build‑ID is gold:** it’s a stable, linker‑emitted identifier that helps correlate stripped binaries to upstream packages—critical when filenames and symbols lie.
|
||||
|
||||
---
|
||||
|
||||
# 2) Reachability analysis so you only page people for *real* risk
|
||||
|
||||
**Why this matters:** Not every CVE in your deps can actually be hit by your app. If you can show “no call path reaches the vulnerable sink,” you can *de‑noise* alerts and ship faster.
|
||||
|
||||
**Core idea (plain English):**
|
||||
|
||||
* Build an *interprocedural call graph* of your app (across modules/packages).
|
||||
* Mark known “sinks” from vulnerability advisories (e.g., dangerous API + version range).
|
||||
* Compute graph reachability from your entrypoints (HTTP handlers, CLI `main`, background jobs).
|
||||
* The intersection of {reachable nodes} × {vulnerable sinks} = “actionable” findings.
|
||||
* Emit a signed *witness* (attestation) that states which sinks are reachable/unreachable and why.
|
||||
|
||||
**Minimal pipeline sketch:**
|
||||
|
||||
* **Ingest code/bytecode:** language‑specific frontends (e.g., .NET IL, JVM bytecode, Python AST, Go SSA).
|
||||
* **Build graph:** nodes = functions/methods; edges = call sites (include dynamic edges conservatively).
|
||||
* **Mark entrypoints:** web routes, message handlers, cron jobs, exported CLIs.
|
||||
* **Mark sinks:** from your vuln DB (API signature + version).
|
||||
* **Decide:** run graph search from entrypoints → is any sink reachable?
|
||||
* **Attest:** DSSE witness with:
|
||||
|
||||
* artifact digest (commit SHA / image digest),
|
||||
* tool version + rule set hash,
|
||||
* list of reachable sinks with at least one example call path,
|
||||
* list of *proven* unreachable sinks (under stated assumptions).
|
||||
|
||||
**Developer contract (portable across languages):**
|
||||
|
||||
* Inputs: source/bytecode zip + manifest of entrypoints.
|
||||
* Outputs:
|
||||
|
||||
* `reachability.witness.json` (DSSE envelope),
|
||||
* optional `paths/` folder with top‑N call paths as compact JSON (for UX rendering).
|
||||
* Verification:
|
||||
|
||||
* Recompute call graph deterministically given the same inputs + tool version,
|
||||
* `cosign verify-attestation --type reachability ...`
|
||||
|
||||
---
|
||||
|
||||
# How these two pieces fit together
|
||||
|
||||
* **Binary SBOM** = “What exactly is in the artifact?” (even in bare images)
|
||||
* **Reachability witness** = “Which vulns actually matter to *this* app build?”
|
||||
* Sign both as **DSSE/in‑toto attestations** and attach to the image/release. Your CI can enforce:
|
||||
|
||||
* “Block if high‑severity + *reachable*,”
|
||||
* “Warn (don’t block) if high‑severity but *unreachable* with a fresh witness.”
|
||||
|
||||
---
|
||||
|
||||
# Quick starter checklist (copy/paste to a task board)
|
||||
|
||||
* [ ] Binary extractors: ELF/PE/Mach‑O parsers; hash & Build‑ID capture.
|
||||
* [ ] Mapping rules: Build‑ID → known package DB; symbol/version heuristics.
|
||||
* [ ] Emit CycloneDX/SPDX; add file‑level components for binaries.
|
||||
* [ ] DSSE signing and `cosign`/`rekor` publish for SBOM attestation.
|
||||
* [ ] Language frontends for reachability (pick your top 1–2 first).
|
||||
* [ ] Call‑graph builder + entrypoint detector.
|
||||
* [ ] Sink catalog normalizer (map CVE → API signature).
|
||||
* [ ] Reachability engine + example path extractor.
|
||||
* [ ] DSSE witness for reachability; attach to build.
|
||||
* [ ] CI policy: block on “reachable high/critical”; surface paths in UI.
|
||||
|
||||
If you want, I can turn this into concrete .NET‑first tasks with sample code scaffolds and a tiny demo repo that builds an image, extracts a binary SBOM, runs reachability on a toy service, and emits both attestations.
|
||||
Below is a concrete, “do‑this‑then‑this” implementation plan for a **layered binary→PURL mapping system** that fits StellaOps’ constraints: **offline**, **deterministic**, **SBOM‑first**, and with **unknowns recorded instead of guessing**.
|
||||
|
||||
I’m going to assume your target is the common pain case StellaOps itself calls out: when package metadata is missing, Scanner falls back to binary identity (`bin:{sha256}`) and you want to deterministically “lift” those binaries into stable package identities (PURLs) without turning the core SBOM into fuzzy guesswork. StellaOps’ own Scanner docs emphasize **deterministic analyzers**, **no fuzzy identity in core**, and keeping heuristics as opt‑in add‑ons. ([Stella Ops][1])
|
||||
|
||||
---
|
||||
|
||||
## 0) What “binary mapping” means in StellaOps terms
|
||||
|
||||
In Scanner’s architecture, the **component key** is:
|
||||
|
||||
* **PURL when present**
|
||||
* otherwise `bin:{sha256}` ([Stella Ops][1])
|
||||
|
||||
So “better binary mapping” = systematically converting more of those `bin:*` components into **PURLs** (or at least producing **actionable mapping evidence + Unknowns**) while preserving:
|
||||
|
||||
* deterministic replay (same inputs ⇒ same output)
|
||||
* offline operation (air‑gapped kits)
|
||||
* policy safety (don’t hide false negatives behind fuzzy IDs)
|
||||
|
||||
Also, StellaOps already has the concept of “gaps” being first‑class via the **Unknowns Registry** (identity gaps, missing build‑id, version conflicts, missing edges, etc.). ([Gitea: Git with a cup of tea][2]) Your binary mapping work should *feed* this system.
|
||||
|
||||
---
|
||||
|
||||
## 1) Design constraints you must keep (or you’ll fight the platform)
|
||||
|
||||
### 1.1 Determinism rules
|
||||
|
||||
StellaOps’ Scanner architecture is explicit: core analyzers are deterministic; heuristic plug‑ins must not contaminate the core SBOM unless explicitly enabled. ([Stella Ops][1])
|
||||
|
||||
That implies:
|
||||
|
||||
* **No probabilistic “best guess” PURL** in the default mapping path.
|
||||
* If you do fuzzy inference, it must be emitted as:
|
||||
|
||||
* “hints” attached to Unknowns, or
|
||||
* a separate heuristic artifact gated by flags.
|
||||
|
||||
### 1.2 Offline kit + debug store is already a hook you can exploit
|
||||
|
||||
Offline kits already bundle:
|
||||
|
||||
* scanner plug‑ins (OS + language analyzers packaged under `plugins/scanner/analyzers/**`)
|
||||
* a **debug store** layout: `debug/.build-id/<aa>/<rest>.debug`
|
||||
* a `debug-manifest.json` that maps build‑ids → originating images (for symbol retrieval) ([Stella Ops][3])
|
||||
|
||||
This is perfect for building a **Build‑ID→PURL index** that remains offline and signed.
|
||||
|
||||
### 1.3 Scanner Worker already loads analyzers via directory catalogs
|
||||
|
||||
The Worker loads OS and language analyzer plug‑ins from default directories (unless overridden), using deterministic directory normalization and a “seal” concept on the last directory. ([Gitea: Git with a cup of tea][4])
|
||||
|
||||
So you can add a third catalog for **native/binary mapping** that behaves the same way.
|
||||
|
||||
---
|
||||
|
||||
## 2) Layering strategy: what to implement (and in what order)
|
||||
|
||||
You want a **resolver pipeline** with strict ordering from “hard evidence” → “soft evidence”.
|
||||
|
||||
### Layer 0 — In‑image authoritative mapping (highest confidence)
|
||||
|
||||
These sources are authoritative because they come from within the artifact:
|
||||
|
||||
1. **OS package DB present** (dpkg/rpm/apk):
|
||||
|
||||
* Map `path → package` using file ownership lists.
|
||||
* If you can also compute file hashes/build‑ids, store them as evidence.
|
||||
|
||||
2. **Language ecosystem metadata present** (already handled by language analyzers):
|
||||
|
||||
* For example, a Python wheel RECORD or a Go buildinfo section can directly imply module versions.
|
||||
|
||||
**Decision rule**: If a binary file is owned by an OS package, **prefer that** over any external mapping index.
|
||||
|
||||
### Layer 1 — “Build provenance” mapping via build IDs / UUIDs (strong, portable)
|
||||
|
||||
When package DB is missing (distroless/scratch), use **compiler/linker stable IDs**:
|
||||
|
||||
* ELF: `.note.gnu.build-id`
|
||||
* Mach‑O: `LC_UUID`
|
||||
* PE: CodeView (PDB GUID+Age) / build signature
|
||||
|
||||
This should be your primary fallback because it survives stripping and renaming.
|
||||
|
||||
### Layer 2 — Hash mapping for curated or vendor‑pinned binaries (strong but brittle across rebuilds)
|
||||
|
||||
Use SHA‑256 → PURL mapping when:
|
||||
|
||||
* binaries are redistributed unchanged (busybox, chromium, embedded runtimes)
|
||||
* you maintain a curated “known binaries” manifest
|
||||
|
||||
StellaOps already has “curated binary manifest generation” mentioned in its repo history, and a `vendor/manifest.json` concept exists (for pinned artifacts / binaries in the system). ([Gitea: Git with a cup of tea][5])
|
||||
For your ops environment you’ll create a similar manifest **for your fleet**.
|
||||
|
||||
### Layer 3 — Dependency closure constraints (helpful as a disambiguator, not a primary mapper)
|
||||
|
||||
If the binary’s DT_NEEDED / imports point to libs you *can* identify, you can use that to disambiguate multiple possible candidates (“this openssl build-id matches, but only one candidate has the required glibc baseline”).
|
||||
|
||||
This must remain deterministic and rules‑based.
|
||||
|
||||
### Layer 4 — Heuristic hints (never change the core SBOM by default)
|
||||
|
||||
Examples:
|
||||
|
||||
* symbol version patterns (`GLIBC_2.28`, etc.)
|
||||
* embedded version strings
|
||||
* import tables
|
||||
* compiler metadata
|
||||
|
||||
These produce **Unknown evidence/hints**, not a resolved identity, unless a special “heuristics allowed” flag is turned on.
|
||||
|
||||
### Layer 5 — Unknowns Registry output (mandatory when you can’t decide)
|
||||
|
||||
If a mapping can’t be made decisively:
|
||||
|
||||
* emit Unknowns (identity_gap, missing_build_id, version_conflict, etc.) ([Gitea: Git with a cup of tea][2])
|
||||
This is not optional; it’s how you prevent silent false negatives.
|
||||
|
||||
---
|
||||
|
||||
## 3) Concrete data model you should implement
|
||||
|
||||
### 3.1 Binary identity record
|
||||
|
||||
Create a single canonical identity structure that *every layer* uses:
|
||||
|
||||
```csharp
|
||||
public enum BinaryFormat { Elf, Pe, MachO, Unknown }
|
||||
|
||||
public sealed record BinaryIdentity(
|
||||
BinaryFormat Format,
|
||||
string Path, // normalized (posix style), rooted at image root
|
||||
string Sha256, // always present
|
||||
string? BuildId, // ELF
|
||||
string? MachOUuid, // Mach-O
|
||||
string? PeCodeViewGuid, // PE/PDB
|
||||
string? Arch, // amd64/arm64/...
|
||||
long SizeBytes
|
||||
);
|
||||
```
|
||||
|
||||
**Determinism tip**: normalize `Path` to a single separator and collapse `//`, `./`, etc.
|
||||
|
||||
### 3.2 Mapping candidate
|
||||
|
||||
Each resolver layer returns candidates like:
|
||||
|
||||
```csharp
|
||||
public enum MappingVerdict { Resolved, Unresolved, Ambiguous }
|
||||
|
||||
public sealed record BinaryMappingCandidate(
|
||||
string Purl,
|
||||
double Confidence, // 0..1 but deterministic
|
||||
string ResolverId, // e.g. "os.fileowner", "buildid.index.v1"
|
||||
IReadOnlyList<string> Evidence, // stable ordering
|
||||
IReadOnlyDictionary<string,string> Properties // stable ordering
|
||||
);
|
||||
```
|
||||
|
||||
### 3.3 Final mapping result
|
||||
|
||||
```csharp
|
||||
public sealed record BinaryMappingResult(
|
||||
MappingVerdict Verdict,
|
||||
BinaryIdentity Subject,
|
||||
BinaryMappingCandidate? Winner,
|
||||
IReadOnlyList<BinaryMappingCandidate> Alternatives,
|
||||
string MappingIndexDigest // sha256 of index snapshot used (or "none")
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4) Build the “Binary Map Index” that makes Layer 1 and 2 work offline
|
||||
|
||||
### 4.1 Where it lives in StellaOps
|
||||
|
||||
Put it in the Offline Kit as a signed artifact, next to other feeds and plug-ins. Offline kit packaging already includes plug-ins and a debug store with a deterministic layout. ([Stella Ops][3])
|
||||
|
||||
Recommended layout:
|
||||
|
||||
```
|
||||
offline-kit/
|
||||
feeds/
|
||||
binary-map/
|
||||
v1/
|
||||
buildid.map.zst
|
||||
sha256.map.zst
|
||||
index.manifest.json
|
||||
index.manifest.json.sig (DSSE or JWS, consistent with your kit)
|
||||
```
|
||||
|
||||
### 4.2 Index record schema (v1)
|
||||
|
||||
Make each record explicit and replayable:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema": "stellaops.binary-map.v1",
|
||||
"records": [
|
||||
{
|
||||
"key": { "kind": "elf.build_id", "value": "2f3a..."},
|
||||
"purl": "pkg:deb/debian/openssl@3.0.11-1~deb12u2?arch=amd64",
|
||||
"evidence": {
|
||||
"source": "os.dpkg.fileowner",
|
||||
"source_image": "sha256:....",
|
||||
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
|
||||
"package": "openssl",
|
||||
"package_version": "3.0.11-1~deb12u2"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Key points:
|
||||
|
||||
* `key.kind` is one of `elf.build_id`, `macho.uuid`, `pe.codeview`, `file.sha256`
|
||||
* include evidence with enough detail to justify mapping
|
||||
|
||||
### 4.3 How to *generate* the index (deterministically)
|
||||
|
||||
You need an **offline index builder** pipeline. In StellaOps terms, this is best treated like a feed exporter step (build-time), then shipped in the Offline Kit.
|
||||
|
||||
**Input set options** (choose one or mix):
|
||||
|
||||
1. “Golden base images” list (your fleet’s base images)
|
||||
2. Distro repositories mirrored into the airgap (Deb/RPM/APK archives)
|
||||
3. Previously scanned images that are allowed into the kit
|
||||
|
||||
**Generation steps**:
|
||||
|
||||
1. For each input image:
|
||||
|
||||
* Extract rootfs in a deterministic path order.
|
||||
* Run OS analyzers (dpkg/rpm/apk) + native identity collection (ELF/PE/MachO).
|
||||
2. Produce raw tuples:
|
||||
|
||||
* `(build_id | uuid | codeview | sha256) → (purl, evidence)`
|
||||
3. Deduplicate:
|
||||
|
||||
* Canonicalize PURLs (normalize qualifiers order, lowercasing rules).
|
||||
* If the same key maps to **multiple distinct PURLs**, keep them all and mark as conflict (do not pick one).
|
||||
4. Sort:
|
||||
|
||||
* Sort by `(key.kind, key.value, purl)` lexicographically.
|
||||
5. Serialize:
|
||||
|
||||
* Emit line‑delimited JSON or a simple binary format.
|
||||
* Compress (zstd).
|
||||
6. Compute digests:
|
||||
|
||||
* `sha256` of each artifact.
|
||||
* `sha256` of concatenated `(artifact name + sha)` for a manifest hash.
|
||||
7. Sign:
|
||||
|
||||
* include in kit manifest and sign with the same process you use for other offline kit elements. Offline kit import in StellaOps validates digests and signatures. ([Stella Ops][3])
|
||||
|
||||
---
|
||||
|
||||
## 5) Runtime side: implement the layered resolver in Scanner Worker
|
||||
|
||||
### 5.1 Where to hook in
|
||||
|
||||
You want this to run after OS + language analyzers have produced fragments, and after native identity collection has produced binary identities.
|
||||
|
||||
Scanner Worker already executes analyzers and appends fragments to `context.Analysis`. ([Gitea: Git with a cup of tea][4])
|
||||
|
||||
Scanner module responsibilities explicitly include OS, language, and native ecosystems as restart-only plug-ins. ([Gitea: Git with a cup of tea][6])
|
||||
So implement binary mapping as either:
|
||||
|
||||
* part of the **native ecosystem analyzer output stage**, or
|
||||
* a **post-analyzer enrichment stage** that runs before SBOM composition.
|
||||
|
||||
I recommend: **post-analyzer enrichment stage**, because it can consult OS+lang analyzer results and unify decisions.
|
||||
|
||||
### 5.2 Add a new ScanAnalysis key
|
||||
|
||||
Store collected binary identities in analysis:
|
||||
|
||||
* `ScanAnalysisKeys.NativeBinaryIdentities` → `ImmutableArray<BinaryIdentity>`
|
||||
|
||||
And store mapping results:
|
||||
|
||||
* `ScanAnalysisKeys.NativeBinaryMappings` → `ImmutableArray<BinaryMappingResult>`
|
||||
|
||||
### 5.3 Implement the resolver pipeline (deterministic ordering)
|
||||
|
||||
```csharp
|
||||
public interface IBinaryMappingResolver
|
||||
{
|
||||
string Id { get; } // stable ID
|
||||
int Order { get; } // deterministic
|
||||
BinaryMappingCandidate? TryResolve(BinaryIdentity identity, MappingContext ctx);
|
||||
}
|
||||
```
|
||||
|
||||
Pipeline:
|
||||
|
||||
1. Sort resolvers by `(Order, Id)` (Ordinal comparison).
|
||||
2. For each resolver:
|
||||
|
||||
* if it returns a candidate, add it to candidates list.
|
||||
* if the resolver is “authoritative” (Layer 0), you can short‑circuit on first hit.
|
||||
3. Decide:
|
||||
|
||||
* If 0 candidates ⇒ `Unresolved`
|
||||
* If 1 candidate ⇒ `Resolved`
|
||||
* If >1:
|
||||
|
||||
* If candidates have different PURLs ⇒ `Ambiguous` unless a deterministic “dominates” rule exists
|
||||
* If candidates have same PURL (from multiple sources) ⇒ merge evidence
|
||||
|
||||
### 5.4 Implement each layer as a resolver
|
||||
|
||||
#### Resolver A: OS file owner (Layer 0)
|
||||
|
||||
Inputs:
|
||||
|
||||
* OS analyzer results in `context.Analysis` (they’re already stored in `ScanAnalysisKeys.OsPackageAnalyzers`). ([Gitea: Git with a cup of tea][4])
|
||||
* You need OS analyzers to expose file ownership mapping.
|
||||
|
||||
Implementation options:
|
||||
|
||||
* Extend OS analyzers to produce `path → packageId` maps.
|
||||
* Or load that from dpkg/rpm DB at mapping time (fast enough if you only query per binary path).
|
||||
|
||||
Candidate:
|
||||
|
||||
* `Purl = pkg:<ecosystem>/<name>@<version>?arch=...`
|
||||
* Confidence = `1.0`
|
||||
* Evidence includes:
|
||||
|
||||
* analyzer id
|
||||
* package name/version
|
||||
* file path
|
||||
|
||||
#### Resolver B: Build‑ID index (Layer 1)
|
||||
|
||||
Inputs:
|
||||
|
||||
* `identity.BuildId` (or uuid/codeview)
|
||||
* `BinaryMapIndex` loaded from Offline Kit `feeds/binary-map/v1/buildid.map.zst`
|
||||
|
||||
Implementation:
|
||||
|
||||
* On worker startup: load and parse index into an immutable structure:
|
||||
|
||||
* `FrozenDictionary<string, BuildIdEntry[]>` (or sorted arrays + binary search)
|
||||
* If key maps to multiple PURLs:
|
||||
|
||||
* return multiple candidates (same resolver id), forcing `Ambiguous` verdict upstream
|
||||
|
||||
Candidate:
|
||||
|
||||
* Confidence = `0.95` (still deterministic)
|
||||
* Evidence includes index manifest digest + record evidence
|
||||
|
||||
#### Resolver C: SHA‑256 index (Layer 2)
|
||||
|
||||
Inputs:
|
||||
|
||||
* `identity.Sha256`
|
||||
* `feeds/binary-map/v1/sha256.map.zst` OR your ops “curated binaries” manifest
|
||||
|
||||
Candidate:
|
||||
|
||||
* Confidence:
|
||||
|
||||
* `0.9` if from signed curated manifest
|
||||
* `0.7` if from “observed in previous scan cache” (I’d avoid this unless you version and sign the cache)
|
||||
|
||||
#### Resolver D: Dependency closure constraints (Layer 3)
|
||||
|
||||
Only run if you have native dependency parsing output (DT_NEEDED / imports). The resolver does **not** return a mapping on its own; instead, it can:
|
||||
|
||||
* bump confidence for existing candidates
|
||||
* or rule out candidates deterministically (e.g., glibc baseline mismatch)
|
||||
|
||||
Make this a “candidate rewriter” stage:
|
||||
|
||||
```csharp
|
||||
public interface ICandidateRefiner
|
||||
{
|
||||
string Id { get; }
|
||||
int Order { get; }
|
||||
IReadOnlyList<BinaryMappingCandidate> Refine(BinaryIdentity id, IReadOnlyList<BinaryMappingCandidate> cands, MappingContext ctx);
|
||||
}
|
||||
```
|
||||
|
||||
#### Resolver E: Heuristic hints (Layer 4)
|
||||
|
||||
Never resolves to a PURL by default. It just produces Unknown evidence payload:
|
||||
|
||||
* extracted strings (“OpenSSL 3.0.11”)
|
||||
* imported symbol names
|
||||
* SONAME
|
||||
* symbol version requirements
|
||||
|
||||
---
|
||||
|
||||
## 6) SBOM composition behavior: how to “lift” bin components safely
|
||||
|
||||
### 6.1 Don’t break the component key rules
|
||||
|
||||
Scanner uses:
|
||||
|
||||
* key = PURL when present, else `bin:{sha256}` ([Stella Ops][1])
|
||||
|
||||
When you resolve a binary identity to a PURL, you have two clean options:
|
||||
|
||||
**Option 1 (recommended): replace the component key with the PURL**
|
||||
|
||||
* This makes downstream policy/advisory matching work naturally.
|
||||
* It’s deterministic as long as the mapping index is versioned and shipped with the kit.
|
||||
|
||||
**Option 2: keep `bin:{sha256}` as the component key and attach `resolved_purl`**
|
||||
|
||||
* Lower disruption to diffing, but policy now has to understand the “resolved_purl” field.
|
||||
* If StellaOps policy assumes `component.purl` is the canonical key, this will cause pain.
|
||||
|
||||
Given StellaOps emphasizes PURLs as the canonical key for identity, I’d implement **Option 1**, but record robust evidence + index digest.
|
||||
|
||||
### 6.2 Preserve file-level evidence
|
||||
|
||||
Even after lifting to PURL, keep evidence that ties the package identity back to file bytes:
|
||||
|
||||
* file path(s)
|
||||
* sha256
|
||||
* build-id/uuid
|
||||
* mapping resolver id + index digest
|
||||
|
||||
This is what makes attestations verifiable and helps operators debug.
|
||||
|
||||
---
|
||||
|
||||
## 7) Unknowns integration: emit Unknowns whenever mapping isn’t decisive
|
||||
|
||||
The Unknowns Registry exists precisely for “unresolved symbol → package mapping”, “missing build-id”, “ambiguous purl”, etc. ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
### 7.1 When to emit Unknowns
|
||||
|
||||
Emit Unknowns for:
|
||||
|
||||
1. `identity.BuildId == null` for ELF
|
||||
|
||||
* `unknown_type = missing_build_id`
|
||||
* evidence: “ELF missing .note.gnu.build-id; using sha256 only”
|
||||
|
||||
2. Multiple candidates with different PURLs
|
||||
|
||||
* `unknown_type = version_conflict` (or `identity_gap`)
|
||||
* evidence: list candidates + their evidence
|
||||
|
||||
3. Heuristic hints found but no authoritative mapping
|
||||
|
||||
* `unknown_type = identity_gap`
|
||||
* evidence: imported symbols, strings, SONAME
|
||||
|
||||
### 7.2 How to compute `unknown_id` deterministically
|
||||
|
||||
Unknowns schema suggests:
|
||||
|
||||
* `unknown_id` is derived from sha256 over `(type + scope + evidence)` ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
Do:
|
||||
|
||||
* stable JSON canonicalization of `scope` + `unknown_type` + `primary evidence fields`
|
||||
* sha256
|
||||
* prefix with `unk:sha256:<...>`
|
||||
|
||||
This guarantees idempotent ingestion behavior (`POST /unknowns/ingest` upsert). ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
---
|
||||
|
||||
## 8) Packaging as a StellaOps plug-in (so ops can upgrade it offline)
|
||||
|
||||
### 8.1 Plug-in manifest
|
||||
|
||||
Scanner plug-ins use a `manifest.json` with `schemaVersion`, `id`, `entryPoint` (dotnet assembly + typeName), etc. ([Gitea: Git with a cup of tea][7])
|
||||
|
||||
Create something like:
|
||||
|
||||
```json
|
||||
{
|
||||
"schemaVersion": "1.0",
|
||||
"id": "stellaops.analyzer.native.binarymap",
|
||||
"displayName": "StellaOps Native Binary Mapper",
|
||||
"version": "0.1.0",
|
||||
"requiresRestart": true,
|
||||
"entryPoint": {
|
||||
"type": "dotnet",
|
||||
"assembly": "StellaOps.Scanner.Analyzers.Native.BinaryMap.dll",
|
||||
"typeName": "StellaOps.Scanner.Analyzers.Native.BinaryMap.BinaryMapPlugin"
|
||||
},
|
||||
"capabilities": [
|
||||
"native-analyzer",
|
||||
"binary-mapper",
|
||||
"elf",
|
||||
"pe",
|
||||
"macho"
|
||||
],
|
||||
"metadata": {
|
||||
"org.stellaops.analyzer.kind": "native",
|
||||
"org.stellaops.restart.required": "true"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 Worker loading
|
||||
|
||||
Mirror the pattern in `CompositeScanAnalyzerDispatcher`:
|
||||
|
||||
* add a catalog `INativeAnalyzerPluginCatalog`
|
||||
* default directory: `plugins/scanner/analyzers/native`
|
||||
* load directories with the same “seal last directory” behavior ([Gitea: Git with a cup of tea][4])
|
||||
|
||||
---
|
||||
|
||||
## 9) Tests and performance gates (what “done” looks like)
|
||||
|
||||
StellaOps has determinism tests and golden fixtures for analyzers; follow that style. ([Gitea: Git with a cup of tea][6])
|
||||
|
||||
### 9.1 Determinism tests
|
||||
|
||||
Create fixtures with:
|
||||
|
||||
* same binaries in different file order
|
||||
* same binaries hardlinked/symlinked
|
||||
* stripped ELF missing build-id
|
||||
* multi-arch variants
|
||||
|
||||
Assert:
|
||||
|
||||
* mapping output JSON byte-for-byte stable
|
||||
* unknown ids stable
|
||||
* candidate ordering stable
|
||||
|
||||
### 9.2 “No fuzzy identity” guardrail tests
|
||||
|
||||
Add tests that:
|
||||
|
||||
* heuristic resolver never emits a `Resolved` verdict unless a feature flag is enabled
|
||||
* ambiguous candidates never auto-select a winner
|
||||
|
||||
### 9.3 Performance budgets
|
||||
|
||||
For ops, you care about scan wall time. Adopt budgets like:
|
||||
|
||||
* identity extraction < 25ms / binary (native parsing)
|
||||
* mapping lookup O(1) / binary (frozen dict) or O(log n) with sorted arrays
|
||||
* index load time bounded (lazy load per worker start)
|
||||
|
||||
Track metrics:
|
||||
|
||||
* count resolved per layer
|
||||
* count ambiguous/unresolved
|
||||
* unknown density (ties into Unknowns Registry scoring later) ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
---
|
||||
|
||||
## 10) Practical “ops” workflow: how to keep improving mapping safely
|
||||
|
||||
### 10.1 Add a feedback loop from Unknowns → index builder
|
||||
|
||||
Unknowns are your backlog:
|
||||
|
||||
* “missing build-id”
|
||||
* “ambiguous mapping”
|
||||
* “hash seen but not in index”
|
||||
|
||||
For each Unknown:
|
||||
|
||||
1. decide if it should be mapped in core (needs authoritative source)
|
||||
2. if yes: add reference artifact to your **index builder input set**
|
||||
3. rebuild the BinaryMap index
|
||||
4. ship via Offline Kit update (signed)
|
||||
|
||||
### 10.2 Don’t let your index silently drift
|
||||
|
||||
Because determinism matters, treat the BinaryMap index like a feed:
|
||||
|
||||
* version it (`v1`, `v2`)
|
||||
* sign it
|
||||
* store index digest in scan evidence
|
||||
|
||||
That way you can explain: “This binary was mapped using binary-map/v1 digest XYZ”.
|
||||
|
||||
---
|
||||
|
||||
## 11) Minimal implementation checklist (if you want the shortest path to value)
|
||||
|
||||
If you only do 3 things, do these:
|
||||
|
||||
1. **Build‑ID extraction everywhere** (ELF/Mach‑O/PE) and always store it in evidence
|
||||
(also emit Unknown when missing, as StellaOps expects) ([Gitea: Git with a cup of tea][8])
|
||||
|
||||
2. **Offline Build‑ID → PURL index** shipped in Offline Kit
|
||||
(fits perfectly with the existing debug-store + kit pattern) ([Stella Ops][3])
|
||||
|
||||
3. **Deterministic resolver pipeline + Unknowns emission**
|
||||
(so you improve mapping without introducing silent risk) ([Gitea: Git with a cup of tea][2])
|
||||
|
||||
---
|
||||
|
||||
If you tell me whether your main pain is **distroless**, **FROM scratch**, or **vendor‑bundled runtimes** (chromium/node/openssl/etc.), I can give you the best “Layer 1 index builder” recipe for that category (what to use as authoritative sources and how to avoid collisions) — but the plan above is already safe and implementable without further assumptions.
|
||||
|
||||
[1]: https://stella-ops.org/docs/modules/scanner/architecture/ "Stella Ops – Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d519782a8f0b30f425c9b6ae0f316b19259972a2/docs/signals/unknowns-registry.md "git.stella-ops.org/unknowns-registry.md at d519782a8f0b30f425c9b6ae0f316b19259972a2 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[3]: https://stella-ops.org/docs/24_offline_kit/index.html "Stella Ops – Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||
[4]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/18f28168f022c73736bfd29033c71daef5e11044/src/Scanner/StellaOps.Scanner.Worker/Processing/CompositeScanAnalyzerDispatcher.cs "git.stella-ops.org/CompositeScanAnalyzerDispatcher.cs at 18f28168f022c73736bfd29033c71daef5e11044 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/8d78dd219b5e44c835e511491a4750f4a3ee3640/vendor/manifest.json?utm_source=chatgpt.com "git.stella-ops.org/manifest.json at ..."
|
||||
[6]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/bc0762e97d251723854b9c4e482b218c8efb1e04/docs/modules/scanner "git.stella-ops.org/scanner at bc0762e97d251723854b9c4e482b218c8efb1e04 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/c37722993137dac4b3a4104045826ca33b9dc289/plugins/scanner/analyzers/lang/StellaOps.Scanner.Analyzers.Lang.Go/manifest.json "git.stella-ops.org/manifest.json at c37722993137dac4b3a4104045826ca33b9dc289 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||
[8]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d519782a8f0b30f425c9b6ae0f316b19259972a2/docs/reachability/evidence-schema.md?utm_source=chatgpt.com "git.stella-ops.org/evidence-schema.md at ..."
|
||||
@@ -0,0 +1,444 @@
|
||||
# ARCHIVED ADVISORY
|
||||
|
||||
> **Status:** Archived
|
||||
> **Archived Date:** 2025-12-18
|
||||
> **Implementation Sprints:**
|
||||
> - `SPRINT_3700_0001_0001_witness_foundation.md` - BLAKE3 + Witness Schema
|
||||
> - `SPRINT_3700_0002_0001_vuln_surfaces_core.md` - Vuln Surface Builder
|
||||
> - `SPRINT_3700_0003_0001_trigger_extraction.md` - Trigger Method Extraction
|
||||
> - `SPRINT_3700_0004_0001_reachability_integration.md` - Reachability Integration
|
||||
> - `SPRINT_3700_0005_0001_witness_ui_cli.md` - Witness UI/CLI
|
||||
> - `SPRINT_3700_0006_0001_incremental_cache.md` - Incremental Cache
|
||||
>
|
||||
> **Gap Analysis:** See `C:\Users\vlindos\.claude\plans\lexical-knitting-map.md`
|
||||
|
||||
---
|
||||
|
||||
Here's a compact, practical way to add two high-leverage capabilities to your scanner: **DSSE-signed path witnesses** and **Smart-Diff x Reachability**-what they are, why they matter, and exactly how to implement them in Stella Ops without ceremony.
|
||||
|
||||
---
|
||||
|
||||
# 1) DSSE-signed path witnesses (entrypoint -> calls -> sink)
|
||||
|
||||
**What it is (in plain terms):**
|
||||
When you flag a CVE as "reachable," also emit a tiny, human-readable proof: the **exact path** from a real entrypoint (e.g., HTTP route, CLI verb, cron) through functions/methods to the **vulnerable sink**. Wrap that proof in a **DSSE** envelope and sign it. Anyone can verify the witness later-offline-without rerunning analysis.
|
||||
|
||||
**Why it matters:**
|
||||
|
||||
* Turns red flags into **auditable evidence** (quiet-by-design).
|
||||
* Lets CI/CD, auditors, and customers **verify** findings independently.
|
||||
* Enables **deterministic replay** and provenance chains (ties nicely to in-toto/SLSA).
|
||||
|
||||
**Minimal JSON witness (stable, vendor-neutral):**
|
||||
|
||||
```json
|
||||
{
|
||||
"witness_schema": "stellaops.witness.v1",
|
||||
"artifact": { "sbom_digest": "sha256:...", "component_purl": "pkg:nuget/Example@1.2.3" },
|
||||
"vuln": { "id": "CVE-2024-XXXX", "source": "NVD", "range": "<=1.2.3" },
|
||||
"entrypoint": { "kind": "http", "name": "GET /billing/pay" },
|
||||
"path": [
|
||||
{"symbol": "BillingController.Pay()", "file": "BillingController.cs", "line": 42},
|
||||
{"symbol": "PaymentsService.Authorize()", "file": "PaymentsService.cs", "line": 88},
|
||||
{"symbol": "LibXYZ.Parser.Parse()", "file": "Parser.cs", "line": 17}
|
||||
],
|
||||
"sink": { "symbol": "LibXYZ.Parser.Parse()", "type": "deserialization" },
|
||||
"evidence": {
|
||||
"callgraph_digest": "sha256:...",
|
||||
"build_id": "dotnet:RID:linux-x64:sha256:...",
|
||||
"analysis_config_digest": "sha256:..."
|
||||
},
|
||||
"observed_at": "2025-12-18T00:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Wrap in DSSE (payloadType & payload are required)**
|
||||
|
||||
```json
|
||||
{
|
||||
"payloadType": "application/vnd.stellaops.witness+json",
|
||||
"payload": "base64(JSON_above)",
|
||||
"signatures": [{ "keyid": "attestor-stellaops-ed25519", "sig": "base64(...)" }]
|
||||
}
|
||||
```
|
||||
|
||||
**.NET 10 signing/verifying (Ed25519)**
|
||||
|
||||
```csharp
|
||||
using System.Security.Cryptography;
|
||||
using System.Text.Json;
|
||||
|
||||
var payloadBytes = JsonSerializer.SerializeToUtf8Bytes(witnessJsonObj);
|
||||
var dsse = new {
|
||||
payloadType = "application/vnd.stellaops.witness+json",
|
||||
payload = Convert.ToBase64String(payloadBytes),
|
||||
signatures = new [] { new { keyid = keyId, sig = Convert.ToBase64String(Sign(payloadBytes, privateKey)) } }
|
||||
};
|
||||
byte[] Sign(byte[] data, byte[] privateKey)
|
||||
{
|
||||
using var ed = new Ed25519();
|
||||
// import private key, sign data (left as your Ed25519 helper)
|
||||
return ed.SignData(data, privateKey);
|
||||
}
|
||||
```
|
||||
|
||||
**Where to emit:**
|
||||
|
||||
* **Scanner.Worker**: after reachability confirms `reachable=true`, emit witness -> **Attestor** signs -> **Authority** stores (Postgres) -> optional Rekor-style mirror.
|
||||
* Expose `/witness/{findingId}` for download & independent verification.
|
||||
|
||||
---
|
||||
|
||||
# 2) Smart-Diff x Reachability (incremental, low-noise updates)
|
||||
|
||||
**What it is:**
|
||||
On **SBOM/VEX/dependency** deltas, don't rescan everything. Update only **affected regions** of the call graph and recompute reachability **just for changed nodes/edges**.
|
||||
|
||||
**Why it matters:**
|
||||
|
||||
* **Order-of-magnitude faster** incremental scans.
|
||||
* Fewer flaky diffs; triage stays focused on **meaningful risk change**.
|
||||
* Perfect for PR gating: "what changed" -> "what became reachable/unreachable."
|
||||
|
||||
**Core idea (graph-reachability):**
|
||||
|
||||
* Maintain a per-service **call graph** `G = (V, E)` with **entrypoint set** `S`.
|
||||
* On diff: compute changed nodes/edges DV/DE.
|
||||
* Run **incremental BFS/DFS** from impacted nodes to sinks (forward or backward), reusing memoized results.
|
||||
* Recompute only **frontiers** touched by D.
|
||||
|
||||
**Minimal tables (Postgres):**
|
||||
|
||||
```sql
|
||||
-- Nodes (functions/methods)
|
||||
CREATE TABLE cg_nodes(
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
service TEXT, symbol TEXT, file TEXT, line INT,
|
||||
hash TEXT, UNIQUE(service, hash)
|
||||
);
|
||||
-- Edges (calls)
|
||||
CREATE TABLE cg_edges(
|
||||
src BIGINT REFERENCES cg_nodes(id),
|
||||
dst BIGINT REFERENCES cg_nodes(id),
|
||||
kind TEXT, PRIMARY KEY(src, dst)
|
||||
);
|
||||
-- Entrypoints & Sinks
|
||||
CREATE TABLE cg_entrypoints(node_id BIGINT REFERENCES cg_nodes(id) PRIMARY KEY);
|
||||
CREATE TABLE cg_sinks(node_id BIGINT REFERENCES cg_nodes(id) PRIMARY KEY, sink_type TEXT);
|
||||
|
||||
-- Memoized reachability cache
|
||||
CREATE TABLE cg_reach_cache(
|
||||
entry_id BIGINT, sink_id BIGINT,
|
||||
path JSONB, reachable BOOLEAN,
|
||||
updated_at TIMESTAMPTZ,
|
||||
PRIMARY KEY(entry_id, sink_id)
|
||||
);
|
||||
```
|
||||
|
||||
**Incremental algorithm (pseudocode):**
|
||||
|
||||
```text
|
||||
Input: DSBOM, DDeps, DCode -> DNodes, DEdges
|
||||
1) Apply D to cg_nodes/cg_edges
|
||||
2) ImpactSet = neighbors(DNodes U endpoints(DEdges))
|
||||
3) For each e in Entrypoints intersect ancestors(ImpactSet):
|
||||
Recompute forward search to affected sinks, stop early on unchanged subgraphs
|
||||
Update cg_reach_cache; if state flips, emit new/updated DSSE witness
|
||||
```
|
||||
|
||||
**.NET 10 reachability sketch (fast & local):**
|
||||
|
||||
```csharp
|
||||
HashSet<int> ImpactSet = ComputeImpact(deltaNodes, deltaEdges);
|
||||
foreach (var e in Intersect(Entrypoints, Ancestors(ImpactSet)))
|
||||
{
|
||||
var res = BoundedReach(e, affectedSinks, graph, cache);
|
||||
foreach (var r in res.Changed)
|
||||
{
|
||||
cache.Upsert(e, r.Sink, r.Path, r.Reachable);
|
||||
if (r.Reachable) EmitDsseWitness(e, r.Sink, r.Path);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**CI/PR flow:**
|
||||
|
||||
1. Build -> SBOM diff -> Dependency diff -> Call-graph delta.
|
||||
2. Run incremental reachability.
|
||||
3. If any `unreachable->reachable` transitions: **fail gate**, attach DSSE witnesses.
|
||||
4. If `reachable->unreachable`: auto-close prior findings (and archive prior witness).
|
||||
|
||||
---
|
||||
|
||||
# UX hooks (quick wins)
|
||||
|
||||
* In findings list, add a **"Show Witness"** button -> modal renders the signed path (entrypoint->...->sink) + **"Verify Signature"** one-click.
|
||||
* In PR checks, summarize only **state flips** with tiny links: "+2 reachable (view witness)" / "-1 (now unreachable)".
|
||||
|
||||
---
|
||||
|
||||
# Minimal tasks to get this live
|
||||
|
||||
* **Scanner.Worker**: build call-graph extraction (per language), add incremental graph store, reachability cache.
|
||||
* **Attestor**: DSSE signing endpoint + key management (Ed25519 by default; PQC mode later).
|
||||
* **Authority**: tables above + witness storage + retrieval API.
|
||||
* **Router/CI plugin**: PR annotation with **state flips** and links to witnesses.
|
||||
* **UI**: witness modal + signature verify.
|
||||
|
||||
If you want, I can draft the exact Postgres migrations, the C# repositories, and a tiny verifier CLI that checks DSSE signatures and prints the call path.
|
||||
Below is a concrete, buildable blueprint for an **advanced reachability analysis engine** inside Stella Ops. I'm going to assume your "Stella Ops" components are roughly:
|
||||
|
||||
* **Scanner.Worker**: runs analyses in CI / on artifacts
|
||||
* **Authority**: stores graphs/findings/witnesses
|
||||
* **Attestor**: signs DSSE envelopes (Ed25519)
|
||||
* (optional) **SurfaceBuilder**: background worker that computes "vuln surfaces" for packages
|
||||
|
||||
The key advance is: **don't treat a CVE as "a package"**. Treat it as a **set of trigger methods** (public API) that can reach the vulnerable code inside the dependency-computed by "Smart-Diff" once, reused everywhere.
|
||||
|
||||
---
|
||||
|
||||
## 0) Define the contract (precision/soundness) up front
|
||||
|
||||
If you don't write this down, you'll fight false positives/negatives forever.
|
||||
|
||||
### What Stella Ops will guarantee (first release)
|
||||
|
||||
* **Whole-program static call graph** (app + selected dependency assemblies)
|
||||
* **Context-insensitive** (fast), **path witness** extracted (shortest path)
|
||||
* **Dynamic dispatch handled** with CHA/RTA (+ DI hints), with explicit uncertainty flags
|
||||
* **Reflection handled best-effort** (constant-string resolution), otherwise "unknown edge"
|
||||
|
||||
### What it will NOT guarantee (first release)
|
||||
|
||||
* Perfect handling of reflection / `dynamic` / runtime codegen
|
||||
* Perfect delegate/event resolution across complex flows
|
||||
* Full taint/dataflow reachability (you can add later)
|
||||
|
||||
This is fine. The major value is: "**we can show you the call path**" and "**we can prove the vuln is triggered by calling these library APIs**".
|
||||
|
||||
---
|
||||
|
||||
## 1) The big idea: "Vuln surfaces" (Smart-Diff -> triggers)
|
||||
|
||||
### Problem
|
||||
|
||||
CVE feeds typically say "package X version range Y is vulnerable" but rarely say *which methods*. If you only do package-level reachability, noise is huge.
|
||||
|
||||
### Solution
|
||||
|
||||
For each CVE+package, compute a **vulnerability surface**:
|
||||
|
||||
* **Candidate sinks** = methods changed between vulnerable and fixed versions (diff at IL level)
|
||||
* **Trigger methods** = *public/exported* methods in the vulnerable version that can reach those changed methods internally
|
||||
|
||||
Then your service scan becomes:
|
||||
|
||||
> "Can any entrypoint reach any trigger method?"
|
||||
|
||||
This is both faster and more precise.
|
||||
|
||||
---
|
||||
|
||||
## 2) Data model (Authority / Postgres)
|
||||
|
||||
You already had call graph tables; here's a concrete schema that supports:
|
||||
|
||||
* graph snapshots
|
||||
* incremental updates
|
||||
* vuln surfaces
|
||||
* reachability cache
|
||||
* DSSE witnesses
|
||||
|
||||
### 2.1 Graph tables
|
||||
|
||||
```sql
|
||||
CREATE TABLE cg_snapshots (
|
||||
snapshot_id BIGSERIAL PRIMARY KEY,
|
||||
service TEXT NOT NULL,
|
||||
build_id TEXT NOT NULL,
|
||||
graph_digest TEXT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
UNIQUE(service, build_id)
|
||||
);
|
||||
|
||||
CREATE TABLE cg_nodes (
|
||||
node_id BIGSERIAL PRIMARY KEY,
|
||||
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
|
||||
method_key TEXT NOT NULL, -- stable key (see below)
|
||||
asm_name TEXT,
|
||||
type_name TEXT,
|
||||
method_name TEXT,
|
||||
file_path TEXT,
|
||||
line_start INT,
|
||||
il_hash TEXT, -- normalized IL hash for diffing
|
||||
flags INT NOT NULL DEFAULT 0, -- bitflags: has_reflection, compiler_generated, etc.
|
||||
UNIQUE(snapshot_id, method_key)
|
||||
);
|
||||
|
||||
CREATE TABLE cg_edges (
|
||||
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
|
||||
src_node_id BIGINT REFERENCES cg_nodes(node_id) ON DELETE CASCADE,
|
||||
dst_node_id BIGINT REFERENCES cg_nodes(node_id) ON DELETE CASCADE,
|
||||
kind SMALLINT NOT NULL, -- 0=call,1=newobj,2=dispatch,3=delegate,4=reflection_guess,...
|
||||
PRIMARY KEY(snapshot_id, src_node_id, dst_node_id, kind)
|
||||
);
|
||||
|
||||
CREATE TABLE cg_entrypoints (
|
||||
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
|
||||
node_id BIGINT REFERENCES cg_nodes(node_id) ON DELETE CASCADE,
|
||||
kind TEXT NOT NULL, -- http, grpc, cli, job, etc.
|
||||
name TEXT NOT NULL, -- GET /foo, "Main", etc.
|
||||
PRIMARY KEY(snapshot_id, node_id, kind, name)
|
||||
);
|
||||
```
|
||||
|
||||
### 2.2 Vuln surface tables (Smart-Diff artifacts)
|
||||
|
||||
```sql
|
||||
CREATE TABLE vuln_surfaces (
|
||||
surface_id BIGSERIAL PRIMARY KEY,
|
||||
ecosystem TEXT NOT NULL, -- nuget
|
||||
package TEXT NOT NULL,
|
||||
cve_id TEXT NOT NULL,
|
||||
vuln_version TEXT NOT NULL, -- a representative vulnerable version
|
||||
fixed_version TEXT NOT NULL,
|
||||
surface_digest TEXT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
UNIQUE(ecosystem, package, cve_id, vuln_version, fixed_version)
|
||||
);
|
||||
|
||||
CREATE TABLE vuln_surface_sinks (
|
||||
surface_id BIGINT REFERENCES vuln_surfaces(surface_id) ON DELETE CASCADE,
|
||||
sink_method_key TEXT NOT NULL,
|
||||
reason TEXT NOT NULL, -- changed|added|removed|heuristic
|
||||
PRIMARY KEY(surface_id, sink_method_key)
|
||||
);
|
||||
|
||||
CREATE TABLE vuln_surface_triggers (
|
||||
surface_id BIGINT REFERENCES vuln_surfaces(surface_id) ON DELETE CASCADE,
|
||||
trigger_method_key TEXT NOT NULL,
|
||||
sink_method_key TEXT NOT NULL,
|
||||
internal_path JSONB, -- optional: library internal witness path
|
||||
PRIMARY KEY(surface_id, trigger_method_key, sink_method_key)
|
||||
);
|
||||
```
|
||||
|
||||
### 2.3 Reachability cache & witnesses
|
||||
|
||||
```sql
|
||||
CREATE TABLE reach_findings (
|
||||
finding_id BIGSERIAL PRIMARY KEY,
|
||||
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
|
||||
cve_id TEXT NOT NULL,
|
||||
ecosystem TEXT NOT NULL,
|
||||
package TEXT NOT NULL,
|
||||
package_version TEXT NOT NULL,
|
||||
reachable BOOLEAN NOT NULL,
|
||||
reachable_entrypoints INT NOT NULL DEFAULT 0,
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
UNIQUE(snapshot_id, cve_id, package, package_version)
|
||||
);
|
||||
|
||||
CREATE TABLE reach_witnesses (
|
||||
witness_id BIGSERIAL PRIMARY KEY,
|
||||
finding_id BIGINT REFERENCES reach_findings(finding_id) ON DELETE CASCADE,
|
||||
entry_node_id BIGINT REFERENCES cg_nodes(node_id),
|
||||
dsse_envelope JSONB NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3) Stable identity: MethodKey + IL hash
|
||||
|
||||
### 3.1 MethodKey (must be stable across builds)
|
||||
|
||||
Use a normalized string like:
|
||||
|
||||
```
|
||||
{AssemblyName}|{DeclaringTypeFullName}|{MethodName}`{GenericArity}({ParamType1},{ParamType2},...)
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
* `MyApp|BillingController|Pay(System.String)`
|
||||
* `LibXYZ|LibXYZ.Parser|Parse(System.ReadOnlySpan<System.Byte>)`
|
||||
|
||||
### 3.2 Normalized IL hash (for smart-diff + incremental graph updates)
|
||||
|
||||
Raw IL bytes aren't stable (metadata tokens change). Normalize:
|
||||
|
||||
* opcode names
|
||||
* branch targets by *instruction index*, not offset
|
||||
* method operands by **resolved MethodKey**
|
||||
* string operands by literal or hashed literal
|
||||
* type operands by full name
|
||||
|
||||
Then hash `SHA256(normalized_bytes)`.
|
||||
|
||||
---
|
||||
|
||||
*[Remainder of advisory truncated for brevity - see original file for full content]*
|
||||
|
||||
---
|
||||
|
||||
## 12) What to implement first (in the order that produces value fastest)
|
||||
|
||||
### Week 1-2 scope (realistic, shippable)
|
||||
|
||||
1. Cecil call graph extraction (direct calls)
|
||||
2. MVC + Minimal API entrypoints
|
||||
3. Reverse BFS reachability with path witnesses
|
||||
4. DSSE witness signing + storage
|
||||
5. SurfaceBuilder v1:
|
||||
|
||||
* IL hash per method
|
||||
* changed methods as sinks
|
||||
* triggers via internal reverse BFS
|
||||
6. UI: "Show Witness" + "Verify Signature"
|
||||
|
||||
### Next increment (precision upgrades)
|
||||
|
||||
7. async/await mapping to original methods
|
||||
8. RTA + DI registration hints
|
||||
9. delegate tracking for Minimal API handlers (if not already)
|
||||
10. interface override triggers in surface builder
|
||||
|
||||
### Later (if you want "attackability", not just "reachability")
|
||||
|
||||
11. taint/dataflow for top sink classes (deserialization, path traversal, SQL, command exec)
|
||||
12. sanitizer modeling & parameter constraints
|
||||
|
||||
---
|
||||
|
||||
## 13) Common failure modes and how to harden
|
||||
|
||||
### MethodKey mismatches (surface vs app call)
|
||||
|
||||
* Ensure both are generated from the same normalization rules
|
||||
* For generic methods, prefer **definition** keys (strip instantiation)
|
||||
* Store both "exact" and "erased generic" variants if needed
|
||||
|
||||
### Multi-target frameworks
|
||||
|
||||
* SurfaceBuilder: compute triggers for each TFM, union them
|
||||
* App scan: choose TFM closest to build RID, but allow fallback to union
|
||||
|
||||
### Huge graphs
|
||||
|
||||
* Drop `System.*` nodes/edges unless:
|
||||
|
||||
* the vuln is in System.* (rare, but handle separately)
|
||||
* Deduplicate nodes by MethodKey across assemblies where safe
|
||||
* Use CSR arrays + pooled queues
|
||||
|
||||
### Reflection heavy projects
|
||||
|
||||
* Mark analysis confidence lower
|
||||
* Include "unknown edges present" in finding metadata
|
||||
* Still produce a witness path up to the reflective callsite
|
||||
|
||||
---
|
||||
|
||||
If you want, I can also paste a **complete Cecil-based CallGraphBuilder class** (nodes+edges+PDB lines), plus the **SurfaceBuilder** that downloads NuGet packages and generates `vuln_surface_triggers` end-to-end.
|
||||
@@ -0,0 +1,197 @@
|
||||
# ARCHIVED ADVISORY
|
||||
|
||||
> **Archived**: 2025-12-18
|
||||
> **Status**: IMPLEMENTED
|
||||
> **Analysis**: Plan file `C:\Users\vlindos\.claude\plans\quizzical-hugging-hearth.md`
|
||||
>
|
||||
> ## Implementation Summary
|
||||
>
|
||||
> This advisory was analyzed and merged into the existing EPSS implementation plan:
|
||||
>
|
||||
> - **Master Plan**: `IMPL_3410_epss_v4_integration_master_plan.md` updated with raw + signal layer schemas
|
||||
> - **Sprint**: `SPRINT_3413_0001_0001_epss_live_enrichment.md` created with 30 tasks (original 14 + 16 from advisory)
|
||||
> - **Migrations Created**:
|
||||
> - `011_epss_raw_layer.sql` - Full JSONB payload storage (~5GB/year)
|
||||
> - `012_epss_signal_layer.sql` - Tenant-scoped signals with dedupe_key and explain_hash
|
||||
>
|
||||
> ## Gap Analysis Result
|
||||
>
|
||||
> | Advisory Proposal | Decision | Rationale |
|
||||
> |-------------------|----------|-----------|
|
||||
> | Raw feed layer (Layer 1) | IMPLEMENTED | Full JSONB storage for deterministic replay |
|
||||
> | Normalized layer (Layer 2) | ALIGNED | Already existed in IMPL_3410 |
|
||||
> | Signal-ready layer (Layer 3) | IMPLEMENTED | Tenant-scoped signals, model change detection |
|
||||
> | Multi-model support | DEFERRED | No customer demand |
|
||||
> | Meta-predictor training | SKIPPED | Out of scope (ML complexity) |
|
||||
> | A/B testing | SKIPPED | Infrastructure overhead |
|
||||
>
|
||||
> ## Key Enhancements Implemented
|
||||
>
|
||||
> 1. **Raw Feed Layer** (`epss_raw` table) - Stores full CSV payload as JSONB for replay
|
||||
> 2. **Signal-Ready Layer** (`epss_signal` table) - Tenant-scoped actionable events
|
||||
> 3. **Model Version Change Detection** - Suppresses noisy deltas on model updates
|
||||
> 4. **Explain Hash** - Deterministic SHA-256 for audit trail
|
||||
> 5. **Risk Band Mapping** - CRITICAL/HIGH/MEDIUM/LOW based on percentile
|
||||
|
||||
---
|
||||
|
||||
# Original Advisory Content
|
||||
|
||||
Here's a compact, practical blueprint for bringing **EPSS** into your stack without chaos: a **3-layer ingestion model** that keeps raw data, produces clean probabilities, and emits "signal-ready" events your risk engine can use immediately.
|
||||
|
||||
---
|
||||
|
||||
# Why this matters (super short)
|
||||
|
||||
* **EPSS** = predicted probability a vuln will be exploited soon.
|
||||
* Mixing "raw EPSS feed" directly into decisions makes audits, rollbacks, and model upgrades painful.
|
||||
* A **layered model** lets you **version probability evolution**, compare vendors, and train **meta-predictors on deltas** (how risk changes over time), not just on snapshots.
|
||||
|
||||
---
|
||||
|
||||
# The three layers (and how they map to Stella Ops)
|
||||
|
||||
1. **Raw feed layer (immutable)**
|
||||
|
||||
* **Goal:** Store exactly what the provider sent (EPSS v4 CSV/JSON, schema drift and all).
|
||||
* **Stella modules:** `Concelier` (preserve-prune source) writes; `Authority` handles signatures/hashes.
|
||||
* **Storage:** `postgres.epss_raw` (partitioned by day); blob column for the untouched payload; SHA-256 of source file.
|
||||
* **Why:** Full provenance + deterministic replay.
|
||||
|
||||
2. **Normalized probabilistic layer**
|
||||
|
||||
* **Goal:** Clean, typed tables keyed by `cve_id`, with **probability, percentile, model_version, asof_ts**.
|
||||
* **Stella modules:** `Excititor` (transform); `Policy Engine` reads.
|
||||
* **Storage:** `postgres.epss_prob` with a **surrogate key** `(cve_id, model_version, asof_ts)` and computed **delta fields** vs previous `asof_ts`.
|
||||
* **Extras:** Keep optional vendor columns (e.g., FIRST, custom regressors) to compare models side-by-side.
|
||||
|
||||
3. **Signal-ready layer (risk engine contracts)**
|
||||
|
||||
* **Goal:** Pre-chewed "events" your **Signals/Router** can route instantly.
|
||||
* **What's inside:** Only the fields needed for gating and UI: `cve_id`, `prob_now`, `prob_delta`, `percentile`, `risk_band`, `explain_hash`.
|
||||
* **Emit:** `first_signal`, `risk_increase`, `risk_decrease`, `quieted` with **idempotent event keys**.
|
||||
* **Stella modules:** `Signals` publishes, `Router` fan-outs, `Timeline` records; `Notify` handles subscriptions.
|
||||
|
||||
---
|
||||
|
||||
# Minimal Postgres schema (ready to paste)
|
||||
|
||||
```sql
|
||||
-- 1) Raw (immutable)
|
||||
create table epss_raw (
|
||||
id bigserial primary key,
|
||||
source_uri text not null,
|
||||
ingestion_ts timestamptz not null default now(),
|
||||
asof_date date not null,
|
||||
payload jsonb not null,
|
||||
payload_sha256 bytea not null
|
||||
);
|
||||
create index on epss_raw (asof_date);
|
||||
|
||||
-- 2) Normalized
|
||||
create table epss_prob (
|
||||
id bigserial primary key,
|
||||
cve_id text not null,
|
||||
model_version text not null,
|
||||
asof_ts timestamptz not null,
|
||||
probability double precision not null,
|
||||
percentile double precision,
|
||||
features jsonb,
|
||||
unique (cve_id, model_version, asof_ts)
|
||||
);
|
||||
|
||||
-- 3) Signal-ready
|
||||
create table epss_signal (
|
||||
signal_id bigserial primary key,
|
||||
cve_id text not null,
|
||||
asof_ts timestamptz not null,
|
||||
probability double precision not null,
|
||||
prob_delta double precision,
|
||||
risk_band text not null,
|
||||
model_version text not null,
|
||||
explain_hash bytea not null,
|
||||
unique (cve_id, model_version, asof_ts)
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# C# ingestion skeleton (StellaOps.Scanner.Worker.DotNet style)
|
||||
|
||||
```csharp
|
||||
// 1) Fetch & store raw (Concelier)
|
||||
public async Task IngestRawAsync(Uri src, DateOnly asOfDate) {
|
||||
var bytes = await http.GetByteArrayAsync(src);
|
||||
var sha = SHA256.HashData(bytes);
|
||||
await pg.ExecuteAsync(
|
||||
"insert into epss_raw(source_uri, asof_date, payload, payload_sha256) values (@u,@d,@p::jsonb,@s)",
|
||||
new { u = src.ToString(), d = asOfDate, p = Encoding.UTF8.GetString(bytes), s = sha });
|
||||
}
|
||||
|
||||
// 2) Normalize (Excititor)
|
||||
public async Task NormalizeAsync(DateOnly asOfDate, string modelVersion) {
|
||||
var raws = await pg.QueryAsync<(string Payload)>("select payload from epss_raw where asof_date=@d", new { d = asOfDate });
|
||||
foreach (var r in raws) {
|
||||
foreach (var row in ParseCsvOrJson(r.Payload)) {
|
||||
await pg.ExecuteAsync(
|
||||
@"insert into epss_prob(cve_id, model_version, asof_ts, probability, percentile, features)
|
||||
values (@cve,@mv,@ts,@prob,@pct,@feat)
|
||||
on conflict do nothing",
|
||||
new { cve = row.Cve, mv = modelVersion, ts = row.AsOf, prob = row.Prob, pct = row.Pctl, feat = row.Features });
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 3) Emit signal-ready (Signals)
|
||||
public async Task EmitSignalsAsync(string modelVersion, double deltaThreshold) {
|
||||
var rows = await pg.QueryAsync(@"select cve_id, asof_ts, probability,
|
||||
probability - lag(probability) over (partition by cve_id, model_version order by asof_ts) as prob_delta
|
||||
from epss_prob where model_version=@mv", new { mv = modelVersion });
|
||||
|
||||
foreach (var r in rows) {
|
||||
var band = Band(r.probability);
|
||||
if (Math.Abs(r.prob_delta ?? 0) >= deltaThreshold) {
|
||||
var explainHash = DeterministicExplainHash(r);
|
||||
await pg.ExecuteAsync(@"insert into epss_signal
|
||||
(cve_id, asof_ts, probability, prob_delta, risk_band, model_version, explain_hash)
|
||||
values (@c,@t,@p,@d,@b,@mv,@h)
|
||||
on conflict do nothing",
|
||||
new { c = r.cve_id, t = r.asof_ts, p = r.probability, d = r.prob_delta, b = band, mv = modelVersion, h = explainHash });
|
||||
|
||||
await bus.PublishAsync("risk.epss.delta", new {
|
||||
cve = r.cve_id, ts = r.asof_ts, prob = r.probability, delta = r.prob_delta, band, model = modelVersion, explain = Convert.ToHexString(explainHash)
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Versioning & experiments (the secret sauce)
|
||||
|
||||
* **Model namespace:** `EPSS-4.0-<regressor-name>-<date>` so you can run multiple variants in parallel.
|
||||
* **Delta-training:** Train a small meta-predictor on **delta-probability** to forecast **"risk jumps in next N days."**
|
||||
* **A/B in production:** Route `model_version=x` to 50% of projects; compare **MTTA to patch** and **false-alarm rate**.
|
||||
|
||||
---
|
||||
|
||||
# Policy & UI wiring (quick contracts)
|
||||
|
||||
**Policy gates** (OPA/Rego or internal rules):
|
||||
|
||||
* Block if `risk_band in {HIGH, CRITICAL}` **AND** `prob_delta >= 0.1` in last 72h.
|
||||
* Soften if asset not reachable or mitigated by VEX.
|
||||
|
||||
**UI (Evidence pane):**
|
||||
|
||||
* Show **sparkline of EPSS over time**, highlight last delta.
|
||||
* "Why now?" button reveals **explain_hash** -> deterministic evidence payload.
|
||||
|
||||
---
|
||||
|
||||
# Ops & reliability
|
||||
|
||||
* Daily ingestion with **idempotent** runs (raw SHA guard).
|
||||
* Backfills: re-normalize from `epss_raw` for any new model without re-downloading.
|
||||
* **Deterministic replay:** export `(raw, transform code hash, model_version)` alongside results.
|
||||
Reference in New Issue
Block a user