save progress

This commit is contained in:
StellaOps Bot
2025-12-18 09:53:46 +02:00
parent 28823a8960
commit 7d5250238c
87 changed files with 9750 additions and 2026 deletions

View File

@@ -0,0 +1,721 @@
Here are two practical ways to make your software supplychain evidence both *useful* and *verifiable*—with enough background to get you shipping.
---
# 1) Binary SBOMs that still work when theres no package manager
**Why this matters:** Container images built `FROM scratch` or “distroless” often lack package metadata, so typical SBOMs go blank. A *binary SBOM* extracts facts directly from executables—so you still know “whats inside,” even in bare images.
**Core idea (plain English):**
* Parse binaries (ELF on Linux, PE on Windows, MachO on macOS).
* Record file paths, cryptographic hashes, import tables, compiler/linker hints, and for ELF also the `.note.gnu.build-id` (a unique ID most linkers embed).
* Map these fingerprints to known packages/versions (vendor fingerprints, distro databases, your own allowlists).
* Sign the result as an attestation so others can trust it without rerunning your scanner.
**Minimal pipeline sketch:**
* **Extract:** `readelf -n` (ELF notes), `objdump`/`otool` for imports; compute SHA256 for every binary.
* **Normalize:** Emit CycloneDX or SPDX components for *binaries*, not just packages.
* **Map:** Use BuildID → package hints (e.g., glibc, OpenSSL), symbol/version patterns, and path heuristics.
* **Attest:** Wrap the SBOM in DSSE + intoto and push to your registry alongside the image digest.
**Pragmatic spec for developers:**
* Inputs: OCI image digest.
* Outputs:
* `binary-sbom.cdx.json` (CycloneDX) or `binary-sbom.spdx.json`.
* `attestation.intoto.jsonl` (DSSE envelope referencing the SBOMs SHA256 and the *image digest*).
* Data fields to capture per artifact:
* `algorithm: sha256`, `digest: <hex>`, `type: elf|pe|macho`, `path`, `size`,
* `elf.build_id` (if present), `imports[]`, `compiler[]`, `arch`, `endian`.
* Verification:
* `cosign verify-attestation --type sbom --digest <image-digest> ...`
**Why the ELF BuildID is gold:** its a stable, linkeremitted identifier that helps correlate stripped binaries to upstream packages—critical when filenames and symbols lie.
---
# 2) Reachability analysis so you only page people for *real* risk
**Why this matters:** Not every CVE in your deps can actually be hit by your app. If you can show “no call path reaches the vulnerable sink,” you can *denoise* alerts and ship faster.
**Core idea (plain English):**
* Build an *interprocedural call graph* of your app (across modules/packages).
* Mark known “sinks” from vulnerability advisories (e.g., dangerous API + version range).
* Compute graph reachability from your entrypoints (HTTP handlers, CLI `main`, background jobs).
* The intersection of {reachable nodes} × {vulnerable sinks} = “actionable” findings.
* Emit a signed *witness* (attestation) that states which sinks are reachable/unreachable and why.
**Minimal pipeline sketch:**
* **Ingest code/bytecode:** languagespecific frontends (e.g., .NET IL, JVM bytecode, Python AST, Go SSA).
* **Build graph:** nodes = functions/methods; edges = call sites (include dynamic edges conservatively).
* **Mark entrypoints:** web routes, message handlers, cron jobs, exported CLIs.
* **Mark sinks:** from your vuln DB (API signature + version).
* **Decide:** run graph search from entrypoints → is any sink reachable?
* **Attest:** DSSE witness with:
* artifact digest (commit SHA / image digest),
* tool version + rule set hash,
* list of reachable sinks with at least one example call path,
* list of *proven* unreachable sinks (under stated assumptions).
**Developer contract (portable across languages):**
* Inputs: source/bytecode zip + manifest of entrypoints.
* Outputs:
* `reachability.witness.json` (DSSE envelope),
* optional `paths/` folder with topN call paths as compact JSON (for UX rendering).
* Verification:
* Recompute call graph deterministically given the same inputs + tool version,
* `cosign verify-attestation --type reachability ...`
---
# How these two pieces fit together
* **Binary SBOM** = “What exactly is in the artifact?” (even in bare images)
* **Reachability witness** = “Which vulns actually matter to *this* app build?”
* Sign both as **DSSE/intoto attestations** and attach to the image/release. Your CI can enforce:
* “Block if highseverity + *reachable*,”
* “Warn (dont block) if highseverity but *unreachable* with a fresh witness.”
---
# Quick starter checklist (copy/paste to a task board)
* [ ] Binary extractors: ELF/PE/MachO parsers; hash & BuildID capture.
* [ ] Mapping rules: BuildID → known package DB; symbol/version heuristics.
* [ ] Emit CycloneDX/SPDX; add filelevel components for binaries.
* [ ] DSSE signing and `cosign`/`rekor` publish for SBOM attestation.
* [ ] Language frontends for reachability (pick your top 12 first).
* [ ] Callgraph builder + entrypoint detector.
* [ ] Sink catalog normalizer (map CVE → API signature).
* [ ] Reachability engine + example path extractor.
* [ ] DSSE witness for reachability; attach to build.
* [ ] CI policy: block on “reachable high/critical”; surface paths in UI.
If you want, I can turn this into concrete .NETfirst tasks with sample code scaffolds and a tiny demo repo that builds an image, extracts a binary SBOM, runs reachability on a toy service, and emits both attestations.
Below is a concrete, “dothisthenthis” implementation plan for a **layered binary→PURL mapping system** that fits StellaOps constraints: **offline**, **deterministic**, **SBOMfirst**, and with **unknowns recorded instead of guessing**.
Im going to assume your target is the common pain case StellaOps itself calls out: when package metadata is missing, Scanner falls back to binary identity (`bin:{sha256}`) and you want to deterministically “lift” those binaries into stable package identities (PURLs) without turning the core SBOM into fuzzy guesswork. StellaOps own Scanner docs emphasize **deterministic analyzers**, **no fuzzy identity in core**, and keeping heuristics as optin addons. ([Stella Ops][1])
---
## 0) What “binary mapping” means in StellaOps terms
In Scanners architecture, the **component key** is:
* **PURL when present**
* otherwise `bin:{sha256}` ([Stella Ops][1])
So “better binary mapping” = systematically converting more of those `bin:*` components into **PURLs** (or at least producing **actionable mapping evidence + Unknowns**) while preserving:
* deterministic replay (same inputs ⇒ same output)
* offline operation (airgapped kits)
* policy safety (dont hide false negatives behind fuzzy IDs)
Also, StellaOps already has the concept of “gaps” being firstclass via the **Unknowns Registry** (identity gaps, missing buildid, version conflicts, missing edges, etc.). ([Gitea: Git with a cup of tea][2]) Your binary mapping work should *feed* this system.
---
## 1) Design constraints you must keep (or youll fight the platform)
### 1.1 Determinism rules
StellaOps Scanner architecture is explicit: core analyzers are deterministic; heuristic plugins must not contaminate the core SBOM unless explicitly enabled. ([Stella Ops][1])
That implies:
* **No probabilistic “best guess” PURL** in the default mapping path.
* If you do fuzzy inference, it must be emitted as:
* “hints” attached to Unknowns, or
* a separate heuristic artifact gated by flags.
### 1.2 Offline kit + debug store is already a hook you can exploit
Offline kits already bundle:
* scanner plugins (OS + language analyzers packaged under `plugins/scanner/analyzers/**`)
* a **debug store** layout: `debug/.build-id/<aa>/<rest>.debug`
* a `debug-manifest.json` that maps buildids → originating images (for symbol retrieval) ([Stella Ops][3])
This is perfect for building a **BuildID→PURL index** that remains offline and signed.
### 1.3 Scanner Worker already loads analyzers via directory catalogs
The Worker loads OS and language analyzer plugins from default directories (unless overridden), using deterministic directory normalization and a “seal” concept on the last directory. ([Gitea: Git with a cup of tea][4])
So you can add a third catalog for **native/binary mapping** that behaves the same way.
---
## 2) Layering strategy: what to implement (and in what order)
You want a **resolver pipeline** with strict ordering from “hard evidence” → “soft evidence”.
### Layer 0 — Inimage authoritative mapping (highest confidence)
These sources are authoritative because they come from within the artifact:
1. **OS package DB present** (dpkg/rpm/apk):
* Map `path → package` using file ownership lists.
* If you can also compute file hashes/buildids, store them as evidence.
2. **Language ecosystem metadata present** (already handled by language analyzers):
* For example, a Python wheel RECORD or a Go buildinfo section can directly imply module versions.
**Decision rule**: If a binary file is owned by an OS package, **prefer that** over any external mapping index.
### Layer 1 — “Build provenance” mapping via build IDs / UUIDs (strong, portable)
When package DB is missing (distroless/scratch), use **compiler/linker stable IDs**:
* ELF: `.note.gnu.build-id`
* MachO: `LC_UUID`
* PE: CodeView (PDB GUID+Age) / build signature
This should be your primary fallback because it survives stripping and renaming.
### Layer 2 — Hash mapping for curated or vendorpinned binaries (strong but brittle across rebuilds)
Use SHA256 → PURL mapping when:
* binaries are redistributed unchanged (busybox, chromium, embedded runtimes)
* you maintain a curated “known binaries” manifest
StellaOps already has “curated binary manifest generation” mentioned in its repo history, and a `vendor/manifest.json` concept exists (for pinned artifacts / binaries in the system). ([Gitea: Git with a cup of tea][5])
For your ops environment youll create a similar manifest **for your fleet**.
### Layer 3 — Dependency closure constraints (helpful as a disambiguator, not a primary mapper)
If the binarys DT_NEEDED / imports point to libs you *can* identify, you can use that to disambiguate multiple possible candidates (“this openssl build-id matches, but only one candidate has the required glibc baseline”).
This must remain deterministic and rulesbased.
### Layer 4 — Heuristic hints (never change the core SBOM by default)
Examples:
* symbol version patterns (`GLIBC_2.28`, etc.)
* embedded version strings
* import tables
* compiler metadata
These produce **Unknown evidence/hints**, not a resolved identity, unless a special “heuristics allowed” flag is turned on.
### Layer 5 — Unknowns Registry output (mandatory when you cant decide)
If a mapping cant be made decisively:
* emit Unknowns (identity_gap, missing_build_id, version_conflict, etc.) ([Gitea: Git with a cup of tea][2])
This is not optional; its how you prevent silent false negatives.
---
## 3) Concrete data model you should implement
### 3.1 Binary identity record
Create a single canonical identity structure that *every layer* uses:
```csharp
public enum BinaryFormat { Elf, Pe, MachO, Unknown }
public sealed record BinaryIdentity(
BinaryFormat Format,
string Path, // normalized (posix style), rooted at image root
string Sha256, // always present
string? BuildId, // ELF
string? MachOUuid, // Mach-O
string? PeCodeViewGuid, // PE/PDB
string? Arch, // amd64/arm64/...
long SizeBytes
);
```
**Determinism tip**: normalize `Path` to a single separator and collapse `//`, `./`, etc.
### 3.2 Mapping candidate
Each resolver layer returns candidates like:
```csharp
public enum MappingVerdict { Resolved, Unresolved, Ambiguous }
public sealed record BinaryMappingCandidate(
string Purl,
double Confidence, // 0..1 but deterministic
string ResolverId, // e.g. "os.fileowner", "buildid.index.v1"
IReadOnlyList<string> Evidence, // stable ordering
IReadOnlyDictionary<string,string> Properties // stable ordering
);
```
### 3.3 Final mapping result
```csharp
public sealed record BinaryMappingResult(
MappingVerdict Verdict,
BinaryIdentity Subject,
BinaryMappingCandidate? Winner,
IReadOnlyList<BinaryMappingCandidate> Alternatives,
string MappingIndexDigest // sha256 of index snapshot used (or "none")
);
```
---
## 4) Build the “Binary Map Index” that makes Layer 1 and 2 work offline
### 4.1 Where it lives in StellaOps
Put it in the Offline Kit as a signed artifact, next to other feeds and plug-ins. Offline kit packaging already includes plug-ins and a debug store with a deterministic layout. ([Stella Ops][3])
Recommended layout:
```
offline-kit/
feeds/
binary-map/
v1/
buildid.map.zst
sha256.map.zst
index.manifest.json
index.manifest.json.sig (DSSE or JWS, consistent with your kit)
```
### 4.2 Index record schema (v1)
Make each record explicit and replayable:
```json
{
"schema": "stellaops.binary-map.v1",
"records": [
{
"key": { "kind": "elf.build_id", "value": "2f3a..."},
"purl": "pkg:deb/debian/openssl@3.0.11-1~deb12u2?arch=amd64",
"evidence": {
"source": "os.dpkg.fileowner",
"source_image": "sha256:....",
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
"package": "openssl",
"package_version": "3.0.11-1~deb12u2"
}
}
]
}
```
Key points:
* `key.kind` is one of `elf.build_id`, `macho.uuid`, `pe.codeview`, `file.sha256`
* include evidence with enough detail to justify mapping
### 4.3 How to *generate* the index (deterministically)
You need an **offline index builder** pipeline. In StellaOps terms, this is best treated like a feed exporter step (build-time), then shipped in the Offline Kit.
**Input set options** (choose one or mix):
1. “Golden base images” list (your fleets base images)
2. Distro repositories mirrored into the airgap (Deb/RPM/APK archives)
3. Previously scanned images that are allowed into the kit
**Generation steps**:
1. For each input image:
* Extract rootfs in a deterministic path order.
* Run OS analyzers (dpkg/rpm/apk) + native identity collection (ELF/PE/MachO).
2. Produce raw tuples:
* `(build_id | uuid | codeview | sha256) → (purl, evidence)`
3. Deduplicate:
* Canonicalize PURLs (normalize qualifiers order, lowercasing rules).
* If the same key maps to **multiple distinct PURLs**, keep them all and mark as conflict (do not pick one).
4. Sort:
* Sort by `(key.kind, key.value, purl)` lexicographically.
5. Serialize:
* Emit linedelimited JSON or a simple binary format.
* Compress (zstd).
6. Compute digests:
* `sha256` of each artifact.
* `sha256` of concatenated `(artifact name + sha)` for a manifest hash.
7. Sign:
* include in kit manifest and sign with the same process you use for other offline kit elements. Offline kit import in StellaOps validates digests and signatures. ([Stella Ops][3])
---
## 5) Runtime side: implement the layered resolver in Scanner Worker
### 5.1 Where to hook in
You want this to run after OS + language analyzers have produced fragments, and after native identity collection has produced binary identities.
Scanner Worker already executes analyzers and appends fragments to `context.Analysis`. ([Gitea: Git with a cup of tea][4])
Scanner module responsibilities explicitly include OS, language, and native ecosystems as restart-only plug-ins. ([Gitea: Git with a cup of tea][6])
So implement binary mapping as either:
* part of the **native ecosystem analyzer output stage**, or
* a **post-analyzer enrichment stage** that runs before SBOM composition.
I recommend: **post-analyzer enrichment stage**, because it can consult OS+lang analyzer results and unify decisions.
### 5.2 Add a new ScanAnalysis key
Store collected binary identities in analysis:
* `ScanAnalysisKeys.NativeBinaryIdentities``ImmutableArray<BinaryIdentity>`
And store mapping results:
* `ScanAnalysisKeys.NativeBinaryMappings``ImmutableArray<BinaryMappingResult>`
### 5.3 Implement the resolver pipeline (deterministic ordering)
```csharp
public interface IBinaryMappingResolver
{
string Id { get; } // stable ID
int Order { get; } // deterministic
BinaryMappingCandidate? TryResolve(BinaryIdentity identity, MappingContext ctx);
}
```
Pipeline:
1. Sort resolvers by `(Order, Id)` (Ordinal comparison).
2. For each resolver:
* if it returns a candidate, add it to candidates list.
* if the resolver is “authoritative” (Layer 0), you can shortcircuit on first hit.
3. Decide:
* If 0 candidates ⇒ `Unresolved`
* If 1 candidate ⇒ `Resolved`
* If >1:
* If candidates have different PURLs ⇒ `Ambiguous` unless a deterministic “dominates” rule exists
* If candidates have same PURL (from multiple sources) ⇒ merge evidence
### 5.4 Implement each layer as a resolver
#### Resolver A: OS file owner (Layer 0)
Inputs:
* OS analyzer results in `context.Analysis` (theyre already stored in `ScanAnalysisKeys.OsPackageAnalyzers`). ([Gitea: Git with a cup of tea][4])
* You need OS analyzers to expose file ownership mapping.
Implementation options:
* Extend OS analyzers to produce `path → packageId` maps.
* Or load that from dpkg/rpm DB at mapping time (fast enough if you only query per binary path).
Candidate:
* `Purl = pkg:<ecosystem>/<name>@<version>?arch=...`
* Confidence = `1.0`
* Evidence includes:
* analyzer id
* package name/version
* file path
#### Resolver B: BuildID index (Layer 1)
Inputs:
* `identity.BuildId` (or uuid/codeview)
* `BinaryMapIndex` loaded from Offline Kit `feeds/binary-map/v1/buildid.map.zst`
Implementation:
* On worker startup: load and parse index into an immutable structure:
* `FrozenDictionary<string, BuildIdEntry[]>` (or sorted arrays + binary search)
* If key maps to multiple PURLs:
* return multiple candidates (same resolver id), forcing `Ambiguous` verdict upstream
Candidate:
* Confidence = `0.95` (still deterministic)
* Evidence includes index manifest digest + record evidence
#### Resolver C: SHA256 index (Layer 2)
Inputs:
* `identity.Sha256`
* `feeds/binary-map/v1/sha256.map.zst` OR your ops “curated binaries” manifest
Candidate:
* Confidence:
* `0.9` if from signed curated manifest
* `0.7` if from “observed in previous scan cache” (Id avoid this unless you version and sign the cache)
#### Resolver D: Dependency closure constraints (Layer 3)
Only run if you have native dependency parsing output (DT_NEEDED / imports). The resolver does **not** return a mapping on its own; instead, it can:
* bump confidence for existing candidates
* or rule out candidates deterministically (e.g., glibc baseline mismatch)
Make this a “candidate rewriter” stage:
```csharp
public interface ICandidateRefiner
{
string Id { get; }
int Order { get; }
IReadOnlyList<BinaryMappingCandidate> Refine(BinaryIdentity id, IReadOnlyList<BinaryMappingCandidate> cands, MappingContext ctx);
}
```
#### Resolver E: Heuristic hints (Layer 4)
Never resolves to a PURL by default. It just produces Unknown evidence payload:
* extracted strings (“OpenSSL 3.0.11”)
* imported symbol names
* SONAME
* symbol version requirements
---
## 6) SBOM composition behavior: how to “lift” bin components safely
### 6.1 Dont break the component key rules
Scanner uses:
* key = PURL when present, else `bin:{sha256}` ([Stella Ops][1])
When you resolve a binary identity to a PURL, you have two clean options:
**Option 1 (recommended): replace the component key with the PURL**
* This makes downstream policy/advisory matching work naturally.
* Its deterministic as long as the mapping index is versioned and shipped with the kit.
**Option 2: keep `bin:{sha256}` as the component key and attach `resolved_purl`**
* Lower disruption to diffing, but policy now has to understand the “resolved_purl” field.
* If StellaOps policy assumes `component.purl` is the canonical key, this will cause pain.
Given StellaOps emphasizes PURLs as the canonical key for identity, Id implement **Option 1**, but record robust evidence + index digest.
### 6.2 Preserve file-level evidence
Even after lifting to PURL, keep evidence that ties the package identity back to file bytes:
* file path(s)
* sha256
* build-id/uuid
* mapping resolver id + index digest
This is what makes attestations verifiable and helps operators debug.
---
## 7) Unknowns integration: emit Unknowns whenever mapping isnt decisive
The Unknowns Registry exists precisely for “unresolved symbol → package mapping”, “missing build-id”, “ambiguous purl”, etc. ([Gitea: Git with a cup of tea][2])
### 7.1 When to emit Unknowns
Emit Unknowns for:
1. `identity.BuildId == null` for ELF
* `unknown_type = missing_build_id`
* evidence: “ELF missing .note.gnu.build-id; using sha256 only”
2. Multiple candidates with different PURLs
* `unknown_type = version_conflict` (or `identity_gap`)
* evidence: list candidates + their evidence
3. Heuristic hints found but no authoritative mapping
* `unknown_type = identity_gap`
* evidence: imported symbols, strings, SONAME
### 7.2 How to compute `unknown_id` deterministically
Unknowns schema suggests:
* `unknown_id` is derived from sha256 over `(type + scope + evidence)` ([Gitea: Git with a cup of tea][2])
Do:
* stable JSON canonicalization of `scope` + `unknown_type` + `primary evidence fields`
* sha256
* prefix with `unk:sha256:<...>`
This guarantees idempotent ingestion behavior (`POST /unknowns/ingest` upsert). ([Gitea: Git with a cup of tea][2])
---
## 8) Packaging as a StellaOps plug-in (so ops can upgrade it offline)
### 8.1 Plug-in manifest
Scanner plug-ins use a `manifest.json` with `schemaVersion`, `id`, `entryPoint` (dotnet assembly + typeName), etc. ([Gitea: Git with a cup of tea][7])
Create something like:
```json
{
"schemaVersion": "1.0",
"id": "stellaops.analyzer.native.binarymap",
"displayName": "StellaOps Native Binary Mapper",
"version": "0.1.0",
"requiresRestart": true,
"entryPoint": {
"type": "dotnet",
"assembly": "StellaOps.Scanner.Analyzers.Native.BinaryMap.dll",
"typeName": "StellaOps.Scanner.Analyzers.Native.BinaryMap.BinaryMapPlugin"
},
"capabilities": [
"native-analyzer",
"binary-mapper",
"elf",
"pe",
"macho"
],
"metadata": {
"org.stellaops.analyzer.kind": "native",
"org.stellaops.restart.required": "true"
}
}
```
### 8.2 Worker loading
Mirror the pattern in `CompositeScanAnalyzerDispatcher`:
* add a catalog `INativeAnalyzerPluginCatalog`
* default directory: `plugins/scanner/analyzers/native`
* load directories with the same “seal last directory” behavior ([Gitea: Git with a cup of tea][4])
---
## 9) Tests and performance gates (what “done” looks like)
StellaOps has determinism tests and golden fixtures for analyzers; follow that style. ([Gitea: Git with a cup of tea][6])
### 9.1 Determinism tests
Create fixtures with:
* same binaries in different file order
* same binaries hardlinked/symlinked
* stripped ELF missing build-id
* multi-arch variants
Assert:
* mapping output JSON byte-for-byte stable
* unknown ids stable
* candidate ordering stable
### 9.2 “No fuzzy identity” guardrail tests
Add tests that:
* heuristic resolver never emits a `Resolved` verdict unless a feature flag is enabled
* ambiguous candidates never auto-select a winner
### 9.3 Performance budgets
For ops, you care about scan wall time. Adopt budgets like:
* identity extraction < 25ms / binary (native parsing)
* mapping lookup O(1) / binary (frozen dict) or O(log n) with sorted arrays
* index load time bounded (lazy load per worker start)
Track metrics:
* count resolved per layer
* count ambiguous/unresolved
* unknown density (ties into Unknowns Registry scoring later) ([Gitea: Git with a cup of tea][2])
---
## 10) Practical “ops” workflow: how to keep improving mapping safely
### 10.1 Add a feedback loop from Unknowns → index builder
Unknowns are your backlog:
* missing build-id
* ambiguous mapping
* hash seen but not in index
For each Unknown:
1. decide if it should be mapped in core (needs authoritative source)
2. if yes: add reference artifact to your **index builder input set**
3. rebuild the BinaryMap index
4. ship via Offline Kit update (signed)
### 10.2 Dont let your index silently drift
Because determinism matters, treat the BinaryMap index like a feed:
* version it (`v1`, `v2`)
* sign it
* store index digest in scan evidence
That way you can explain: This binary was mapped using binary-map/v1 digest XYZ”.
---
## 11) Minimal implementation checklist (if you want the shortest path to value)
If you only do 3 things, do these:
1. **BuildID extraction everywhere** (ELF/MachO/PE) and always store it in evidence
(also emit Unknown when missing, as StellaOps expects) ([Gitea: Git with a cup of tea][8])
2. **Offline BuildID → PURL index** shipped in Offline Kit
(fits perfectly with the existing debug-store + kit pattern) ([Stella Ops][3])
3. **Deterministic resolver pipeline + Unknowns emission**
(so you improve mapping without introducing silent risk) ([Gitea: Git with a cup of tea][2])
---
If you tell me whether your main pain is **distroless**, **FROM scratch**, or **vendorbundled runtimes** (chromium/node/openssl/etc.), I can give you the best Layer 1 index builder recipe for that category (what to use as authoritative sources and how to avoid collisions) but the plan above is already safe and implementable without further assumptions.
[1]: https://stella-ops.org/docs/modules/scanner/architecture/ "Stella Ops Signed Reachability · Deterministic Replay · Sovereign Crypto"
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d519782a8f0b30f425c9b6ae0f316b19259972a2/docs/signals/unknowns-registry.md "git.stella-ops.org/unknowns-registry.md at d519782a8f0b30f425c9b6ae0f316b19259972a2 - git.stella-ops.org - Gitea: Git with a cup of tea"
[3]: https://stella-ops.org/docs/24_offline_kit/index.html "Stella Ops Signed Reachability · Deterministic Replay · Sovereign Crypto"
[4]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/18f28168f022c73736bfd29033c71daef5e11044/src/Scanner/StellaOps.Scanner.Worker/Processing/CompositeScanAnalyzerDispatcher.cs "git.stella-ops.org/CompositeScanAnalyzerDispatcher.cs at 18f28168f022c73736bfd29033c71daef5e11044 - git.stella-ops.org - Gitea: Git with a cup of tea"
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/8d78dd219b5e44c835e511491a4750f4a3ee3640/vendor/manifest.json?utm_source=chatgpt.com "git.stella-ops.org/manifest.json at ..."
[6]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/bc0762e97d251723854b9c4e482b218c8efb1e04/docs/modules/scanner "git.stella-ops.org/scanner at bc0762e97d251723854b9c4e482b218c8efb1e04 - git.stella-ops.org - Gitea: Git with a cup of tea"
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/c37722993137dac4b3a4104045826ca33b9dc289/plugins/scanner/analyzers/lang/StellaOps.Scanner.Analyzers.Lang.Go/manifest.json "git.stella-ops.org/manifest.json at c37722993137dac4b3a4104045826ca33b9dc289 - git.stella-ops.org - Gitea: Git with a cup of tea"
[8]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d519782a8f0b30f425c9b6ae0f316b19259972a2/docs/reachability/evidence-schema.md?utm_source=chatgpt.com "git.stella-ops.org/evidence-schema.md at ..."

View File

@@ -0,0 +1,444 @@
# ARCHIVED ADVISORY
> **Status:** Archived
> **Archived Date:** 2025-12-18
> **Implementation Sprints:**
> - `SPRINT_3700_0001_0001_witness_foundation.md` - BLAKE3 + Witness Schema
> - `SPRINT_3700_0002_0001_vuln_surfaces_core.md` - Vuln Surface Builder
> - `SPRINT_3700_0003_0001_trigger_extraction.md` - Trigger Method Extraction
> - `SPRINT_3700_0004_0001_reachability_integration.md` - Reachability Integration
> - `SPRINT_3700_0005_0001_witness_ui_cli.md` - Witness UI/CLI
> - `SPRINT_3700_0006_0001_incremental_cache.md` - Incremental Cache
>
> **Gap Analysis:** See `C:\Users\vlindos\.claude\plans\lexical-knitting-map.md`
---
Here's a compact, practical way to add two high-leverage capabilities to your scanner: **DSSE-signed path witnesses** and **Smart-Diff x Reachability**-what they are, why they matter, and exactly how to implement them in Stella Ops without ceremony.
---
# 1) DSSE-signed path witnesses (entrypoint -> calls -> sink)
**What it is (in plain terms):**
When you flag a CVE as "reachable," also emit a tiny, human-readable proof: the **exact path** from a real entrypoint (e.g., HTTP route, CLI verb, cron) through functions/methods to the **vulnerable sink**. Wrap that proof in a **DSSE** envelope and sign it. Anyone can verify the witness later-offline-without rerunning analysis.
**Why it matters:**
* Turns red flags into **auditable evidence** (quiet-by-design).
* Lets CI/CD, auditors, and customers **verify** findings independently.
* Enables **deterministic replay** and provenance chains (ties nicely to in-toto/SLSA).
**Minimal JSON witness (stable, vendor-neutral):**
```json
{
"witness_schema": "stellaops.witness.v1",
"artifact": { "sbom_digest": "sha256:...", "component_purl": "pkg:nuget/Example@1.2.3" },
"vuln": { "id": "CVE-2024-XXXX", "source": "NVD", "range": "<=1.2.3" },
"entrypoint": { "kind": "http", "name": "GET /billing/pay" },
"path": [
{"symbol": "BillingController.Pay()", "file": "BillingController.cs", "line": 42},
{"symbol": "PaymentsService.Authorize()", "file": "PaymentsService.cs", "line": 88},
{"symbol": "LibXYZ.Parser.Parse()", "file": "Parser.cs", "line": 17}
],
"sink": { "symbol": "LibXYZ.Parser.Parse()", "type": "deserialization" },
"evidence": {
"callgraph_digest": "sha256:...",
"build_id": "dotnet:RID:linux-x64:sha256:...",
"analysis_config_digest": "sha256:..."
},
"observed_at": "2025-12-18T00:00:00Z"
}
```
**Wrap in DSSE (payloadType & payload are required)**
```json
{
"payloadType": "application/vnd.stellaops.witness+json",
"payload": "base64(JSON_above)",
"signatures": [{ "keyid": "attestor-stellaops-ed25519", "sig": "base64(...)" }]
}
```
**.NET 10 signing/verifying (Ed25519)**
```csharp
using System.Security.Cryptography;
using System.Text.Json;
var payloadBytes = JsonSerializer.SerializeToUtf8Bytes(witnessJsonObj);
var dsse = new {
payloadType = "application/vnd.stellaops.witness+json",
payload = Convert.ToBase64String(payloadBytes),
signatures = new [] { new { keyid = keyId, sig = Convert.ToBase64String(Sign(payloadBytes, privateKey)) } }
};
byte[] Sign(byte[] data, byte[] privateKey)
{
using var ed = new Ed25519();
// import private key, sign data (left as your Ed25519 helper)
return ed.SignData(data, privateKey);
}
```
**Where to emit:**
* **Scanner.Worker**: after reachability confirms `reachable=true`, emit witness -> **Attestor** signs -> **Authority** stores (Postgres) -> optional Rekor-style mirror.
* Expose `/witness/{findingId}` for download & independent verification.
---
# 2) Smart-Diff x Reachability (incremental, low-noise updates)
**What it is:**
On **SBOM/VEX/dependency** deltas, don't rescan everything. Update only **affected regions** of the call graph and recompute reachability **just for changed nodes/edges**.
**Why it matters:**
* **Order-of-magnitude faster** incremental scans.
* Fewer flaky diffs; triage stays focused on **meaningful risk change**.
* Perfect for PR gating: "what changed" -> "what became reachable/unreachable."
**Core idea (graph-reachability):**
* Maintain a per-service **call graph** `G = (V, E)` with **entrypoint set** `S`.
* On diff: compute changed nodes/edges DV/DE.
* Run **incremental BFS/DFS** from impacted nodes to sinks (forward or backward), reusing memoized results.
* Recompute only **frontiers** touched by D.
**Minimal tables (Postgres):**
```sql
-- Nodes (functions/methods)
CREATE TABLE cg_nodes(
id BIGSERIAL PRIMARY KEY,
service TEXT, symbol TEXT, file TEXT, line INT,
hash TEXT, UNIQUE(service, hash)
);
-- Edges (calls)
CREATE TABLE cg_edges(
src BIGINT REFERENCES cg_nodes(id),
dst BIGINT REFERENCES cg_nodes(id),
kind TEXT, PRIMARY KEY(src, dst)
);
-- Entrypoints & Sinks
CREATE TABLE cg_entrypoints(node_id BIGINT REFERENCES cg_nodes(id) PRIMARY KEY);
CREATE TABLE cg_sinks(node_id BIGINT REFERENCES cg_nodes(id) PRIMARY KEY, sink_type TEXT);
-- Memoized reachability cache
CREATE TABLE cg_reach_cache(
entry_id BIGINT, sink_id BIGINT,
path JSONB, reachable BOOLEAN,
updated_at TIMESTAMPTZ,
PRIMARY KEY(entry_id, sink_id)
);
```
**Incremental algorithm (pseudocode):**
```text
Input: DSBOM, DDeps, DCode -> DNodes, DEdges
1) Apply D to cg_nodes/cg_edges
2) ImpactSet = neighbors(DNodes U endpoints(DEdges))
3) For each e in Entrypoints intersect ancestors(ImpactSet):
Recompute forward search to affected sinks, stop early on unchanged subgraphs
Update cg_reach_cache; if state flips, emit new/updated DSSE witness
```
**.NET 10 reachability sketch (fast & local):**
```csharp
HashSet<int> ImpactSet = ComputeImpact(deltaNodes, deltaEdges);
foreach (var e in Intersect(Entrypoints, Ancestors(ImpactSet)))
{
var res = BoundedReach(e, affectedSinks, graph, cache);
foreach (var r in res.Changed)
{
cache.Upsert(e, r.Sink, r.Path, r.Reachable);
if (r.Reachable) EmitDsseWitness(e, r.Sink, r.Path);
}
}
```
**CI/PR flow:**
1. Build -> SBOM diff -> Dependency diff -> Call-graph delta.
2. Run incremental reachability.
3. If any `unreachable->reachable` transitions: **fail gate**, attach DSSE witnesses.
4. If `reachable->unreachable`: auto-close prior findings (and archive prior witness).
---
# UX hooks (quick wins)
* In findings list, add a **"Show Witness"** button -> modal renders the signed path (entrypoint->...->sink) + **"Verify Signature"** one-click.
* In PR checks, summarize only **state flips** with tiny links: "+2 reachable (view witness)" / "-1 (now unreachable)".
---
# Minimal tasks to get this live
* **Scanner.Worker**: build call-graph extraction (per language), add incremental graph store, reachability cache.
* **Attestor**: DSSE signing endpoint + key management (Ed25519 by default; PQC mode later).
* **Authority**: tables above + witness storage + retrieval API.
* **Router/CI plugin**: PR annotation with **state flips** and links to witnesses.
* **UI**: witness modal + signature verify.
If you want, I can draft the exact Postgres migrations, the C# repositories, and a tiny verifier CLI that checks DSSE signatures and prints the call path.
Below is a concrete, buildable blueprint for an **advanced reachability analysis engine** inside Stella Ops. I'm going to assume your "Stella Ops" components are roughly:
* **Scanner.Worker**: runs analyses in CI / on artifacts
* **Authority**: stores graphs/findings/witnesses
* **Attestor**: signs DSSE envelopes (Ed25519)
* (optional) **SurfaceBuilder**: background worker that computes "vuln surfaces" for packages
The key advance is: **don't treat a CVE as "a package"**. Treat it as a **set of trigger methods** (public API) that can reach the vulnerable code inside the dependency-computed by "Smart-Diff" once, reused everywhere.
---
## 0) Define the contract (precision/soundness) up front
If you don't write this down, you'll fight false positives/negatives forever.
### What Stella Ops will guarantee (first release)
* **Whole-program static call graph** (app + selected dependency assemblies)
* **Context-insensitive** (fast), **path witness** extracted (shortest path)
* **Dynamic dispatch handled** with CHA/RTA (+ DI hints), with explicit uncertainty flags
* **Reflection handled best-effort** (constant-string resolution), otherwise "unknown edge"
### What it will NOT guarantee (first release)
* Perfect handling of reflection / `dynamic` / runtime codegen
* Perfect delegate/event resolution across complex flows
* Full taint/dataflow reachability (you can add later)
This is fine. The major value is: "**we can show you the call path**" and "**we can prove the vuln is triggered by calling these library APIs**".
---
## 1) The big idea: "Vuln surfaces" (Smart-Diff -> triggers)
### Problem
CVE feeds typically say "package X version range Y is vulnerable" but rarely say *which methods*. If you only do package-level reachability, noise is huge.
### Solution
For each CVE+package, compute a **vulnerability surface**:
* **Candidate sinks** = methods changed between vulnerable and fixed versions (diff at IL level)
* **Trigger methods** = *public/exported* methods in the vulnerable version that can reach those changed methods internally
Then your service scan becomes:
> "Can any entrypoint reach any trigger method?"
This is both faster and more precise.
---
## 2) Data model (Authority / Postgres)
You already had call graph tables; here's a concrete schema that supports:
* graph snapshots
* incremental updates
* vuln surfaces
* reachability cache
* DSSE witnesses
### 2.1 Graph tables
```sql
CREATE TABLE cg_snapshots (
snapshot_id BIGSERIAL PRIMARY KEY,
service TEXT NOT NULL,
build_id TEXT NOT NULL,
graph_digest TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE(service, build_id)
);
CREATE TABLE cg_nodes (
node_id BIGSERIAL PRIMARY KEY,
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
method_key TEXT NOT NULL, -- stable key (see below)
asm_name TEXT,
type_name TEXT,
method_name TEXT,
file_path TEXT,
line_start INT,
il_hash TEXT, -- normalized IL hash for diffing
flags INT NOT NULL DEFAULT 0, -- bitflags: has_reflection, compiler_generated, etc.
UNIQUE(snapshot_id, method_key)
);
CREATE TABLE cg_edges (
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
src_node_id BIGINT REFERENCES cg_nodes(node_id) ON DELETE CASCADE,
dst_node_id BIGINT REFERENCES cg_nodes(node_id) ON DELETE CASCADE,
kind SMALLINT NOT NULL, -- 0=call,1=newobj,2=dispatch,3=delegate,4=reflection_guess,...
PRIMARY KEY(snapshot_id, src_node_id, dst_node_id, kind)
);
CREATE TABLE cg_entrypoints (
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
node_id BIGINT REFERENCES cg_nodes(node_id) ON DELETE CASCADE,
kind TEXT NOT NULL, -- http, grpc, cli, job, etc.
name TEXT NOT NULL, -- GET /foo, "Main", etc.
PRIMARY KEY(snapshot_id, node_id, kind, name)
);
```
### 2.2 Vuln surface tables (Smart-Diff artifacts)
```sql
CREATE TABLE vuln_surfaces (
surface_id BIGSERIAL PRIMARY KEY,
ecosystem TEXT NOT NULL, -- nuget
package TEXT NOT NULL,
cve_id TEXT NOT NULL,
vuln_version TEXT NOT NULL, -- a representative vulnerable version
fixed_version TEXT NOT NULL,
surface_digest TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE(ecosystem, package, cve_id, vuln_version, fixed_version)
);
CREATE TABLE vuln_surface_sinks (
surface_id BIGINT REFERENCES vuln_surfaces(surface_id) ON DELETE CASCADE,
sink_method_key TEXT NOT NULL,
reason TEXT NOT NULL, -- changed|added|removed|heuristic
PRIMARY KEY(surface_id, sink_method_key)
);
CREATE TABLE vuln_surface_triggers (
surface_id BIGINT REFERENCES vuln_surfaces(surface_id) ON DELETE CASCADE,
trigger_method_key TEXT NOT NULL,
sink_method_key TEXT NOT NULL,
internal_path JSONB, -- optional: library internal witness path
PRIMARY KEY(surface_id, trigger_method_key, sink_method_key)
);
```
### 2.3 Reachability cache & witnesses
```sql
CREATE TABLE reach_findings (
finding_id BIGSERIAL PRIMARY KEY,
snapshot_id BIGINT REFERENCES cg_snapshots(snapshot_id) ON DELETE CASCADE,
cve_id TEXT NOT NULL,
ecosystem TEXT NOT NULL,
package TEXT NOT NULL,
package_version TEXT NOT NULL,
reachable BOOLEAN NOT NULL,
reachable_entrypoints INT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE(snapshot_id, cve_id, package, package_version)
);
CREATE TABLE reach_witnesses (
witness_id BIGSERIAL PRIMARY KEY,
finding_id BIGINT REFERENCES reach_findings(finding_id) ON DELETE CASCADE,
entry_node_id BIGINT REFERENCES cg_nodes(node_id),
dsse_envelope JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
```
---
## 3) Stable identity: MethodKey + IL hash
### 3.1 MethodKey (must be stable across builds)
Use a normalized string like:
```
{AssemblyName}|{DeclaringTypeFullName}|{MethodName}`{GenericArity}({ParamType1},{ParamType2},...)
```
Examples:
* `MyApp|BillingController|Pay(System.String)`
* `LibXYZ|LibXYZ.Parser|Parse(System.ReadOnlySpan<System.Byte>)`
### 3.2 Normalized IL hash (for smart-diff + incremental graph updates)
Raw IL bytes aren't stable (metadata tokens change). Normalize:
* opcode names
* branch targets by *instruction index*, not offset
* method operands by **resolved MethodKey**
* string operands by literal or hashed literal
* type operands by full name
Then hash `SHA256(normalized_bytes)`.
---
*[Remainder of advisory truncated for brevity - see original file for full content]*
---
## 12) What to implement first (in the order that produces value fastest)
### Week 1-2 scope (realistic, shippable)
1. Cecil call graph extraction (direct calls)
2. MVC + Minimal API entrypoints
3. Reverse BFS reachability with path witnesses
4. DSSE witness signing + storage
5. SurfaceBuilder v1:
* IL hash per method
* changed methods as sinks
* triggers via internal reverse BFS
6. UI: "Show Witness" + "Verify Signature"
### Next increment (precision upgrades)
7. async/await mapping to original methods
8. RTA + DI registration hints
9. delegate tracking for Minimal API handlers (if not already)
10. interface override triggers in surface builder
### Later (if you want "attackability", not just "reachability")
11. taint/dataflow for top sink classes (deserialization, path traversal, SQL, command exec)
12. sanitizer modeling & parameter constraints
---
## 13) Common failure modes and how to harden
### MethodKey mismatches (surface vs app call)
* Ensure both are generated from the same normalization rules
* For generic methods, prefer **definition** keys (strip instantiation)
* Store both "exact" and "erased generic" variants if needed
### Multi-target frameworks
* SurfaceBuilder: compute triggers for each TFM, union them
* App scan: choose TFM closest to build RID, but allow fallback to union
### Huge graphs
* Drop `System.*` nodes/edges unless:
* the vuln is in System.* (rare, but handle separately)
* Deduplicate nodes by MethodKey across assemblies where safe
* Use CSR arrays + pooled queues
### Reflection heavy projects
* Mark analysis confidence lower
* Include "unknown edges present" in finding metadata
* Still produce a witness path up to the reflective callsite
---
If you want, I can also paste a **complete Cecil-based CallGraphBuilder class** (nodes+edges+PDB lines), plus the **SurfaceBuilder** that downloads NuGet packages and generates `vuln_surface_triggers` end-to-end.

View File

@@ -0,0 +1,197 @@
# ARCHIVED ADVISORY
> **Archived**: 2025-12-18
> **Status**: IMPLEMENTED
> **Analysis**: Plan file `C:\Users\vlindos\.claude\plans\quizzical-hugging-hearth.md`
>
> ## Implementation Summary
>
> This advisory was analyzed and merged into the existing EPSS implementation plan:
>
> - **Master Plan**: `IMPL_3410_epss_v4_integration_master_plan.md` updated with raw + signal layer schemas
> - **Sprint**: `SPRINT_3413_0001_0001_epss_live_enrichment.md` created with 30 tasks (original 14 + 16 from advisory)
> - **Migrations Created**:
> - `011_epss_raw_layer.sql` - Full JSONB payload storage (~5GB/year)
> - `012_epss_signal_layer.sql` - Tenant-scoped signals with dedupe_key and explain_hash
>
> ## Gap Analysis Result
>
> | Advisory Proposal | Decision | Rationale |
> |-------------------|----------|-----------|
> | Raw feed layer (Layer 1) | IMPLEMENTED | Full JSONB storage for deterministic replay |
> | Normalized layer (Layer 2) | ALIGNED | Already existed in IMPL_3410 |
> | Signal-ready layer (Layer 3) | IMPLEMENTED | Tenant-scoped signals, model change detection |
> | Multi-model support | DEFERRED | No customer demand |
> | Meta-predictor training | SKIPPED | Out of scope (ML complexity) |
> | A/B testing | SKIPPED | Infrastructure overhead |
>
> ## Key Enhancements Implemented
>
> 1. **Raw Feed Layer** (`epss_raw` table) - Stores full CSV payload as JSONB for replay
> 2. **Signal-Ready Layer** (`epss_signal` table) - Tenant-scoped actionable events
> 3. **Model Version Change Detection** - Suppresses noisy deltas on model updates
> 4. **Explain Hash** - Deterministic SHA-256 for audit trail
> 5. **Risk Band Mapping** - CRITICAL/HIGH/MEDIUM/LOW based on percentile
---
# Original Advisory Content
Here's a compact, practical blueprint for bringing **EPSS** into your stack without chaos: a **3-layer ingestion model** that keeps raw data, produces clean probabilities, and emits "signal-ready" events your risk engine can use immediately.
---
# Why this matters (super short)
* **EPSS** = predicted probability a vuln will be exploited soon.
* Mixing "raw EPSS feed" directly into decisions makes audits, rollbacks, and model upgrades painful.
* A **layered model** lets you **version probability evolution**, compare vendors, and train **meta-predictors on deltas** (how risk changes over time), not just on snapshots.
---
# The three layers (and how they map to Stella Ops)
1. **Raw feed layer (immutable)**
* **Goal:** Store exactly what the provider sent (EPSS v4 CSV/JSON, schema drift and all).
* **Stella modules:** `Concelier` (preserve-prune source) writes; `Authority` handles signatures/hashes.
* **Storage:** `postgres.epss_raw` (partitioned by day); blob column for the untouched payload; SHA-256 of source file.
* **Why:** Full provenance + deterministic replay.
2. **Normalized probabilistic layer**
* **Goal:** Clean, typed tables keyed by `cve_id`, with **probability, percentile, model_version, asof_ts**.
* **Stella modules:** `Excititor` (transform); `Policy Engine` reads.
* **Storage:** `postgres.epss_prob` with a **surrogate key** `(cve_id, model_version, asof_ts)` and computed **delta fields** vs previous `asof_ts`.
* **Extras:** Keep optional vendor columns (e.g., FIRST, custom regressors) to compare models side-by-side.
3. **Signal-ready layer (risk engine contracts)**
* **Goal:** Pre-chewed "events" your **Signals/Router** can route instantly.
* **What's inside:** Only the fields needed for gating and UI: `cve_id`, `prob_now`, `prob_delta`, `percentile`, `risk_band`, `explain_hash`.
* **Emit:** `first_signal`, `risk_increase`, `risk_decrease`, `quieted` with **idempotent event keys**.
* **Stella modules:** `Signals` publishes, `Router` fan-outs, `Timeline` records; `Notify` handles subscriptions.
---
# Minimal Postgres schema (ready to paste)
```sql
-- 1) Raw (immutable)
create table epss_raw (
id bigserial primary key,
source_uri text not null,
ingestion_ts timestamptz not null default now(),
asof_date date not null,
payload jsonb not null,
payload_sha256 bytea not null
);
create index on epss_raw (asof_date);
-- 2) Normalized
create table epss_prob (
id bigserial primary key,
cve_id text not null,
model_version text not null,
asof_ts timestamptz not null,
probability double precision not null,
percentile double precision,
features jsonb,
unique (cve_id, model_version, asof_ts)
);
-- 3) Signal-ready
create table epss_signal (
signal_id bigserial primary key,
cve_id text not null,
asof_ts timestamptz not null,
probability double precision not null,
prob_delta double precision,
risk_band text not null,
model_version text not null,
explain_hash bytea not null,
unique (cve_id, model_version, asof_ts)
);
```
---
# C# ingestion skeleton (StellaOps.Scanner.Worker.DotNet style)
```csharp
// 1) Fetch & store raw (Concelier)
public async Task IngestRawAsync(Uri src, DateOnly asOfDate) {
var bytes = await http.GetByteArrayAsync(src);
var sha = SHA256.HashData(bytes);
await pg.ExecuteAsync(
"insert into epss_raw(source_uri, asof_date, payload, payload_sha256) values (@u,@d,@p::jsonb,@s)",
new { u = src.ToString(), d = asOfDate, p = Encoding.UTF8.GetString(bytes), s = sha });
}
// 2) Normalize (Excititor)
public async Task NormalizeAsync(DateOnly asOfDate, string modelVersion) {
var raws = await pg.QueryAsync<(string Payload)>("select payload from epss_raw where asof_date=@d", new { d = asOfDate });
foreach (var r in raws) {
foreach (var row in ParseCsvOrJson(r.Payload)) {
await pg.ExecuteAsync(
@"insert into epss_prob(cve_id, model_version, asof_ts, probability, percentile, features)
values (@cve,@mv,@ts,@prob,@pct,@feat)
on conflict do nothing",
new { cve = row.Cve, mv = modelVersion, ts = row.AsOf, prob = row.Prob, pct = row.Pctl, feat = row.Features });
}
}
}
// 3) Emit signal-ready (Signals)
public async Task EmitSignalsAsync(string modelVersion, double deltaThreshold) {
var rows = await pg.QueryAsync(@"select cve_id, asof_ts, probability,
probability - lag(probability) over (partition by cve_id, model_version order by asof_ts) as prob_delta
from epss_prob where model_version=@mv", new { mv = modelVersion });
foreach (var r in rows) {
var band = Band(r.probability);
if (Math.Abs(r.prob_delta ?? 0) >= deltaThreshold) {
var explainHash = DeterministicExplainHash(r);
await pg.ExecuteAsync(@"insert into epss_signal
(cve_id, asof_ts, probability, prob_delta, risk_band, model_version, explain_hash)
values (@c,@t,@p,@d,@b,@mv,@h)
on conflict do nothing",
new { c = r.cve_id, t = r.asof_ts, p = r.probability, d = r.prob_delta, b = band, mv = modelVersion, h = explainHash });
await bus.PublishAsync("risk.epss.delta", new {
cve = r.cve_id, ts = r.asof_ts, prob = r.probability, delta = r.prob_delta, band, model = modelVersion, explain = Convert.ToHexString(explainHash)
});
}
}
}
```
---
# Versioning & experiments (the secret sauce)
* **Model namespace:** `EPSS-4.0-<regressor-name>-<date>` so you can run multiple variants in parallel.
* **Delta-training:** Train a small meta-predictor on **delta-probability** to forecast **"risk jumps in next N days."**
* **A/B in production:** Route `model_version=x` to 50% of projects; compare **MTTA to patch** and **false-alarm rate**.
---
# Policy & UI wiring (quick contracts)
**Policy gates** (OPA/Rego or internal rules):
* Block if `risk_band in {HIGH, CRITICAL}` **AND** `prob_delta >= 0.1` in last 72h.
* Soften if asset not reachable or mitigated by VEX.
**UI (Evidence pane):**
* Show **sparkline of EPSS over time**, highlight last delta.
* "Why now?" button reveals **explain_hash** -> deterministic evidence payload.
---
# Ops & reliability
* Daily ingestion with **idempotent** runs (raw SHA guard).
* Backfills: re-normalize from `epss_raw` for any new model without re-downloading.
* **Deterministic replay:** export `(raw, transform code hash, model_version)` alongside results.