Add reference architecture and testing strategy documentation
- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces. - Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails. - Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented. - Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
This commit is contained in:
@@ -0,0 +1,783 @@
|
||||
Below is a practical, production-grade architecture for building a **vulnerable binaries database**. I’m going to be explicit about what “such a database” can mean, because there are two materially different products:
|
||||
|
||||
1. **Known-build catalog**: “These exact shipped binaries (Build-ID / hash) are affected or fixed for CVE X.”
|
||||
2. **Binary fingerprint DB**: “Even if the binary is unpackaged / self-built, we can match vulnerable code patterns.”
|
||||
|
||||
You want both. The first gets you breadth fast; the second is the moat.
|
||||
|
||||
---
|
||||
|
||||
## 1) Core principle: treat “binary identity” as the primary key
|
||||
|
||||
For Linux ELF:
|
||||
|
||||
* Primary: `ELF Build-ID` (from `.note.gnu.build-id`)
|
||||
* Fallback: `sha256(file_bytes)`
|
||||
* Add: `sha256(.text)` and/or BLAKE3 for speed
|
||||
|
||||
This creates a stable identity that survives “package metadata lies.”
|
||||
|
||||
**BinaryKey = build_id || file_sha256**
|
||||
|
||||
---
|
||||
|
||||
## 2) High-level system diagram
|
||||
|
||||
```
|
||||
┌──────────────────────────┐
|
||||
│ Vulnerability Intel │
|
||||
│ OSV/NVD + distro advis. │
|
||||
└───────────┬──────────────┘
|
||||
│ normalize
|
||||
v
|
||||
┌──────────────────────────┐
|
||||
│ Vuln Knowledge Store │
|
||||
│ CVE↔pkg ranges, patches │
|
||||
└───────────┬──────────────┘
|
||||
│
|
||||
│
|
||||
┌───────────────────────v─────────────────────────┐
|
||||
│ Repo Snapshotter (per distro/arch/date) │
|
||||
│ - mirrors metadata + packages (+ debuginfo) │
|
||||
│ - verifies signatures │
|
||||
│ - emits signed snapshot manifest │
|
||||
└───────────┬───────────────────────────┬─────────┘
|
||||
│ │
|
||||
│ packages │ debuginfo/sources
|
||||
v v
|
||||
┌──────────────────────────┐ ┌──────────────────────────┐
|
||||
│ Package Unpacker │ │ Source/Buildinfo Mapper │
|
||||
│ - extract files │ │ - pkg→source commit/patch │
|
||||
└───────────┬──────────────┘ └───────────┬──────────────┘
|
||||
│ binaries │
|
||||
v │
|
||||
┌──────────────────────────┐ │
|
||||
│ Binary Feature Extractor │ │
|
||||
│ - Build-ID, hashes │ │
|
||||
│ - dyn deps, symbols │ │
|
||||
│ - function boundaries (opt)│ │
|
||||
└───────────┬──────────────┘ │
|
||||
│ │
|
||||
v v
|
||||
┌──────────────────────────────────────────────────┐
|
||||
│ Vulnerable Binary Classifier │
|
||||
│ Tier A: pkg/version range │
|
||||
│ Tier B: Build-ID→known shipped build │
|
||||
│ Tier C: code fingerprints (function/CFG hashes) │
|
||||
└───────────┬───────────────────────────┬──────────┘
|
||||
│ │
|
||||
v v
|
||||
┌──────────────────────────┐ ┌──────────────────────────┐
|
||||
│ Vulnerable Binary DB │ │ Evidence/Attestation DB │
|
||||
│ (indexed by BinaryKey) │ │ (signed proofs, snapshots)│
|
||||
└───────────┬──────────────┘ └───────────┬──────────────┘
|
||||
│ publish signed snapshot │
|
||||
v v
|
||||
Clients/Scanners Explainable VEX outputs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3) Data stores you actually need
|
||||
|
||||
### A) Relational store (Postgres)
|
||||
|
||||
Use this for *indexes and joins*.
|
||||
|
||||
Key tables:
|
||||
|
||||
**`binary_identity`**
|
||||
|
||||
* `binary_key` (build_id or file_sha256) PK
|
||||
* `build_id` (nullable)
|
||||
* `file_sha256`, `text_sha256`
|
||||
* `arch`, `osabi`, `type` (ET_DYN/EXEC), `stripped`
|
||||
* `first_seen_snapshot`, `last_seen_snapshot`
|
||||
|
||||
**`binary_package_map`**
|
||||
|
||||
* `binary_key`
|
||||
* `distro`, `pkg_name`, `pkg_version_release`, `arch`
|
||||
* `file_path_in_pkg`, `snapshot_id`
|
||||
|
||||
**`snapshot_manifest`**
|
||||
|
||||
* `snapshot_id`
|
||||
* `distro`, `arch`, `timestamp`
|
||||
* `repo_metadata_digests`, `signing_key_id`, `dsse_envelope_ref`
|
||||
|
||||
**`cve_package_ranges`**
|
||||
|
||||
* `cve_id`, `ecosystem` (deb/rpm/apk), `pkg_name`
|
||||
* `vulnerable_ranges`, `fixed_ranges`
|
||||
* `advisory_ref`, `snapshot_id`
|
||||
|
||||
**`binary_vuln_assertion`**
|
||||
|
||||
* `binary_key`, `cve_id`
|
||||
* `status` ∈ {affected, not_affected, fixed, unknown}
|
||||
* `method` ∈ {range_match, buildid_catalog, fingerprint_match}
|
||||
* `confidence` (0–1)
|
||||
* `evidence_ref` (points to signed evidence)
|
||||
|
||||
### B) Object store (S3/MinIO)
|
||||
|
||||
Do not bloat Postgres with large blobs.
|
||||
|
||||
Store:
|
||||
|
||||
* extracted symbol lists, string tables
|
||||
* function hash maps
|
||||
* disassembly snippets for matched functions (small)
|
||||
* DSSE envelopes / attestations
|
||||
* optional: debug info extracts (or references to where they can be fetched)
|
||||
|
||||
### C) Optional search index (OpenSearch/Elastic)
|
||||
|
||||
If you want fast “find all binaries exporting `SSL_read`” style queries, index symbols/strings.
|
||||
|
||||
---
|
||||
|
||||
## 4) Building the database: pipelines
|
||||
|
||||
### Pipeline 1: Distro repo snapshots → Known-build catalog (breadth)
|
||||
|
||||
This is your fastest route to a “binaries DB.”
|
||||
|
||||
**Step 1 — Snapshot**
|
||||
|
||||
* Mirror repo metadata + packages for (distro, release, arch).
|
||||
* Verify signatures (APT Release.gpg, RPM signatures, APK signatures).
|
||||
* Emit **signed snapshot manifest** (DSSE) listing digests of everything mirrored.
|
||||
|
||||
**Step 2 — Extract binaries**
|
||||
For each package:
|
||||
|
||||
* unpack (deb/rpm/apk)
|
||||
* select ELF files (EXEC + shared libs)
|
||||
* compute Build-ID, file hash, `.text` hash
|
||||
* store identity + `binary_package_map`
|
||||
|
||||
**Step 3 — Assign CVE status (Tier A + Tier B)**
|
||||
|
||||
* Ingest distro advisories and/or OSV mappings into `cve_package_ranges`
|
||||
* For each `binary_package_map`, apply range checks
|
||||
* Create `binary_vuln_assertion` entries:
|
||||
|
||||
* `method=range_match` (coarse)
|
||||
* If you have a Build-ID mapping to exact shipped builds, you can tag:
|
||||
|
||||
* `method=buildid_catalog` (stronger than pure version)
|
||||
|
||||
This yields a database where a scanner can do:
|
||||
|
||||
* “Given Build-ID, tell me all CVEs per the distro snapshot.”
|
||||
|
||||
This already reduces noise because the primary key is the **binary**.
|
||||
|
||||
---
|
||||
|
||||
### Pipeline 2: Patch-aware classification (backports handled)
|
||||
|
||||
To handle “version says vulnerable but backport fixed” you must incorporate patch provenance.
|
||||
|
||||
**Step 1 — Build provenance mapping**
|
||||
Per ecosystem:
|
||||
|
||||
* Debian/Ubuntu: parse `Sources`, changelogs, (ideally) `.buildinfo`, patch series.
|
||||
* RPM distros: SRPM + changelog + patch list.
|
||||
* Alpine: APKBUILD + patches.
|
||||
|
||||
**Step 2 — CVE ↔ patch linkage**
|
||||
From advisories and patch metadata, store:
|
||||
|
||||
* “CVE fixed by patch set P in build B of pkg V-R”
|
||||
|
||||
**Step 3 — Apply to binaries**
|
||||
Instead of version-only, decide:
|
||||
|
||||
* if the **specific build** includes the patch
|
||||
* mark as `fixed` even if upstream version looks vulnerable
|
||||
|
||||
This is still not “binary-only,” but it’s much closer to truth for distros.
|
||||
|
||||
---
|
||||
|
||||
### Pipeline 3: Binary fingerprint factory (the moat)
|
||||
|
||||
This is where you become independent of packaging claims.
|
||||
|
||||
You build fingerprints at the **function/CFG level** for high-impact CVEs.
|
||||
|
||||
#### 3.1 Select targets
|
||||
|
||||
You cannot fingerprint everything. Start with:
|
||||
|
||||
* top shared libs (openssl, glibc, zlib, expat, libxml2, curl, sqlite, ncurses, etc.)
|
||||
* CVEs that are exploited in the wild / high-severity
|
||||
* CVEs where distros backport heavily (version logic is unreliable)
|
||||
|
||||
#### 3.2 Identify “changed functions” from the fix
|
||||
|
||||
Input: upstream commit/patch or distro patch.
|
||||
|
||||
Process:
|
||||
|
||||
* diff the patch
|
||||
* extract affected files + functions (tree-sitter/ctags + diff hunks)
|
||||
* list candidate functions and key basic blocks
|
||||
|
||||
#### 3.3 Build vulnerable + fixed reference binaries
|
||||
|
||||
For each (arch, toolchain profile):
|
||||
|
||||
* compile “known vulnerable” and “known fixed”
|
||||
* ensure reproducibility: record compiler version, flags, link mode
|
||||
* store provenance (DSSE) for these reference builds
|
||||
|
||||
#### 3.4 Extract robust fingerprints
|
||||
|
||||
Avoid raw byte signatures (they break across compilers).
|
||||
|
||||
Better fingerprint types, from weakest to strongest:
|
||||
|
||||
* **symbol-level**: function name + versioned symbol + library SONAME
|
||||
* **function normalized hash**:
|
||||
|
||||
* disassemble function
|
||||
* normalize:
|
||||
|
||||
* strip addresses/relocs
|
||||
* bucket registers
|
||||
* normalize immediates (where safe)
|
||||
* hash instruction sequence or basic-block sequence
|
||||
* **basic-block multiset hash**:
|
||||
|
||||
* build a set/multiset of block hashes; order-independent
|
||||
* **lightweight CFG hash**:
|
||||
|
||||
* nodes: block hashes
|
||||
* edges: control flow
|
||||
* hash canonical representation
|
||||
|
||||
Store fingerprints like:
|
||||
|
||||
**`vuln_fingerprint`**
|
||||
|
||||
* `cve_id`
|
||||
* `component` (openssl/libssl)
|
||||
* `arch`
|
||||
* `fp_type` (func_norm_hash, bb_multiset, cfg_hash)
|
||||
* `fp_value`
|
||||
* `function_hint` (name if present; else pattern)
|
||||
* `confidence`, `notes`
|
||||
* `evidence_ref` (points to reference builds + patch)
|
||||
|
||||
#### 3.5 Validate fingerprints at scale
|
||||
|
||||
This is non-negotiable.
|
||||
|
||||
Validation loop:
|
||||
|
||||
* Test against:
|
||||
|
||||
* known vulnerable builds (must match)
|
||||
* known fixed builds (must not match)
|
||||
* large “benign corpus” (estimate false positives)
|
||||
* Maintain:
|
||||
|
||||
* precision/recall metrics per fingerprint
|
||||
* confidence score
|
||||
|
||||
Only promote fingerprints to “production” when validation passes thresholds.
|
||||
|
||||
---
|
||||
|
||||
## 5) Query-time logic (how scanners use the DB)
|
||||
|
||||
Given a target binary, the scanner computes:
|
||||
|
||||
* `binary_key`
|
||||
* basic features (arch, SONAME, symbols)
|
||||
* optional function hashes (for targeted libs)
|
||||
|
||||
Then it queries in this precedence order:
|
||||
|
||||
1. **Exact match**: `binary_key` exists with explicit assertion (strong)
|
||||
2. **Build catalog**: Build-ID→known distro build→CVE mapping (strong)
|
||||
3. **Fingerprint match**: function/CFG hashes hit (strong, binary-only)
|
||||
4. **Fallback**: package range matching (weakest)
|
||||
|
||||
Return result as a signed VEX with evidence references.
|
||||
|
||||
---
|
||||
|
||||
## 6) Update model: “sealed knowledge snapshots”
|
||||
|
||||
To make this auditable and customer-friendly:
|
||||
|
||||
* Every repo snapshot is immutable and signed.
|
||||
* Every fingerprint bundle is versioned and signed.
|
||||
* Every “vulnerable binaries DB release” is a signed manifest pointing to:
|
||||
|
||||
* which repo snapshots were used
|
||||
* which advisory snapshots were used
|
||||
* which fingerprint sets were included
|
||||
|
||||
This lets you prove:
|
||||
|
||||
* what you knew
|
||||
* when you knew it
|
||||
* exactly which data drove the verdict
|
||||
|
||||
---
|
||||
|
||||
## 7) Scaling and cost control
|
||||
|
||||
Without control, fingerprinting explodes. Use these constraints:
|
||||
|
||||
* Only disassemble/hash functions for:
|
||||
|
||||
* libraries in your “hot set”
|
||||
* binaries whose package indicates relevance to a targeted CVE family
|
||||
* Deduplicate aggressively:
|
||||
|
||||
* identical `.text_sha256` ⇒ reuse extracted functions
|
||||
* identical Build-ID across paths ⇒ reuse features
|
||||
* Incremental snapshots:
|
||||
|
||||
* process only new/changed packages per snapshot
|
||||
* store “already processed digest” cache (Valkey)
|
||||
|
||||
---
|
||||
|
||||
## 8) Security and trust boundaries
|
||||
|
||||
A vulnerable binary DB is itself a high-value target. Hardening must be part of architecture:
|
||||
|
||||
* Verify upstream repo signatures before ingestion.
|
||||
* Run unpacking/extraction in sandboxes (namespaces/seccomp) because packages can be hostile.
|
||||
* Sign:
|
||||
|
||||
* snapshot manifests
|
||||
* fingerprint sets
|
||||
* DB releases
|
||||
* Keep signing keys in an HSM/KMS.
|
||||
* Maintain provenance chain: input digests → output digests.
|
||||
|
||||
---
|
||||
|
||||
## 9) Minimal viable build plan (no wishful thinking)
|
||||
|
||||
### MVP 1: Known-build binary catalog (fast, broad)
|
||||
|
||||
* Repo snapshot + Build-ID extraction
|
||||
* CVE ranges + advisory snapshots
|
||||
* `binary_key → (pkg@ver, CVEs)` mapping
|
||||
* Signed snapshot releases
|
||||
|
||||
### MVP 2: Patch-aware backport handling (precision jump)
|
||||
|
||||
* Source/patch mapping
|
||||
* Build-specific “fixed-by-backport” logic
|
||||
|
||||
### MVP 3: Fingerprints for top components (moat)
|
||||
|
||||
* Fingerprint factory for ~20–50 “most reused” libs
|
||||
* Validation corpus + confidence scoring
|
||||
|
||||
---
|
||||
|
||||
## 10) What you can claim (and what you should not)
|
||||
|
||||
You can defensibly claim:
|
||||
|
||||
* “We can identify vulnerable **binaries** (not just packages) with verifiable evidence.”
|
||||
|
||||
You should not claim (until you have it):
|
||||
|
||||
* “We can detect every vulnerable binary on earth”
|
||||
No one can do that without unacceptable false positives/negatives.
|
||||
|
||||
---
|
||||
|
||||
If you want this aligned tightly to your Stella Ops services split (scanner webservice lattice algorithms + proof-of-integrity graph), I can map the above into:
|
||||
|
||||
* concrete microservices,
|
||||
* queue/job boundaries,
|
||||
* and the exact API surfaces (`/snapshots`, `/features`, `/fingerprints`, `/match`, `/vex`).
|
||||
To code **MVP 2 (patch-aware backport handling)** without any human triage, you need one thing in your database that most scanners do not maintain:
|
||||
|
||||
> A normalized, distro-specific index:
|
||||
> **(distro, release, source-package, CVE) → fixed-in distro version (or not-affected / wontfix / unknown), with evidence and confidence.**
|
||||
|
||||
Then every scan decision becomes a deterministic version comparison against that index, not against upstream semver.
|
||||
|
||||
Below is a concrete, buildable approach (pipelines + data model + code skeletons) that stays fully automated.
|
||||
|
||||
---
|
||||
|
||||
## 1) What MVP2 computes
|
||||
|
||||
### Output table you must build
|
||||
|
||||
**`cve_fix_index`**
|
||||
|
||||
* `distro` (e.g., debian, ubuntu, rhel, alpine)
|
||||
* `release` (e.g., bookworm, jammy, 9, 3.19)
|
||||
* `source_pkg` (not binary subpackage)
|
||||
* `cve_id`
|
||||
* `state` ∈ {`fixed`, `vulnerable`, `not_affected`, `wontfix`, `unknown`}
|
||||
* `fixed_version` (nullable; distro version string, including revision)
|
||||
* `method` ∈ {`security_feed`, `changelog`, `patch_header`, `upstream_patch_match`}
|
||||
* `confidence` (float)
|
||||
* `evidence` (JSON: references to advisory entry, changelog lines, patch names + digests)
|
||||
* `snapshot_id` (your sealed snapshot identifier)
|
||||
|
||||
### Why “source package”?
|
||||
|
||||
Security trackers and patch sets are tracked at the **source** level (e.g., `openssl`), while runtime installs are often **binary subpackages** (e.g., `libssl3`). You need a stable join:
|
||||
`binary_pkg -> source_pkg`.
|
||||
|
||||
---
|
||||
|
||||
## 2) No-human signals, in strict priority order
|
||||
|
||||
You can do this with **zero manual** work by using a tiered resolver:
|
||||
|
||||
### Tier 1 — Structured distro security feed (highest precision)
|
||||
|
||||
This is the authoritative “backport-aware” answer because it encodes:
|
||||
|
||||
* “fixed in 1.1.1n-0ubuntu2.4” (even if upstream says “fixed in 1.1.1o”)
|
||||
* “not affected” cases
|
||||
* sometimes arch-specific applicability
|
||||
|
||||
Your ingestor just parses and normalizes it.
|
||||
|
||||
### Tier 2 — Source package changelog CVE mentions
|
||||
|
||||
If a feed entry is missing/late, parse source changelog:
|
||||
|
||||
* Debian/Ubuntu: `debian/changelog`
|
||||
* RPM: `%changelog` in `.spec`
|
||||
* Alpine: `secfixes` in `APKBUILD` (often present)
|
||||
|
||||
This is surprisingly effective because maintainers often include “CVE-XXXX-YYYY” in the entry that introduced the fix.
|
||||
|
||||
### Tier 3 — Patch metadata (DEP-3 headers / patch filenames)
|
||||
|
||||
Parse patches shipped with the source package:
|
||||
|
||||
* Debian: `debian/patches/*` + `debian/patches/series`
|
||||
* RPM: patch files listed in spec / SRPM
|
||||
* Alpine: `patches/*.patch` in the aport
|
||||
|
||||
Search patch headers and filenames for CVE IDs, store patch hashes.
|
||||
|
||||
### Tier 4 — Upstream patch equivalence (optional in MVP2, strong)
|
||||
|
||||
If you can map CVE→upstream fix commit (OSV often helps), you can match canonicalized patch hunks against distro patches.
|
||||
|
||||
MVP2 can ship without Tier 4; Tier 1+2 already eliminates most backport false positives.
|
||||
|
||||
---
|
||||
|
||||
## 3) Architecture: the “Fix Index Builder” job
|
||||
|
||||
### Inputs
|
||||
|
||||
* Your sealed repo snapshot: Packages + Sources (or SRPM/aports)
|
||||
* Distro security feed snapshot (OVAL/JSON/errata tracker) for same release
|
||||
* (Optional) OSV/NVD upstream ranges for fallback only
|
||||
|
||||
### Processing graph
|
||||
|
||||
1. **Build `binary_pkg → source_pkg` map** from repo metadata
|
||||
2. **Ingest security feed** → produce `FixRecord(method=security_feed, confidence=0.95)`
|
||||
3. **For source packages in snapshot**:
|
||||
|
||||
* unpack source
|
||||
* parse changelog for CVE mentions → `FixRecord(method=changelog, confidence=0.75–0.85)`
|
||||
* parse patch headers → `FixRecord(method=patch_header, confidence=0.80–0.90)`
|
||||
4. **Merge** records into a single best record per key (distro, release, source_pkg, cve)
|
||||
5. Store into `cve_fix_index` with evidence
|
||||
6. Sign the resulting snapshot manifest
|
||||
|
||||
---
|
||||
|
||||
## 4) Merge logic (no human, deterministic)
|
||||
|
||||
You need a deterministic rule for conflicts.
|
||||
|
||||
Recommended (conservative but still precision-improving):
|
||||
|
||||
1. If any record says `not_affected` with confidence ≥ 0.9 → choose `not_affected`
|
||||
2. Else if any record says `fixed` with confidence ≥ 0.9 → choose `fixed` and `fixed_version = max_fixed_version_among_high_conf`
|
||||
3. Else if any record says `fixed` at all → choose `fixed` with best available `fixed_version`
|
||||
4. Else if any says `wontfix` → choose `wontfix`
|
||||
5. Else `unknown`
|
||||
|
||||
Additionally:
|
||||
|
||||
* Keep *all* evidence records in `evidence` so you can explain and audit.
|
||||
|
||||
---
|
||||
|
||||
## 5) Version comparison: do not reinvent it
|
||||
|
||||
Backport handling lives or dies on correct version ordering.
|
||||
|
||||
### Practical approach (recommended for ingestion + server-side decisioning)
|
||||
|
||||
Use official tooling in containerized workers:
|
||||
|
||||
* Debian/Ubuntu: `dpkg --compare-versions`
|
||||
* RPM distros: `rpmdev-vercmp` or `rpm` library
|
||||
* Alpine: `apk version -t`
|
||||
|
||||
This is reliable and avoids subtle comparator bugs.
|
||||
|
||||
If you must do it in-process, use well-tested libraries per ecosystem (but containerized official tools are the most robust).
|
||||
|
||||
---
|
||||
|
||||
## 6) Concrete code: Debian/Ubuntu changelog + patch parsing
|
||||
|
||||
This example shows **Tier 2 + Tier 3** inference for a single unpacked source tree. You would wrap this inside your snapshot processing loop.
|
||||
|
||||
### 6.1 CVE extractor
|
||||
|
||||
```python
|
||||
import re
|
||||
from pathlib import Path
|
||||
from hashlib import sha256
|
||||
|
||||
CVE_RE = re.compile(r"\bCVE-\d{4}-\d{4,7}\b")
|
||||
|
||||
def extract_cves(text: str) -> set[str]:
|
||||
return set(CVE_RE.findall(text or ""))
|
||||
```
|
||||
|
||||
### 6.2 Parse the *top* debian/changelog entry (for this version)
|
||||
|
||||
This works well because when you unpack a `.dsc` for version `V`, the top entry is for `V`.
|
||||
|
||||
```python
|
||||
def parse_debian_changelog_top_entry(src_dir: Path) -> tuple[str, set[str], dict]:
|
||||
"""
|
||||
Returns:
|
||||
version: str
|
||||
cves: set[str] found in the top entry
|
||||
evidence: dict with excerpt for explainability
|
||||
"""
|
||||
changelog_path = src_dir / "debian" / "changelog"
|
||||
if not changelog_path.exists():
|
||||
return "", set(), {}
|
||||
|
||||
lines = changelog_path.read_text(errors="replace").splitlines()
|
||||
if not lines:
|
||||
return "", set(), {}
|
||||
|
||||
# First line: "pkgname (version) distro; urgency=..."
|
||||
m = re.match(r"^[^\s]+\s+\(([^)]+)\)\s+", lines[0])
|
||||
version = m.group(1) if m else ""
|
||||
|
||||
entry_lines = [lines[0]]
|
||||
# Collect until maintainer trailer line: " -- Name <email> date"
|
||||
for line in lines[1:]:
|
||||
entry_lines.append(line)
|
||||
if line.startswith(" -- "):
|
||||
break
|
||||
|
||||
entry_text = "\n".join(entry_lines)
|
||||
cves = extract_cves(entry_text)
|
||||
|
||||
evidence = {
|
||||
"file": "debian/changelog",
|
||||
"version": version,
|
||||
"excerpt": entry_text[:2000], # store small excerpt, not whole file
|
||||
}
|
||||
return version, cves, evidence
|
||||
```
|
||||
|
||||
### 6.3 Parse CVEs from patch headers (DEP-3-ish)
|
||||
|
||||
```python
|
||||
def parse_debian_patches_for_cves(src_dir: Path) -> tuple[dict[str, list[dict]], dict]:
|
||||
"""
|
||||
Returns:
|
||||
cve_to_patches: {CVE: [ {path, sha256, header_excerpt}, ... ]}
|
||||
evidence_summary: dict
|
||||
"""
|
||||
patches_dir = src_dir / "debian" / "patches"
|
||||
if not patches_dir.exists():
|
||||
return {}, {}
|
||||
|
||||
cve_to_patches: dict[str, list[dict]] = {}
|
||||
|
||||
for patch in patches_dir.glob("*"):
|
||||
if not patch.is_file():
|
||||
continue
|
||||
# Read only first N lines to keep it cheap
|
||||
header = "\n".join(patch.read_text(errors="replace").splitlines()[:80])
|
||||
cves = extract_cves(header + "\n" + patch.name)
|
||||
if not cves:
|
||||
continue
|
||||
|
||||
digest = sha256(patch.read_bytes()).hexdigest()
|
||||
rec = {
|
||||
"path": str(patch.relative_to(src_dir)),
|
||||
"sha256": digest,
|
||||
"header_excerpt": header[:1200],
|
||||
}
|
||||
for cve in cves:
|
||||
cve_to_patches.setdefault(cve, []).append(rec)
|
||||
|
||||
evidence = {
|
||||
"dir": "debian/patches",
|
||||
"matched_cves": len(cve_to_patches),
|
||||
}
|
||||
return cve_to_patches, evidence
|
||||
```
|
||||
|
||||
### 6.4 Produce FixRecords from the source tree
|
||||
|
||||
```python
|
||||
def infer_fix_records_from_debian_source(src_dir: Path, distro: str, release: str, source_pkg: str, snapshot_id: str):
|
||||
version, changelog_cves, changelog_ev = parse_debian_changelog_top_entry(src_dir)
|
||||
cve_to_patches, patch_ev = parse_debian_patches_for_cves(src_dir)
|
||||
|
||||
records = []
|
||||
|
||||
# Changelog-based: treat CVE mentioned in top entry as fixed in this version
|
||||
for cve in changelog_cves:
|
||||
records.append({
|
||||
"distro": distro,
|
||||
"release": release,
|
||||
"source_pkg": source_pkg,
|
||||
"cve_id": cve,
|
||||
"state": "fixed",
|
||||
"fixed_version": version,
|
||||
"method": "changelog",
|
||||
"confidence": 0.80,
|
||||
"evidence": {"changelog": changelog_ev},
|
||||
"snapshot_id": snapshot_id,
|
||||
})
|
||||
|
||||
# Patch-header-based: treat CVE-tagged patches as fixed in this version
|
||||
for cve, patches in cve_to_patches.items():
|
||||
records.append({
|
||||
"distro": distro,
|
||||
"release": release,
|
||||
"source_pkg": source_pkg,
|
||||
"cve_id": cve,
|
||||
"state": "fixed",
|
||||
"fixed_version": version,
|
||||
"method": "patch_header",
|
||||
"confidence": 0.87,
|
||||
"evidence": {"patches": patches, "patch_summary": patch_ev},
|
||||
"snapshot_id": snapshot_id,
|
||||
})
|
||||
|
||||
return records
|
||||
```
|
||||
|
||||
That is the automated “patch-aware” signal generator.
|
||||
|
||||
---
|
||||
|
||||
## 7) Wiring this into your database build
|
||||
|
||||
### 7.1 Store raw evidence and merged result
|
||||
|
||||
Two-stage storage is worth it:
|
||||
|
||||
1. `cve_fix_evidence` (append-only)
|
||||
2. `cve_fix_index` (merged best record)
|
||||
|
||||
So you can:
|
||||
|
||||
* rerun merge rules
|
||||
* improve confidence scoring
|
||||
* keep auditability
|
||||
|
||||
### 7.2 Merging “fixed_version” for a CVE
|
||||
|
||||
When multiple versions mention the same CVE, you usually want the **latest** mentioning version (highest by distro comparator), because repeated mentions often indicate earlier partial fix.
|
||||
|
||||
Pseudo:
|
||||
|
||||
```python
|
||||
def choose_fixed_version(existing: str | None, candidate: str, vercmp) -> str:
|
||||
if not existing:
|
||||
return candidate
|
||||
return candidate if vercmp(candidate, existing) > 0 else existing
|
||||
```
|
||||
|
||||
Where `vercmp` calls `dpkg --compare-versions` (Debian) or equivalent for that distro.
|
||||
|
||||
---
|
||||
|
||||
## 8) Decisioning logic at scan time (what changes with MVP2)
|
||||
|
||||
Without MVP2, you likely do:
|
||||
|
||||
* upstream range check (false positives for backports)
|
||||
|
||||
With MVP2, you do:
|
||||
|
||||
1. identify `distro+release` from environment (or image base)
|
||||
2. map `binary_pkg → source_pkg`
|
||||
3. query `cve_fix_index(distro, release, source_pkg, cve)`
|
||||
4. if `state=fixed` and `pkg_version >= fixed_version` → fixed
|
||||
5. if `state=not_affected` → safe
|
||||
6. else fallback to upstream ranges
|
||||
|
||||
That single substitution removes most backport noise.
|
||||
|
||||
---
|
||||
|
||||
## 9) Practical notes so you don’t get trapped
|
||||
|
||||
### A) You must know the distro release
|
||||
|
||||
Backport reality is release-specific. The same package name/version can have different patching across releases.
|
||||
|
||||
### B) Arch-specific fixes exist
|
||||
|
||||
Your schema should allow `arch` on fix records (nullable). If the feed says “only amd64 affected,” store it.
|
||||
|
||||
### C) False positives in changelog parsing
|
||||
|
||||
Mitigation without humans:
|
||||
|
||||
* require CVE mention + security-ish verbs (“fix”, “security”, “patch”) OR presence in patch headers
|
||||
* otherwise lower confidence and let feed override
|
||||
|
||||
### D) Keep evidence small
|
||||
|
||||
Store:
|
||||
|
||||
* excerpt + patch hashes
|
||||
Not entire source tarballs.
|
||||
|
||||
---
|
||||
|
||||
## 10) Minimal “done definition” for MVP2
|
||||
|
||||
You have MVP2 when, for Debian/Ubuntu at least, you can demonstrate:
|
||||
|
||||
* A CVE that upstream marks vulnerable for version X
|
||||
* The distro backported it in X-<revision>
|
||||
* Your system classifies:
|
||||
|
||||
* `X-older_revision` as vulnerable
|
||||
* `X-newer_revision` as fixed
|
||||
* With evidence: fix feed record and/or changelog/patch proof
|
||||
|
||||
No human required.
|
||||
|
||||
---
|
||||
|
||||
If you want, I can provide the same “Tier 2/3 inference” module for RPM (SRPM/spec parsing) and Alpine (APKBUILD `secfixes` extraction), plus the exact Postgres DDL for `cve_fix_evidence` and `cve_fix_index`, and the merge SQL.
|
||||
Reference in New Issue
Block a user