Below is a practical, production-grade architecture for building a **vulnerable binaries database**. I’m going to be explicit about what “such a database” can mean, because there are two materially different products: 1. **Known-build catalog**: “These exact shipped binaries (Build-ID / hash) are affected or fixed for CVE X.” 2. **Binary fingerprint DB**: “Even if the binary is unpackaged / self-built, we can match vulnerable code patterns.” You want both. The first gets you breadth fast; the second is the moat. --- ## 1) Core principle: treat “binary identity” as the primary key For Linux ELF: * Primary: `ELF Build-ID` (from `.note.gnu.build-id`) * Fallback: `sha256(file_bytes)` * Add: `sha256(.text)` and/or BLAKE3 for speed This creates a stable identity that survives “package metadata lies.” **BinaryKey = build_id || file_sha256** --- ## 2) High-level system diagram ``` ┌──────────────────────────┐ │ Vulnerability Intel │ │ OSV/NVD + distro advis. │ └───────────┬──────────────┘ │ normalize v ┌──────────────────────────┐ │ Vuln Knowledge Store │ │ CVE↔pkg ranges, patches │ └───────────┬──────────────┘ │ │ ┌───────────────────────v─────────────────────────┐ │ Repo Snapshotter (per distro/arch/date) │ │ - mirrors metadata + packages (+ debuginfo) │ │ - verifies signatures │ │ - emits signed snapshot manifest │ └───────────┬───────────────────────────┬─────────┘ │ │ │ packages │ debuginfo/sources v v ┌──────────────────────────┐ ┌──────────────────────────┐ │ Package Unpacker │ │ Source/Buildinfo Mapper │ │ - extract files │ │ - pkg→source commit/patch │ └───────────┬──────────────┘ └───────────┬──────────────┘ │ binaries │ v │ ┌──────────────────────────┐ │ │ Binary Feature Extractor │ │ │ - Build-ID, hashes │ │ │ - dyn deps, symbols │ │ │ - function boundaries (opt)│ │ └───────────┬──────────────┘ │ │ │ v v ┌──────────────────────────────────────────────────┐ │ Vulnerable Binary Classifier │ │ Tier A: pkg/version range │ │ Tier B: Build-ID→known shipped build │ │ Tier C: code fingerprints (function/CFG hashes) │ └───────────┬───────────────────────────┬──────────┘ │ │ v v ┌──────────────────────────┐ ┌──────────────────────────┐ │ Vulnerable Binary DB │ │ Evidence/Attestation DB │ │ (indexed by BinaryKey) │ │ (signed proofs, snapshots)│ └───────────┬──────────────┘ └───────────┬──────────────┘ │ publish signed snapshot │ v v Clients/Scanners Explainable VEX outputs ``` --- ## 3) Data stores you actually need ### A) Relational store (Postgres) Use this for *indexes and joins*. Key tables: **`binary_identity`** * `binary_key` (build_id or file_sha256) PK * `build_id` (nullable) * `file_sha256`, `text_sha256` * `arch`, `osabi`, `type` (ET_DYN/EXEC), `stripped` * `first_seen_snapshot`, `last_seen_snapshot` **`binary_package_map`** * `binary_key` * `distro`, `pkg_name`, `pkg_version_release`, `arch` * `file_path_in_pkg`, `snapshot_id` **`snapshot_manifest`** * `snapshot_id` * `distro`, `arch`, `timestamp` * `repo_metadata_digests`, `signing_key_id`, `dsse_envelope_ref` **`cve_package_ranges`** * `cve_id`, `ecosystem` (deb/rpm/apk), `pkg_name` * `vulnerable_ranges`, `fixed_ranges` * `advisory_ref`, `snapshot_id` **`binary_vuln_assertion`** * `binary_key`, `cve_id` * `status` ∈ {affected, not_affected, fixed, unknown} * `method` ∈ {range_match, buildid_catalog, fingerprint_match} * `confidence` (0–1) * `evidence_ref` (points to signed evidence) ### B) Object store (S3/MinIO) Do not bloat Postgres with large blobs. Store: * extracted symbol lists, string tables * function hash maps * disassembly snippets for matched functions (small) * DSSE envelopes / attestations * optional: debug info extracts (or references to where they can be fetched) ### C) Optional search index (OpenSearch/Elastic) If you want fast “find all binaries exporting `SSL_read`” style queries, index symbols/strings. --- ## 4) Building the database: pipelines ### Pipeline 1: Distro repo snapshots → Known-build catalog (breadth) This is your fastest route to a “binaries DB.” **Step 1 — Snapshot** * Mirror repo metadata + packages for (distro, release, arch). * Verify signatures (APT Release.gpg, RPM signatures, APK signatures). * Emit **signed snapshot manifest** (DSSE) listing digests of everything mirrored. **Step 2 — Extract binaries** For each package: * unpack (deb/rpm/apk) * select ELF files (EXEC + shared libs) * compute Build-ID, file hash, `.text` hash * store identity + `binary_package_map` **Step 3 — Assign CVE status (Tier A + Tier B)** * Ingest distro advisories and/or OSV mappings into `cve_package_ranges` * For each `binary_package_map`, apply range checks * Create `binary_vuln_assertion` entries: * `method=range_match` (coarse) * If you have a Build-ID mapping to exact shipped builds, you can tag: * `method=buildid_catalog` (stronger than pure version) This yields a database where a scanner can do: * “Given Build-ID, tell me all CVEs per the distro snapshot.” This already reduces noise because the primary key is the **binary**. --- ### Pipeline 2: Patch-aware classification (backports handled) To handle “version says vulnerable but backport fixed” you must incorporate patch provenance. **Step 1 — Build provenance mapping** Per ecosystem: * Debian/Ubuntu: parse `Sources`, changelogs, (ideally) `.buildinfo`, patch series. * RPM distros: SRPM + changelog + patch list. * Alpine: APKBUILD + patches. **Step 2 — CVE ↔ patch linkage** From advisories and patch metadata, store: * “CVE fixed by patch set P in build B of pkg V-R” **Step 3 — Apply to binaries** Instead of version-only, decide: * if the **specific build** includes the patch * mark as `fixed` even if upstream version looks vulnerable This is still not “binary-only,” but it’s much closer to truth for distros. --- ### Pipeline 3: Binary fingerprint factory (the moat) This is where you become independent of packaging claims. You build fingerprints at the **function/CFG level** for high-impact CVEs. #### 3.1 Select targets You cannot fingerprint everything. Start with: * top shared libs (openssl, glibc, zlib, expat, libxml2, curl, sqlite, ncurses, etc.) * CVEs that are exploited in the wild / high-severity * CVEs where distros backport heavily (version logic is unreliable) #### 3.2 Identify “changed functions” from the fix Input: upstream commit/patch or distro patch. Process: * diff the patch * extract affected files + functions (tree-sitter/ctags + diff hunks) * list candidate functions and key basic blocks #### 3.3 Build vulnerable + fixed reference binaries For each (arch, toolchain profile): * compile “known vulnerable” and “known fixed” * ensure reproducibility: record compiler version, flags, link mode * store provenance (DSSE) for these reference builds #### 3.4 Extract robust fingerprints Avoid raw byte signatures (they break across compilers). Better fingerprint types, from weakest to strongest: * **symbol-level**: function name + versioned symbol + library SONAME * **function normalized hash**: * disassemble function * normalize: * strip addresses/relocs * bucket registers * normalize immediates (where safe) * hash instruction sequence or basic-block sequence * **basic-block multiset hash**: * build a set/multiset of block hashes; order-independent * **lightweight CFG hash**: * nodes: block hashes * edges: control flow * hash canonical representation Store fingerprints like: **`vuln_fingerprint`** * `cve_id` * `component` (openssl/libssl) * `arch` * `fp_type` (func_norm_hash, bb_multiset, cfg_hash) * `fp_value` * `function_hint` (name if present; else pattern) * `confidence`, `notes` * `evidence_ref` (points to reference builds + patch) #### 3.5 Validate fingerprints at scale This is non-negotiable. Validation loop: * Test against: * known vulnerable builds (must match) * known fixed builds (must not match) * large “benign corpus” (estimate false positives) * Maintain: * precision/recall metrics per fingerprint * confidence score Only promote fingerprints to “production” when validation passes thresholds. --- ## 5) Query-time logic (how scanners use the DB) Given a target binary, the scanner computes: * `binary_key` * basic features (arch, SONAME, symbols) * optional function hashes (for targeted libs) Then it queries in this precedence order: 1. **Exact match**: `binary_key` exists with explicit assertion (strong) 2. **Build catalog**: Build-ID→known distro build→CVE mapping (strong) 3. **Fingerprint match**: function/CFG hashes hit (strong, binary-only) 4. **Fallback**: package range matching (weakest) Return result as a signed VEX with evidence references. --- ## 6) Update model: “sealed knowledge snapshots” To make this auditable and customer-friendly: * Every repo snapshot is immutable and signed. * Every fingerprint bundle is versioned and signed. * Every “vulnerable binaries DB release” is a signed manifest pointing to: * which repo snapshots were used * which advisory snapshots were used * which fingerprint sets were included This lets you prove: * what you knew * when you knew it * exactly which data drove the verdict --- ## 7) Scaling and cost control Without control, fingerprinting explodes. Use these constraints: * Only disassemble/hash functions for: * libraries in your “hot set” * binaries whose package indicates relevance to a targeted CVE family * Deduplicate aggressively: * identical `.text_sha256` ⇒ reuse extracted functions * identical Build-ID across paths ⇒ reuse features * Incremental snapshots: * process only new/changed packages per snapshot * store “already processed digest” cache (Valkey) --- ## 8) Security and trust boundaries A vulnerable binary DB is itself a high-value target. Hardening must be part of architecture: * Verify upstream repo signatures before ingestion. * Run unpacking/extraction in sandboxes (namespaces/seccomp) because packages can be hostile. * Sign: * snapshot manifests * fingerprint sets * DB releases * Keep signing keys in an HSM/KMS. * Maintain provenance chain: input digests → output digests. --- ## 9) Minimal viable build plan (no wishful thinking) ### MVP 1: Known-build binary catalog (fast, broad) * Repo snapshot + Build-ID extraction * CVE ranges + advisory snapshots * `binary_key → (pkg@ver, CVEs)` mapping * Signed snapshot releases ### MVP 2: Patch-aware backport handling (precision jump) * Source/patch mapping * Build-specific “fixed-by-backport” logic ### MVP 3: Fingerprints for top components (moat) * Fingerprint factory for ~20–50 “most reused” libs * Validation corpus + confidence scoring --- ## 10) What you can claim (and what you should not) You can defensibly claim: * “We can identify vulnerable **binaries** (not just packages) with verifiable evidence.” You should not claim (until you have it): * “We can detect every vulnerable binary on earth” No one can do that without unacceptable false positives/negatives. --- If you want this aligned tightly to your Stella Ops services split (scanner webservice lattice algorithms + proof-of-integrity graph), I can map the above into: * concrete microservices, * queue/job boundaries, * and the exact API surfaces (`/snapshots`, `/features`, `/fingerprints`, `/match`, `/vex`). To code **MVP 2 (patch-aware backport handling)** without any human triage, you need one thing in your database that most scanners do not maintain: > A normalized, distro-specific index: > **(distro, release, source-package, CVE) → fixed-in distro version (or not-affected / wontfix / unknown), with evidence and confidence.** Then every scan decision becomes a deterministic version comparison against that index, not against upstream semver. Below is a concrete, buildable approach (pipelines + data model + code skeletons) that stays fully automated. --- ## 1) What MVP2 computes ### Output table you must build **`cve_fix_index`** * `distro` (e.g., debian, ubuntu, rhel, alpine) * `release` (e.g., bookworm, jammy, 9, 3.19) * `source_pkg` (not binary subpackage) * `cve_id` * `state` ∈ {`fixed`, `vulnerable`, `not_affected`, `wontfix`, `unknown`} * `fixed_version` (nullable; distro version string, including revision) * `method` ∈ {`security_feed`, `changelog`, `patch_header`, `upstream_patch_match`} * `confidence` (float) * `evidence` (JSON: references to advisory entry, changelog lines, patch names + digests) * `snapshot_id` (your sealed snapshot identifier) ### Why “source package”? Security trackers and patch sets are tracked at the **source** level (e.g., `openssl`), while runtime installs are often **binary subpackages** (e.g., `libssl3`). You need a stable join: `binary_pkg -> source_pkg`. --- ## 2) No-human signals, in strict priority order You can do this with **zero manual** work by using a tiered resolver: ### Tier 1 — Structured distro security feed (highest precision) This is the authoritative “backport-aware” answer because it encodes: * “fixed in 1.1.1n-0ubuntu2.4” (even if upstream says “fixed in 1.1.1o”) * “not affected” cases * sometimes arch-specific applicability Your ingestor just parses and normalizes it. ### Tier 2 — Source package changelog CVE mentions If a feed entry is missing/late, parse source changelog: * Debian/Ubuntu: `debian/changelog` * RPM: `%changelog` in `.spec` * Alpine: `secfixes` in `APKBUILD` (often present) This is surprisingly effective because maintainers often include “CVE-XXXX-YYYY” in the entry that introduced the fix. ### Tier 3 — Patch metadata (DEP-3 headers / patch filenames) Parse patches shipped with the source package: * Debian: `debian/patches/*` + `debian/patches/series` * RPM: patch files listed in spec / SRPM * Alpine: `patches/*.patch` in the aport Search patch headers and filenames for CVE IDs, store patch hashes. ### Tier 4 — Upstream patch equivalence (optional in MVP2, strong) If you can map CVE→upstream fix commit (OSV often helps), you can match canonicalized patch hunks against distro patches. MVP2 can ship without Tier 4; Tier 1+2 already eliminates most backport false positives. --- ## 3) Architecture: the “Fix Index Builder” job ### Inputs * Your sealed repo snapshot: Packages + Sources (or SRPM/aports) * Distro security feed snapshot (OVAL/JSON/errata tracker) for same release * (Optional) OSV/NVD upstream ranges for fallback only ### Processing graph 1. **Build `binary_pkg → source_pkg` map** from repo metadata 2. **Ingest security feed** → produce `FixRecord(method=security_feed, confidence=0.95)` 3. **For source packages in snapshot**: * unpack source * parse changelog for CVE mentions → `FixRecord(method=changelog, confidence=0.75–0.85)` * parse patch headers → `FixRecord(method=patch_header, confidence=0.80–0.90)` 4. **Merge** records into a single best record per key (distro, release, source_pkg, cve) 5. Store into `cve_fix_index` with evidence 6. Sign the resulting snapshot manifest --- ## 4) Merge logic (no human, deterministic) You need a deterministic rule for conflicts. Recommended (conservative but still precision-improving): 1. If any record says `not_affected` with confidence ≥ 0.9 → choose `not_affected` 2. Else if any record says `fixed` with confidence ≥ 0.9 → choose `fixed` and `fixed_version = max_fixed_version_among_high_conf` 3. Else if any record says `fixed` at all → choose `fixed` with best available `fixed_version` 4. Else if any says `wontfix` → choose `wontfix` 5. Else `unknown` Additionally: * Keep *all* evidence records in `evidence` so you can explain and audit. --- ## 5) Version comparison: do not reinvent it Backport handling lives or dies on correct version ordering. ### Practical approach (recommended for ingestion + server-side decisioning) Use official tooling in containerized workers: * Debian/Ubuntu: `dpkg --compare-versions` * RPM distros: `rpmdev-vercmp` or `rpm` library * Alpine: `apk version -t` This is reliable and avoids subtle comparator bugs. If you must do it in-process, use well-tested libraries per ecosystem (but containerized official tools are the most robust). --- ## 6) Concrete code: Debian/Ubuntu changelog + patch parsing This example shows **Tier 2 + Tier 3** inference for a single unpacked source tree. You would wrap this inside your snapshot processing loop. ### 6.1 CVE extractor ```python import re from pathlib import Path from hashlib import sha256 CVE_RE = re.compile(r"\bCVE-\d{4}-\d{4,7}\b") def extract_cves(text: str) -> set[str]: return set(CVE_RE.findall(text or "")) ``` ### 6.2 Parse the *top* debian/changelog entry (for this version) This works well because when you unpack a `.dsc` for version `V`, the top entry is for `V`. ```python def parse_debian_changelog_top_entry(src_dir: Path) -> tuple[str, set[str], dict]: """ Returns: version: str cves: set[str] found in the top entry evidence: dict with excerpt for explainability """ changelog_path = src_dir / "debian" / "changelog" if not changelog_path.exists(): return "", set(), {} lines = changelog_path.read_text(errors="replace").splitlines() if not lines: return "", set(), {} # First line: "pkgname (version) distro; urgency=..." m = re.match(r"^[^\s]+\s+\(([^)]+)\)\s+", lines[0]) version = m.group(1) if m else "" entry_lines = [lines[0]] # Collect until maintainer trailer line: " -- Name date" for line in lines[1:]: entry_lines.append(line) if line.startswith(" -- "): break entry_text = "\n".join(entry_lines) cves = extract_cves(entry_text) evidence = { "file": "debian/changelog", "version": version, "excerpt": entry_text[:2000], # store small excerpt, not whole file } return version, cves, evidence ``` ### 6.3 Parse CVEs from patch headers (DEP-3-ish) ```python def parse_debian_patches_for_cves(src_dir: Path) -> tuple[dict[str, list[dict]], dict]: """ Returns: cve_to_patches: {CVE: [ {path, sha256, header_excerpt}, ... ]} evidence_summary: dict """ patches_dir = src_dir / "debian" / "patches" if not patches_dir.exists(): return {}, {} cve_to_patches: dict[str, list[dict]] = {} for patch in patches_dir.glob("*"): if not patch.is_file(): continue # Read only first N lines to keep it cheap header = "\n".join(patch.read_text(errors="replace").splitlines()[:80]) cves = extract_cves(header + "\n" + patch.name) if not cves: continue digest = sha256(patch.read_bytes()).hexdigest() rec = { "path": str(patch.relative_to(src_dir)), "sha256": digest, "header_excerpt": header[:1200], } for cve in cves: cve_to_patches.setdefault(cve, []).append(rec) evidence = { "dir": "debian/patches", "matched_cves": len(cve_to_patches), } return cve_to_patches, evidence ``` ### 6.4 Produce FixRecords from the source tree ```python def infer_fix_records_from_debian_source(src_dir: Path, distro: str, release: str, source_pkg: str, snapshot_id: str): version, changelog_cves, changelog_ev = parse_debian_changelog_top_entry(src_dir) cve_to_patches, patch_ev = parse_debian_patches_for_cves(src_dir) records = [] # Changelog-based: treat CVE mentioned in top entry as fixed in this version for cve in changelog_cves: records.append({ "distro": distro, "release": release, "source_pkg": source_pkg, "cve_id": cve, "state": "fixed", "fixed_version": version, "method": "changelog", "confidence": 0.80, "evidence": {"changelog": changelog_ev}, "snapshot_id": snapshot_id, }) # Patch-header-based: treat CVE-tagged patches as fixed in this version for cve, patches in cve_to_patches.items(): records.append({ "distro": distro, "release": release, "source_pkg": source_pkg, "cve_id": cve, "state": "fixed", "fixed_version": version, "method": "patch_header", "confidence": 0.87, "evidence": {"patches": patches, "patch_summary": patch_ev}, "snapshot_id": snapshot_id, }) return records ``` That is the automated “patch-aware” signal generator. --- ## 7) Wiring this into your database build ### 7.1 Store raw evidence and merged result Two-stage storage is worth it: 1. `cve_fix_evidence` (append-only) 2. `cve_fix_index` (merged best record) So you can: * rerun merge rules * improve confidence scoring * keep auditability ### 7.2 Merging “fixed_version” for a CVE When multiple versions mention the same CVE, you usually want the **latest** mentioning version (highest by distro comparator), because repeated mentions often indicate earlier partial fix. Pseudo: ```python def choose_fixed_version(existing: str | None, candidate: str, vercmp) -> str: if not existing: return candidate return candidate if vercmp(candidate, existing) > 0 else existing ``` Where `vercmp` calls `dpkg --compare-versions` (Debian) or equivalent for that distro. --- ## 8) Decisioning logic at scan time (what changes with MVP2) Without MVP2, you likely do: * upstream range check (false positives for backports) With MVP2, you do: 1. identify `distro+release` from environment (or image base) 2. map `binary_pkg → source_pkg` 3. query `cve_fix_index(distro, release, source_pkg, cve)` 4. if `state=fixed` and `pkg_version >= fixed_version` → fixed 5. if `state=not_affected` → safe 6. else fallback to upstream ranges That single substitution removes most backport noise. --- ## 9) Practical notes so you don’t get trapped ### A) You must know the distro release Backport reality is release-specific. The same package name/version can have different patching across releases. ### B) Arch-specific fixes exist Your schema should allow `arch` on fix records (nullable). If the feed says “only amd64 affected,” store it. ### C) False positives in changelog parsing Mitigation without humans: * require CVE mention + security-ish verbs (“fix”, “security”, “patch”) OR presence in patch headers * otherwise lower confidence and let feed override ### D) Keep evidence small Store: * excerpt + patch hashes Not entire source tarballs. --- ## 10) Minimal “done definition” for MVP2 You have MVP2 when, for Debian/Ubuntu at least, you can demonstrate: * A CVE that upstream marks vulnerable for version X * The distro backported it in X- * Your system classifies: * `X-older_revision` as vulnerable * `X-newer_revision` as fixed * With evidence: fix feed record and/or changelog/patch proof No human required. --- If you want, I can provide the same “Tier 2/3 inference” module for RPM (SRPM/spec parsing) and Alpine (APKBUILD `secfixes` extraction), plus the exact Postgres DDL for `cve_fix_evidence` and `cve_fix_index`, and the merge SQL.