25 KiB
Below is a practical, production-grade architecture for building a vulnerable binaries database. I’m going to be explicit about what “such a database” can mean, because there are two materially different products:
- Known-build catalog: “These exact shipped binaries (Build-ID / hash) are affected or fixed for CVE X.”
- Binary fingerprint DB: “Even if the binary is unpackaged / self-built, we can match vulnerable code patterns.”
You want both. The first gets you breadth fast; the second is the moat.
1) Core principle: treat “binary identity” as the primary key
For Linux ELF:
- Primary:
ELF Build-ID(from.note.gnu.build-id) - Fallback:
sha256(file_bytes) - Add:
sha256(.text)and/or BLAKE3 for speed
This creates a stable identity that survives “package metadata lies.”
BinaryKey = build_id || file_sha256
2) High-level system diagram
┌──────────────────────────┐
│ Vulnerability Intel │
│ OSV/NVD + distro advis. │
└───────────┬──────────────┘
│ normalize
v
┌──────────────────────────┐
│ Vuln Knowledge Store │
│ CVE↔pkg ranges, patches │
└───────────┬──────────────┘
│
│
┌───────────────────────v─────────────────────────┐
│ Repo Snapshotter (per distro/arch/date) │
│ - mirrors metadata + packages (+ debuginfo) │
│ - verifies signatures │
│ - emits signed snapshot manifest │
└───────────┬───────────────────────────┬─────────┘
│ │
│ packages │ debuginfo/sources
v v
┌──────────────────────────┐ ┌──────────────────────────┐
│ Package Unpacker │ │ Source/Buildinfo Mapper │
│ - extract files │ │ - pkg→source commit/patch │
└───────────┬──────────────┘ └───────────┬──────────────┘
│ binaries │
v │
┌──────────────────────────┐ │
│ Binary Feature Extractor │ │
│ - Build-ID, hashes │ │
│ - dyn deps, symbols │ │
│ - function boundaries (opt)│ │
└───────────┬──────────────┘ │
│ │
v v
┌──────────────────────────────────────────────────┐
│ Vulnerable Binary Classifier │
│ Tier A: pkg/version range │
│ Tier B: Build-ID→known shipped build │
│ Tier C: code fingerprints (function/CFG hashes) │
└───────────┬───────────────────────────┬──────────┘
│ │
v v
┌──────────────────────────┐ ┌──────────────────────────┐
│ Vulnerable Binary DB │ │ Evidence/Attestation DB │
│ (indexed by BinaryKey) │ │ (signed proofs, snapshots)│
└───────────┬──────────────┘ └───────────┬──────────────┘
│ publish signed snapshot │
v v
Clients/Scanners Explainable VEX outputs
3) Data stores you actually need
A) Relational store (Postgres)
Use this for indexes and joins.
Key tables:
binary_identity
binary_key(build_id or file_sha256) PKbuild_id(nullable)file_sha256,text_sha256arch,osabi,type(ET_DYN/EXEC),strippedfirst_seen_snapshot,last_seen_snapshot
binary_package_map
binary_keydistro,pkg_name,pkg_version_release,archfile_path_in_pkg,snapshot_id
snapshot_manifest
snapshot_iddistro,arch,timestamprepo_metadata_digests,signing_key_id,dsse_envelope_ref
cve_package_ranges
cve_id,ecosystem(deb/rpm/apk),pkg_namevulnerable_ranges,fixed_rangesadvisory_ref,snapshot_id
binary_vuln_assertion
binary_key,cve_idstatus∈ {affected, not_affected, fixed, unknown}method∈ {range_match, buildid_catalog, fingerprint_match}confidence(0–1)evidence_ref(points to signed evidence)
B) Object store (S3/MinIO)
Do not bloat Postgres with large blobs.
Store:
- extracted symbol lists, string tables
- function hash maps
- disassembly snippets for matched functions (small)
- DSSE envelopes / attestations
- optional: debug info extracts (or references to where they can be fetched)
C) Optional search index (OpenSearch/Elastic)
If you want fast “find all binaries exporting SSL_read” style queries, index symbols/strings.
4) Building the database: pipelines
Pipeline 1: Distro repo snapshots → Known-build catalog (breadth)
This is your fastest route to a “binaries DB.”
Step 1 — Snapshot
- Mirror repo metadata + packages for (distro, release, arch).
- Verify signatures (APT Release.gpg, RPM signatures, APK signatures).
- Emit signed snapshot manifest (DSSE) listing digests of everything mirrored.
Step 2 — Extract binaries For each package:
- unpack (deb/rpm/apk)
- select ELF files (EXEC + shared libs)
- compute Build-ID, file hash,
.texthash - store identity +
binary_package_map
Step 3 — Assign CVE status (Tier A + Tier B)
-
Ingest distro advisories and/or OSV mappings into
cve_package_ranges -
For each
binary_package_map, apply range checks -
Create
binary_vuln_assertionentries:method=range_match(coarse)
-
If you have a Build-ID mapping to exact shipped builds, you can tag:
method=buildid_catalog(stronger than pure version)
This yields a database where a scanner can do:
- “Given Build-ID, tell me all CVEs per the distro snapshot.”
This already reduces noise because the primary key is the binary.
Pipeline 2: Patch-aware classification (backports handled)
To handle “version says vulnerable but backport fixed” you must incorporate patch provenance.
Step 1 — Build provenance mapping Per ecosystem:
- Debian/Ubuntu: parse
Sources, changelogs, (ideally).buildinfo, patch series. - RPM distros: SRPM + changelog + patch list.
- Alpine: APKBUILD + patches.
Step 2 — CVE ↔ patch linkage From advisories and patch metadata, store:
- “CVE fixed by patch set P in build B of pkg V-R”
Step 3 — Apply to binaries Instead of version-only, decide:
- if the specific build includes the patch
- mark as
fixedeven if upstream version looks vulnerable
This is still not “binary-only,” but it’s much closer to truth for distros.
Pipeline 3: Binary fingerprint factory (the moat)
This is where you become independent of packaging claims.
You build fingerprints at the function/CFG level for high-impact CVEs.
3.1 Select targets
You cannot fingerprint everything. Start with:
- top shared libs (openssl, glibc, zlib, expat, libxml2, curl, sqlite, ncurses, etc.)
- CVEs that are exploited in the wild / high-severity
- CVEs where distros backport heavily (version logic is unreliable)
3.2 Identify “changed functions” from the fix
Input: upstream commit/patch or distro patch.
Process:
- diff the patch
- extract affected files + functions (tree-sitter/ctags + diff hunks)
- list candidate functions and key basic blocks
3.3 Build vulnerable + fixed reference binaries
For each (arch, toolchain profile):
- compile “known vulnerable” and “known fixed”
- ensure reproducibility: record compiler version, flags, link mode
- store provenance (DSSE) for these reference builds
3.4 Extract robust fingerprints
Avoid raw byte signatures (they break across compilers).
Better fingerprint types, from weakest to strongest:
-
symbol-level: function name + versioned symbol + library SONAME
-
function normalized hash:
-
disassemble function
-
normalize:
- strip addresses/relocs
- bucket registers
- normalize immediates (where safe)
-
hash instruction sequence or basic-block sequence
-
-
basic-block multiset hash:
- build a set/multiset of block hashes; order-independent
-
lightweight CFG hash:
- nodes: block hashes
- edges: control flow
- hash canonical representation
Store fingerprints like:
vuln_fingerprint
cve_idcomponent(openssl/libssl)archfp_type(func_norm_hash, bb_multiset, cfg_hash)fp_valuefunction_hint(name if present; else pattern)confidence,notesevidence_ref(points to reference builds + patch)
3.5 Validate fingerprints at scale
This is non-negotiable.
Validation loop:
-
Test against:
- known vulnerable builds (must match)
- known fixed builds (must not match)
- large “benign corpus” (estimate false positives)
-
Maintain:
- precision/recall metrics per fingerprint
- confidence score
Only promote fingerprints to “production” when validation passes thresholds.
5) Query-time logic (how scanners use the DB)
Given a target binary, the scanner computes:
binary_key- basic features (arch, SONAME, symbols)
- optional function hashes (for targeted libs)
Then it queries in this precedence order:
- Exact match:
binary_keyexists with explicit assertion (strong) - Build catalog: Build-ID→known distro build→CVE mapping (strong)
- Fingerprint match: function/CFG hashes hit (strong, binary-only)
- Fallback: package range matching (weakest)
Return result as a signed VEX with evidence references.
6) Update model: “sealed knowledge snapshots”
To make this auditable and customer-friendly:
-
Every repo snapshot is immutable and signed.
-
Every fingerprint bundle is versioned and signed.
-
Every “vulnerable binaries DB release” is a signed manifest pointing to:
- which repo snapshots were used
- which advisory snapshots were used
- which fingerprint sets were included
This lets you prove:
- what you knew
- when you knew it
- exactly which data drove the verdict
7) Scaling and cost control
Without control, fingerprinting explodes. Use these constraints:
-
Only disassemble/hash functions for:
- libraries in your “hot set”
- binaries whose package indicates relevance to a targeted CVE family
-
Deduplicate aggressively:
- identical
.text_sha256⇒ reuse extracted functions - identical Build-ID across paths ⇒ reuse features
- identical
-
Incremental snapshots:
- process only new/changed packages per snapshot
- store “already processed digest” cache (Valkey)
8) Security and trust boundaries
A vulnerable binary DB is itself a high-value target. Hardening must be part of architecture:
-
Verify upstream repo signatures before ingestion.
-
Run unpacking/extraction in sandboxes (namespaces/seccomp) because packages can be hostile.
-
Sign:
- snapshot manifests
- fingerprint sets
- DB releases
-
Keep signing keys in an HSM/KMS.
-
Maintain provenance chain: input digests → output digests.
9) Minimal viable build plan (no wishful thinking)
MVP 1: Known-build binary catalog (fast, broad)
- Repo snapshot + Build-ID extraction
- CVE ranges + advisory snapshots
binary_key → (pkg@ver, CVEs)mapping- Signed snapshot releases
MVP 2: Patch-aware backport handling (precision jump)
- Source/patch mapping
- Build-specific “fixed-by-backport” logic
MVP 3: Fingerprints for top components (moat)
- Fingerprint factory for ~20–50 “most reused” libs
- Validation corpus + confidence scoring
10) What you can claim (and what you should not)
You can defensibly claim:
- “We can identify vulnerable binaries (not just packages) with verifiable evidence.”
You should not claim (until you have it):
- “We can detect every vulnerable binary on earth” No one can do that without unacceptable false positives/negatives.
If you want this aligned tightly to your Stella Ops services split (scanner webservice lattice algorithms + proof-of-integrity graph), I can map the above into:
- concrete microservices,
- queue/job boundaries,
- and the exact API surfaces (
/snapshots,/features,/fingerprints,/match,/vex). To code MVP 2 (patch-aware backport handling) without any human triage, you need one thing in your database that most scanners do not maintain:
A normalized, distro-specific index: (distro, release, source-package, CVE) → fixed-in distro version (or not-affected / wontfix / unknown), with evidence and confidence.
Then every scan decision becomes a deterministic version comparison against that index, not against upstream semver.
Below is a concrete, buildable approach (pipelines + data model + code skeletons) that stays fully automated.
1) What MVP2 computes
Output table you must build
cve_fix_index
distro(e.g., debian, ubuntu, rhel, alpine)release(e.g., bookworm, jammy, 9, 3.19)source_pkg(not binary subpackage)cve_idstate∈ {fixed,vulnerable,not_affected,wontfix,unknown}fixed_version(nullable; distro version string, including revision)method∈ {security_feed,changelog,patch_header,upstream_patch_match}confidence(float)evidence(JSON: references to advisory entry, changelog lines, patch names + digests)snapshot_id(your sealed snapshot identifier)
Why “source package”?
Security trackers and patch sets are tracked at the source level (e.g., openssl), while runtime installs are often binary subpackages (e.g., libssl3). You need a stable join:
binary_pkg -> source_pkg.
2) No-human signals, in strict priority order
You can do this with zero manual work by using a tiered resolver:
Tier 1 — Structured distro security feed (highest precision)
This is the authoritative “backport-aware” answer because it encodes:
- “fixed in 1.1.1n-0ubuntu2.4” (even if upstream says “fixed in 1.1.1o”)
- “not affected” cases
- sometimes arch-specific applicability
Your ingestor just parses and normalizes it.
Tier 2 — Source package changelog CVE mentions
If a feed entry is missing/late, parse source changelog:
- Debian/Ubuntu:
debian/changelog - RPM:
%changelogin.spec - Alpine:
secfixesinAPKBUILD(often present)
This is surprisingly effective because maintainers often include “CVE-XXXX-YYYY” in the entry that introduced the fix.
Tier 3 — Patch metadata (DEP-3 headers / patch filenames)
Parse patches shipped with the source package:
- Debian:
debian/patches/*+debian/patches/series - RPM: patch files listed in spec / SRPM
- Alpine:
patches/*.patchin the aport
Search patch headers and filenames for CVE IDs, store patch hashes.
Tier 4 — Upstream patch equivalence (optional in MVP2, strong)
If you can map CVE→upstream fix commit (OSV often helps), you can match canonicalized patch hunks against distro patches.
MVP2 can ship without Tier 4; Tier 1+2 already eliminates most backport false positives.
3) Architecture: the “Fix Index Builder” job
Inputs
- Your sealed repo snapshot: Packages + Sources (or SRPM/aports)
- Distro security feed snapshot (OVAL/JSON/errata tracker) for same release
- (Optional) OSV/NVD upstream ranges for fallback only
Processing graph
-
Build
binary_pkg → source_pkgmap from repo metadata -
Ingest security feed → produce
FixRecord(method=security_feed, confidence=0.95) -
For source packages in snapshot:
- unpack source
- parse changelog for CVE mentions →
FixRecord(method=changelog, confidence=0.75–0.85) - parse patch headers →
FixRecord(method=patch_header, confidence=0.80–0.90)
-
Merge records into a single best record per key (distro, release, source_pkg, cve)
-
Store into
cve_fix_indexwith evidence -
Sign the resulting snapshot manifest
4) Merge logic (no human, deterministic)
You need a deterministic rule for conflicts.
Recommended (conservative but still precision-improving):
- If any record says
not_affectedwith confidence ≥ 0.9 → choosenot_affected - Else if any record says
fixedwith confidence ≥ 0.9 → choosefixedandfixed_version = max_fixed_version_among_high_conf - Else if any record says
fixedat all → choosefixedwith best availablefixed_version - Else if any says
wontfix→ choosewontfix - Else
unknown
Additionally:
- Keep all evidence records in
evidenceso you can explain and audit.
5) Version comparison: do not reinvent it
Backport handling lives or dies on correct version ordering.
Practical approach (recommended for ingestion + server-side decisioning)
Use official tooling in containerized workers:
- Debian/Ubuntu:
dpkg --compare-versions - RPM distros:
rpmdev-vercmporrpmlibrary - Alpine:
apk version -t
This is reliable and avoids subtle comparator bugs.
If you must do it in-process, use well-tested libraries per ecosystem (but containerized official tools are the most robust).
6) Concrete code: Debian/Ubuntu changelog + patch parsing
This example shows Tier 2 + Tier 3 inference for a single unpacked source tree. You would wrap this inside your snapshot processing loop.
6.1 CVE extractor
import re
from pathlib import Path
from hashlib import sha256
CVE_RE = re.compile(r"\bCVE-\d{4}-\d{4,7}\b")
def extract_cves(text: str) -> set[str]:
return set(CVE_RE.findall(text or ""))
6.2 Parse the top debian/changelog entry (for this version)
This works well because when you unpack a .dsc for version V, the top entry is for V.
def parse_debian_changelog_top_entry(src_dir: Path) -> tuple[str, set[str], dict]:
"""
Returns:
version: str
cves: set[str] found in the top entry
evidence: dict with excerpt for explainability
"""
changelog_path = src_dir / "debian" / "changelog"
if not changelog_path.exists():
return "", set(), {}
lines = changelog_path.read_text(errors="replace").splitlines()
if not lines:
return "", set(), {}
# First line: "pkgname (version) distro; urgency=..."
m = re.match(r"^[^\s]+\s+\(([^)]+)\)\s+", lines[0])
version = m.group(1) if m else ""
entry_lines = [lines[0]]
# Collect until maintainer trailer line: " -- Name <email> date"
for line in lines[1:]:
entry_lines.append(line)
if line.startswith(" -- "):
break
entry_text = "\n".join(entry_lines)
cves = extract_cves(entry_text)
evidence = {
"file": "debian/changelog",
"version": version,
"excerpt": entry_text[:2000], # store small excerpt, not whole file
}
return version, cves, evidence
6.3 Parse CVEs from patch headers (DEP-3-ish)
def parse_debian_patches_for_cves(src_dir: Path) -> tuple[dict[str, list[dict]], dict]:
"""
Returns:
cve_to_patches: {CVE: [ {path, sha256, header_excerpt}, ... ]}
evidence_summary: dict
"""
patches_dir = src_dir / "debian" / "patches"
if not patches_dir.exists():
return {}, {}
cve_to_patches: dict[str, list[dict]] = {}
for patch in patches_dir.glob("*"):
if not patch.is_file():
continue
# Read only first N lines to keep it cheap
header = "\n".join(patch.read_text(errors="replace").splitlines()[:80])
cves = extract_cves(header + "\n" + patch.name)
if not cves:
continue
digest = sha256(patch.read_bytes()).hexdigest()
rec = {
"path": str(patch.relative_to(src_dir)),
"sha256": digest,
"header_excerpt": header[:1200],
}
for cve in cves:
cve_to_patches.setdefault(cve, []).append(rec)
evidence = {
"dir": "debian/patches",
"matched_cves": len(cve_to_patches),
}
return cve_to_patches, evidence
6.4 Produce FixRecords from the source tree
def infer_fix_records_from_debian_source(src_dir: Path, distro: str, release: str, source_pkg: str, snapshot_id: str):
version, changelog_cves, changelog_ev = parse_debian_changelog_top_entry(src_dir)
cve_to_patches, patch_ev = parse_debian_patches_for_cves(src_dir)
records = []
# Changelog-based: treat CVE mentioned in top entry as fixed in this version
for cve in changelog_cves:
records.append({
"distro": distro,
"release": release,
"source_pkg": source_pkg,
"cve_id": cve,
"state": "fixed",
"fixed_version": version,
"method": "changelog",
"confidence": 0.80,
"evidence": {"changelog": changelog_ev},
"snapshot_id": snapshot_id,
})
# Patch-header-based: treat CVE-tagged patches as fixed in this version
for cve, patches in cve_to_patches.items():
records.append({
"distro": distro,
"release": release,
"source_pkg": source_pkg,
"cve_id": cve,
"state": "fixed",
"fixed_version": version,
"method": "patch_header",
"confidence": 0.87,
"evidence": {"patches": patches, "patch_summary": patch_ev},
"snapshot_id": snapshot_id,
})
return records
That is the automated “patch-aware” signal generator.
7) Wiring this into your database build
7.1 Store raw evidence and merged result
Two-stage storage is worth it:
cve_fix_evidence(append-only)cve_fix_index(merged best record)
So you can:
- rerun merge rules
- improve confidence scoring
- keep auditability
7.2 Merging “fixed_version” for a CVE
When multiple versions mention the same CVE, you usually want the latest mentioning version (highest by distro comparator), because repeated mentions often indicate earlier partial fix.
Pseudo:
def choose_fixed_version(existing: str | None, candidate: str, vercmp) -> str:
if not existing:
return candidate
return candidate if vercmp(candidate, existing) > 0 else existing
Where vercmp calls dpkg --compare-versions (Debian) or equivalent for that distro.
8) Decisioning logic at scan time (what changes with MVP2)
Without MVP2, you likely do:
- upstream range check (false positives for backports)
With MVP2, you do:
- identify
distro+releasefrom environment (or image base) - map
binary_pkg → source_pkg - query
cve_fix_index(distro, release, source_pkg, cve) - if
state=fixedandpkg_version >= fixed_version→ fixed - if
state=not_affected→ safe - else fallback to upstream ranges
That single substitution removes most backport noise.
9) Practical notes so you don’t get trapped
A) You must know the distro release
Backport reality is release-specific. The same package name/version can have different patching across releases.
B) Arch-specific fixes exist
Your schema should allow arch on fix records (nullable). If the feed says “only amd64 affected,” store it.
C) False positives in changelog parsing
Mitigation without humans:
- require CVE mention + security-ish verbs (“fix”, “security”, “patch”) OR presence in patch headers
- otherwise lower confidence and let feed override
D) Keep evidence small
Store:
- excerpt + patch hashes Not entire source tarballs.
10) Minimal “done definition” for MVP2
You have MVP2 when, for Debian/Ubuntu at least, you can demonstrate:
-
A CVE that upstream marks vulnerable for version X
-
The distro backported it in X-
-
Your system classifies:
X-older_revisionas vulnerableX-newer_revisionas fixed
-
With evidence: fix feed record and/or changelog/patch proof
No human required.
If you want, I can provide the same “Tier 2/3 inference” module for RPM (SRPM/spec parsing) and Alpine (APKBUILD secfixes extraction), plus the exact Postgres DDL for cve_fix_evidence and cve_fix_index, and the merge SQL.