Files
git.stella-ops.org/docs/product-advisories/unprocessed/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md
2025-12-21 18:04:15 +02:00

25 KiB
Raw Blame History

Below is a practical, production-grade architecture for building a vulnerable binaries database. Im going to be explicit about what “such a database” can mean, because there are two materially different products:

  1. Known-build catalog: “These exact shipped binaries (Build-ID / hash) are affected or fixed for CVE X.”
  2. Binary fingerprint DB: “Even if the binary is unpackaged / self-built, we can match vulnerable code patterns.”

You want both. The first gets you breadth fast; the second is the moat.


1) Core principle: treat “binary identity” as the primary key

For Linux ELF:

  • Primary: ELF Build-ID (from .note.gnu.build-id)
  • Fallback: sha256(file_bytes)
  • Add: sha256(.text) and/or BLAKE3 for speed

This creates a stable identity that survives “package metadata lies.”

BinaryKey = build_id || file_sha256


2) High-level system diagram

            ┌──────────────────────────┐
            │ Vulnerability Intel      │
            │ OSV/NVD + distro advis.  │
            └───────────┬──────────────┘
                        │ normalize
                        v
            ┌──────────────────────────┐
            │ Vuln Knowledge Store     │
            │ CVE↔pkg ranges, patches  │
            └───────────┬──────────────┘
                        │
                        │
┌───────────────────────v─────────────────────────┐
│ Repo Snapshotter (per distro/arch/date)          │
│ - mirrors metadata + packages (+ debuginfo)      │
│ - verifies signatures                            │
│ - emits signed snapshot manifest                 │
└───────────┬───────────────────────────┬─────────┘
            │                           │
            │ packages                  │ debuginfo/sources
            v                           v
┌──────────────────────────┐   ┌──────────────────────────┐
│ Package Unpacker          │   │ Source/Buildinfo Mapper   │
│ - extract files           │   │ - pkg→source commit/patch │
└───────────┬──────────────┘   └───────────┬──────────────┘
            │ binaries                      │
            v                               │
┌──────────────────────────┐               │
│ Binary Feature Extractor  │               │
│ - Build-ID, hashes         │               │
│ - dyn deps, symbols        │               │
│ - function boundaries (opt)│               │
└───────────┬──────────────┘               │
            │                               │
            v                               v
┌──────────────────────────────────────────────────┐
│ Vulnerable Binary Classifier                      │
│ Tier A: pkg/version range                         │
│ Tier B: Build-ID→known shipped build              │
│ Tier C: code fingerprints (function/CFG hashes)   │
└───────────┬───────────────────────────┬──────────┘
            │                           │
            v                           v
┌──────────────────────────┐   ┌──────────────────────────┐
│ Vulnerable Binary DB      │   │ Evidence/Attestation DB   │
│ (indexed by BinaryKey)    │   │ (signed proofs, snapshots)│
└───────────┬──────────────┘   └───────────┬──────────────┘
            │ publish signed snapshot       │
            v                               v
        Clients/Scanners             Explainable VEX outputs

3) Data stores you actually need

A) Relational store (Postgres)

Use this for indexes and joins.

Key tables:

binary_identity

  • binary_key (build_id or file_sha256) PK
  • build_id (nullable)
  • file_sha256, text_sha256
  • arch, osabi, type (ET_DYN/EXEC), stripped
  • first_seen_snapshot, last_seen_snapshot

binary_package_map

  • binary_key
  • distro, pkg_name, pkg_version_release, arch
  • file_path_in_pkg, snapshot_id

snapshot_manifest

  • snapshot_id
  • distro, arch, timestamp
  • repo_metadata_digests, signing_key_id, dsse_envelope_ref

cve_package_ranges

  • cve_id, ecosystem (deb/rpm/apk), pkg_name
  • vulnerable_ranges, fixed_ranges
  • advisory_ref, snapshot_id

binary_vuln_assertion

  • binary_key, cve_id
  • status ∈ {affected, not_affected, fixed, unknown}
  • method ∈ {range_match, buildid_catalog, fingerprint_match}
  • confidence (01)
  • evidence_ref (points to signed evidence)

B) Object store (S3/MinIO)

Do not bloat Postgres with large blobs.

Store:

  • extracted symbol lists, string tables
  • function hash maps
  • disassembly snippets for matched functions (small)
  • DSSE envelopes / attestations
  • optional: debug info extracts (or references to where they can be fetched)

C) Optional search index (OpenSearch/Elastic)

If you want fast “find all binaries exporting SSL_read” style queries, index symbols/strings.


4) Building the database: pipelines

Pipeline 1: Distro repo snapshots → Known-build catalog (breadth)

This is your fastest route to a “binaries DB.”

Step 1 — Snapshot

  • Mirror repo metadata + packages for (distro, release, arch).
  • Verify signatures (APT Release.gpg, RPM signatures, APK signatures).
  • Emit signed snapshot manifest (DSSE) listing digests of everything mirrored.

Step 2 — Extract binaries For each package:

  • unpack (deb/rpm/apk)
  • select ELF files (EXEC + shared libs)
  • compute Build-ID, file hash, .text hash
  • store identity + binary_package_map

Step 3 — Assign CVE status (Tier A + Tier B)

  • Ingest distro advisories and/or OSV mappings into cve_package_ranges

  • For each binary_package_map, apply range checks

  • Create binary_vuln_assertion entries:

    • method=range_match (coarse)
  • If you have a Build-ID mapping to exact shipped builds, you can tag:

    • method=buildid_catalog (stronger than pure version)

This yields a database where a scanner can do:

  • “Given Build-ID, tell me all CVEs per the distro snapshot.”

This already reduces noise because the primary key is the binary.


Pipeline 2: Patch-aware classification (backports handled)

To handle “version says vulnerable but backport fixed” you must incorporate patch provenance.

Step 1 — Build provenance mapping Per ecosystem:

  • Debian/Ubuntu: parse Sources, changelogs, (ideally) .buildinfo, patch series.
  • RPM distros: SRPM + changelog + patch list.
  • Alpine: APKBUILD + patches.

Step 2 — CVE ↔ patch linkage From advisories and patch metadata, store:

  • “CVE fixed by patch set P in build B of pkg V-R”

Step 3 — Apply to binaries Instead of version-only, decide:

  • if the specific build includes the patch
  • mark as fixed even if upstream version looks vulnerable

This is still not “binary-only,” but its much closer to truth for distros.


Pipeline 3: Binary fingerprint factory (the moat)

This is where you become independent of packaging claims.

You build fingerprints at the function/CFG level for high-impact CVEs.

3.1 Select targets

You cannot fingerprint everything. Start with:

  • top shared libs (openssl, glibc, zlib, expat, libxml2, curl, sqlite, ncurses, etc.)
  • CVEs that are exploited in the wild / high-severity
  • CVEs where distros backport heavily (version logic is unreliable)

3.2 Identify “changed functions” from the fix

Input: upstream commit/patch or distro patch.

Process:

  • diff the patch
  • extract affected files + functions (tree-sitter/ctags + diff hunks)
  • list candidate functions and key basic blocks

3.3 Build vulnerable + fixed reference binaries

For each (arch, toolchain profile):

  • compile “known vulnerable” and “known fixed”
  • ensure reproducibility: record compiler version, flags, link mode
  • store provenance (DSSE) for these reference builds

3.4 Extract robust fingerprints

Avoid raw byte signatures (they break across compilers).

Better fingerprint types, from weakest to strongest:

  • symbol-level: function name + versioned symbol + library SONAME

  • function normalized hash:

    • disassemble function

    • normalize:

      • strip addresses/relocs
      • bucket registers
      • normalize immediates (where safe)
    • hash instruction sequence or basic-block sequence

  • basic-block multiset hash:

    • build a set/multiset of block hashes; order-independent
  • lightweight CFG hash:

    • nodes: block hashes
    • edges: control flow
    • hash canonical representation

Store fingerprints like:

vuln_fingerprint

  • cve_id
  • component (openssl/libssl)
  • arch
  • fp_type (func_norm_hash, bb_multiset, cfg_hash)
  • fp_value
  • function_hint (name if present; else pattern)
  • confidence, notes
  • evidence_ref (points to reference builds + patch)

3.5 Validate fingerprints at scale

This is non-negotiable.

Validation loop:

  • Test against:

    • known vulnerable builds (must match)
    • known fixed builds (must not match)
    • large “benign corpus” (estimate false positives)
  • Maintain:

    • precision/recall metrics per fingerprint
    • confidence score

Only promote fingerprints to “production” when validation passes thresholds.


5) Query-time logic (how scanners use the DB)

Given a target binary, the scanner computes:

  • binary_key
  • basic features (arch, SONAME, symbols)
  • optional function hashes (for targeted libs)

Then it queries in this precedence order:

  1. Exact match: binary_key exists with explicit assertion (strong)
  2. Build catalog: Build-ID→known distro build→CVE mapping (strong)
  3. Fingerprint match: function/CFG hashes hit (strong, binary-only)
  4. Fallback: package range matching (weakest)

Return result as a signed VEX with evidence references.


6) Update model: “sealed knowledge snapshots”

To make this auditable and customer-friendly:

  • Every repo snapshot is immutable and signed.

  • Every fingerprint bundle is versioned and signed.

  • Every “vulnerable binaries DB release” is a signed manifest pointing to:

    • which repo snapshots were used
    • which advisory snapshots were used
    • which fingerprint sets were included

This lets you prove:

  • what you knew
  • when you knew it
  • exactly which data drove the verdict

7) Scaling and cost control

Without control, fingerprinting explodes. Use these constraints:

  • Only disassemble/hash functions for:

    • libraries in your “hot set”
    • binaries whose package indicates relevance to a targeted CVE family
  • Deduplicate aggressively:

    • identical .text_sha256 ⇒ reuse extracted functions
    • identical Build-ID across paths ⇒ reuse features
  • Incremental snapshots:

    • process only new/changed packages per snapshot
    • store “already processed digest” cache (Valkey)

8) Security and trust boundaries

A vulnerable binary DB is itself a high-value target. Hardening must be part of architecture:

  • Verify upstream repo signatures before ingestion.

  • Run unpacking/extraction in sandboxes (namespaces/seccomp) because packages can be hostile.

  • Sign:

    • snapshot manifests
    • fingerprint sets
    • DB releases
  • Keep signing keys in an HSM/KMS.

  • Maintain provenance chain: input digests → output digests.


9) Minimal viable build plan (no wishful thinking)

MVP 1: Known-build binary catalog (fast, broad)

  • Repo snapshot + Build-ID extraction
  • CVE ranges + advisory snapshots
  • binary_key → (pkg@ver, CVEs) mapping
  • Signed snapshot releases

MVP 2: Patch-aware backport handling (precision jump)

  • Source/patch mapping
  • Build-specific “fixed-by-backport” logic

MVP 3: Fingerprints for top components (moat)

  • Fingerprint factory for ~2050 “most reused” libs
  • Validation corpus + confidence scoring

10) What you can claim (and what you should not)

You can defensibly claim:

  • “We can identify vulnerable binaries (not just packages) with verifiable evidence.”

You should not claim (until you have it):

  • “We can detect every vulnerable binary on earth” No one can do that without unacceptable false positives/negatives.

If you want this aligned tightly to your Stella Ops services split (scanner webservice lattice algorithms + proof-of-integrity graph), I can map the above into:

  • concrete microservices,
  • queue/job boundaries,
  • and the exact API surfaces (/snapshots, /features, /fingerprints, /match, /vex). To code MVP 2 (patch-aware backport handling) without any human triage, you need one thing in your database that most scanners do not maintain:

A normalized, distro-specific index: (distro, release, source-package, CVE) → fixed-in distro version (or not-affected / wontfix / unknown), with evidence and confidence.

Then every scan decision becomes a deterministic version comparison against that index, not against upstream semver.

Below is a concrete, buildable approach (pipelines + data model + code skeletons) that stays fully automated.


1) What MVP2 computes

Output table you must build

cve_fix_index

  • distro (e.g., debian, ubuntu, rhel, alpine)
  • release (e.g., bookworm, jammy, 9, 3.19)
  • source_pkg (not binary subpackage)
  • cve_id
  • state ∈ {fixed, vulnerable, not_affected, wontfix, unknown}
  • fixed_version (nullable; distro version string, including revision)
  • method ∈ {security_feed, changelog, patch_header, upstream_patch_match}
  • confidence (float)
  • evidence (JSON: references to advisory entry, changelog lines, patch names + digests)
  • snapshot_id (your sealed snapshot identifier)

Why “source package”?

Security trackers and patch sets are tracked at the source level (e.g., openssl), while runtime installs are often binary subpackages (e.g., libssl3). You need a stable join: binary_pkg -> source_pkg.


2) No-human signals, in strict priority order

You can do this with zero manual work by using a tiered resolver:

Tier 1 — Structured distro security feed (highest precision)

This is the authoritative “backport-aware” answer because it encodes:

  • “fixed in 1.1.1n-0ubuntu2.4” (even if upstream says “fixed in 1.1.1o”)
  • “not affected” cases
  • sometimes arch-specific applicability

Your ingestor just parses and normalizes it.

Tier 2 — Source package changelog CVE mentions

If a feed entry is missing/late, parse source changelog:

  • Debian/Ubuntu: debian/changelog
  • RPM: %changelog in .spec
  • Alpine: secfixes in APKBUILD (often present)

This is surprisingly effective because maintainers often include “CVE-XXXX-YYYY” in the entry that introduced the fix.

Tier 3 — Patch metadata (DEP-3 headers / patch filenames)

Parse patches shipped with the source package:

  • Debian: debian/patches/* + debian/patches/series
  • RPM: patch files listed in spec / SRPM
  • Alpine: patches/*.patch in the aport

Search patch headers and filenames for CVE IDs, store patch hashes.

Tier 4 — Upstream patch equivalence (optional in MVP2, strong)

If you can map CVE→upstream fix commit (OSV often helps), you can match canonicalized patch hunks against distro patches.

MVP2 can ship without Tier 4; Tier 1+2 already eliminates most backport false positives.


3) Architecture: the “Fix Index Builder” job

Inputs

  • Your sealed repo snapshot: Packages + Sources (or SRPM/aports)
  • Distro security feed snapshot (OVAL/JSON/errata tracker) for same release
  • (Optional) OSV/NVD upstream ranges for fallback only

Processing graph

  1. Build binary_pkg → source_pkg map from repo metadata

  2. Ingest security feed → produce FixRecord(method=security_feed, confidence=0.95)

  3. For source packages in snapshot:

    • unpack source
    • parse changelog for CVE mentions → FixRecord(method=changelog, confidence=0.750.85)
    • parse patch headers → FixRecord(method=patch_header, confidence=0.800.90)
  4. Merge records into a single best record per key (distro, release, source_pkg, cve)

  5. Store into cve_fix_index with evidence

  6. Sign the resulting snapshot manifest


4) Merge logic (no human, deterministic)

You need a deterministic rule for conflicts.

Recommended (conservative but still precision-improving):

  1. If any record says not_affected with confidence ≥ 0.9 → choose not_affected
  2. Else if any record says fixed with confidence ≥ 0.9 → choose fixed and fixed_version = max_fixed_version_among_high_conf
  3. Else if any record says fixed at all → choose fixed with best available fixed_version
  4. Else if any says wontfix → choose wontfix
  5. Else unknown

Additionally:

  • Keep all evidence records in evidence so you can explain and audit.

5) Version comparison: do not reinvent it

Backport handling lives or dies on correct version ordering.

Use official tooling in containerized workers:

  • Debian/Ubuntu: dpkg --compare-versions
  • RPM distros: rpmdev-vercmp or rpm library
  • Alpine: apk version -t

This is reliable and avoids subtle comparator bugs.

If you must do it in-process, use well-tested libraries per ecosystem (but containerized official tools are the most robust).


6) Concrete code: Debian/Ubuntu changelog + patch parsing

This example shows Tier 2 + Tier 3 inference for a single unpacked source tree. You would wrap this inside your snapshot processing loop.

6.1 CVE extractor

import re
from pathlib import Path
from hashlib import sha256

CVE_RE = re.compile(r"\bCVE-\d{4}-\d{4,7}\b")

def extract_cves(text: str) -> set[str]:
    return set(CVE_RE.findall(text or ""))

6.2 Parse the top debian/changelog entry (for this version)

This works well because when you unpack a .dsc for version V, the top entry is for V.

def parse_debian_changelog_top_entry(src_dir: Path) -> tuple[str, set[str], dict]:
    """
    Returns:
      version: str
      cves: set[str] found in the top entry
      evidence: dict with excerpt for explainability
    """
    changelog_path = src_dir / "debian" / "changelog"
    if not changelog_path.exists():
        return "", set(), {}

    lines = changelog_path.read_text(errors="replace").splitlines()
    if not lines:
        return "", set(), {}

    # First line: "pkgname (version) distro; urgency=..."
    m = re.match(r"^[^\s]+\s+\(([^)]+)\)\s+", lines[0])
    version = m.group(1) if m else ""

    entry_lines = [lines[0]]
    # Collect until maintainer trailer line: " -- Name <email>  date"
    for line in lines[1:]:
        entry_lines.append(line)
        if line.startswith(" -- "):
            break

    entry_text = "\n".join(entry_lines)
    cves = extract_cves(entry_text)

    evidence = {
        "file": "debian/changelog",
        "version": version,
        "excerpt": entry_text[:2000],  # store small excerpt, not whole file
    }
    return version, cves, evidence

6.3 Parse CVEs from patch headers (DEP-3-ish)

def parse_debian_patches_for_cves(src_dir: Path) -> tuple[dict[str, list[dict]], dict]:
    """
    Returns:
      cve_to_patches: {CVE: [ {path, sha256, header_excerpt}, ... ]}
      evidence_summary: dict
    """
    patches_dir = src_dir / "debian" / "patches"
    if not patches_dir.exists():
        return {}, {}

    cve_to_patches: dict[str, list[dict]] = {}

    for patch in patches_dir.glob("*"):
        if not patch.is_file():
            continue
        # Read only first N lines to keep it cheap
        header = "\n".join(patch.read_text(errors="replace").splitlines()[:80])
        cves = extract_cves(header + "\n" + patch.name)
        if not cves:
            continue

        digest = sha256(patch.read_bytes()).hexdigest()
        rec = {
            "path": str(patch.relative_to(src_dir)),
            "sha256": digest,
            "header_excerpt": header[:1200],
        }
        for cve in cves:
            cve_to_patches.setdefault(cve, []).append(rec)

    evidence = {
        "dir": "debian/patches",
        "matched_cves": len(cve_to_patches),
    }
    return cve_to_patches, evidence

6.4 Produce FixRecords from the source tree

def infer_fix_records_from_debian_source(src_dir: Path, distro: str, release: str, source_pkg: str, snapshot_id: str):
    version, changelog_cves, changelog_ev = parse_debian_changelog_top_entry(src_dir)
    cve_to_patches, patch_ev = parse_debian_patches_for_cves(src_dir)

    records = []

    # Changelog-based: treat CVE mentioned in top entry as fixed in this version
    for cve in changelog_cves:
        records.append({
            "distro": distro,
            "release": release,
            "source_pkg": source_pkg,
            "cve_id": cve,
            "state": "fixed",
            "fixed_version": version,
            "method": "changelog",
            "confidence": 0.80,
            "evidence": {"changelog": changelog_ev},
            "snapshot_id": snapshot_id,
        })

    # Patch-header-based: treat CVE-tagged patches as fixed in this version
    for cve, patches in cve_to_patches.items():
        records.append({
            "distro": distro,
            "release": release,
            "source_pkg": source_pkg,
            "cve_id": cve,
            "state": "fixed",
            "fixed_version": version,
            "method": "patch_header",
            "confidence": 0.87,
            "evidence": {"patches": patches, "patch_summary": patch_ev},
            "snapshot_id": snapshot_id,
        })

    return records

That is the automated “patch-aware” signal generator.


7) Wiring this into your database build

7.1 Store raw evidence and merged result

Two-stage storage is worth it:

  1. cve_fix_evidence (append-only)
  2. cve_fix_index (merged best record)

So you can:

  • rerun merge rules
  • improve confidence scoring
  • keep auditability

7.2 Merging “fixed_version” for a CVE

When multiple versions mention the same CVE, you usually want the latest mentioning version (highest by distro comparator), because repeated mentions often indicate earlier partial fix.

Pseudo:

def choose_fixed_version(existing: str | None, candidate: str, vercmp) -> str:
    if not existing:
        return candidate
    return candidate if vercmp(candidate, existing) > 0 else existing

Where vercmp calls dpkg --compare-versions (Debian) or equivalent for that distro.


8) Decisioning logic at scan time (what changes with MVP2)

Without MVP2, you likely do:

  • upstream range check (false positives for backports)

With MVP2, you do:

  1. identify distro+release from environment (or image base)
  2. map binary_pkg → source_pkg
  3. query cve_fix_index(distro, release, source_pkg, cve)
  4. if state=fixed and pkg_version >= fixed_version → fixed
  5. if state=not_affected → safe
  6. else fallback to upstream ranges

That single substitution removes most backport noise.


9) Practical notes so you dont get trapped

A) You must know the distro release

Backport reality is release-specific. The same package name/version can have different patching across releases.

B) Arch-specific fixes exist

Your schema should allow arch on fix records (nullable). If the feed says “only amd64 affected,” store it.

C) False positives in changelog parsing

Mitigation without humans:

  • require CVE mention + security-ish verbs (“fix”, “security”, “patch”) OR presence in patch headers
  • otherwise lower confidence and let feed override

D) Keep evidence small

Store:

  • excerpt + patch hashes Not entire source tarballs.

10) Minimal “done definition” for MVP2

You have MVP2 when, for Debian/Ubuntu at least, you can demonstrate:

  • A CVE that upstream marks vulnerable for version X

  • The distro backported it in X-

  • Your system classifies:

    • X-older_revision as vulnerable
    • X-newer_revision as fixed
  • With evidence: fix feed record and/or changelog/patch proof

No human required.


If you want, I can provide the same “Tier 2/3 inference” module for RPM (SRPM/spec parsing) and Alpine (APKBUILD secfixes extraction), plus the exact Postgres DDL for cve_fix_evidence and cve_fix_index, and the merge SQL.