git.stella-ops.org/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md at 5d398ec4422ca30d8c830dcd588c28fe9f6ef7ac - git.stella-ops.org

Files

master 5d398ec442 add 21th Dec advisories

2025-12-21 18:04:15 +02:00

25 KiB

Raw Blame History

Below is a practical, production-grade architecture for building a vulnerable binaries database. I’m going to be explicit about what “such a database” can mean, because there are two materially different products:

Known-build catalog: “These exact shipped binaries (Build-ID / hash) are affected or fixed for CVE X.”
Binary fingerprint DB: “Even if the binary is unpackaged / self-built, we can match vulnerable code patterns.”

You want both. The first gets you breadth fast; the second is the moat.

1) Core principle: treat “binary identity” as the primary key

For Linux ELF:

Primary: ELF Build-ID (from .note.gnu.build-id)
Fallback: sha256(file_bytes)
Add: sha256(.text) and/or BLAKE3 for speed

This creates a stable identity that survives “package metadata lies.”

BinaryKey = build_id || file_sha256

2) High-level system diagram

            ┌──────────────────────────┐
            │ Vulnerability Intel      │
            │ OSV/NVD + distro advis.  │
            └───────────┬──────────────┘
                        │ normalize
                        v
            ┌──────────────────────────┐
            │ Vuln Knowledge Store     │
            │ CVE↔pkg ranges, patches  │
            └───────────┬──────────────┘
                        │
                        │
┌───────────────────────v─────────────────────────┐
│ Repo Snapshotter (per distro/arch/date)          │
│ - mirrors metadata + packages (+ debuginfo)      │
│ - verifies signatures                            │
│ - emits signed snapshot manifest                 │
└───────────┬───────────────────────────┬─────────┘
            │                           │
            │ packages                  │ debuginfo/sources
            v                           v
┌──────────────────────────┐   ┌──────────────────────────┐
│ Package Unpacker          │   │ Source/Buildinfo Mapper   │
│ - extract files           │   │ - pkg→source commit/patch │
└───────────┬──────────────┘   └───────────┬──────────────┘
            │ binaries                      │
            v                               │
┌──────────────────────────┐               │
│ Binary Feature Extractor  │               │
│ - Build-ID, hashes         │               │
│ - dyn deps, symbols        │               │
│ - function boundaries (opt)│               │
└───────────┬──────────────┘               │
            │                               │
            v                               v
┌──────────────────────────────────────────────────┐
│ Vulnerable Binary Classifier                      │
│ Tier A: pkg/version range                         │
│ Tier B: Build-ID→known shipped build              │
│ Tier C: code fingerprints (function/CFG hashes)   │
└───────────┬───────────────────────────┬──────────┘
            │                           │
            v                           v
┌──────────────────────────┐   ┌──────────────────────────┐
│ Vulnerable Binary DB      │   │ Evidence/Attestation DB   │
│ (indexed by BinaryKey)    │   │ (signed proofs, snapshots)│
└───────────┬──────────────┘   └───────────┬──────────────┘
            │ publish signed snapshot       │
            v                               v
        Clients/Scanners             Explainable VEX outputs

3) Data stores you actually need

A) Relational store (Postgres)

Use this for indexes and joins.

Key tables:

binary_identity

binary_key (build_id or file_sha256) PK
build_id (nullable)
file_sha256, text_sha256
arch, osabi, type (ET_DYN/EXEC), stripped
first_seen_snapshot, last_seen_snapshot

binary_package_map

binary_key
distro, pkg_name, pkg_version_release, arch
file_path_in_pkg, snapshot_id

snapshot_manifest

snapshot_id
distro, arch, timestamp
repo_metadata_digests, signing_key_id, dsse_envelope_ref

cve_package_ranges

cve_id, ecosystem (deb/rpm/apk), pkg_name
vulnerable_ranges, fixed_ranges
advisory_ref, snapshot_id

binary_vuln_assertion

binary_key, cve_id
status ∈ {affected, not_affected, fixed, unknown}
method ∈ {range_match, buildid_catalog, fingerprint_match}
confidence (0–1)
evidence_ref (points to signed evidence)

B) Object store (S3/MinIO)

Do not bloat Postgres with large blobs.

Store:

extracted symbol lists, string tables
function hash maps
disassembly snippets for matched functions (small)
DSSE envelopes / attestations
optional: debug info extracts (or references to where they can be fetched)

C) Optional search index (OpenSearch/Elastic)

If you want fast “find all binaries exporting SSL_read” style queries, index symbols/strings.

4) Building the database: pipelines

Pipeline 1: Distro repo snapshots → Known-build catalog (breadth)

This is your fastest route to a “binaries DB.”

Step 1 — Snapshot

Mirror repo metadata + packages for (distro, release, arch).
Verify signatures (APT Release.gpg, RPM signatures, APK signatures).
Emit signed snapshot manifest (DSSE) listing digests of everything mirrored.

Step 2 — Extract binaries For each package:

unpack (deb/rpm/apk)
select ELF files (EXEC + shared libs)
compute Build-ID, file hash, .text hash
store identity + binary_package_map

Step 3 — Assign CVE status (Tier A + Tier B)

Ingest distro advisories and/or OSV mappings into cve_package_ranges
For each binary_package_map, apply range checks
Create binary_vuln_assertion entries:
- method=range_match (coarse)
If you have a Build-ID mapping to exact shipped builds, you can tag:
- method=buildid_catalog (stronger than pure version)

This yields a database where a scanner can do:

“Given Build-ID, tell me all CVEs per the distro snapshot.”

This already reduces noise because the primary key is the binary.

Pipeline 2: Patch-aware classification (backports handled)

To handle “version says vulnerable but backport fixed” you must incorporate patch provenance.

Step 1 — Build provenance mapping Per ecosystem:

Debian/Ubuntu: parse Sources, changelogs, (ideally) .buildinfo, patch series.
RPM distros: SRPM + changelog + patch list.
Alpine: APKBUILD + patches.

Step 2 — CVE ↔ patch linkage From advisories and patch metadata, store:

“CVE fixed by patch set P in build B of pkg V-R”

Step 3 — Apply to binaries Instead of version-only, decide:

if the specific build includes the patch
mark as fixed even if upstream version looks vulnerable

This is still not “binary-only,” but it’s much closer to truth for distros.

Pipeline 3: Binary fingerprint factory (the moat)

This is where you become independent of packaging claims.

You build fingerprints at the function/CFG level for high-impact CVEs.

3.1 Select targets

You cannot fingerprint everything. Start with:

top shared libs (openssl, glibc, zlib, expat, libxml2, curl, sqlite, ncurses, etc.)
CVEs that are exploited in the wild / high-severity
CVEs where distros backport heavily (version logic is unreliable)

3.2 Identify “changed functions” from the fix

Input: upstream commit/patch or distro patch.

Process:

diff the patch
extract affected files + functions (tree-sitter/ctags + diff hunks)
list candidate functions and key basic blocks

3.3 Build vulnerable + fixed reference binaries

For each (arch, toolchain profile):

compile “known vulnerable” and “known fixed”
ensure reproducibility: record compiler version, flags, link mode
store provenance (DSSE) for these reference builds

3.4 Extract robust fingerprints

Avoid raw byte signatures (they break across compilers).

Better fingerprint types, from weakest to strongest:

symbol-level: function name + versioned symbol + library SONAME
function normalized hash:
- disassemble function
- normalize:
  - strip addresses/relocs
  - bucket registers
  - normalize immediates (where safe)
- hash instruction sequence or basic-block sequence
basic-block multiset hash:
- build a set/multiset of block hashes; order-independent
lightweight CFG hash:
- nodes: block hashes
- edges: control flow
- hash canonical representation

Store fingerprints like:

vuln_fingerprint

cve_id
component (openssl/libssl)
arch
fp_type (func_norm_hash, bb_multiset, cfg_hash)
fp_value
function_hint (name if present; else pattern)
confidence, notes
evidence_ref (points to reference builds + patch)

3.5 Validate fingerprints at scale

This is non-negotiable.

Validation loop:

Test against:
- known vulnerable builds (must match)
- known fixed builds (must not match)
- large “benign corpus” (estimate false positives)
Maintain:
- precision/recall metrics per fingerprint
- confidence score

Only promote fingerprints to “production” when validation passes thresholds.

5) Query-time logic (how scanners use the DB)

Given a target binary, the scanner computes:

binary_key
basic features (arch, SONAME, symbols)
optional function hashes (for targeted libs)

Then it queries in this precedence order:

Exact match: binary_key exists with explicit assertion (strong)
Build catalog: Build-ID→known distro build→CVE mapping (strong)
Fingerprint match: function/CFG hashes hit (strong, binary-only)
Fallback: package range matching (weakest)

Return result as a signed VEX with evidence references.

6) Update model: “sealed knowledge snapshots”

To make this auditable and customer-friendly:

Every repo snapshot is immutable and signed.
Every fingerprint bundle is versioned and signed.
Every “vulnerable binaries DB release” is a signed manifest pointing to:
- which repo snapshots were used
- which advisory snapshots were used
- which fingerprint sets were included

This lets you prove:

what you knew
when you knew it
exactly which data drove the verdict

7) Scaling and cost control

Without control, fingerprinting explodes. Use these constraints:

Only disassemble/hash functions for:
- libraries in your “hot set”
- binaries whose package indicates relevance to a targeted CVE family
Deduplicate aggressively:
- identical .text_sha256 ⇒ reuse extracted functions
- identical Build-ID across paths ⇒ reuse features
Incremental snapshots:
- process only new/changed packages per snapshot
- store “already processed digest” cache (Valkey)

8) Security and trust boundaries

A vulnerable binary DB is itself a high-value target. Hardening must be part of architecture:

Verify upstream repo signatures before ingestion.
Run unpacking/extraction in sandboxes (namespaces/seccomp) because packages can be hostile.
Sign:
- snapshot manifests
- fingerprint sets
- DB releases
Keep signing keys in an HSM/KMS.
Maintain provenance chain: input digests → output digests.

9) Minimal viable build plan (no wishful thinking)

MVP 1: Known-build binary catalog (fast, broad)

Repo snapshot + Build-ID extraction
CVE ranges + advisory snapshots
binary_key → (pkg@ver, CVEs) mapping
Signed snapshot releases

MVP 2: Patch-aware backport handling (precision jump)

Source/patch mapping
Build-specific “fixed-by-backport” logic

MVP 3: Fingerprints for top components (moat)

Fingerprint factory for ~20–50 “most reused” libs
Validation corpus + confidence scoring

10) What you can claim (and what you should not)

You can defensibly claim:

“We can identify vulnerable binaries (not just packages) with verifiable evidence.”

You should not claim (until you have it):

“We can detect every vulnerable binary on earth” No one can do that without unacceptable false positives/negatives.

If you want this aligned tightly to your Stella Ops services split (scanner webservice lattice algorithms + proof-of-integrity graph), I can map the above into:

concrete microservices,
queue/job boundaries,
and the exact API surfaces (/snapshots, /features, /fingerprints, /match, /vex). To code MVP 2 (patch-aware backport handling) without any human triage, you need one thing in your database that most scanners do not maintain:

A normalized, distro-specific index: (distro, release, source-package, CVE) → fixed-in distro version (or not-affected / wontfix / unknown), with evidence and confidence.

Then every scan decision becomes a deterministic version comparison against that index, not against upstream semver.

Below is a concrete, buildable approach (pipelines + data model + code skeletons) that stays fully automated.

1) What MVP2 computes

Output table you must build

cve_fix_index

distro (e.g., debian, ubuntu, rhel, alpine)
release (e.g., bookworm, jammy, 9, 3.19)
source_pkg (not binary subpackage)
cve_id
state ∈ {fixed, vulnerable, not_affected, wontfix, unknown}
fixed_version (nullable; distro version string, including revision)
method ∈ {security_feed, changelog, patch_header, upstream_patch_match}
confidence (float)
evidence (JSON: references to advisory entry, changelog lines, patch names + digests)
snapshot_id (your sealed snapshot identifier)

Why “source package”?

Security trackers and patch sets are tracked at the source level (e.g., openssl), while runtime installs are often binary subpackages (e.g., libssl3). You need a stable join: binary_pkg -> source_pkg.

2) No-human signals, in strict priority order

You can do this with zero manual work by using a tiered resolver:

Tier 1 — Structured distro security feed (highest precision)

This is the authoritative “backport-aware” answer because it encodes:

“fixed in 1.1.1n-0ubuntu2.4” (even if upstream says “fixed in 1.1.1o”)
“not affected” cases
sometimes arch-specific applicability

Your ingestor just parses and normalizes it.

Tier 2 — Source package changelog CVE mentions

If a feed entry is missing/late, parse source changelog:

Debian/Ubuntu: debian/changelog
RPM: %changelog in .spec
Alpine: secfixes in APKBUILD (often present)

This is surprisingly effective because maintainers often include “CVE-XXXX-YYYY” in the entry that introduced the fix.

Tier 3 — Patch metadata (DEP-3 headers / patch filenames)

Parse patches shipped with the source package:

Debian: debian/patches/* + debian/patches/series
RPM: patch files listed in spec / SRPM
Alpine: patches/*.patch in the aport

Search patch headers and filenames for CVE IDs, store patch hashes.

Tier 4 — Upstream patch equivalence (optional in MVP2, strong)

If you can map CVE→upstream fix commit (OSV often helps), you can match canonicalized patch hunks against distro patches.

MVP2 can ship without Tier 4; Tier 1+2 already eliminates most backport false positives.

3) Architecture: the “Fix Index Builder” job

Inputs

Your sealed repo snapshot: Packages + Sources (or SRPM/aports)
Distro security feed snapshot (OVAL/JSON/errata tracker) for same release
(Optional) OSV/NVD upstream ranges for fallback only

Processing graph

Build binary_pkg → source_pkg map from repo metadata
Ingest security feed → produce FixRecord(method=security_feed, confidence=0.95)
For source packages in snapshot:
- unpack source
- parse changelog for CVE mentions → FixRecord(method=changelog, confidence=0.75–0.85)
- parse patch headers → FixRecord(method=patch_header, confidence=0.80–0.90)
Merge records into a single best record per key (distro, release, source_pkg, cve)
Store into cve_fix_index with evidence
Sign the resulting snapshot manifest

4) Merge logic (no human, deterministic)

You need a deterministic rule for conflicts.

Recommended (conservative but still precision-improving):

If any record says not_affected with confidence ≥ 0.9 → choose not_affected
Else if any record says fixed with confidence ≥ 0.9 → choose fixed and fixed_version = max_fixed_version_among_high_conf
Else if any record says fixed at all → choose fixed with best available fixed_version
Else if any says wontfix → choose wontfix
Else unknown

Additionally:

Keep all evidence records in evidence so you can explain and audit.

5) Version comparison: do not reinvent it

Backport handling lives or dies on correct version ordering.

Practical approach (recommended for ingestion + server-side decisioning)

Use official tooling in containerized workers:

Debian/Ubuntu: dpkg --compare-versions
RPM distros: rpmdev-vercmp or rpm library
Alpine: apk version -t

This is reliable and avoids subtle comparator bugs.

If you must do it in-process, use well-tested libraries per ecosystem (but containerized official tools are the most robust).

6) Concrete code: Debian/Ubuntu changelog + patch parsing

This example shows Tier 2 + Tier 3 inference for a single unpacked source tree. You would wrap this inside your snapshot processing loop.

6.1 CVE extractor

import re
from pathlib import Path
from hashlib import sha256

CVE_RE = re.compile(r"\bCVE-\d{4}-\d{4,7}\b")

def extract_cves(text: str) -> set[str]:
    return set(CVE_RE.findall(text or ""))

6.2 Parse the top debian/changelog entry (for this version)

This works well because when you unpack a .dsc for version V, the top entry is for V.

def parse_debian_changelog_top_entry(src_dir: Path) -> tuple[str, set[str], dict]:
    """
    Returns:
      version: str
      cves: set[str] found in the top entry
      evidence: dict with excerpt for explainability
    """
    changelog_path = src_dir / "debian" / "changelog"
    if not changelog_path.exists():
        return "", set(), {}

    lines = changelog_path.read_text(errors="replace").splitlines()
    if not lines:
        return "", set(), {}

    # First line: "pkgname (version) distro; urgency=..."
    m = re.match(r"^[^\s]+\s+\(([^)]+)\)\s+", lines[0])
    version = m.group(1) if m else ""

    entry_lines = [lines[0]]
    # Collect until maintainer trailer line: " -- Name <email>  date"
    for line in lines[1:]:
        entry_lines.append(line)
        if line.startswith(" -- "):
            break

    entry_text = "\n".join(entry_lines)
    cves = extract_cves(entry_text)

    evidence = {
        "file": "debian/changelog",
        "version": version,
        "excerpt": entry_text[:2000],  # store small excerpt, not whole file
    }
    return version, cves, evidence

6.3 Parse CVEs from patch headers (DEP-3-ish)

def parse_debian_patches_for_cves(src_dir: Path) -> tuple[dict[str, list[dict]], dict]:
    """
    Returns:
      cve_to_patches: {CVE: [ {path, sha256, header_excerpt}, ... ]}
      evidence_summary: dict
    """
    patches_dir = src_dir / "debian" / "patches"
    if not patches_dir.exists():
        return {}, {}

    cve_to_patches: dict[str, list[dict]] = {}

    for patch in patches_dir.glob("*"):
        if not patch.is_file():
            continue
        # Read only first N lines to keep it cheap
        header = "\n".join(patch.read_text(errors="replace").splitlines()[:80])
        cves = extract_cves(header + "\n" + patch.name)
        if not cves:
            continue

        digest = sha256(patch.read_bytes()).hexdigest()
        rec = {
            "path": str(patch.relative_to(src_dir)),
            "sha256": digest,
            "header_excerpt": header[:1200],
        }
        for cve in cves:
            cve_to_patches.setdefault(cve, []).append(rec)

    evidence = {
        "dir": "debian/patches",
        "matched_cves": len(cve_to_patches),
    }
    return cve_to_patches, evidence

6.4 Produce FixRecords from the source tree

def infer_fix_records_from_debian_source(src_dir: Path, distro: str, release: str, source_pkg: str, snapshot_id: str):
    version, changelog_cves, changelog_ev = parse_debian_changelog_top_entry(src_dir)
    cve_to_patches, patch_ev = parse_debian_patches_for_cves(src_dir)

    records = []

    # Changelog-based: treat CVE mentioned in top entry as fixed in this version
    for cve in changelog_cves:
        records.append({
            "distro": distro,
            "release": release,
            "source_pkg": source_pkg,
            "cve_id": cve,
            "state": "fixed",
            "fixed_version": version,
            "method": "changelog",
            "confidence": 0.80,
            "evidence": {"changelog": changelog_ev},
            "snapshot_id": snapshot_id,
        })

    # Patch-header-based: treat CVE-tagged patches as fixed in this version
    for cve, patches in cve_to_patches.items():
        records.append({
            "distro": distro,
            "release": release,
            "source_pkg": source_pkg,
            "cve_id": cve,
            "state": "fixed",
            "fixed_version": version,
            "method": "patch_header",
            "confidence": 0.87,
            "evidence": {"patches": patches, "patch_summary": patch_ev},
            "snapshot_id": snapshot_id,
        })

    return records

That is the automated “patch-aware” signal generator.

7) Wiring this into your database build

7.1 Store raw evidence and merged result

Two-stage storage is worth it:

cve_fix_evidence (append-only)
cve_fix_index (merged best record)

So you can:

rerun merge rules
improve confidence scoring
keep auditability

7.2 Merging “fixed_version” for a CVE

When multiple versions mention the same CVE, you usually want the latest mentioning version (highest by distro comparator), because repeated mentions often indicate earlier partial fix.

Pseudo:

def choose_fixed_version(existing: str | None, candidate: str, vercmp) -> str:
    if not existing:
        return candidate
    return candidate if vercmp(candidate, existing) > 0 else existing

Where vercmp calls dpkg --compare-versions (Debian) or equivalent for that distro.

8) Decisioning logic at scan time (what changes with MVP2)

Without MVP2, you likely do:

upstream range check (false positives for backports)

With MVP2, you do:

identify distro+release from environment (or image base)
map binary_pkg → source_pkg
query cve_fix_index(distro, release, source_pkg, cve)
if state=fixed and pkg_version >= fixed_version → fixed
if state=not_affected → safe
else fallback to upstream ranges

That single substitution removes most backport noise.

9) Practical notes so you don’t get trapped

A) You must know the distro release

Backport reality is release-specific. The same package name/version can have different patching across releases.

B) Arch-specific fixes exist

Your schema should allow arch on fix records (nullable). If the feed says “only amd64 affected,” store it.

C) False positives in changelog parsing

Mitigation without humans:

require CVE mention + security-ish verbs (“fix”, “security”, “patch”) OR presence in patch headers
otherwise lower confidence and let feed override

D) Keep evidence small