Files
git.stella-ops.org/docs-archived/product/advisories/03-Dec-2026 - Building a Binary Fingerprint Database.md
2026-01-08 09:06:03 +02:00

7.6 KiB
Raw Blame History

Heres a compact, practical blueprint for a binaryfingerprint store + trustscoring engine that lets you quickly tell whether a system binary is patched, backported, or risky—even fully offline.

Why this matters (plain English)

Package versions lie (backports!). Instead of trusting names like libssl 1.1.1k, we trust whats inside: build IDs, section hashes, compiler metadata, and signed provenance. With that, we can answer: Is this exact binary knowngood, knownbad, or unknown—on this distro, on this date, with these patches?


Core concept

  • Binary Fingerprint = tuple of:

    • BuildID (ELF/PE), if present.
    • Sectionlevel hashes (e.g., .text, .rodata, selected function ranges).
    • Compiler/Linker metadata (vendor/version, LTO flags, PIE/RELRO, sanitizer bits).
    • Symbol graph sketch (optional, minhash of exported symbol names + sizes).
    • Feature toggles (FIPS mode, CET/CFI present, Fortify level, RELRO type, SSP).
  • Provenance Chain (who built it): Upstream → Distro vendor (with patchset) → Local rebuild.

  • Trust Score: combines provenance weight + cryptographic attestations + “golden set” matches + observed patch deltas.


Minimal architecture (fits StellaOps style)

  1. Ingesters

    • ingester.distro: walks repo mirrors or local systems, extracts ELF/PE, computes fingerprints, captures package→file mapping, vendor patch metadata (changelog, source SRPM diffs).
    • ingester.upstream: indexes upstream releases, commit tags, and official build artifacts.
    • ingester.local: indexes CI outputs (your own builds), intoto/DSSE attestations if available.
  2. Fingerprint Store (offlineready)

    • Primary DB: PostgreSQL (authoritative).
    • Accelerator: Valkey (ephemeral) for fast lookup by BuildID and section hash prefixes.
    • Bundle Export: signed, chunked SQLite/Parquet packs for airgapped sites.
  3. Trust Engine

    • Scores (0100) per binary instance using:

      • Provenance weight (Upstream signed > Distro signed > Local unsigned).
      • Attestation presence/quality (intoto/DSSE, reproducible build stamp).
      • Patch alignment vs Golden Set (reference fingerprints for “fixed” and “vulnerable” builds).
      • Hardening baseline (RELRO/PIE/SSP/CET/CFI).
      • Divergence penalty (unexpected section deltas vs vendordeclared patch).
    • Emits Verdict: Patched, Likely Patched (Backport), Unpatched, Unknown, with rationale.

  4. Query APIs

    • /lookup/by-buildid/{id}
    • /lookup/by-hash/{algo}/{prefix}
    • /classify (batch): accepts an SBOM file list or live filesystem scan.
    • /explain/{fingerprint}: returns diff vs Golden Set and the proof trail.

Data model (tables you can lift into Postgres)

  • artifact (artifact_id PK, file_sha256, size, mime, elf_machine, pe_machine, ts, signers[])
  • fingerprint (fp_id PK, artifact_id, build_id, text_hash, rodata_hash, sym_sketch, compiler_vendor, compiler_ver, lto, pie, relro, ssp, cfi, cet, flags jsonb)
  • provenance (prov_id PK, fp_id, origin ENUM('upstream','distro','local'), vendor, distro, release, package, version, source_commit, patchset jsonb, attestation_hash, attestation_quality_score)
  • golden_set (golden_id PK, package, cve, status ENUM('fixed','vulnerable'), fp_ref, method ENUM('vendor-advisory','diff-sig','function-patch'), notes)
  • trust_score (fp_id, score int, verdict, reasons jsonb, computed_at)

Indexes: (build_id), (text_hash), (rodata_hash), (package, version), GIN on patchset, reasons.


How detection works (fast path)

  1. Exact match BuildID hit → join golden_set → return verdict + reason.

  2. Near match (backport mode) No BuildID match → compare .text/.rodata and functionrange hashes against “fixed” Golden Set:

    • If patched function ranges match, mark Likely Patched (Backport).
    • If vulnerable function ranges match, mark Unpatched.
  3. Heuristic fallback Symbol sketch + compiler metadata + hardening flags narrow candidate set; compute targeted function hashes only (dont hash the whole file).


Building the “Golden Set”

  • Sources:

    • Vendor advisories (perCVE “fixed in” builds).
    • Upstream tags containing the fix commit.
    • Distro SRPM diffs for backports (extract exact hunk regions; compute functionrange hashes pre/post).
  • Store both:

    • “Fixed” fingerprints (postpatch).
    • “Vulnerable” fingerprints (prepatch).
  • Annotate evidence method:

    • vendor-advisory (strong), diff-sig (strong if clean hunk), function-patch (targeted).

Trust scoring (example)

  • Base by provenance:

    • Upstream + signed + reproducible: +40
    • Distro signed with changelog & SRPM diff: +30
    • Local unsigned: +10
  • Attestations:

    • Valid DSSE + intoto chain: +20
    • Reproducible build proof: +10
  • Golden Set alignment:

    • Matches “fixed”: +20
    • Matches “vulnerable”: 40
    • Partial (patched functions match, rest differs): +10
  • Hardening:

    • PIE/RELRO/SSP/CET/CFI each +2 (cap +10)
  • Divergence penalties:

    • Unexplained textsection drift 10
    • Suspicious toolchain fingerprint 5

Verdict bands: ≥80 Patched, 6579 Likely Patched (Backport), 3564 Unknown, <35 Unpatched.


CLI outline (StellaOpsstyle)

# Index a filesystem or package repo
stella-fp index /usr/bin /lib --out fp.db --bundle out.bundle.parquet

# Score a host (offline)
stella-fp classify --fp-store fp.db --golden golden.db --out verdicts.json

# Explain a result
stella-fp explain --fp <fp_id> --golden golden.db

# Maintain Golden Set
stella-fp golden add --package openssl --cve CVE-2023-XXXX --status fixed --from-srpm path.src.rpm
stella-fp golden add --package openssl --cve CVE-2023-XXXX --status vulnerable --from-upstream v1.1.1k

Implementation notes (ELF/PE)

  • ELF: read BuildID from .note.gnu.build-id; hash .text and selected function ranges (use DWARF/eh_frame or symbol table when present; otherwise lightweight linearsweep with sanity checks). Record RELRO/PIE from program headers.
  • PE: use Debug Directory (GUID/age) and Section Table; capture CFG/ASLR/NX/GS flags.
  • Functionrange hashing: normalize NOPs/padding, zero relocation slots, mask addressrelative operands (keeps hashes stable across vendor rebuilds).
  • Performance: cache persection hash; only compute function hashes when nearmatch needs confirmation.

How this plugs into your world

  • Sbomer/Vexer: attach trust scores & verdicts to components in CycloneDX/SPDX; emit VEX statements like “Fixed by backport: evidence=diffsig, source=Astra/RedHat SRPM.”
  • Feedser: when CVE feed says “vulnerable by version,” override with binary proof from Golden Set.
  • Policy Engine: gate deployments on verdict ∈ {Patched, Likely Patched} OR score ≥ 65.

Next steps you can action today

  1. Create schemas above in Postgres; scaffold a small stella-fp Go/.NET tool to compute fingerprints for /bin, /lib* on one reference host (e.g., Debian + Alpine).
  2. Handcurate a pilot Golden Set for 3 noisy CVEs (OpenSSL, glibc, curl). Store both pre/post patch fingerprints and 23 backported vendor builds each.
  3. Wire a classify step into your CI/CD and surface the verdict + rationale in your VEX output.

If you want, I can drop in starter code (C#/.NET 10) for the fingerprint extractor and the Postgres schema migration, plus a tiny “functionrange hasher” that masks relocations and normalizes padding.