Add reference architecture and testing strategy documentation

- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces. - Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails. - Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented. - Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
2025-12-22 07:59:15 +02:00
parent 5d398ec442
commit 53503cb407
96 changed files with 37565 additions and 71 deletions
--- a/docs/product-advisories/archived/2025-12-22-binaryindex/21-Dec-2025
+++ b/docs/product-advisories/archived/2025-12-22-binaryindex/21-Dec-2025
@@ -0,0 +1,783 @@
+Below is a practical, production-grade architecture for building a **vulnerable binaries database**. I’m going to be explicit about what “such a database” can mean, because there are two materially different products:
+
+1. **Known-build catalog**: “These exact shipped binaries (Build-ID / hash) are affected or fixed for CVE X.”
+2. **Binary fingerprint DB**: “Even if the binary is unpackaged / self-built, we can match vulnerable code patterns.”
+
+You want both. The first gets you breadth fast; the second is the moat.
+
+---
+
+## 1) Core principle: treat “binary identity” as the primary key
+
+For Linux ELF:
+
+* Primary: `ELF Build-ID` (from `.note.gnu.build-id`)
+* Fallback: `sha256(file_bytes)`
+* Add: `sha256(.text)` and/or BLAKE3 for speed
+
+This creates a stable identity that survives “package metadata lies.”
+
+**BinaryKey = build_id || file_sha256**
+
+---
+
+## 2) High-level system diagram
+
+```
+            ┌──────────────────────────┐
+            │ Vulnerability Intel      │
+            │ OSV/NVD + distro advis.  │
+            └───────────┬──────────────┘
+                        │ normalize
+                        v
+            ┌──────────────────────────┐
+            │ Vuln Knowledge Store     │
+            │ CVE↔pkg ranges, patches  │
+            └───────────┬──────────────┘
+                        │
+                        │
+┌───────────────────────v─────────────────────────┐
+│ Repo Snapshotter (per distro/arch/date)          │
+│ - mirrors metadata + packages (+ debuginfo)      │
+│ - verifies signatures                            │
+│ - emits signed snapshot manifest                 │
+└───────────┬───────────────────────────┬─────────┘
+            │                           │
+            │ packages                  │ debuginfo/sources
+            v                           v
+┌──────────────────────────┐   ┌──────────────────────────┐
+│ Package Unpacker          │   │ Source/Buildinfo Mapper   │
+│ - extract files           │   │ - pkg→source commit/patch │
+└───────────┬──────────────┘   └───────────┬──────────────┘
+            │ binaries                      │
+            v                               │
+┌──────────────────────────┐               │
+│ Binary Feature Extractor  │               │
+│ - Build-ID, hashes         │               │
+│ - dyn deps, symbols        │               │
+│ - function boundaries (opt)│               │
+└───────────┬──────────────┘               │
+            │                               │
+            v                               v
+┌──────────────────────────────────────────────────┐
+│ Vulnerable Binary Classifier                      │
+│ Tier A: pkg/version range                         │
+│ Tier B: Build-ID→known shipped build              │
+│ Tier C: code fingerprints (function/CFG hashes)   │
+└───────────┬───────────────────────────┬──────────┘
+            │                           │
+            v                           v
+┌──────────────────────────┐   ┌──────────────────────────┐
+│ Vulnerable Binary DB      │   │ Evidence/Attestation DB   │
+│ (indexed by BinaryKey)    │   │ (signed proofs, snapshots)│
+└───────────┬──────────────┘   └───────────┬──────────────┘
+            │ publish signed snapshot       │
+            v                               v
+        Clients/Scanners             Explainable VEX outputs
+```
+
+---
+
+## 3) Data stores you actually need
+
+### A) Relational store (Postgres)
+
+Use this for *indexes and joins*.
+
+Key tables:
+
+**`binary_identity`**
+
+* `binary_key` (build_id or file_sha256) PK
+* `build_id` (nullable)
+* `file_sha256`, `text_sha256`
+* `arch`, `osabi`, `type` (ET_DYN/EXEC), `stripped`
+* `first_seen_snapshot`, `last_seen_snapshot`
+
+**`binary_package_map`**
+
+* `binary_key`
+* `distro`, `pkg_name`, `pkg_version_release`, `arch`
+* `file_path_in_pkg`, `snapshot_id`
+
+**`snapshot_manifest`**
+
+* `snapshot_id`
+* `distro`, `arch`, `timestamp`
+* `repo_metadata_digests`, `signing_key_id`, `dsse_envelope_ref`
+
+**`cve_package_ranges`**
+
+* `cve_id`, `ecosystem` (deb/rpm/apk), `pkg_name`
+* `vulnerable_ranges`, `fixed_ranges`
+* `advisory_ref`, `snapshot_id`
+
+**`binary_vuln_assertion`**
+
+* `binary_key`, `cve_id`
+* `status` ∈ {affected, not_affected, fixed, unknown}
+* `method` ∈ {range_match, buildid_catalog, fingerprint_match}
+* `confidence` (0–1)
+* `evidence_ref` (points to signed evidence)
+
+### B) Object store (S3/MinIO)
+
+Do not bloat Postgres with large blobs.
+
+Store:
+
+* extracted symbol lists, string tables
+* function hash maps
+* disassembly snippets for matched functions (small)
+* DSSE envelopes / attestations
+* optional: debug info extracts (or references to where they can be fetched)
+
+### C) Optional search index (OpenSearch/Elastic)
+
+If you want fast “find all binaries exporting `SSL_read`” style queries, index symbols/strings.
+
+---
+
+## 4) Building the database: pipelines
+
+### Pipeline 1: Distro repo snapshots → Known-build catalog (breadth)
+
+This is your fastest route to a “binaries DB.”
+
+**Step 1 — Snapshot**
+
+* Mirror repo metadata + packages for (distro, release, arch).
+* Verify signatures (APT Release.gpg, RPM signatures, APK signatures).
+* Emit **signed snapshot manifest** (DSSE) listing digests of everything mirrored.
+
+**Step 2 — Extract binaries**
+For each package:
+
+* unpack (deb/rpm/apk)
+* select ELF files (EXEC + shared libs)
+* compute Build-ID, file hash, `.text` hash
+* store identity + `binary_package_map`
+
+**Step 3 — Assign CVE status (Tier A + Tier B)**
+
+* Ingest distro advisories and/or OSV mappings into `cve_package_ranges`
+* For each `binary_package_map`, apply range checks
+* Create `binary_vuln_assertion` entries:
+
+  * `method=range_match` (coarse)
+* If you have a Build-ID mapping to exact shipped builds, you can tag:
+
+  * `method=buildid_catalog` (stronger than pure version)
+
+This yields a database where a scanner can do:
+
+* “Given Build-ID, tell me all CVEs per the distro snapshot.”
+
+This already reduces noise because the primary key is the **binary**.
+
+---
+
+### Pipeline 2: Patch-aware classification (backports handled)
+
+To handle “version says vulnerable but backport fixed” you must incorporate patch provenance.
+
+**Step 1 — Build provenance mapping**
+Per ecosystem:
+
+* Debian/Ubuntu: parse `Sources`, changelogs, (ideally) `.buildinfo`, patch series.
+* RPM distros: SRPM + changelog + patch list.
+* Alpine: APKBUILD + patches.
+
+**Step 2 — CVE ↔ patch linkage**
+From advisories and patch metadata, store:
+
+* “CVE fixed by patch set P in build B of pkg V-R”
+
+**Step 3 — Apply to binaries**
+Instead of version-only, decide:
+
+* if the **specific build** includes the patch
+* mark as `fixed` even if upstream version looks vulnerable
+
+This is still not “binary-only,” but it’s much closer to truth for distros.
+
+---
+
+### Pipeline 3: Binary fingerprint factory (the moat)
+
+This is where you become independent of packaging claims.
+
+You build fingerprints at the **function/CFG level** for high-impact CVEs.
+
+#### 3.1 Select targets
+
+You cannot fingerprint everything. Start with:
+
+* top shared libs (openssl, glibc, zlib, expat, libxml2, curl, sqlite, ncurses, etc.)
+* CVEs that are exploited in the wild / high-severity
+* CVEs where distros backport heavily (version logic is unreliable)
+
+#### 3.2 Identify “changed functions” from the fix
+
+Input: upstream commit/patch or distro patch.
+
+Process:
+
+* diff the patch
+* extract affected files + functions (tree-sitter/ctags + diff hunks)
+* list candidate functions and key basic blocks
+
+#### 3.3 Build vulnerable + fixed reference binaries
+
+For each (arch, toolchain profile):
+
+* compile “known vulnerable” and “known fixed”
+* ensure reproducibility: record compiler version, flags, link mode
+* store provenance (DSSE) for these reference builds
+
+#### 3.4 Extract robust fingerprints
+
+Avoid raw byte signatures (they break across compilers).
+
+Better fingerprint types, from weakest to strongest:
+
+* **symbol-level**: function name + versioned symbol + library SONAME
+* **function normalized hash**:
+
+  * disassemble function
+  * normalize:
+
+    * strip addresses/relocs
+    * bucket registers
+    * normalize immediates (where safe)
+  * hash instruction sequence or basic-block sequence
+* **basic-block multiset hash**:
+
+  * build a set/multiset of block hashes; order-independent
+* **lightweight CFG hash**:
+
+  * nodes: block hashes
+  * edges: control flow
+  * hash canonical representation
+
+Store fingerprints like:
+
+**`vuln_fingerprint`**
+
+* `cve_id`
+* `component` (openssl/libssl)
+* `arch`
+* `fp_type` (func_norm_hash, bb_multiset, cfg_hash)
+* `fp_value`
+* `function_hint` (name if present; else pattern)
+* `confidence`, `notes`
+* `evidence_ref` (points to reference builds + patch)
+
+#### 3.5 Validate fingerprints at scale
+
+This is non-negotiable.
+
+Validation loop:
+
+* Test against:
+
+  * known vulnerable builds (must match)
+  * known fixed builds (must not match)
+  * large “benign corpus” (estimate false positives)
+* Maintain:
+
+  * precision/recall metrics per fingerprint
+  * confidence score
+
+Only promote fingerprints to “production” when validation passes thresholds.
+
+---
+
+## 5) Query-time logic (how scanners use the DB)
+
+Given a target binary, the scanner computes:
+
+* `binary_key`
+* basic features (arch, SONAME, symbols)
+* optional function hashes (for targeted libs)
+
+Then it queries in this precedence order:
+
+1. **Exact match**: `binary_key` exists with explicit assertion (strong)
+2. **Build catalog**: Build-ID→known distro build→CVE mapping (strong)
+3. **Fingerprint match**: function/CFG hashes hit (strong, binary-only)
+4. **Fallback**: package range matching (weakest)
+
+Return result as a signed VEX with evidence references.
+
+---
+
+## 6) Update model: “sealed knowledge snapshots”
+
+To make this auditable and customer-friendly:
+
+* Every repo snapshot is immutable and signed.
+* Every fingerprint bundle is versioned and signed.
+* Every “vulnerable binaries DB release” is a signed manifest pointing to:
+
+  * which repo snapshots were used
+  * which advisory snapshots were used
+  * which fingerprint sets were included
+
+This lets you prove:
+
+* what you knew
+* when you knew it
+* exactly which data drove the verdict
+
+---
+
+## 7) Scaling and cost control
+
+Without control, fingerprinting explodes. Use these constraints:
+
+* Only disassemble/hash functions for:
+
+  * libraries in your “hot set”
+  * binaries whose package indicates relevance to a targeted CVE family
+* Deduplicate aggressively:
+
+  * identical `.text_sha256` ⇒ reuse extracted functions
+  * identical Build-ID across paths ⇒ reuse features
+* Incremental snapshots:
+
+  * process only new/changed packages per snapshot
+  * store “already processed digest” cache (Valkey)
+
+---
+
+## 8) Security and trust boundaries
+
+A vulnerable binary DB is itself a high-value target. Hardening must be part of architecture:
+
+* Verify upstream repo signatures before ingestion.
+* Run unpacking/extraction in sandboxes (namespaces/seccomp) because packages can be hostile.
+* Sign:
+
+  * snapshot manifests
+  * fingerprint sets
+  * DB releases
+* Keep signing keys in an HSM/KMS.
+* Maintain provenance chain: input digests → output digests.
+
+---
+
+## 9) Minimal viable build plan (no wishful thinking)
+
+### MVP 1: Known-build binary catalog (fast, broad)
+
+* Repo snapshot + Build-ID extraction
+* CVE ranges + advisory snapshots
+* `binary_key → (pkg@ver, CVEs)` mapping
+* Signed snapshot releases
+
+### MVP 2: Patch-aware backport handling (precision jump)
+
+* Source/patch mapping
+* Build-specific “fixed-by-backport” logic
+
+### MVP 3: Fingerprints for top components (moat)
+
+* Fingerprint factory for ~20–50 “most reused” libs
+* Validation corpus + confidence scoring
+
+---
+
+## 10) What you can claim (and what you should not)
+
+You can defensibly claim:
+
+* “We can identify vulnerable **binaries** (not just packages) with verifiable evidence.”
+
+You should not claim (until you have it):
+
+* “We can detect every vulnerable binary on earth”
+  No one can do that without unacceptable false positives/negatives.
+
+---
+
+If you want this aligned tightly to your Stella Ops services split (scanner webservice lattice algorithms + proof-of-integrity graph), I can map the above into:
+
+* concrete microservices,
+* queue/job boundaries,
+* and the exact API surfaces (`/snapshots`, `/features`, `/fingerprints`, `/match`, `/vex`).
+To code **MVP 2 (patch-aware backport handling)** without any human triage, you need one thing in your database that most scanners do not maintain:
+
+> A normalized, distro-specific index:
+> **(distro, release, source-package, CVE) → fixed-in distro version (or not-affected / wontfix / unknown), with evidence and confidence.**
+
+Then every scan decision becomes a deterministic version comparison against that index, not against upstream semver.
+
+Below is a concrete, buildable approach (pipelines + data model + code skeletons) that stays fully automated.
+
+---
+
+## 1) What MVP2 computes
+
+### Output table you must build
+
+**`cve_fix_index`**
+
+* `distro` (e.g., debian, ubuntu, rhel, alpine)
+* `release` (e.g., bookworm, jammy, 9, 3.19)
+* `source_pkg` (not binary subpackage)
+* `cve_id`
+* `state` ∈ {`fixed`, `vulnerable`, `not_affected`, `wontfix`, `unknown`}
+* `fixed_version` (nullable; distro version string, including revision)
+* `method` ∈ {`security_feed`, `changelog`, `patch_header`, `upstream_patch_match`}
+* `confidence` (float)
+* `evidence` (JSON: references to advisory entry, changelog lines, patch names + digests)
+* `snapshot_id` (your sealed snapshot identifier)
+
+### Why “source package”?
+
+Security trackers and patch sets are tracked at the **source** level (e.g., `openssl`), while runtime installs are often **binary subpackages** (e.g., `libssl3`). You need a stable join:
+`binary_pkg -> source_pkg`.
+
+---
+
+## 2) No-human signals, in strict priority order
+
+You can do this with **zero manual** work by using a tiered resolver:
+
+### Tier 1 — Structured distro security feed (highest precision)
+
+This is the authoritative “backport-aware” answer because it encodes:
+
+* “fixed in 1.1.1n-0ubuntu2.4” (even if upstream says “fixed in 1.1.1o”)
+* “not affected” cases
+* sometimes arch-specific applicability
+
+Your ingestor just parses and normalizes it.
+
+### Tier 2 — Source package changelog CVE mentions
+
+If a feed entry is missing/late, parse source changelog:
+
+* Debian/Ubuntu: `debian/changelog`
+* RPM: `%changelog` in `.spec`
+* Alpine: `secfixes` in `APKBUILD` (often present)
+
+This is surprisingly effective because maintainers often include “CVE-XXXX-YYYY” in the entry that introduced the fix.
+
+### Tier 3 — Patch metadata (DEP-3 headers / patch filenames)
+
+Parse patches shipped with the source package:
+
+* Debian: `debian/patches/*` + `debian/patches/series`
+* RPM: patch files listed in spec / SRPM
+* Alpine: `patches/*.patch` in the aport
+
+Search patch headers and filenames for CVE IDs, store patch hashes.
+
+### Tier 4 — Upstream patch equivalence (optional in MVP2, strong)
+
+If you can map CVE→upstream fix commit (OSV often helps), you can match canonicalized patch hunks against distro patches.
+
+MVP2 can ship without Tier 4; Tier 1+2 already eliminates most backport false positives.
+
+---
+
+## 3) Architecture: the “Fix Index Builder” job
+
+### Inputs
+
+* Your sealed repo snapshot: Packages + Sources (or SRPM/aports)
+* Distro security feed snapshot (OVAL/JSON/errata tracker) for same release
+* (Optional) OSV/NVD upstream ranges for fallback only
+
+### Processing graph
+
+1. **Build `binary_pkg → source_pkg` map** from repo metadata
+2. **Ingest security feed** → produce `FixRecord(method=security_feed, confidence=0.95)`
+3. **For source packages in snapshot**:
+
+   * unpack source
+   * parse changelog for CVE mentions → `FixRecord(method=changelog, confidence=0.75–0.85)`
+   * parse patch headers → `FixRecord(method=patch_header, confidence=0.80–0.90)`
+4. **Merge** records into a single best record per key (distro, release, source_pkg, cve)
+5. Store into `cve_fix_index` with evidence
+6. Sign the resulting snapshot manifest
+
+---
+
+## 4) Merge logic (no human, deterministic)
+
+You need a deterministic rule for conflicts.
+
+Recommended (conservative but still precision-improving):
+
+1. If any record says `not_affected` with confidence ≥ 0.9 → choose `not_affected`
+2. Else if any record says `fixed` with confidence ≥ 0.9 → choose `fixed` and `fixed_version = max_fixed_version_among_high_conf`
+3. Else if any record says `fixed` at all → choose `fixed` with best available `fixed_version`
+4. Else if any says `wontfix` → choose `wontfix`
+5. Else `unknown`
+
+Additionally:
+
+* Keep *all* evidence records in `evidence` so you can explain and audit.
+
+---
+
+## 5) Version comparison: do not reinvent it
+
+Backport handling lives or dies on correct version ordering.
+
+### Practical approach (recommended for ingestion + server-side decisioning)
+
+Use official tooling in containerized workers:
+
+* Debian/Ubuntu: `dpkg --compare-versions`
+* RPM distros: `rpmdev-vercmp` or `rpm` library
+* Alpine: `apk version -t`
+
+This is reliable and avoids subtle comparator bugs.
+
+If you must do it in-process, use well-tested libraries per ecosystem (but containerized official tools are the most robust).
+
+---
+
+## 6) Concrete code: Debian/Ubuntu changelog + patch parsing
+
+This example shows **Tier 2 + Tier 3** inference for a single unpacked source tree. You would wrap this inside your snapshot processing loop.
+
+### 6.1 CVE extractor
+
+```python
+import re
+from pathlib import Path
+from hashlib import sha256
+
+CVE_RE = re.compile(r"\bCVE-\d{4}-\d{4,7}\b")
+
+def extract_cves(text: str) -> set[str]:
+    return set(CVE_RE.findall(text or ""))
+```
+
+### 6.2 Parse the *top* debian/changelog entry (for this version)
+
+This works well because when you unpack a `.dsc` for version `V`, the top entry is for `V`.
+
+```python
+def parse_debian_changelog_top_entry(src_dir: Path) -> tuple[str, set[str], dict]:
+    """
+    Returns:
+      version: str
+      cves: set[str] found in the top entry
+      evidence: dict with excerpt for explainability
+    """
+    changelog_path = src_dir / "debian" / "changelog"
+    if not changelog_path.exists():
+        return "", set(), {}
+
+    lines = changelog_path.read_text(errors="replace").splitlines()
+    if not lines:
+        return "", set(), {}
+
+    # First line: "pkgname (version) distro; urgency=..."
+    m = re.match(r"^[^\s]+\s+\(([^)]+)\)\s+", lines[0])
+    version = m.group(1) if m else ""
+
+    entry_lines = [lines[0]]
+    # Collect until maintainer trailer line: " -- Name <email>  date"
+    for line in lines[1:]:
+        entry_lines.append(line)
+        if line.startswith(" -- "):
+            break
+
+    entry_text = "\n".join(entry_lines)
+    cves = extract_cves(entry_text)
+
+    evidence = {
+        "file": "debian/changelog",
+        "version": version,
+        "excerpt": entry_text[:2000],  # store small excerpt, not whole file
+    }
+    return version, cves, evidence
+```
+
+### 6.3 Parse CVEs from patch headers (DEP-3-ish)
+
+```python
+def parse_debian_patches_for_cves(src_dir: Path) -> tuple[dict[str, list[dict]], dict]:
+    """
+    Returns:
+      cve_to_patches: {CVE: [ {path, sha256, header_excerpt}, ... ]}
+      evidence_summary: dict
+    """
+    patches_dir = src_dir / "debian" / "patches"
+    if not patches_dir.exists():
+        return {}, {}
+
+    cve_to_patches: dict[str, list[dict]] = {}
+
+    for patch in patches_dir.glob("*"):
+        if not patch.is_file():
+            continue
+        # Read only first N lines to keep it cheap
+        header = "\n".join(patch.read_text(errors="replace").splitlines()[:80])
+        cves = extract_cves(header + "\n" + patch.name)
+        if not cves:
+            continue
+
+        digest = sha256(patch.read_bytes()).hexdigest()
+        rec = {
+            "path": str(patch.relative_to(src_dir)),
+            "sha256": digest,
+            "header_excerpt": header[:1200],
+        }
+        for cve in cves:
+            cve_to_patches.setdefault(cve, []).append(rec)
+
+    evidence = {
+        "dir": "debian/patches",
+        "matched_cves": len(cve_to_patches),
+    }
+    return cve_to_patches, evidence
+```
+
+### 6.4 Produce FixRecords from the source tree
+
+```python
+def infer_fix_records_from_debian_source(src_dir: Path, distro: str, release: str, source_pkg: str, snapshot_id: str):
+    version, changelog_cves, changelog_ev = parse_debian_changelog_top_entry(src_dir)
+    cve_to_patches, patch_ev = parse_debian_patches_for_cves(src_dir)
+
+    records = []
+
+    # Changelog-based: treat CVE mentioned in top entry as fixed in this version
+    for cve in changelog_cves:
+        records.append({
+            "distro": distro,
+            "release": release,
+            "source_pkg": source_pkg,
+            "cve_id": cve,
+            "state": "fixed",
+            "fixed_version": version,
+            "method": "changelog",
+            "confidence": 0.80,
+            "evidence": {"changelog": changelog_ev},
+            "snapshot_id": snapshot_id,
+        })
+
+    # Patch-header-based: treat CVE-tagged patches as fixed in this version
+    for cve, patches in cve_to_patches.items():
+        records.append({
+            "distro": distro,
+            "release": release,
+            "source_pkg": source_pkg,
+            "cve_id": cve,
+            "state": "fixed",
+            "fixed_version": version,
+            "method": "patch_header",
+            "confidence": 0.87,
+            "evidence": {"patches": patches, "patch_summary": patch_ev},
+            "snapshot_id": snapshot_id,
+        })
+
+    return records
+```
+
+That is the automated “patch-aware” signal generator.
+
+---
+
+## 7) Wiring this into your database build
+
+### 7.1 Store raw evidence and merged result
+
+Two-stage storage is worth it:
+
+1. `cve_fix_evidence` (append-only)
+2. `cve_fix_index` (merged best record)
+
+So you can:
+
+* rerun merge rules
+* improve confidence scoring
+* keep auditability
+
+### 7.2 Merging “fixed_version” for a CVE
+
+When multiple versions mention the same CVE, you usually want the **latest** mentioning version (highest by distro comparator), because repeated mentions often indicate earlier partial fix.
+
+Pseudo:
+
+```python
+def choose_fixed_version(existing: str | None, candidate: str, vercmp) -> str:
+    if not existing:
+        return candidate
+    return candidate if vercmp(candidate, existing) > 0 else existing
+```
+
+Where `vercmp` calls `dpkg --compare-versions` (Debian) or equivalent for that distro.
+
+---
+
+## 8) Decisioning logic at scan time (what changes with MVP2)
+
+Without MVP2, you likely do:
+
+* upstream range check (false positives for backports)
+
+With MVP2, you do:
+
+1. identify `distro+release` from environment (or image base)
+2. map `binary_pkg → source_pkg`
+3. query `cve_fix_index(distro, release, source_pkg, cve)`
+4. if `state=fixed` and `pkg_version >= fixed_version` → fixed
+5. if `state=not_affected` → safe
+6. else fallback to upstream ranges
+
+That single substitution removes most backport noise.
+
+---
+
+## 9) Practical notes so you don’t get trapped
+
+### A) You must know the distro release
+
+Backport reality is release-specific. The same package name/version can have different patching across releases.
+
+### B) Arch-specific fixes exist
+
+Your schema should allow `arch` on fix records (nullable). If the feed says “only amd64 affected,” store it.
+
+### C) False positives in changelog parsing
+
+Mitigation without humans:
+
+* require CVE mention + security-ish verbs (“fix”, “security”, “patch”) OR presence in patch headers
+* otherwise lower confidence and let feed override
+
+### D) Keep evidence small
+
+Store:
+
+* excerpt + patch hashes
+  Not entire source tarballs.
+
+---
+
+## 10) Minimal “done definition” for MVP2
+
+You have MVP2 when, for Debian/Ubuntu at least, you can demonstrate:
+
+* A CVE that upstream marks vulnerable for version X
+* The distro backported it in X-<revision>
+* Your system classifies:
+
+  * `X-older_revision` as vulnerable
+  * `X-newer_revision` as fixed
+* With evidence: fix feed record and/or changelog/patch proof
+
+No human required.
+
+---
+
+If you want, I can provide the same “Tier 2/3 inference” module for RPM (SRPM/spec parsing) and Alpine (APKBUILD `secfixes` extraction), plus the exact Postgres DDL for `cve_fix_evidence` and `cve_fix_index`, and the merge SQL.