Files
git.stella-ops.org/docs/features/checked/concelier/deterministic-semantic-merge-hash-for-advisory-deduplication.md
2026-02-13 02:04:55 +02:00

3.3 KiB

Deterministic Semantic Merge Hash for Advisory Deduplication

Module

Concelier

Status

VERIFIED

Description

Computes identity-based semantic hash from (CVE + PURL/CPE + version-range + CWE + patch_lineage) for cross-distro advisory deduplication. Includes normalizers (PURL, CPE, version range, CWE, patch lineage), golden corpus validation (Debian/RHEL/SUSE/Alpine), fuzzing tests (1000 random inputs), shadow-write migration mode, and backfill service. Distinct from "Advisory Ingestion with Canonical Deduplication" which is the overall dedup concept; this is the specific merge_hash identity algorithm.

Implementation Details

  • Modules: src/Concelier/__Libraries/StellaOps.Concelier.Merge/Identity/, src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/, src/Concelier/__Libraries/StellaOps.Concelier.Merge/Jobs/
  • Key Classes:
    • MergeHashCalculator (src/Concelier/__Libraries/StellaOps.Concelier.Merge/Identity/MergeHashCalculator.cs) - computes deterministic semantic hash from (CVE + PURL/CPE + version-range + CWE + patch_lineage) with input normalizers
    • MergeHashShadowWriteService (src/Concelier/__Libraries/StellaOps.Concelier.Merge/Identity/MergeHashShadowWriteService.cs) - shadow-write mode for migration validation
    • MergeHashBackfillService (src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/MergeHashBackfillService.cs) - retroactive backfill of merge hashes for existing advisories
    • MergeHashBackfillJob (src/Concelier/__Libraries/StellaOps.Concelier.Merge/Jobs/MergeHashBackfillJob.cs) - scheduled IJob for backfill execution
  • Interfaces: IMergeHashCalculator
  • Source: SPRINT_8200_0012_0001_CONCEL_merge_hash_library.md

Verification Evidence

  • Run: run-002 (2026-02-13)
  • Test project: StellaOps.Concelier.Merge.Tests (731/731 pass)
  • Baseline: 687 existing tests + 44 new tests
  • New test files:
    • MergeHashShadowWriteServiceTests.cs (16 tests): backfill-all, backfill-one, skip-if-hash-exists, force recompute, error resilience, cancellation, field preservation
    • MergeHashBackfillServiceTests.cs (18 tests): dry-run mode, skip-if-hash-exists, error counting, cancellation, duration, SuccessRate/AvgTimePerAdvisoryMs metrics
    • MergeHashBackfillJobTests.cs (10 tests): IJob parameter parsing (seed/force routing, empty seed fallback, type-safe force)
  • Existing coverage: MergeHashCalculatorTests (20), GoldenCorpusTests (10), FuzzingTests (5) - all assertions verified meaningful

E2E Test Plan

  • Compute merge hash for two semantically identical advisories from different sources (e.g., Debian and RHEL for same CVE) and verify identical hash output
  • Verify PURL normalization: different PURL formats for the same package produce the same merge hash
  • Verify CPE normalization: equivalent CPE strings produce identical hashes
  • Verify determinism: same input produces the same hash across 1000 repeated computations
  • Verify golden corpus: validate merge hash against the golden corpus of known Debian/RHEL/SUSE/Alpine advisories
  • Verify shadow-write mode: enable shadow writes and confirm both old and new hashes are persisted for comparison
  • Verify backfill: run MergeHashBackfillJob and confirm pre-existing advisories receive computed merge hashes