Files
git.stella-ops.org/docs/architecture/decisions/ADR-001-linkset-correlation-v2.md
2026-01-25 23:39:14 +02:00

4.6 KiB

ADR-001: Linkset Correlation Algorithm V2

Status: Accepted Date: 2026-01-25 Sprint: SPRINT_20260125_001_Concelier_linkset_correlation_v2

Context

The Concelier module's linkset correlation algorithm determines whether multiple vulnerability observations (from different sources like NVD, GitHub Advisories, vendor feeds) refer to the same underlying vulnerability. The V1 algorithm had several critical failure modes:

  1. Alias intersection transitivity failure: Sources A (CVE-X), B (CVE-X + GHSA-Y), C (GHSA-Y) produced empty intersection despite transitive identity through shared aliases.

  2. Thin source penalty: A source with zero packages collapsed the entire group's package score to 0, even when other sources shared packages.

  3. False reference conflicts: Zero reference overlap was treated as a conflict rather than neutral evidence.

  4. Uniform conflict penalties: All conflicts applied the same -0.1 penalty regardless of severity.

These issues caused both false negatives (failing to link related advisories) and false positives (emitting spurious conflicts).

Decision

We will replace the V1 intersection-based correlation algorithm with a V2 graph-based approach that:

1. Graph-Based Alias Connectivity

Replace intersection-across-all with union-find graph connectivity. Build a bipartite graph (observation ↔ alias nodes) and compute Largest Connected Component (LCC) ratio.

Rationale: Transitive relationships are naturally captured by graph connectivity. Three sources with partial alias overlap can still achieve high correlation if they form a connected component.

2. Pairwise Package Coverage

Replace intersection-across-all with pairwise coverage scoring. Score is positive when any pair shares a package key, even if some sources have no packages.

Rationale: "Thin" sources (e.g., vendor advisories with only CVE IDs) should not penalize correlation when other sources provide package evidence.

3. Neutral Reference Scoring

Zero reference overlap returns 0.5 (neutral) instead of emitting a conflict. Reserve conflicts for true contradictions.

Rationale: Disjoint reference sets indicate lack of supporting evidence, not contradiction.

4. Typed Conflict Severities

Replace uniform -0.1 penalty with severity-based penalties:

Conflict Type Severity Penalty
Distinct CVEs in cluster Hard -0.4
Disjoint version ranges Hard -0.3
Overlapping divergent ranges Soft -0.05
CVSS/severity mismatch Soft -0.05
Alias inconsistency Soft -0.1
Zero reference overlap None 0

Rationale: Hard conflicts (distinct identities) should heavily penalize confidence. Soft conflicts (metadata differences) may indicate data quality issues but not identity mismatch.

5. Additional Correlation Signals

Add high-discriminative signals:

  • Patch lineage (0.10 weight): Shared commit SHA indicates same fix
  • Version compatibility (0.10 weight): Classify range relationships
  • IDF weighting: Rare package matches weighted higher than common packages

6. V1/V2 Switchable Interface

Provide ILinksetCorrelationService with configurable version selection to enable gradual rollout and A/B testing.

Consequences

Positive

  • Eliminates false negatives from transitive alias chains
  • Eliminates false negatives from thin sources
  • Reduces false positive conflicts from disjoint references
  • Enables fine-grained conflict severity handling by downstream policy
  • Adds discriminative signals (patch lineage) that differentiate from commodity linkers

Negative

  • Changes correlation weights, affecting existing linkset confidence scores
  • Requires recomputation of existing linksets during migration
  • Adds Valkey dependency for IDF caching (mitigated by graceful fallback)

Neutral

  • Algorithm complexity increases but remains O(n²) in observations
  • Determinism preserved through fixed scorer order and tie-breakers

Implementation

  • Core algorithm: LinksetCorrelationV2.cs
  • Service interface: ILinksetCorrelationService.cs
  • Service implementation: LinksetCorrelationService.cs
  • Model extension: ConflictSeverity enum in AdvisoryLinkset.cs
  • IDF caching: ValkeyPackageIdfService.cs
  • Tests: 27 V2 tests + 18 IDF tests

References

  • Sprint: docs/implplan/SPRINT_20260125_001_Concelier_linkset_correlation_v2.md
  • Algorithm documentation: docs/modules/concelier/linkset-correlation-v2.md
  • Architecture section: docs/modules/concelier/architecture.md § 5.2
  • Conflict resolution runbook: docs/modules/concelier/operations/conflict-resolution.md § 5.1