# ADR-001: Linkset Correlation Algorithm V2

**Status:** Accepted
**Date:** 2026-01-25
**Sprint:** SPRINT_20260125_001_Concelier_linkset_correlation_v2

## Context

The Concelier module's linkset correlation algorithm determines whether multiple vulnerability observations (from different sources like NVD, GitHub Advisories, vendor feeds) refer to the same underlying vulnerability. The V1 algorithm had several critical failure modes:

1. **Alias intersection transitivity failure**: Sources A (CVE-X), B (CVE-X + GHSA-Y), C (GHSA-Y) produced empty intersection despite transitive identity through shared aliases.

2. **Thin source penalty**: A source with zero packages collapsed the entire group's package score to 0, even when other sources shared packages.

3. **False reference conflicts**: Zero reference overlap was treated as a conflict rather than neutral evidence.

4. **Uniform conflict penalties**: All conflicts applied the same -0.1 penalty regardless of severity.

These issues caused both false negatives (failing to link related advisories) and false positives (emitting spurious conflicts).

## Decision

We will replace the V1 intersection-based correlation algorithm with a V2 graph-based approach that:

### 1. Graph-Based Alias Connectivity
Replace intersection-across-all with union-find graph connectivity. Build a bipartite graph (observation ↔ alias nodes) and compute Largest Connected Component (LCC) ratio.

**Rationale**: Transitive relationships are naturally captured by graph connectivity. Three sources with partial alias overlap can still achieve high correlation if they form a connected component.

### 2. Pairwise Package Coverage
Replace intersection-across-all with pairwise coverage scoring. Score is positive when any pair shares a package key, even if some sources have no packages.

**Rationale**: "Thin" sources (e.g., vendor advisories with only CVE IDs) should not penalize correlation when other sources provide package evidence.

### 3. Neutral Reference Scoring
Zero reference overlap returns 0.5 (neutral) instead of emitting a conflict. Reserve conflicts for true contradictions.

**Rationale**: Disjoint reference sets indicate lack of supporting evidence, not contradiction.

### 4. Typed Conflict Severities
Replace uniform -0.1 penalty with severity-based penalties:

| Conflict Type | Severity | Penalty |
|---------------|----------|---------|
| Distinct CVEs in cluster | Hard | -0.4 |
| Disjoint version ranges | Hard | -0.3 |
| Overlapping divergent ranges | Soft | -0.05 |
| CVSS/severity mismatch | Soft | -0.05 |
| Alias inconsistency | Soft | -0.1 |
| Zero reference overlap | None | 0 |

**Rationale**: Hard conflicts (distinct identities) should heavily penalize confidence. Soft conflicts (metadata differences) may indicate data quality issues but not identity mismatch.

### 5. Additional Correlation Signals
Add high-discriminative signals:
- **Patch lineage** (0.10 weight): Shared commit SHA indicates same fix
- **Version compatibility** (0.10 weight): Classify range relationships
- **IDF weighting**: Rare package matches weighted higher than common packages

### 6. V1/V2 Switchable Interface
Provide `ILinksetCorrelationService` with configurable version selection to enable gradual rollout and A/B testing.

## Consequences

### Positive
- Eliminates false negatives from transitive alias chains
- Eliminates false negatives from thin sources
- Reduces false positive conflicts from disjoint references
- Enables fine-grained conflict severity handling by downstream policy
- Adds discriminative signals (patch lineage) that differentiate from commodity linkers

### Negative
- Changes correlation weights, affecting existing linkset confidence scores
- Requires recomputation of existing linksets during migration
- Adds Valkey dependency for IDF caching (mitigated by graceful fallback)

### Neutral
- Algorithm complexity increases but remains O(n²) in observations
- Determinism preserved through fixed scorer order and tie-breakers

## Implementation

- **Core algorithm**: `LinksetCorrelationV2.cs`
- **Service interface**: `ILinksetCorrelationService.cs`
- **Service implementation**: `LinksetCorrelationService.cs`
- **Model extension**: `ConflictSeverity` enum in `AdvisoryLinkset.cs`
- **IDF caching**: `ValkeyPackageIdfService.cs`
- **Tests**: 27 V2 tests + 18 IDF tests

## References

- Sprint: `docs/implplan/SPRINT_20260125_001_Concelier_linkset_correlation_v2.md`
- Algorithm documentation: `docs/modules/concelier/linkset-correlation-v2.md`
- Architecture section: `docs/modules/concelier/architecture.md` § 5.2
- Conflict resolution runbook: `docs/modules/concelier/operations/conflict-resolution.md` § 5.1