# CONCELIER-LNM-26-001 · Linkset Correlation Rules (v2)

> Supersedes `linkset-correlation-21-002.md` for new linkset builds.
> V1 linksets remain valid; migration job will recompute confidence using v2 algorithm.

Purpose: Address critical failure modes in v1 correlation (intersection transitivity, false conflict emission) and introduce higher-discriminative signals (patch lineage, version compatibility, IDF-weighted package matching).

---

## Scope

- Applies to linksets produced from `advisory_observations` (LNM v2).
- Correlation is aggregation-only: no value synthesis or merge; emit conflicts instead of collapsing fields.
- Output persists in `advisory_linksets` and drives `advisory.linkset.updated@1` events.
- Maintains determinism, offline posture, and LNM/AOC contracts.

---

## Key Changes from v1

| Aspect | v1 Behavior | v2 Behavior |
|--------|-------------|-------------|
| Alias matching | Intersection across all inputs | Graph connectivity (LCC ratio) |
| PURL matching | Intersection across all inputs | Pairwise coverage + IDF weighting |
| Reference clash | Emitted on zero overlap | Only on true URL contradictions |
| Conflict penalty | Single -0.1 for any conflict | Typed severities with per-reason penalties |
| Patch lineage | Not used | Top-tier signal (+0.35 for exact SHA) |
| Version ranges | Divergence noted only | Classified (Equivalent/Overlapping/Disjoint) |

---

## Deterministic Confidence Calculation (0-1)

### Signal Weights

```
confidence = clamp(
  0.30 * alias_connectivity +
  0.10 * alias_authority +
  0.20 * package_coverage +
  0.10 * version_compatibility +
  0.10 * cpe_match +
  0.10 * patch_lineage +
  0.05 * reference_overlap +
  0.05 * freshness_score
) - typed_penalty
```

### Signal Definitions

#### `alias_connectivity` (weight: 0.30)

**Graph-based scoring** replacing intersection-across-all.

1. Build bipartite graph: observation nodes ↔ alias nodes
2. Connect observations that share any alias (transitive bridging)
3. Compute LCC (largest connected component) ratio: `|LCC| / N`

| Scenario | Score |
|----------|-------|
| All observations in single connected component | 1.0 |
| 80% of observations connected | 0.8 |
| No alias overlap at all | 0.0 |

**Why this matters**: Sources A (CVE-X), B (CVE-X + GHSA-Y), C (GHSA-Y) now correctly correlate via transitive bridging, whereas v1 produced score = 0.

#### `alias_authority` (weight: 0.10)

Scope-based weighting using existing canonical key prefixes:

| Alias Type | Authority Score |
|------------|-----------------|
| CVE-* (global) | 1.0 |
| GHSA-* (ecosystem) | 0.8 |
| Vendor IDs (RHSA, MSRC, CISCO, VMSA) | 0.6 |
| Distribution IDs (DSA, USN, SUSE) | 0.4 |
| Unknown scheme | 0.2 |

#### `package_coverage` (weight: 0.20)

**Pairwise + IDF weighting** replacing intersection-across-all.

1. Extract package keys (PURL without version) from each observation
2. For each package key, compute IDF weight: `log(N / (1 + df))` where N = corpus size, df = observations containing package
3. Score = weighted overlap ratio across pairs

| Scenario | Score |
|----------|-------|
| All sources share same rare package | ~1.0 |
| All sources share common package (lodash) | ~0.6 |
| One "thin" source with no packages | Other sources still score > 0 |
| No package overlap | 0.0 |

**IDF fallback**: When cache unavailable, uniform weights (1.0) are used.

#### `version_compatibility` (weight: 0.10)

Classifies version relationships per shared package:

| Relation | Score | Conflict |
|----------|-------|----------|
| **Equivalent**: ranges normalize identically | 1.0 | None |
| **Overlapping**: non-empty intersection | 0.6 | Soft (`affected-range-divergence`) |
| **Disjoint**: no intersection | 0.0 | Hard (`disjoint-version-ranges`) |
| **Unknown**: parse failure | 0.5 | None |

Uses `SemanticVersionRangeResolver` for semver; delegates to ecosystem-specific comparers for rpm EVR, dpkg, apk.

#### `cpe_match` (weight: 0.10)

Unchanged from v1:
- Exact CPE overlap: 1.0
- Same vendor/product: 0.5
- No match: 0.0

#### `patch_lineage` (weight: 0.10)

**New signal**: correlation via shared fix commits.

1. Extract patch references from observation references (type: `patch`, `fix`, `commit`)
2. Normalize to commit SHAs using `PatchLineageNormalizer`
3. Any pairwise SHA match: 1.0; otherwise 0.0

**Why this matters**: "These advisories fix the same code" is high-confidence evidence most platforms lack.

#### `reference_overlap` (weight: 0.05)

**Positive-only** (no conflict on zero overlap):

1. Normalize URLs (lowercase, strip tracking params, https://)
2. Compute max pairwise overlap ratio
3. Map to score: `0.5 + (overlap * 0.5)`

| Scenario | Score |
|----------|-------|
| 100% URL overlap | 1.0 |
| 50% URL overlap | 0.75 |
| Zero URL overlap | 0.5 (neutral) |

**No `reference-clash` emission** for simple disjoint sets.

#### `freshness_score` (weight: 0.05)

Unchanged from v1:
- Spread ≤ 48h: 1.0
- Spread ≥ 14d: 0.0
- Linear decay between

---

## Conflict Emission (Typed Severities)

### Severity Levels

| Severity | Penalty Range | Meaning |
|----------|---------------|---------|
| **Hard** | 0.30 - 0.40 | Significant disagreement; likely prevents high-confidence linking |
| **Soft** | 0.05 - 0.10 | Minor disagreement; link with reduced confidence |
| **Info** | 0.00 | Informational; no penalty |

### Conflict Types and Penalties

| Conflict Reason | Severity | Penalty | Trigger Condition |
|-----------------|----------|---------|-------------------|
| `distinct-cves` | Hard | -0.40 | Two different CVE-* identifiers in cluster |
| `disjoint-version-ranges` | Hard | -0.30 | Same package key, ranges have no intersection |
| `alias-inconsistency` | Soft | -0.10 | Disconnected alias graph (but no CVE conflict) |
| `affected-range-divergence` | Soft | -0.05 | Ranges overlap but differ |
| `severity-mismatch` | Soft | -0.05 | CVSS base score delta > 1.0 |
| `reference-clash` | Info | 0.00 | Reserved for true contradictions only |
| `metadata-gap` | Info | 0.00 | Required provenance missing |

### Penalty Calculation

```
typed_penalty = min(0.6, sum(penalty_per_conflict))
```

Saturates at 0.6 to prevent complete collapse; minimum confidence = 0.1 when any evidence exists.

### Conflict Record Shape

```json
{
  "field": "aliases",
  "reason": "distinct-cves",
  "severity": "Hard",
  "values": ["nvd:CVE-2025-1234", "ghsa:CVE-2025-5678"],
  "sourceIds": ["nvd", "ghsa"]
}
```

---

## Linkset Output Shape

Additions from v1:

```json
{
  "key": {
    "vulnerabilityId": "CVE-2025-1234",
    "productKey": "pkg:npm/lodash",
    "confidence": 0.85
  },
  "conflicts": [
    {
      "field": "affected.versions[pkg:npm/lodash]",
      "reason": "affected-range-divergence",
      "severity": "Soft",
      "values": ["nvd:>=4.0.0,<4.17.21", "ghsa:>=4.0.0,<4.18.0"],
      "sourceIds": ["nvd", "ghsa"]
    }
  ],
  "signalScores": {
    "aliasConnectivity": 1.0,
    "aliasAuthority": 1.0,
    "packageCoverage": 0.85,
    "versionCompatibility": 0.6,
    "cpeMatch": 0.5,
    "patchLineage": 1.0,
    "referenceOverlap": 0.75,
    "freshness": 1.0
  },
  "provenance": {
    "observationHashes": ["sha256:abc...", "sha256:def..."],
    "toolVersion": "concelier/2.0.0",
    "correlationVersion": "v2"
  }
}
```

---

## Algorithm Pseudocode

```
function Compute(observations):
    if observations.empty:
        return (confidence=1.0, conflicts=[])

    conflicts = []

    # 1. Alias connectivity (graph-based)
    aliasGraph = buildBipartiteGraph(observations)
    aliasConnectivity = LCC(aliasGraph) / observations.count
    if hasDistinctCVEs(aliasGraph):
        conflicts.add(HardConflict("distinct-cves"))
    elif aliasConnectivity < 1.0:
        conflicts.add(SoftConflict("alias-inconsistency"))

    # 2. Alias authority
    aliasAuthority = maxAuthorityScore(observations)

    # 3. Package coverage (pairwise + IDF)
    packageCoverage = computeIDFWeightedCoverage(observations)

    # 4. Version compatibility
    for sharedPackage in findSharedPackages(observations):
        relation = classifyVersionRelation(observations, sharedPackage)
        if relation == Disjoint:
            conflicts.add(HardConflict("disjoint-version-ranges"))
        elif relation == Overlapping:
            conflicts.add(SoftConflict("affected-range-divergence"))
    versionScore = averageRelationScore(observations)

    # 5. CPE match
    cpeScore = computeCpeOverlap(observations)

    # 6. Patch lineage
    patchScore = 1.0 if anyPairSharesCommitSHA(observations) else 0.0

    # 7. Reference overlap (positive-only)
    referenceScore = 0.5 + (maxPairwiseURLOverlap(observations) * 0.5)

    # 8. Freshness
    freshnessScore = computeFreshness(observations)

    # Calculate weighted sum
    baseConfidence = (
        0.30 * aliasConnectivity +
        0.10 * aliasAuthority +
        0.20 * packageCoverage +
        0.10 * versionScore +
        0.10 * cpeScore +
        0.10 * patchScore +
        0.05 * referenceScore +
        0.05 * freshnessScore
    )

    # Apply typed penalties
    penalty = min(0.6, sum(conflict.penalty for conflict in conflicts))
    finalConfidence = max(0.1, baseConfidence - penalty)

    return (confidence=finalConfidence, conflicts=dedupe(conflicts))
```

---

## Implementation

### Code Locations

| Component | Path |
|-----------|------|
| V2 Algorithm | `src/Concelier/__Libraries/StellaOps.Concelier.Core/Linksets/LinksetCorrelationV2.cs` |
| Conflict Model | `src/Concelier/__Libraries/StellaOps.Concelier.Core/Linksets/AdvisoryLinkset.cs` |
| Patch Normalizer | `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Identity/Normalizers/PatchLineageNormalizer.cs` |
| Version Resolver | `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Comparers/SemanticVersionRangeResolver.cs` |

### Configuration

```yaml
concelier:
  correlation:
    version: v2  # v1 | v2
    weights:
      aliasConnectivity: 0.30
      aliasAuthority: 0.10
      packageCoverage: 0.20
      versionCompatibility: 0.10
      cpeMatch: 0.10
      patchLineage: 0.10
      referenceOverlap: 0.05
      freshness: 0.05
    idf:
      enabled: true
      cacheKey: "concelier:package:idf"
      refreshIntervalMinutes: 60
    textSimilarity:
      enabled: false  # Phase 3
```

---

## Telemetry

| Instrument | Type | Tags | Purpose |
|------------|------|------|---------|
| `concelier.linkset.confidence` | Histogram | `version` | Confidence score distribution |
| `concelier.linkset.conflicts_total` | Counter | `reason`, `severity` | Conflict counts by type |
| `concelier.linkset.signal_score` | Histogram | `signal` | Per-signal score distribution |
| `concelier.linkset.patch_lineage_hits` | Counter | - | Patch SHA matches found |
| `concelier.linkset.idf_cache_hit` | Counter | `hit` | IDF cache effectiveness |

---

## Migration

### Recompute Job

```bash
stella db linksets recompute --correlation-version v2 --batch-size 1000
```

Recomputes confidence for existing linksets using v2 algorithm. Does not modify observation data.

### Rollback

Set `concelier:correlation:version: v1` to revert to intersection-based scoring.

---

## Fixtures

- `docs/modules/concelier/samples/linkset-v2-transitive-bridge.json`: Three-source transitive bridging (A↔B↔C) demonstrating graph connectivity.
- `docs/modules/concelier/samples/linkset-v2-patch-match.json`: Two-source correlation via shared commit SHA.
- `docs/modules/concelier/samples/linkset-v2-hard-conflict.json`: Distinct CVEs in cluster triggering hard penalty.

All fixtures use ASCII ordering and ISO-8601 UTC timestamps.

---

## Change Control

- V2 is add-only relative to v1 output schema.
- Signal weight adjustments require sprint note but not schema version bump.
- New conflict reasons require `advisory.linkset.updated@2` event schema and doc update.
- Removal of a signal requires deprecation period and migration guidance.