Files
git.stella-ops.org/docs/modules/concelier/linkset-correlation-v2.md
2026-01-25 23:39:14 +02:00

12 KiB

CONCELIER-LNM-26-001 · Linkset Correlation Rules (v2)

Supersedes linkset-correlation-21-002.md for new linkset builds. V1 linksets remain valid; migration job will recompute confidence using v2 algorithm.

Purpose: Address critical failure modes in v1 correlation (intersection transitivity, false conflict emission) and introduce higher-discriminative signals (patch lineage, version compatibility, IDF-weighted package matching).


Scope

  • Applies to linksets produced from advisory_observations (LNM v2).
  • Correlation is aggregation-only: no value synthesis or merge; emit conflicts instead of collapsing fields.
  • Output persists in advisory_linksets and drives advisory.linkset.updated@1 events.
  • Maintains determinism, offline posture, and LNM/AOC contracts.

Key Changes from v1

Aspect v1 Behavior v2 Behavior
Alias matching Intersection across all inputs Graph connectivity (LCC ratio)
PURL matching Intersection across all inputs Pairwise coverage + IDF weighting
Reference clash Emitted on zero overlap Only on true URL contradictions
Conflict penalty Single -0.1 for any conflict Typed severities with per-reason penalties
Patch lineage Not used Top-tier signal (+0.35 for exact SHA)
Version ranges Divergence noted only Classified (Equivalent/Overlapping/Disjoint)

Deterministic Confidence Calculation (0-1)

Signal Weights

confidence = clamp(
  0.30 * alias_connectivity +
  0.10 * alias_authority +
  0.20 * package_coverage +
  0.10 * version_compatibility +
  0.10 * cpe_match +
  0.10 * patch_lineage +
  0.05 * reference_overlap +
  0.05 * freshness_score
) - typed_penalty

Signal Definitions

alias_connectivity (weight: 0.30)

Graph-based scoring replacing intersection-across-all.

  1. Build bipartite graph: observation nodes ↔ alias nodes
  2. Connect observations that share any alias (transitive bridging)
  3. Compute LCC (largest connected component) ratio: |LCC| / N
Scenario Score
All observations in single connected component 1.0
80% of observations connected 0.8
No alias overlap at all 0.0

Why this matters: Sources A (CVE-X), B (CVE-X + GHSA-Y), C (GHSA-Y) now correctly correlate via transitive bridging, whereas v1 produced score = 0.

alias_authority (weight: 0.10)

Scope-based weighting using existing canonical key prefixes:

Alias Type Authority Score
CVE-* (global) 1.0
GHSA-* (ecosystem) 0.8
Vendor IDs (RHSA, MSRC, CISCO, VMSA) 0.6
Distribution IDs (DSA, USN, SUSE) 0.4
Unknown scheme 0.2

package_coverage (weight: 0.20)

Pairwise + IDF weighting replacing intersection-across-all.

  1. Extract package keys (PURL without version) from each observation
  2. For each package key, compute IDF weight: log(N / (1 + df)) where N = corpus size, df = observations containing package
  3. Score = weighted overlap ratio across pairs
Scenario Score
All sources share same rare package ~1.0
All sources share common package (lodash) ~0.6
One "thin" source with no packages Other sources still score > 0
No package overlap 0.0

IDF fallback: When cache unavailable, uniform weights (1.0) are used.

version_compatibility (weight: 0.10)

Classifies version relationships per shared package:

Relation Score Conflict
Equivalent: ranges normalize identically 1.0 None
Overlapping: non-empty intersection 0.6 Soft (affected-range-divergence)
Disjoint: no intersection 0.0 Hard (disjoint-version-ranges)
Unknown: parse failure 0.5 None

Uses SemanticVersionRangeResolver for semver; delegates to ecosystem-specific comparers for rpm EVR, dpkg, apk.

cpe_match (weight: 0.10)

Unchanged from v1:

  • Exact CPE overlap: 1.0
  • Same vendor/product: 0.5
  • No match: 0.0

patch_lineage (weight: 0.10)

New signal: correlation via shared fix commits.

  1. Extract patch references from observation references (type: patch, fix, commit)
  2. Normalize to commit SHAs using PatchLineageNormalizer
  3. Any pairwise SHA match: 1.0; otherwise 0.0

Why this matters: "These advisories fix the same code" is high-confidence evidence most platforms lack.

reference_overlap (weight: 0.05)

Positive-only (no conflict on zero overlap):

  1. Normalize URLs (lowercase, strip tracking params, https://)
  2. Compute max pairwise overlap ratio
  3. Map to score: 0.5 + (overlap * 0.5)
Scenario Score
100% URL overlap 1.0
50% URL overlap 0.75
Zero URL overlap 0.5 (neutral)

No reference-clash emission for simple disjoint sets.

freshness_score (weight: 0.05)

Unchanged from v1:

  • Spread ≤ 48h: 1.0
  • Spread ≥ 14d: 0.0
  • Linear decay between

Conflict Emission (Typed Severities)

Severity Levels

Severity Penalty Range Meaning
Hard 0.30 - 0.40 Significant disagreement; likely prevents high-confidence linking
Soft 0.05 - 0.10 Minor disagreement; link with reduced confidence
Info 0.00 Informational; no penalty

Conflict Types and Penalties

Conflict Reason Severity Penalty Trigger Condition
distinct-cves Hard -0.40 Two different CVE-* identifiers in cluster
disjoint-version-ranges Hard -0.30 Same package key, ranges have no intersection
alias-inconsistency Soft -0.10 Disconnected alias graph (but no CVE conflict)
affected-range-divergence Soft -0.05 Ranges overlap but differ
severity-mismatch Soft -0.05 CVSS base score delta > 1.0
reference-clash Info 0.00 Reserved for true contradictions only
metadata-gap Info 0.00 Required provenance missing

Penalty Calculation

typed_penalty = min(0.6, sum(penalty_per_conflict))

Saturates at 0.6 to prevent complete collapse; minimum confidence = 0.1 when any evidence exists.

Conflict Record Shape

{
  "field": "aliases",
  "reason": "distinct-cves",
  "severity": "Hard",
  "values": ["nvd:CVE-2025-1234", "ghsa:CVE-2025-5678"],
  "sourceIds": ["nvd", "ghsa"]
}

Linkset Output Shape

Additions from v1:

{
  "key": {
    "vulnerabilityId": "CVE-2025-1234",
    "productKey": "pkg:npm/lodash",
    "confidence": 0.85
  },
  "conflicts": [
    {
      "field": "affected.versions[pkg:npm/lodash]",
      "reason": "affected-range-divergence",
      "severity": "Soft",
      "values": ["nvd:>=4.0.0,<4.17.21", "ghsa:>=4.0.0,<4.18.0"],
      "sourceIds": ["nvd", "ghsa"]
    }
  ],
  "signalScores": {
    "aliasConnectivity": 1.0,
    "aliasAuthority": 1.0,
    "packageCoverage": 0.85,
    "versionCompatibility": 0.6,
    "cpeMatch": 0.5,
    "patchLineage": 1.0,
    "referenceOverlap": 0.75,
    "freshness": 1.0
  },
  "provenance": {
    "observationHashes": ["sha256:abc...", "sha256:def..."],
    "toolVersion": "concelier/2.0.0",
    "correlationVersion": "v2"
  }
}

Algorithm Pseudocode

function Compute(observations):
    if observations.empty:
        return (confidence=1.0, conflicts=[])

    conflicts = []

    # 1. Alias connectivity (graph-based)
    aliasGraph = buildBipartiteGraph(observations)
    aliasConnectivity = LCC(aliasGraph) / observations.count
    if hasDistinctCVEs(aliasGraph):
        conflicts.add(HardConflict("distinct-cves"))
    elif aliasConnectivity < 1.0:
        conflicts.add(SoftConflict("alias-inconsistency"))

    # 2. Alias authority
    aliasAuthority = maxAuthorityScore(observations)

    # 3. Package coverage (pairwise + IDF)
    packageCoverage = computeIDFWeightedCoverage(observations)

    # 4. Version compatibility
    for sharedPackage in findSharedPackages(observations):
        relation = classifyVersionRelation(observations, sharedPackage)
        if relation == Disjoint:
            conflicts.add(HardConflict("disjoint-version-ranges"))
        elif relation == Overlapping:
            conflicts.add(SoftConflict("affected-range-divergence"))
    versionScore = averageRelationScore(observations)

    # 5. CPE match
    cpeScore = computeCpeOverlap(observations)

    # 6. Patch lineage
    patchScore = 1.0 if anyPairSharesCommitSHA(observations) else 0.0

    # 7. Reference overlap (positive-only)
    referenceScore = 0.5 + (maxPairwiseURLOverlap(observations) * 0.5)

    # 8. Freshness
    freshnessScore = computeFreshness(observations)

    # Calculate weighted sum
    baseConfidence = (
        0.30 * aliasConnectivity +
        0.10 * aliasAuthority +
        0.20 * packageCoverage +
        0.10 * versionScore +
        0.10 * cpeScore +
        0.10 * patchScore +
        0.05 * referenceScore +
        0.05 * freshnessScore
    )

    # Apply typed penalties
    penalty = min(0.6, sum(conflict.penalty for conflict in conflicts))
    finalConfidence = max(0.1, baseConfidence - penalty)

    return (confidence=finalConfidence, conflicts=dedupe(conflicts))

Implementation

Code Locations

Component Path
V2 Algorithm src/Concelier/__Libraries/StellaOps.Concelier.Core/Linksets/LinksetCorrelationV2.cs
Conflict Model src/Concelier/__Libraries/StellaOps.Concelier.Core/Linksets/AdvisoryLinkset.cs
Patch Normalizer src/Concelier/__Libraries/StellaOps.Concelier.Merge/Identity/Normalizers/PatchLineageNormalizer.cs
Version Resolver src/Concelier/__Libraries/StellaOps.Concelier.Merge/Comparers/SemanticVersionRangeResolver.cs

Configuration

concelier:
  correlation:
    version: v2  # v1 | v2
    weights:
      aliasConnectivity: 0.30
      aliasAuthority: 0.10
      packageCoverage: 0.20
      versionCompatibility: 0.10
      cpeMatch: 0.10
      patchLineage: 0.10
      referenceOverlap: 0.05
      freshness: 0.05
    idf:
      enabled: true
      cacheKey: "concelier:package:idf"
      refreshIntervalMinutes: 60
    textSimilarity:
      enabled: false  # Phase 3

Telemetry

Instrument Type Tags Purpose
concelier.linkset.confidence Histogram version Confidence score distribution
concelier.linkset.conflicts_total Counter reason, severity Conflict counts by type
concelier.linkset.signal_score Histogram signal Per-signal score distribution
concelier.linkset.patch_lineage_hits Counter - Patch SHA matches found
concelier.linkset.idf_cache_hit Counter hit IDF cache effectiveness

Migration

Recompute Job

stella db linksets recompute --correlation-version v2 --batch-size 1000

Recomputes confidence for existing linksets using v2 algorithm. Does not modify observation data.

Rollback

Set concelier:correlation:version: v1 to revert to intersection-based scoring.


Fixtures

  • docs/modules/concelier/samples/linkset-v2-transitive-bridge.json: Three-source transitive bridging (A↔B↔C) demonstrating graph connectivity.
  • docs/modules/concelier/samples/linkset-v2-patch-match.json: Two-source correlation via shared commit SHA.
  • docs/modules/concelier/samples/linkset-v2-hard-conflict.json: Distinct CVEs in cluster triggering hard penalty.

All fixtures use ASCII ordering and ISO-8601 UTC timestamps.


Change Control

  • V2 is add-only relative to v1 output schema.
  • Signal weight adjustments require sprint note but not schema version bump.
  • New conflict reasons require advisory.linkset.updated@2 event schema and doc update.
  • Removal of a signal requires deprecation period and migration guidance.