Files
git.stella-ops.org/docs/signals/unknowns-ranking.md
master 5a480a3c2a
Some checks failed
Reachability Corpus Validation / validate-corpus (push) Waiting to run
Reachability Corpus Validation / validate-ground-truths (push) Waiting to run
Reachability Corpus Validation / determinism-check (push) Blocked by required conditions
Scanner Analyzers / Discover Analyzers (push) Waiting to run
Scanner Analyzers / Build Analyzers (push) Blocked by required conditions
Scanner Analyzers / Test Language Analyzers (push) Blocked by required conditions
Scanner Analyzers / Validate Test Fixtures (push) Waiting to run
Scanner Analyzers / Verify Deterministic Output (push) Blocked by required conditions
Signals CI & Image / signals-ci (push) Waiting to run
Signals Reachability Scoring & Events / reachability-smoke (push) Waiting to run
Signals Reachability Scoring & Events / sign-and-upload (push) Blocked by required conditions
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Findings Ledger CI / build-test (push) Has been cancelled
Findings Ledger CI / migration-validation (push) Has been cancelled
Findings Ledger CI / generate-manifest (push) Has been cancelled
Lighthouse CI / Lighthouse Audit (push) Has been cancelled
Lighthouse CI / Axe Accessibility Audit (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Add call graph fixtures for various languages and scenarios
- Introduced `all-edge-reasons.json` to test edge resolution reasons in .NET.
- Added `all-visibility-levels.json` to validate method visibility levels in .NET.
- Created `dotnet-aspnetcore-minimal.json` for a minimal ASP.NET Core application.
- Included `go-gin-api.json` for a Go Gin API application structure.
- Added `java-spring-boot.json` for the Spring PetClinic application in Java.
- Introduced `legacy-no-schema.json` for legacy application structure without schema.
- Created `node-express-api.json` for an Express.js API application structure.
2025-12-16 10:44:24 +02:00

11 KiB
Raw Permalink Blame History

Unknowns Ranking Algorithm Reference

This document describes the multi-factor scoring algorithm used to rank and triage unknowns in the StellaOps Signals module.

Purpose

When reachability analysis encounters unresolved symbols, edges, or package identities, these are recorded as unknowns. The ranking algorithm prioritizes unknowns by computing a composite score from five factors, then assigns each to a triage band (HOT/WARM/COLD) that determines rescan scheduling and escalation policies.

Scoring Formula

The composite score is computed as:

Score = wP × P + wE × E + wU × U + wC × C + wS × S

Where:

  • P = Popularity (deployment impact)
  • E = Exploit potential (CVE severity)
  • U = Uncertainty density (flag accumulation)
  • C = Centrality (graph position importance)
  • S = Staleness (evidence age)

All factors are normalized to [0.0, 1.0] before weighting. The final score is clamped to [0.0, 1.0].

Default Weights

Factor Weight Description
wP 0.25 Popularity weight
wE 0.25 Exploit potential weight
wU 0.25 Uncertainty density weight
wC 0.15 Centrality weight
wS 0.10 Staleness weight

Weights must sum to 1.0 and are configurable via Signals:UnknownsScoring settings.

Factor Details

Factor P: Popularity (Deployment Impact)

Measures how widely the unknown's package is deployed across monitored environments.

Formula:

P = min(1, log10(1 + deploymentCount) / log10(1 + maxDeployments))

Parameters:

  • deploymentCount: Number of deployments referencing the package (from deploy_refs table)
  • maxDeployments: Normalization ceiling (default: 100)

Rationale: Logarithmic scaling prevents a single highly-deployed package from dominating scores while still prioritizing widely-used dependencies.

Factor E: Exploit Potential (CVE Severity)

Estimates the consequence severity if the unknown resolves to a vulnerable component.

Current Implementation:

  • Returns 0.5 (medium potential) when no CVE association exists
  • Future: Integrate KEV lookup, EPSS scores, and exploit database references

Planned Enhancements:

  • CVE severity mapping (Critical=1.0, High=0.8, Medium=0.5, Low=0.2)
  • KEV (Known Exploited Vulnerabilities) flag boost
  • EPSS (Exploit Prediction Scoring System) integration

Factor U: Uncertainty Density (Flag Accumulation)

Aggregates uncertainty signals from multiple sources. Each flag contributes a weighted penalty.

Flag Weights:

Flag Weight Description
NoProvenanceAnchor 0.30 Cannot verify package source
VersionRange 0.25 Version specified as range, not exact
DynamicCallTarget 0.25 Reflection, eval, or dynamic dispatch
ConflictingFeeds 0.20 Contradictory info from different feeds
ExternalAssembly 0.20 Assembly outside analysis scope
MissingVector 0.15 No CVSS vector for severity assessment
UnreachableSourceAdvisory 0.10 Source advisory URL unreachable

Formula:

U = min(1.0, sum(activeFlags × flagWeight))

Example:

  • NoProvenanceAnchor (0.30) + VersionRange (0.25) + MissingVector (0.15) = 0.70

Factor C: Centrality (Graph Position Importance)

Measures the unknown's position importance in the call graph using betweenness centrality.

Formula:

C = min(1.0, betweenness / maxBetweenness)

Parameters:

  • betweenness: Raw betweenness centrality from graph analysis
  • maxBetweenness: Normalization ceiling (default: 1000)

Rationale: High-betweenness nodes appear on many shortest paths, meaning they're likely to be reached regardless of entry point.

Related Metrics:

  • DegreeCentrality: Number of incoming + outgoing edges (stored but not used in score)
  • BetweennessCentrality: Raw betweenness value (stored for debugging)

Factor S: Staleness (Evidence Age)

Measures how old the evidence is since the last successful analysis attempt.

Formula:

S = min(1.0, daysSinceLastAnalysis / maxDays)

With exponential decay enhancement (optional):

S = 1 - exp(-daysSinceLastAnalysis / tau)

Parameters:

  • daysSinceLastAnalysis: Days since LastAnalyzedAt timestamp
  • maxDays: Staleness ceiling (default: 14 days)
  • tau: Decay constant for exponential model (default: 14)

Special Cases:

  • Never analyzed (LastAnalyzedAt is null): S = 1.0 (maximum staleness)

Band Assignment

Based on the composite score, unknowns are assigned to triage bands:

Band Threshold Rescan Policy Description
HOT Score >= 0.70 15 minutes Immediate rescan + VEX escalation
WARM 0.40 <= Score < 0.70 24 hours Scheduled rescan within 12-72h
COLD Score < 0.40 7 days Weekly batch processing

Thresholds are configurable:

Signals:
  UnknownsScoring:
    HotThreshold: 0.70
    WarmThreshold: 0.40

Scheduler Integration

The UnknownsRescanWorker processes unknowns based on their band:

HOT Band Processing

  • Poll interval: 1 minute
  • Batch size: 10 items
  • Action: Trigger immediate rescan via IRescanOrchestrator
  • On failure: Exponential backoff, max 3 retries before demotion to WARM

WARM Band Processing

  • Poll interval: 5 minutes
  • Batch size: 50 items
  • Scheduled window: 12-72 hours based on score within band
  • On failure: Increment RescanAttempts, re-queue with delay

COLD Band Processing

  • Schedule: Weekly on configurable day (default: Sunday)
  • Batch size: 500 items
  • Action: Batch rescan job submission
  • On failure: Log and retry next week

Normalization Trace

Each scored unknown includes a NormalizationTrace for debugging and replay:

{
  "rawPopularity": 42,
  "normalizedPopularity": 0.65,
  "popularityFormula": "min(1, log10(1 + 42) / log10(1 + 100))",

  "rawExploitPotential": 0.5,
  "normalizedExploitPotential": 0.5,

  "rawUncertainty": 0.55,
  "normalizedUncertainty": 0.55,
  "activeFlags": ["NoProvenanceAnchor", "VersionRange"],

  "rawCentrality": 250.0,
  "normalizedCentrality": 0.25,

  "rawStaleness": 7,
  "normalizedStaleness": 0.5,

  "weights": {
    "wP": 0.25,
    "wE": 0.25,
    "wU": 0.25,
    "wC": 0.15,
    "wS": 0.10
  },
  "finalScore": 0.52,
  "assignedBand": "Warm",
  "computedAt": "2025-12-15T10:00:00Z"
}

Replay Capability: Given the trace, the exact score can be recomputed:

Score = 0.25×0.65 + 0.25×0.5 + 0.25×0.55 + 0.15×0.25 + 0.10×0.5
      = 0.1625 + 0.125 + 0.1375 + 0.0375 + 0.05
      = 0.5125 ≈ 0.52

API Endpoints

Query Unknowns by Band

GET /api/signals/unknowns?band=hot&limit=50&offset=0

Response:

{
  "items": [
    {
      "id": "unk-123",
      "subjectKey": "myapp|1.0.0",
      "purl": "pkg:npm/lodash@4.17.21",
      "score": 0.82,
      "band": "Hot",
      "flags": { "noProvenanceAnchor": true, "versionRange": true },
      "nextScheduledRescan": "2025-12-15T10:15:00Z"
    }
  ],
  "total": 15,
  "hasMore": false
}

Get Score Explanation

GET /api/signals/unknowns/{id}/explain

Response:

{
  "unknown": { /* full UnknownSymbolDocument */ },
  "normalizationTrace": { /* trace object */ },
  "factorBreakdown": {
    "popularity": { "raw": 42, "normalized": 0.65, "weighted": 0.1625 },
    "exploitPotential": { "raw": 0.5, "normalized": 0.5, "weighted": 0.125 },
    "uncertainty": { "raw": 0.55, "normalized": 0.55, "weighted": 0.1375 },
    "centrality": { "raw": 250, "normalized": 0.25, "weighted": 0.0375 },
    "staleness": { "raw": 7, "normalized": 0.5, "weighted": 0.05 }
  },
  "bandThresholds": { "hot": 0.70, "warm": 0.40 }
}

Configuration Reference

Signals:
  UnknownsScoring:
    # Factor weights (must sum to 1.0)
    WeightPopularity: 0.25
    WeightExploitPotential: 0.25
    WeightUncertainty: 0.25
    WeightCentrality: 0.15
    WeightStaleness: 0.10

    # Popularity normalization
    PopularityMaxDeployments: 100

    # Uncertainty flag weights
    FlagWeightNoProvenance: 0.30
    FlagWeightVersionRange: 0.25
    FlagWeightConflictingFeeds: 0.20
    FlagWeightMissingVector: 0.15
    FlagWeightUnreachableSource: 0.10
    FlagWeightDynamicTarget: 0.25
    FlagWeightExternalAssembly: 0.20

    # Centrality normalization
    CentralityMaxBetweenness: 1000.0

    # Staleness normalization
    StalenessMaxDays: 14
    StalenessTau: 14  # For exponential decay

    # Band thresholds
    HotThreshold: 0.70
    WarmThreshold: 0.40

    # Rescan scheduling
    HotRescanMinutes: 15
    WarmRescanHours: 24
    ColdRescanDays: 7

  UnknownsDecay:
    # Nightly batch decay
    BatchEnabled: true
    MaxSubjectsPerBatch: 1000
    ColdBatchDay: Sunday

Determinism Requirements

The scoring algorithm is fully deterministic:

  1. Same inputs produce identical scores - Given identical UnknownSymbolDocument, deployment counts, and graph metrics, the score will always be the same
  2. Normalization trace enables replay - The trace contains all raw values and weights needed to reproduce the score
  3. Timestamps use UTC ISO 8601 - All ComputedAt, LastAnalyzedAt, and NextScheduledRescan timestamps are UTC
  4. Weights logged per computation - The trace includes the exact weights used, allowing audit of configuration changes

Database Schema

-- Unknowns table (enhanced)
CREATE TABLE signals.unknowns (
    id UUID PRIMARY KEY,
    subject_key TEXT NOT NULL,
    purl TEXT,
    symbol_id TEXT,
    callgraph_id TEXT,

    -- Scoring factors
    popularity_score FLOAT DEFAULT 0,
    deployment_count INT DEFAULT 0,
    exploit_potential_score FLOAT DEFAULT 0,
    uncertainty_score FLOAT DEFAULT 0,
    centrality_score FLOAT DEFAULT 0,
    degree_centrality INT DEFAULT 0,
    betweenness_centrality FLOAT DEFAULT 0,
    staleness_score FLOAT DEFAULT 0,
    days_since_last_analysis INT DEFAULT 0,

    -- Composite score and band
    score FLOAT DEFAULT 0,
    band TEXT DEFAULT 'cold' CHECK (band IN ('hot', 'warm', 'cold')),

    -- Metadata
    flags JSONB DEFAULT '{}',
    normalization_trace JSONB,
    rescan_attempts INT DEFAULT 0,
    last_rescan_result TEXT,

    -- Timestamps
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_analyzed_at TIMESTAMPTZ,
    next_scheduled_rescan TIMESTAMPTZ
);

-- Indexes for band-based queries
CREATE INDEX idx_unknowns_band ON signals.unknowns(band);
CREATE INDEX idx_unknowns_score ON signals.unknowns(score DESC);
CREATE INDEX idx_unknowns_next_rescan ON signals.unknowns(next_scheduled_rescan)
    WHERE next_scheduled_rescan IS NOT NULL;
CREATE INDEX idx_unknowns_subject ON signals.unknowns(subject_key);

Metrics and Observability

The following metrics are exposed for monitoring:

Metric Type Description
signals_unknowns_total Gauge Total unknowns by band
signals_unknowns_rescans_total Counter Rescans triggered by band
signals_unknowns_scoring_duration_seconds Histogram Scoring computation time
signals_unknowns_band_transitions_total Counter Band changes (e.g., WARM->HOT)