# Unknowns Ranking Algorithm Reference This document describes the multi-factor scoring algorithm used to rank and triage unknowns in the StellaOps Signals module. ## Purpose When reachability analysis encounters unresolved symbols, edges, or package identities, these are recorded as **unknowns**. The ranking algorithm prioritizes unknowns by computing a composite score from five factors, then assigns each to a triage band (HOT/WARM/COLD) that determines rescan scheduling and escalation policies. ## Scoring Formula The composite score is computed as: ``` Score = wP × P + wE × E + wU × U + wC × C + wS × S ``` Where: - **P** = Popularity (deployment impact) - **E** = Exploit potential (CVE severity) - **U** = Uncertainty density (flag accumulation) - **C** = Centrality (graph position importance) - **S** = Staleness (evidence age) All factors are normalized to [0.0, 1.0] before weighting. The final score is clamped to [0.0, 1.0]. ### Default Weights | Factor | Weight | Description | |--------|--------|-------------| | wP | 0.25 | Popularity weight | | wE | 0.25 | Exploit potential weight | | wU | 0.25 | Uncertainty density weight | | wC | 0.15 | Centrality weight | | wS | 0.10 | Staleness weight | Weights must sum to 1.0 and are configurable via `Signals:UnknownsScoring` settings. ## Factor Details ### Factor P: Popularity (Deployment Impact) Measures how widely the unknown's package is deployed across monitored environments. **Formula:** ``` P = min(1, log10(1 + deploymentCount) / log10(1 + maxDeployments)) ``` **Parameters:** - `deploymentCount`: Number of deployments referencing the package (from `deploy_refs` table) - `maxDeployments`: Normalization ceiling (default: 100) **Rationale:** Logarithmic scaling prevents a single highly-deployed package from dominating scores while still prioritizing widely-used dependencies. ### Factor E: Exploit Potential (CVE Severity) Estimates the consequence severity if the unknown resolves to a vulnerable component. **Current Implementation:** - Returns 0.5 (medium potential) when no CVE association exists - Future: Integrate KEV lookup, EPSS scores, and exploit database references **Planned Enhancements:** - CVE severity mapping (Critical=1.0, High=0.8, Medium=0.5, Low=0.2) - KEV (Known Exploited Vulnerabilities) flag boost - EPSS (Exploit Prediction Scoring System) integration ### Factor U: Uncertainty Density (Flag Accumulation) Aggregates uncertainty signals from multiple sources. Each flag contributes a weighted penalty. **Flag Weights:** | Flag | Weight | Description | |------|--------|-------------| | `NoProvenanceAnchor` | 0.30 | Cannot verify package source | | `VersionRange` | 0.25 | Version specified as range, not exact | | `DynamicCallTarget` | 0.25 | Reflection, eval, or dynamic dispatch | | `ConflictingFeeds` | 0.20 | Contradictory info from different feeds | | `ExternalAssembly` | 0.20 | Assembly outside analysis scope | | `MissingVector` | 0.15 | No CVSS vector for severity assessment | | `UnreachableSourceAdvisory` | 0.10 | Source advisory URL unreachable | **Formula:** ``` U = min(1.0, sum(activeFlags × flagWeight)) ``` **Example:** - NoProvenanceAnchor (0.30) + VersionRange (0.25) + MissingVector (0.15) = 0.70 ### Factor C: Centrality (Graph Position Importance) Measures the unknown's position importance in the call graph using betweenness centrality. **Formula:** ``` C = min(1.0, betweenness / maxBetweenness) ``` **Parameters:** - `betweenness`: Raw betweenness centrality from graph analysis - `maxBetweenness`: Normalization ceiling (default: 1000) **Rationale:** High-betweenness nodes appear on many shortest paths, meaning they're likely to be reached regardless of entry point. **Related Metrics:** - `DegreeCentrality`: Number of incoming + outgoing edges (stored but not used in score) - `BetweennessCentrality`: Raw betweenness value (stored for debugging) ### Factor S: Staleness (Evidence Age) Measures how old the evidence is since the last successful analysis attempt. **Formula:** ``` S = min(1.0, daysSinceLastAnalysis / maxDays) ``` With exponential decay enhancement (optional): ``` S = 1 - exp(-daysSinceLastAnalysis / tau) ``` **Parameters:** - `daysSinceLastAnalysis`: Days since `LastAnalyzedAt` timestamp - `maxDays`: Staleness ceiling (default: 14 days) - `tau`: Decay constant for exponential model (default: 14) **Special Cases:** - Never analyzed (`LastAnalyzedAt` is null): S = 1.0 (maximum staleness) ## Band Assignment Based on the composite score, unknowns are assigned to triage bands: | Band | Threshold | Rescan Policy | Description | |------|-----------|---------------|-------------| | **HOT** | Score >= 0.70 | 15 minutes | Immediate rescan + VEX escalation | | **WARM** | 0.40 <= Score < 0.70 | 24 hours | Scheduled rescan within 12-72h | | **COLD** | Score < 0.40 | 7 days | Weekly batch processing | Thresholds are configurable: ```yaml Signals: UnknownsScoring: HotThreshold: 0.70 WarmThreshold: 0.40 ``` ## Scheduler Integration The `UnknownsRescanWorker` processes unknowns based on their band: ### HOT Band Processing - Poll interval: 1 minute - Batch size: 10 items - Action: Trigger immediate rescan via `IRescanOrchestrator` - On failure: Exponential backoff, max 3 retries before demotion to WARM ### WARM Band Processing - Poll interval: 5 minutes - Batch size: 50 items - Scheduled window: 12-72 hours based on score within band - On failure: Increment `RescanAttempts`, re-queue with delay ### COLD Band Processing - Schedule: Weekly on configurable day (default: Sunday) - Batch size: 500 items - Action: Batch rescan job submission - On failure: Log and retry next week ## Normalization Trace Each scored unknown includes a `NormalizationTrace` for debugging and replay: ```json { "rawPopularity": 42, "normalizedPopularity": 0.65, "popularityFormula": "min(1, log10(1 + 42) / log10(1 + 100))", "rawExploitPotential": 0.5, "normalizedExploitPotential": 0.5, "rawUncertainty": 0.55, "normalizedUncertainty": 0.55, "activeFlags": ["NoProvenanceAnchor", "VersionRange"], "rawCentrality": 250.0, "normalizedCentrality": 0.25, "rawStaleness": 7, "normalizedStaleness": 0.5, "weights": { "wP": 0.25, "wE": 0.25, "wU": 0.25, "wC": 0.15, "wS": 0.10 }, "finalScore": 0.52, "assignedBand": "Warm", "computedAt": "2025-12-15T10:00:00Z" } ``` **Replay Capability:** Given the trace, the exact score can be recomputed: ``` Score = 0.25×0.65 + 0.25×0.5 + 0.25×0.55 + 0.15×0.25 + 0.10×0.5 = 0.1625 + 0.125 + 0.1375 + 0.0375 + 0.05 = 0.5125 ≈ 0.52 ``` ## API Endpoints ### Query Unknowns by Band ``` GET /api/signals/unknowns?band=hot&limit=50&offset=0 ``` Response: ```json { "items": [ { "id": "unk-123", "subjectKey": "myapp|1.0.0", "purl": "pkg:npm/lodash@4.17.21", "score": 0.82, "band": "Hot", "flags": { "noProvenanceAnchor": true, "versionRange": true }, "nextScheduledRescan": "2025-12-15T10:15:00Z" } ], "total": 15, "hasMore": false } ``` ### Get Score Explanation ``` GET /api/signals/unknowns/{id}/explain ``` Response: ```json { "unknown": { /* full UnknownSymbolDocument */ }, "normalizationTrace": { /* trace object */ }, "factorBreakdown": { "popularity": { "raw": 42, "normalized": 0.65, "weighted": 0.1625 }, "exploitPotential": { "raw": 0.5, "normalized": 0.5, "weighted": 0.125 }, "uncertainty": { "raw": 0.55, "normalized": 0.55, "weighted": 0.1375 }, "centrality": { "raw": 250, "normalized": 0.25, "weighted": 0.0375 }, "staleness": { "raw": 7, "normalized": 0.5, "weighted": 0.05 } }, "bandThresholds": { "hot": 0.70, "warm": 0.40 } } ``` ## Configuration Reference ```yaml Signals: UnknownsScoring: # Factor weights (must sum to 1.0) WeightPopularity: 0.25 WeightExploitPotential: 0.25 WeightUncertainty: 0.25 WeightCentrality: 0.15 WeightStaleness: 0.10 # Popularity normalization PopularityMaxDeployments: 100 # Uncertainty flag weights FlagWeightNoProvenance: 0.30 FlagWeightVersionRange: 0.25 FlagWeightConflictingFeeds: 0.20 FlagWeightMissingVector: 0.15 FlagWeightUnreachableSource: 0.10 FlagWeightDynamicTarget: 0.25 FlagWeightExternalAssembly: 0.20 # Centrality normalization CentralityMaxBetweenness: 1000.0 # Staleness normalization StalenessMaxDays: 14 StalenessTau: 14 # For exponential decay # Band thresholds HotThreshold: 0.70 WarmThreshold: 0.40 # Rescan scheduling HotRescanMinutes: 15 WarmRescanHours: 24 ColdRescanDays: 7 UnknownsDecay: # Nightly batch decay BatchEnabled: true MaxSubjectsPerBatch: 1000 ColdBatchDay: Sunday ``` ## Determinism Requirements The scoring algorithm is fully deterministic: 1. **Same inputs produce identical scores** - Given identical `UnknownSymbolDocument`, deployment counts, and graph metrics, the score will always be the same 2. **Normalization trace enables replay** - The trace contains all raw values and weights needed to reproduce the score 3. **Timestamps use UTC ISO 8601** - All `ComputedAt`, `LastAnalyzedAt`, and `NextScheduledRescan` timestamps are UTC 4. **Weights logged per computation** - The trace includes the exact weights used, allowing audit of configuration changes ## Database Schema ```sql -- Unknowns table (enhanced) CREATE TABLE signals.unknowns ( id UUID PRIMARY KEY, subject_key TEXT NOT NULL, purl TEXT, symbol_id TEXT, callgraph_id TEXT, -- Scoring factors popularity_score FLOAT DEFAULT 0, deployment_count INT DEFAULT 0, exploit_potential_score FLOAT DEFAULT 0, uncertainty_score FLOAT DEFAULT 0, centrality_score FLOAT DEFAULT 0, degree_centrality INT DEFAULT 0, betweenness_centrality FLOAT DEFAULT 0, staleness_score FLOAT DEFAULT 0, days_since_last_analysis INT DEFAULT 0, -- Composite score and band score FLOAT DEFAULT 0, band TEXT DEFAULT 'cold' CHECK (band IN ('hot', 'warm', 'cold')), -- Metadata flags JSONB DEFAULT '{}', normalization_trace JSONB, rescan_attempts INT DEFAULT 0, last_rescan_result TEXT, -- Timestamps created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), last_analyzed_at TIMESTAMPTZ, next_scheduled_rescan TIMESTAMPTZ ); -- Indexes for band-based queries CREATE INDEX idx_unknowns_band ON signals.unknowns(band); CREATE INDEX idx_unknowns_score ON signals.unknowns(score DESC); CREATE INDEX idx_unknowns_next_rescan ON signals.unknowns(next_scheduled_rescan) WHERE next_scheduled_rescan IS NOT NULL; CREATE INDEX idx_unknowns_subject ON signals.unknowns(subject_key); ``` ## Metrics and Observability The following metrics are exposed for monitoring: | Metric | Type | Description | |--------|------|-------------| | `signals_unknowns_total` | Gauge | Total unknowns by band | | `signals_unknowns_rescans_total` | Counter | Rescans triggered by band | | `signals_unknowns_scoring_duration_seconds` | Histogram | Scoring computation time | | `signals_unknowns_band_transitions_total` | Counter | Band changes (e.g., WARM->HOT) | ## Related Documentation - [Unknowns Registry](./unknowns-registry.md) - Data model and API for unknowns - [Reachability Analysis](./reachability.md) - Reachability scoring integration - [Callgraph Schema](./callgraph-formats.md) - Graph structure for centrality computation