git.stella-ops.org/docs/signals/unknowns-ranking.md

# Unknowns Ranking Algorithm Reference

This document describes the multi-factor scoring algorithm used to rank and triage unknowns in the StellaOps Signals module.

## Purpose

When reachability analysis encounters unresolved symbols, edges, or package identities, these are recorded as **unknowns**. The ranking algorithm prioritizes unknowns by computing a composite score from five factors, then assigns each to a triage band (HOT/WARM/COLD) that determines rescan scheduling and escalation policies.

## Scoring Formula

The composite score is computed as:

```
Score = wP × P + wE × E + wU × U + wC × C + wS × S
```

Where:
- **P** = Popularity (deployment impact)
- **E** = Exploit potential (CVE severity)
- **U** = Uncertainty density (flag accumulation)
- **C** = Centrality (graph position importance)
- **S** = Staleness (evidence age)

All factors are normalized to [0.0, 1.0] before weighting. The final score is clamped to [0.0, 1.0].

### Default Weights

| Factor | Weight | Description |
|--------|--------|-------------|
| wP | 0.25 | Popularity weight |
| wE | 0.25 | Exploit potential weight |
| wU | 0.25 | Uncertainty density weight |
| wC | 0.15 | Centrality weight |
| wS | 0.10 | Staleness weight |

Weights must sum to 1.0 and are configurable via `Signals:UnknownsScoring` settings.

## Factor Details

### Factor P: Popularity (Deployment Impact)

Measures how widely the unknown's package is deployed across monitored environments.

**Formula:**
```
P = min(1, log10(1 + deploymentCount) / log10(1 + maxDeployments))
```

**Parameters:**
- `deploymentCount`: Number of deployments referencing the package (from `deploy_refs` table)
- `maxDeployments`: Normalization ceiling (default: 100)

**Rationale:** Logarithmic scaling prevents a single highly-deployed package from dominating scores while still prioritizing widely-used dependencies.

### Factor E: Exploit Potential (CVE Severity)

Estimates the consequence severity if the unknown resolves to a vulnerable component.

**Current Implementation:**
- Returns 0.5 (medium potential) when no CVE association exists
- Future: Integrate KEV lookup, EPSS scores, and exploit database references

**Planned Enhancements:**
- CVE severity mapping (Critical=1.0, High=0.8, Medium=0.5, Low=0.2)
- KEV (Known Exploited Vulnerabilities) flag boost
- EPSS (Exploit Prediction Scoring System) integration

### Factor U: Uncertainty Density (Flag Accumulation)

Aggregates uncertainty signals from multiple sources. Each flag contributes a weighted penalty.

**Flag Weights:**

| Flag | Weight | Description |
|------|--------|-------------|
| `NoProvenanceAnchor` | 0.30 | Cannot verify package source |
| `VersionRange` | 0.25 | Version specified as range, not exact |
| `DynamicCallTarget` | 0.25 | Reflection, eval, or dynamic dispatch |
| `ConflictingFeeds` | 0.20 | Contradictory info from different feeds |
| `ExternalAssembly` | 0.20 | Assembly outside analysis scope |
| `MissingVector` | 0.15 | No CVSS vector for severity assessment |
| `UnreachableSourceAdvisory` | 0.10 | Source advisory URL unreachable |

**Formula:**
```
U = min(1.0, sum(activeFlags × flagWeight))
```

**Example:**
- NoProvenanceAnchor (0.30) + VersionRange (0.25) + MissingVector (0.15) = 0.70

### Factor C: Centrality (Graph Position Importance)

Measures the unknown's position importance in the call graph using betweenness centrality.

**Formula:**
```
C = min(1.0, betweenness / maxBetweenness)
```

**Parameters:**
- `betweenness`: Raw betweenness centrality from graph analysis
- `maxBetweenness`: Normalization ceiling (default: 1000)

**Rationale:** High-betweenness nodes appear on many shortest paths, meaning they're likely to be reached regardless of entry point.

**Related Metrics:**
- `DegreeCentrality`: Number of incoming + outgoing edges (stored but not used in score)
- `BetweennessCentrality`: Raw betweenness value (stored for debugging)

### Factor S: Staleness (Evidence Age)

Measures how old the evidence is since the last successful analysis attempt.

**Formula:**
```
S = min(1.0, daysSinceLastAnalysis / maxDays)
```

With exponential decay enhancement (optional):
```
S = 1 - exp(-daysSinceLastAnalysis / tau)
```

**Parameters:**
- `daysSinceLastAnalysis`: Days since `LastAnalyzedAt` timestamp
- `maxDays`: Staleness ceiling (default: 14 days)
- `tau`: Decay constant for exponential model (default: 14)

**Special Cases:**
- Never analyzed (`LastAnalyzedAt` is null): S = 1.0 (maximum staleness)

## Band Assignment

Based on the composite score, unknowns are assigned to triage bands:

| Band | Threshold | Rescan Policy | Description |
|------|-----------|---------------|-------------|
| **HOT** | Score >= 0.70 | 15 minutes | Immediate rescan + VEX escalation |
| **WARM** | 0.40 <= Score < 0.70 | 24 hours | Scheduled rescan within 12-72h |
| **COLD** | Score < 0.40 | 7 days | Weekly batch processing |

Thresholds are configurable:
```yaml
Signals:
  UnknownsScoring:
    HotThreshold: 0.70
    WarmThreshold: 0.40
```

## Scheduler Integration

The `UnknownsRescanWorker` processes unknowns based on their band:

### HOT Band Processing
- Poll interval: 1 minute
- Batch size: 10 items
- Action: Trigger immediate rescan via `IRescanOrchestrator`
- On failure: Exponential backoff, max 3 retries before demotion to WARM

### WARM Band Processing
- Poll interval: 5 minutes
- Batch size: 50 items
- Scheduled window: 12-72 hours based on score within band
- On failure: Increment `RescanAttempts`, re-queue with delay

### COLD Band Processing
- Schedule: Weekly on configurable day (default: Sunday)
- Batch size: 500 items
- Action: Batch rescan job submission
- On failure: Log and retry next week

## Normalization Trace

Each scored unknown includes a `NormalizationTrace` for debugging and replay:

```json
{
  "rawPopularity": 42,
  "normalizedPopularity": 0.65,
  "popularityFormula": "min(1, log10(1 + 42) / log10(1 + 100))",

  "rawExploitPotential": 0.5,
  "normalizedExploitPotential": 0.5,

  "rawUncertainty": 0.55,
  "normalizedUncertainty": 0.55,
  "activeFlags": ["NoProvenanceAnchor", "VersionRange"],

  "rawCentrality": 250.0,
  "normalizedCentrality": 0.25,

  "rawStaleness": 7,
  "normalizedStaleness": 0.5,

  "weights": {
    "wP": 0.25,
    "wE": 0.25,
    "wU": 0.25,
    "wC": 0.15,
    "wS": 0.10
  },
  "finalScore": 0.52,
  "assignedBand": "Warm",
  "computedAt": "2025-12-15T10:00:00Z"
}
```

**Replay Capability:** Given the trace, the exact score can be recomputed:
```
Score = 0.25×0.65 + 0.25×0.5 + 0.25×0.55 + 0.15×0.25 + 0.10×0.5
      = 0.1625 + 0.125 + 0.1375 + 0.0375 + 0.05
      = 0.5125 ≈ 0.52
```

## API Endpoints

### Query Unknowns by Band

```
GET /api/signals/unknowns?band=hot&limit=50&offset=0
```

Response:
```json
{
  "items": [
    {
      "id": "unk-123",
      "subjectKey": "myapp|1.0.0",
      "purl": "pkg:npm/lodash@4.17.21",
      "score": 0.82,
      "band": "Hot",
      "flags": { "noProvenanceAnchor": true, "versionRange": true },
      "nextScheduledRescan": "2025-12-15T10:15:00Z"
    }
  ],
  "total": 15,
  "hasMore": false
}
```

### Get Score Explanation

```
GET /api/signals/unknowns/{id}/explain
```

Response:
```json
{
  "unknown": { /* full UnknownSymbolDocument */ },
  "normalizationTrace": { /* trace object */ },
  "factorBreakdown": {
    "popularity": { "raw": 42, "normalized": 0.65, "weighted": 0.1625 },
    "exploitPotential": { "raw": 0.5, "normalized": 0.5, "weighted": 0.125 },
    "uncertainty": { "raw": 0.55, "normalized": 0.55, "weighted": 0.1375 },
    "centrality": { "raw": 250, "normalized": 0.25, "weighted": 0.0375 },
    "staleness": { "raw": 7, "normalized": 0.5, "weighted": 0.05 }
  },
  "bandThresholds": { "hot": 0.70, "warm": 0.40 }
}
```

## Configuration Reference

```yaml
Signals:
  UnknownsScoring:
    # Factor weights (must sum to 1.0)
    WeightPopularity: 0.25
    WeightExploitPotential: 0.25
    WeightUncertainty: 0.25
    WeightCentrality: 0.15
    WeightStaleness: 0.10

    # Popularity normalization
    PopularityMaxDeployments: 100

    # Uncertainty flag weights
    FlagWeightNoProvenance: 0.30
    FlagWeightVersionRange: 0.25
    FlagWeightConflictingFeeds: 0.20
    FlagWeightMissingVector: 0.15
    FlagWeightUnreachableSource: 0.10
    FlagWeightDynamicTarget: 0.25
    FlagWeightExternalAssembly: 0.20

    # Centrality normalization
    CentralityMaxBetweenness: 1000.0

    # Staleness normalization
    StalenessMaxDays: 14
    StalenessTau: 14  # For exponential decay

    # Band thresholds
    HotThreshold: 0.70
    WarmThreshold: 0.40

    # Rescan scheduling
    HotRescanMinutes: 15
    WarmRescanHours: 24
    ColdRescanDays: 7

  UnknownsDecay:
    # Nightly batch decay
    BatchEnabled: true
    MaxSubjectsPerBatch: 1000
    ColdBatchDay: Sunday
```

## Determinism Requirements

The scoring algorithm is fully deterministic:

1. **Same inputs produce identical scores** - Given identical `UnknownSymbolDocument`, deployment counts, and graph metrics, the score will always be the same
2. **Normalization trace enables replay** - The trace contains all raw values and weights needed to reproduce the score
3. **Timestamps use UTC ISO 8601** - All `ComputedAt`, `LastAnalyzedAt`, and `NextScheduledRescan` timestamps are UTC
4. **Weights logged per computation** - The trace includes the exact weights used, allowing audit of configuration changes

## Database Schema

```sql
-- Unknowns table (enhanced)
CREATE TABLE signals.unknowns (
    id UUID PRIMARY KEY,
    subject_key TEXT NOT NULL,
    purl TEXT,
    symbol_id TEXT,
    callgraph_id TEXT,

    -- Scoring factors
    popularity_score FLOAT DEFAULT 0,
    deployment_count INT DEFAULT 0,
    exploit_potential_score FLOAT DEFAULT 0,
    uncertainty_score FLOAT DEFAULT 0,
    centrality_score FLOAT DEFAULT 0,
    degree_centrality INT DEFAULT 0,
    betweenness_centrality FLOAT DEFAULT 0,
    staleness_score FLOAT DEFAULT 0,
    days_since_last_analysis INT DEFAULT 0,

    -- Composite score and band
    score FLOAT DEFAULT 0,
    band TEXT DEFAULT 'cold' CHECK (band IN ('hot', 'warm', 'cold')),

    -- Metadata
    flags JSONB DEFAULT '{}',
    normalization_trace JSONB,
    rescan_attempts INT DEFAULT 0,
    last_rescan_result TEXT,

    -- Timestamps
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_analyzed_at TIMESTAMPTZ,
    next_scheduled_rescan TIMESTAMPTZ
);

-- Indexes for band-based queries
CREATE INDEX idx_unknowns_band ON signals.unknowns(band);
CREATE INDEX idx_unknowns_score ON signals.unknowns(score DESC);
CREATE INDEX idx_unknowns_next_rescan ON signals.unknowns(next_scheduled_rescan)
    WHERE next_scheduled_rescan IS NOT NULL;
CREATE INDEX idx_unknowns_subject ON signals.unknowns(subject_key);
```

## Metrics and Observability

The following metrics are exposed for monitoring:

| Metric | Type | Description |
|--------|------|-------------|
| `signals_unknowns_total` | Gauge | Total unknowns by band |
| `signals_unknowns_rescans_total` | Counter | Rescans triggered by band |
| `signals_unknowns_scoring_duration_seconds` | Histogram | Scoring computation time |
| `signals_unknowns_band_transitions_total` | Counter | Band changes (e.g., WARM->HOT) |

## Related Documentation

- [Unknowns Registry](./unknowns-registry.md) - Data model and API for unknowns
- [Reachability Analysis](./reachability.md) - Reachability scoring integration
- [Callgraph Schema](./callgraph-formats.md) - Graph structure for centrality computation