Add call graph fixtures for various languages and scenarios

- Introduced `all-edge-reasons.json` to test edge resolution reasons in .NET. - Added `all-visibility-levels.json` to validate method visibility levels in .NET. - Created `dotnet-aspnetcore-minimal.json` for a minimal ASP.NET Core application. - Included `go-gin-api.json` for a Go Gin API application structure. - Added `java-spring-boot.json` for the Spring PetClinic application in Java. - Introduced `legacy-no-schema.json` for legacy application structure without schema. - Created `node-express-api.json` for an Express.js API application structure.
2025-12-16 10:44:24 +02:00
parent 4391f35d8a
commit 5a480a3c2a
223 changed files with 19367 additions and 727 deletions
--- a/docs/signals/unknowns-ranking.md
+++ b/docs/signals/unknowns-ranking.md
@@ -0,0 +1,383 @@
+# Unknowns Ranking Algorithm Reference
+
+This document describes the multi-factor scoring algorithm used to rank and triage unknowns in the StellaOps Signals module.
+
+## Purpose
+
+When reachability analysis encounters unresolved symbols, edges, or package identities, these are recorded as **unknowns**. The ranking algorithm prioritizes unknowns by computing a composite score from five factors, then assigns each to a triage band (HOT/WARM/COLD) that determines rescan scheduling and escalation policies.
+
+## Scoring Formula
+
+The composite score is computed as:
+
+```
+Score = wP × P + wE × E + wU × U + wC × C + wS × S
+```
+
+Where:
+- **P** = Popularity (deployment impact)
+- **E** = Exploit potential (CVE severity)
+- **U** = Uncertainty density (flag accumulation)
+- **C** = Centrality (graph position importance)
+- **S** = Staleness (evidence age)
+
+All factors are normalized to [0.0, 1.0] before weighting. The final score is clamped to [0.0, 1.0].
+
+### Default Weights
+
+| Factor | Weight | Description |
+|--------|--------|-------------|
+| wP | 0.25 | Popularity weight |
+| wE | 0.25 | Exploit potential weight |
+| wU | 0.25 | Uncertainty density weight |
+| wC | 0.15 | Centrality weight |
+| wS | 0.10 | Staleness weight |
+
+Weights must sum to 1.0 and are configurable via `Signals:UnknownsScoring` settings.
+
+## Factor Details
+
+### Factor P: Popularity (Deployment Impact)
+
+Measures how widely the unknown's package is deployed across monitored environments.
+
+**Formula:**
+```
+P = min(1, log10(1 + deploymentCount) / log10(1 + maxDeployments))
+```
+
+**Parameters:**
+- `deploymentCount`: Number of deployments referencing the package (from `deploy_refs` table)
+- `maxDeployments`: Normalization ceiling (default: 100)
+
+**Rationale:** Logarithmic scaling prevents a single highly-deployed package from dominating scores while still prioritizing widely-used dependencies.
+
+### Factor E: Exploit Potential (CVE Severity)
+
+Estimates the consequence severity if the unknown resolves to a vulnerable component.
+
+**Current Implementation:**
+- Returns 0.5 (medium potential) when no CVE association exists
+- Future: Integrate KEV lookup, EPSS scores, and exploit database references
+
+**Planned Enhancements:**
+- CVE severity mapping (Critical=1.0, High=0.8, Medium=0.5, Low=0.2)
+- KEV (Known Exploited Vulnerabilities) flag boost
+- EPSS (Exploit Prediction Scoring System) integration
+
+### Factor U: Uncertainty Density (Flag Accumulation)
+
+Aggregates uncertainty signals from multiple sources. Each flag contributes a weighted penalty.
+
+**Flag Weights:**
+
+| Flag | Weight | Description |
+|------|--------|-------------|
+| `NoProvenanceAnchor` | 0.30 | Cannot verify package source |
+| `VersionRange` | 0.25 | Version specified as range, not exact |
+| `DynamicCallTarget` | 0.25 | Reflection, eval, or dynamic dispatch |
+| `ConflictingFeeds` | 0.20 | Contradictory info from different feeds |
+| `ExternalAssembly` | 0.20 | Assembly outside analysis scope |
+| `MissingVector` | 0.15 | No CVSS vector for severity assessment |
+| `UnreachableSourceAdvisory` | 0.10 | Source advisory URL unreachable |
+
+**Formula:**
+```
+U = min(1.0, sum(activeFlags × flagWeight))
+```
+
+**Example:**
+- NoProvenanceAnchor (0.30) + VersionRange (0.25) + MissingVector (0.15) = 0.70
+
+### Factor C: Centrality (Graph Position Importance)
+
+Measures the unknown's position importance in the call graph using betweenness centrality.
+
+**Formula:**
+```
+C = min(1.0, betweenness / maxBetweenness)
+```
+
+**Parameters:**
+- `betweenness`: Raw betweenness centrality from graph analysis
+- `maxBetweenness`: Normalization ceiling (default: 1000)
+
+**Rationale:** High-betweenness nodes appear on many shortest paths, meaning they're likely to be reached regardless of entry point.
+
+**Related Metrics:**
+- `DegreeCentrality`: Number of incoming + outgoing edges (stored but not used in score)
+- `BetweennessCentrality`: Raw betweenness value (stored for debugging)
+
+### Factor S: Staleness (Evidence Age)
+
+Measures how old the evidence is since the last successful analysis attempt.
+
+**Formula:**
+```
+S = min(1.0, daysSinceLastAnalysis / maxDays)
+```
+
+With exponential decay enhancement (optional):
+```
+S = 1 - exp(-daysSinceLastAnalysis / tau)
+```
+
+**Parameters:**
+- `daysSinceLastAnalysis`: Days since `LastAnalyzedAt` timestamp
+- `maxDays`: Staleness ceiling (default: 14 days)
+- `tau`: Decay constant for exponential model (default: 14)
+
+**Special Cases:**
+- Never analyzed (`LastAnalyzedAt` is null): S = 1.0 (maximum staleness)
+
+## Band Assignment
+
+Based on the composite score, unknowns are assigned to triage bands:
+
+| Band | Threshold | Rescan Policy | Description |
+|------|-----------|---------------|-------------|
+| **HOT** | Score >= 0.70 | 15 minutes | Immediate rescan + VEX escalation |
+| **WARM** | 0.40 <= Score < 0.70 | 24 hours | Scheduled rescan within 12-72h |
+| **COLD** | Score < 0.40 | 7 days | Weekly batch processing |
+
+Thresholds are configurable:
+```yaml
+Signals:
+  UnknownsScoring:
+    HotThreshold: 0.70
+    WarmThreshold: 0.40
+```
+
+## Scheduler Integration
+
+The `UnknownsRescanWorker` processes unknowns based on their band:
+
+### HOT Band Processing
+- Poll interval: 1 minute
+- Batch size: 10 items
+- Action: Trigger immediate rescan via `IRescanOrchestrator`
+- On failure: Exponential backoff, max 3 retries before demotion to WARM
+
+### WARM Band Processing
+- Poll interval: 5 minutes
+- Batch size: 50 items
+- Scheduled window: 12-72 hours based on score within band
+- On failure: Increment `RescanAttempts`, re-queue with delay
+
+### COLD Band Processing
+- Schedule: Weekly on configurable day (default: Sunday)
+- Batch size: 500 items
+- Action: Batch rescan job submission
+- On failure: Log and retry next week
+
+## Normalization Trace
+
+Each scored unknown includes a `NormalizationTrace` for debugging and replay:
+
+```json
+{
+  "rawPopularity": 42,
+  "normalizedPopularity": 0.65,
+  "popularityFormula": "min(1, log10(1 + 42) / log10(1 + 100))",
+
+  "rawExploitPotential": 0.5,
+  "normalizedExploitPotential": 0.5,
+
+  "rawUncertainty": 0.55,
+  "normalizedUncertainty": 0.55,
+  "activeFlags": ["NoProvenanceAnchor", "VersionRange"],
+
+  "rawCentrality": 250.0,
+  "normalizedCentrality": 0.25,
+
+  "rawStaleness": 7,
+  "normalizedStaleness": 0.5,
+
+  "weights": {
+    "wP": 0.25,
+    "wE": 0.25,
+    "wU": 0.25,
+    "wC": 0.15,
+    "wS": 0.10
+  },
+  "finalScore": 0.52,
+  "assignedBand": "Warm",
+  "computedAt": "2025-12-15T10:00:00Z"
+}
+```
+
+**Replay Capability:** Given the trace, the exact score can be recomputed:
+```
+Score = 0.25×0.65 + 0.25×0.5 + 0.25×0.55 + 0.15×0.25 + 0.10×0.5
+      = 0.1625 + 0.125 + 0.1375 + 0.0375 + 0.05
+      = 0.5125 ≈ 0.52
+```
+
+## API Endpoints
+
+### Query Unknowns by Band
+
+```
+GET /api/signals/unknowns?band=hot&limit=50&offset=0
+```
+
+Response:
+```json
+{
+  "items": [
+    {
+      "id": "unk-123",
+      "subjectKey": "myapp|1.0.0",
+      "purl": "pkg:npm/lodash@4.17.21",
+      "score": 0.82,
+      "band": "Hot",
+      "flags": { "noProvenanceAnchor": true, "versionRange": true },
+      "nextScheduledRescan": "2025-12-15T10:15:00Z"
+    }
+  ],
+  "total": 15,
+  "hasMore": false
+}
+```
+
+### Get Score Explanation
+
+```
+GET /api/signals/unknowns/{id}/explain
+```
+
+Response:
+```json
+{
+  "unknown": { /* full UnknownSymbolDocument */ },
+  "normalizationTrace": { /* trace object */ },
+  "factorBreakdown": {
+    "popularity": { "raw": 42, "normalized": 0.65, "weighted": 0.1625 },
+    "exploitPotential": { "raw": 0.5, "normalized": 0.5, "weighted": 0.125 },
+    "uncertainty": { "raw": 0.55, "normalized": 0.55, "weighted": 0.1375 },
+    "centrality": { "raw": 250, "normalized": 0.25, "weighted": 0.0375 },
+    "staleness": { "raw": 7, "normalized": 0.5, "weighted": 0.05 }
+  },
+  "bandThresholds": { "hot": 0.70, "warm": 0.40 }
+}
+```
+
+## Configuration Reference
+
+```yaml
+Signals:
+  UnknownsScoring:
+    # Factor weights (must sum to 1.0)
+    WeightPopularity: 0.25
+    WeightExploitPotential: 0.25
+    WeightUncertainty: 0.25
+    WeightCentrality: 0.15
+    WeightStaleness: 0.10
+
+    # Popularity normalization
+    PopularityMaxDeployments: 100
+
+    # Uncertainty flag weights
+    FlagWeightNoProvenance: 0.30
+    FlagWeightVersionRange: 0.25
+    FlagWeightConflictingFeeds: 0.20
+    FlagWeightMissingVector: 0.15
+    FlagWeightUnreachableSource: 0.10
+    FlagWeightDynamicTarget: 0.25
+    FlagWeightExternalAssembly: 0.20
+
+    # Centrality normalization
+    CentralityMaxBetweenness: 1000.0
+
+    # Staleness normalization
+    StalenessMaxDays: 14
+    StalenessTau: 14  # For exponential decay
+
+    # Band thresholds
+    HotThreshold: 0.70
+    WarmThreshold: 0.40
+
+    # Rescan scheduling
+    HotRescanMinutes: 15
+    WarmRescanHours: 24
+    ColdRescanDays: 7
+
+  UnknownsDecay:
+    # Nightly batch decay
+    BatchEnabled: true
+    MaxSubjectsPerBatch: 1000
+    ColdBatchDay: Sunday
+```
+
+## Determinism Requirements
+
+The scoring algorithm is fully deterministic:
+
+1. **Same inputs produce identical scores** - Given identical `UnknownSymbolDocument`, deployment counts, and graph metrics, the score will always be the same
+2. **Normalization trace enables replay** - The trace contains all raw values and weights needed to reproduce the score
+3. **Timestamps use UTC ISO 8601** - All `ComputedAt`, `LastAnalyzedAt`, and `NextScheduledRescan` timestamps are UTC
+4. **Weights logged per computation** - The trace includes the exact weights used, allowing audit of configuration changes
+
+## Database Schema
+
+```sql
+-- Unknowns table (enhanced)
+CREATE TABLE signals.unknowns (
+    id UUID PRIMARY KEY,
+    subject_key TEXT NOT NULL,
+    purl TEXT,
+    symbol_id TEXT,
+    callgraph_id TEXT,
+
+    -- Scoring factors
+    popularity_score FLOAT DEFAULT 0,
+    deployment_count INT DEFAULT 0,
+    exploit_potential_score FLOAT DEFAULT 0,
+    uncertainty_score FLOAT DEFAULT 0,
+    centrality_score FLOAT DEFAULT 0,
+    degree_centrality INT DEFAULT 0,
+    betweenness_centrality FLOAT DEFAULT 0,
+    staleness_score FLOAT DEFAULT 0,
+    days_since_last_analysis INT DEFAULT 0,
+
+    -- Composite score and band
+    score FLOAT DEFAULT 0,
+    band TEXT DEFAULT 'cold' CHECK (band IN ('hot', 'warm', 'cold')),
+
+    -- Metadata
+    flags JSONB DEFAULT '{}',
+    normalization_trace JSONB,
+    rescan_attempts INT DEFAULT 0,
+    last_rescan_result TEXT,
+
+    -- Timestamps
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    last_analyzed_at TIMESTAMPTZ,
+    next_scheduled_rescan TIMESTAMPTZ
+);
+
+-- Indexes for band-based queries
+CREATE INDEX idx_unknowns_band ON signals.unknowns(band);
+CREATE INDEX idx_unknowns_score ON signals.unknowns(score DESC);
+CREATE INDEX idx_unknowns_next_rescan ON signals.unknowns(next_scheduled_rescan)
+    WHERE next_scheduled_rescan IS NOT NULL;
+CREATE INDEX idx_unknowns_subject ON signals.unknowns(subject_key);
+```
+
+## Metrics and Observability
+
+The following metrics are exposed for monitoring:
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `signals_unknowns_total` | Gauge | Total unknowns by band |
+| `signals_unknowns_rescans_total` | Counter | Rescans triggered by band |
+| `signals_unknowns_scoring_duration_seconds` | Histogram | Scoring computation time |
+| `signals_unknowns_band_transitions_total` | Counter | Band changes (e.g., WARM->HOT) |
+
+## Related Documentation
+
+- [Unknowns Registry](./unknowns-registry.md) - Data model and API for unknowns
+- [Reachability Analysis](./reachability.md) - Reachability scoring integration
+- [Callgraph Schema](./callgraph-formats.md) - Graph structure for centrality computation