Add call graph fixtures for various languages and scenarios
Some checks failed
Reachability Corpus Validation / validate-corpus (push) Waiting to run
Reachability Corpus Validation / validate-ground-truths (push) Waiting to run
Reachability Corpus Validation / determinism-check (push) Blocked by required conditions
Scanner Analyzers / Discover Analyzers (push) Waiting to run
Scanner Analyzers / Build Analyzers (push) Blocked by required conditions
Scanner Analyzers / Test Language Analyzers (push) Blocked by required conditions
Scanner Analyzers / Validate Test Fixtures (push) Waiting to run
Scanner Analyzers / Verify Deterministic Output (push) Blocked by required conditions
Signals CI & Image / signals-ci (push) Waiting to run
Signals Reachability Scoring & Events / reachability-smoke (push) Waiting to run
Signals Reachability Scoring & Events / sign-and-upload (push) Blocked by required conditions
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Findings Ledger CI / build-test (push) Has been cancelled
Findings Ledger CI / migration-validation (push) Has been cancelled
Findings Ledger CI / generate-manifest (push) Has been cancelled
Lighthouse CI / Lighthouse Audit (push) Has been cancelled
Lighthouse CI / Axe Accessibility Audit (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled

- Introduced `all-edge-reasons.json` to test edge resolution reasons in .NET.
- Added `all-visibility-levels.json` to validate method visibility levels in .NET.
- Created `dotnet-aspnetcore-minimal.json` for a minimal ASP.NET Core application.
- Included `go-gin-api.json` for a Go Gin API application structure.
- Added `java-spring-boot.json` for the Spring PetClinic application in Java.
- Introduced `legacy-no-schema.json` for legacy application structure without schema.
- Created `node-express-api.json` for an Express.js API application structure.
This commit is contained in:
master
2025-12-16 10:44:24 +02:00
parent 4391f35d8a
commit 5a480a3c2a
223 changed files with 19367 additions and 727 deletions

View File

@@ -1,15 +1,355 @@
# Callgraph Formats (outline)
# Callgraph Schema Reference
## Pending Inputs
- See sprint SPRINT_0309_0001_0009_docs_tasks_md_ix action tracker; inputs due 2025-12-09..12 from owning guilds.
This document describes the `stella.callgraph.v1` schema used for representing call graphs in StellaOps.
## Determinism Checklist
- [ ] Hash any inbound assets/payloads; place sums alongside artifacts (e.g., SHA256SUMS in this folder).
- [ ] Keep examples offline-friendly and deterministic (fixed seeds, pinned versions, stable ordering).
- [ ] Note source/approver for any provided captures or schemas.
## Schema Version
## Sections to fill (once inputs arrive)
- Supported callgraph schema versions and shapes.
- Field definitions and validation rules.
- Common validation errors with deterministic examples.
- Hashes for any sample graphs provided.
**Current Version:** `stella.callgraph.v1`
All call graphs should include the `schema` field set to `stella.callgraph.v1`. Legacy call graphs without this field are automatically migrated on ingestion.
## Document Structure
A `CallgraphDocument` contains the following top-level fields:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `schema` | string | Yes | Schema identifier: `stella.callgraph.v1` |
| `scanKey` | string | No | Scan context identifier |
| `language` | CallgraphLanguage | No | Primary language of the call graph |
| `artifacts` | CallgraphArtifact[] | No | Artifacts included in the graph |
| `nodes` | CallgraphNode[] | Yes | Graph nodes representing symbols |
| `edges` | CallgraphEdge[] | Yes | Call edges between nodes |
| `entrypoints` | CallgraphEntrypoint[] | No | Discovered entrypoints |
| `metadata` | CallgraphMetadata | No | Graph-level metadata |
| `id` | string | Yes | Unique graph identifier |
| `component` | string | No | Component name |
| `version` | string | No | Component version |
| `ingestedAt` | DateTimeOffset | No | Ingestion timestamp (ISO 8601) |
| `graphHash` | string | No | Content hash for deduplication |
### Legacy Fields
These fields are preserved for backward compatibility:
| Field | Type | Description |
|-------|------|-------------|
| `languageString` | string | Legacy language string |
| `roots` | CallgraphRoot[] | Legacy root/entrypoint representation |
| `schemaVersion` | string | Legacy schema version field |
## Enumerations
### CallgraphLanguage
Supported languages for call graph analysis:
| Value | Description |
|-------|-------------|
| `Unknown` | Language not determined |
| `DotNet` | .NET (C#, F#, VB.NET) |
| `Java` | Java and JVM languages |
| `Node` | Node.js / JavaScript / TypeScript |
| `Python` | Python |
| `Go` | Go |
| `Rust` | Rust |
| `Ruby` | Ruby |
| `Php` | PHP |
| `Binary` | Native binary (ELF, PE) |
| `Swift` | Swift |
| `Kotlin` | Kotlin |
### SymbolVisibility
Access visibility levels for symbols:
| Value | Description |
|-------|-------------|
| `Unknown` | Visibility not determined |
| `Public` | Publicly accessible |
| `Internal` | Internal to assembly/module |
| `Protected` | Protected (subclass accessible) |
| `Private` | Private to containing type |
### EdgeKind
Edge classification based on analysis confidence:
| Value | Description | Confidence |
|-------|-------------|------------|
| `Static` | Statically determined call | High |
| `Heuristic` | Heuristically inferred | Medium |
| `Runtime` | Runtime-observed edge | Highest |
### EdgeReason
Reason codes explaining why an edge exists (critical for explainability):
| Value | Description | Typical Kind |
|-------|-------------|--------------|
| `DirectCall` | Direct method/function call | Static |
| `VirtualCall` | Virtual/interface dispatch | Static |
| `ReflectionString` | Reflection-based invocation | Heuristic |
| `DiBinding` | Dependency injection binding | Heuristic |
| `DynamicImport` | Dynamic import/require | Heuristic |
| `NewObj` | Constructor/object instantiation | Static |
| `DelegateCreate` | Delegate/function pointer creation | Static |
| `AsyncContinuation` | Async/await continuation | Static |
| `EventHandler` | Event handler subscription | Heuristic |
| `GenericInstantiation` | Generic type instantiation | Static |
| `NativeInterop` | Native interop (P/Invoke, JNI, FFI) | Static |
| `RuntimeMinted` | Runtime-minted edge from execution | Runtime |
| `Unknown` | Reason could not be determined | - |
### EntrypointKind
Types of entrypoints:
| Value | Description |
|-------|-------------|
| `Unknown` | Type not determined |
| `Http` | HTTP endpoint |
| `Grpc` | gRPC endpoint |
| `Cli` | CLI command handler |
| `Job` | Background job |
| `Event` | Event handler |
| `MessageQueue` | Message queue consumer |
| `Timer` | Timer/scheduled task |
| `Test` | Test method |
| `Main` | Main entry point |
| `ModuleInit` | Module initializer |
| `StaticConstructor` | Static constructor |
### EntrypointFramework
Frameworks that expose entrypoints:
| Value | Description | Language |
|-------|-------------|----------|
| `Unknown` | Framework not determined | - |
| `AspNetCore` | ASP.NET Core | DotNet |
| `MinimalApi` | ASP.NET Core Minimal APIs | DotNet |
| `Spring` | Spring Framework | Java |
| `SpringBoot` | Spring Boot | Java |
| `Express` | Express.js | Node |
| `Fastify` | Fastify | Node |
| `NestJs` | NestJS | Node |
| `FastApi` | FastAPI | Python |
| `Flask` | Flask | Python |
| `Django` | Django | Python |
| `Rails` | Ruby on Rails | Ruby |
| `Gin` | Gin | Go |
| `Echo` | Echo | Go |
| `Actix` | Actix Web | Rust |
| `Rocket` | Rocket | Rust |
| `AzureFunctions` | Azure Functions | Multi |
| `AwsLambda` | AWS Lambda | Multi |
| `CloudFunctions` | Google Cloud Functions | Multi |
### EntrypointPhase
Execution phase for entrypoints:
| Value | Description |
|-------|-------------|
| `ModuleInit` | Module/assembly initialization |
| `AppStart` | Application startup (Main) |
| `Runtime` | Runtime request handling |
| `Shutdown` | Shutdown/cleanup handlers |
## Node Structure
A `CallgraphNode` represents a symbol (method, function, type) in the call graph:
```json
{
"id": "n001",
"nodeId": "n001",
"name": "GetWeatherForecast",
"kind": "method",
"namespace": "SampleApi.Controllers",
"file": "WeatherForecastController.cs",
"line": 15,
"symbolKey": "SampleApi.Controllers.WeatherForecastController::GetWeatherForecast()",
"artifactKey": "SampleApi.dll",
"visibility": "Public",
"isEntrypointCandidate": true,
"attributes": {
"returnType": "IEnumerable<WeatherForecast>",
"httpMethod": "GET",
"route": "/weatherforecast"
},
"flags": 3
}
```
### Node Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | Yes | Unique identifier within the graph |
| `nodeId` | string | No | Alias for id (v1 schema convention) |
| `name` | string | Yes | Human-readable symbol name |
| `kind` | string | Yes | Symbol kind (method, function, class) |
| `namespace` | string | No | Namespace or module path |
| `file` | string | No | Source file path |
| `line` | int | No | Source line number |
| `symbolKey` | string | No | Canonical symbol key (v1) |
| `artifactKey` | string | No | Reference to containing artifact |
| `visibility` | SymbolVisibility | No | Access visibility |
| `isEntrypointCandidate` | bool | No | Whether node is an entrypoint candidate |
| `purl` | string | No | Package URL for external packages |
| `symbolDigest` | string | No | Content-addressed symbol digest |
| `attributes` | object | No | Additional attributes |
| `flags` | int | No | Bitmask for efficient filtering |
### Symbol Key Format
The `symbolKey` follows a canonical format:
```
{Namespace}.{Type}[`Arity][+Nested]::{Method}[`Arity]({ParamTypes})
```
Examples:
- `System.String::Concat(string, string)`
- `MyApp.Controllers.UserController::GetUser(int)`
- `System.Collections.Generic.List`1::Add(T)`
## Edge Structure
A `CallgraphEdge` represents a call relationship between two symbols:
```json
{
"sourceId": "n001",
"targetId": "n002",
"from": "n001",
"to": "n002",
"type": "call",
"kind": "Static",
"reason": "DirectCall",
"weight": 1.0,
"offset": 42,
"isResolved": true,
"provenance": "static-analysis"
}
```
### Edge Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `sourceId` | string | Yes | Source node ID (caller) |
| `targetId` | string | Yes | Target node ID (callee) |
| `from` | string | No | Alias for sourceId (v1) |
| `to` | string | No | Alias for targetId (v1) |
| `type` | string | No | Legacy edge type |
| `kind` | EdgeKind | No | Edge classification |
| `reason` | EdgeReason | No | Reason for edge existence |
| `weight` | double | No | Confidence weight (0.0-1.0) |
| `offset` | int | No | IL/bytecode offset |
| `isResolved` | bool | No | Whether target was fully resolved |
| `provenance` | string | No | Provenance information |
| `candidates` | string[] | No | Virtual dispatch candidates |
## Entrypoint Structure
A `CallgraphEntrypoint` represents a discovered entrypoint:
```json
{
"nodeId": "n001",
"kind": "Http",
"route": "/api/users/{id}",
"httpMethod": "GET",
"framework": "AspNetCore",
"source": "attribute",
"phase": "Runtime",
"order": 0
}
```
### Entrypoint Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `nodeId` | string | Yes | Reference to the node |
| `kind` | EntrypointKind | Yes | Type of entrypoint |
| `route` | string | No | HTTP route pattern |
| `httpMethod` | string | No | HTTP method (GET, POST, etc.) |
| `framework` | EntrypointFramework | No | Framework exposing the entrypoint |
| `source` | string | No | Discovery source |
| `phase` | EntrypointPhase | No | Execution phase |
| `order` | int | No | Deterministic ordering |
## Determinism Requirements
For reproducible analysis, call graphs must be deterministic:
1. **Stable Ordering**
- Nodes must be sorted by `id` (ordinal string comparison)
- Edges must be sorted by `sourceId`, then `targetId`
- Entrypoints must be sorted by `order`
2. **Enum Serialization**
- All enums serialize as camelCase strings
- Example: `EdgeReason.DirectCall``"directCall"`
3. **Timestamps**
- All timestamps must be UTC ISO 8601 format
- Example: `2025-01-15T10:00:00Z`
4. **Content Hashing**
- The `graphHash` field should contain a stable content hash
- Hash algorithm: SHA-256
- Format: `sha256:{hex-digest}`
## Schema Migration
Legacy call graphs without the `schema` field are automatically migrated:
1. **Schema Field**: Set to `stella.callgraph.v1`
2. **Language Parsing**: String language converted to `CallgraphLanguage` enum
3. **Visibility Inference**: Inferred from symbol key patterns:
- Contains `.Internal.``Internal`
- Contains `._` or `<``Private`
- Default → `Public`
4. **Edge Reason Inference**: Based on legacy `type` field:
- `call`, `direct``DirectCall`
- `virtual`, `callvirt``VirtualCall`
- `newobj``NewObj`
- etc.
5. **Entrypoint Inference**: Built from legacy `roots` and candidate nodes
6. **Symbol Key Generation**: Built from namespace and name if missing
## Validation Rules
Call graphs are validated against these rules:
1. All node `id` values must be unique
2. All edge `sourceId` and `targetId` must reference existing nodes
3. All entrypoint `nodeId` must reference existing nodes
4. Edge `weight` must be between 0.0 and 1.0
5. Artifacts referenced by nodes must exist in the `artifacts` list
## Golden Fixtures
Reference fixtures for testing are located at:
`tests/reachability/fixtures/callgraph-schema-v1/`
| Fixture | Description |
|---------|-------------|
| `dotnet-aspnetcore-minimal.json` | ASP.NET Core application |
| `java-spring-boot.json` | Spring Boot application |
| `node-express-api.json` | Express.js API |
| `go-gin-api.json` | Go Gin API |
| `legacy-no-schema.json` | Legacy format for migration testing |
| `all-edge-reasons.json` | All 13 edge reason codes |
| `all-visibility-levels.json` | All 5 visibility levels |
## Related Documentation
- [Reachability Analysis Technical Reference](../reachability/README.md)
- [Schema Migration Implementation](../../src/Signals/StellaOps.Signals/Parsing/CallgraphSchemaMigrator.cs)
- [SPRINT_1100: CallGraph Schema Enhancement](../implplan/SPRINT_1100_0001_0001_callgraph_schema_enhancement.md)

View File

@@ -0,0 +1,383 @@
# Unknowns Ranking Algorithm Reference
This document describes the multi-factor scoring algorithm used to rank and triage unknowns in the StellaOps Signals module.
## Purpose
When reachability analysis encounters unresolved symbols, edges, or package identities, these are recorded as **unknowns**. The ranking algorithm prioritizes unknowns by computing a composite score from five factors, then assigns each to a triage band (HOT/WARM/COLD) that determines rescan scheduling and escalation policies.
## Scoring Formula
The composite score is computed as:
```
Score = wP × P + wE × E + wU × U + wC × C + wS × S
```
Where:
- **P** = Popularity (deployment impact)
- **E** = Exploit potential (CVE severity)
- **U** = Uncertainty density (flag accumulation)
- **C** = Centrality (graph position importance)
- **S** = Staleness (evidence age)
All factors are normalized to [0.0, 1.0] before weighting. The final score is clamped to [0.0, 1.0].
### Default Weights
| Factor | Weight | Description |
|--------|--------|-------------|
| wP | 0.25 | Popularity weight |
| wE | 0.25 | Exploit potential weight |
| wU | 0.25 | Uncertainty density weight |
| wC | 0.15 | Centrality weight |
| wS | 0.10 | Staleness weight |
Weights must sum to 1.0 and are configurable via `Signals:UnknownsScoring` settings.
## Factor Details
### Factor P: Popularity (Deployment Impact)
Measures how widely the unknown's package is deployed across monitored environments.
**Formula:**
```
P = min(1, log10(1 + deploymentCount) / log10(1 + maxDeployments))
```
**Parameters:**
- `deploymentCount`: Number of deployments referencing the package (from `deploy_refs` table)
- `maxDeployments`: Normalization ceiling (default: 100)
**Rationale:** Logarithmic scaling prevents a single highly-deployed package from dominating scores while still prioritizing widely-used dependencies.
### Factor E: Exploit Potential (CVE Severity)
Estimates the consequence severity if the unknown resolves to a vulnerable component.
**Current Implementation:**
- Returns 0.5 (medium potential) when no CVE association exists
- Future: Integrate KEV lookup, EPSS scores, and exploit database references
**Planned Enhancements:**
- CVE severity mapping (Critical=1.0, High=0.8, Medium=0.5, Low=0.2)
- KEV (Known Exploited Vulnerabilities) flag boost
- EPSS (Exploit Prediction Scoring System) integration
### Factor U: Uncertainty Density (Flag Accumulation)
Aggregates uncertainty signals from multiple sources. Each flag contributes a weighted penalty.
**Flag Weights:**
| Flag | Weight | Description |
|------|--------|-------------|
| `NoProvenanceAnchor` | 0.30 | Cannot verify package source |
| `VersionRange` | 0.25 | Version specified as range, not exact |
| `DynamicCallTarget` | 0.25 | Reflection, eval, or dynamic dispatch |
| `ConflictingFeeds` | 0.20 | Contradictory info from different feeds |
| `ExternalAssembly` | 0.20 | Assembly outside analysis scope |
| `MissingVector` | 0.15 | No CVSS vector for severity assessment |
| `UnreachableSourceAdvisory` | 0.10 | Source advisory URL unreachable |
**Formula:**
```
U = min(1.0, sum(activeFlags × flagWeight))
```
**Example:**
- NoProvenanceAnchor (0.30) + VersionRange (0.25) + MissingVector (0.15) = 0.70
### Factor C: Centrality (Graph Position Importance)
Measures the unknown's position importance in the call graph using betweenness centrality.
**Formula:**
```
C = min(1.0, betweenness / maxBetweenness)
```
**Parameters:**
- `betweenness`: Raw betweenness centrality from graph analysis
- `maxBetweenness`: Normalization ceiling (default: 1000)
**Rationale:** High-betweenness nodes appear on many shortest paths, meaning they're likely to be reached regardless of entry point.
**Related Metrics:**
- `DegreeCentrality`: Number of incoming + outgoing edges (stored but not used in score)
- `BetweennessCentrality`: Raw betweenness value (stored for debugging)
### Factor S: Staleness (Evidence Age)
Measures how old the evidence is since the last successful analysis attempt.
**Formula:**
```
S = min(1.0, daysSinceLastAnalysis / maxDays)
```
With exponential decay enhancement (optional):
```
S = 1 - exp(-daysSinceLastAnalysis / tau)
```
**Parameters:**
- `daysSinceLastAnalysis`: Days since `LastAnalyzedAt` timestamp
- `maxDays`: Staleness ceiling (default: 14 days)
- `tau`: Decay constant for exponential model (default: 14)
**Special Cases:**
- Never analyzed (`LastAnalyzedAt` is null): S = 1.0 (maximum staleness)
## Band Assignment
Based on the composite score, unknowns are assigned to triage bands:
| Band | Threshold | Rescan Policy | Description |
|------|-----------|---------------|-------------|
| **HOT** | Score >= 0.70 | 15 minutes | Immediate rescan + VEX escalation |
| **WARM** | 0.40 <= Score < 0.70 | 24 hours | Scheduled rescan within 12-72h |
| **COLD** | Score < 0.40 | 7 days | Weekly batch processing |
Thresholds are configurable:
```yaml
Signals:
UnknownsScoring:
HotThreshold: 0.70
WarmThreshold: 0.40
```
## Scheduler Integration
The `UnknownsRescanWorker` processes unknowns based on their band:
### HOT Band Processing
- Poll interval: 1 minute
- Batch size: 10 items
- Action: Trigger immediate rescan via `IRescanOrchestrator`
- On failure: Exponential backoff, max 3 retries before demotion to WARM
### WARM Band Processing
- Poll interval: 5 minutes
- Batch size: 50 items
- Scheduled window: 12-72 hours based on score within band
- On failure: Increment `RescanAttempts`, re-queue with delay
### COLD Band Processing
- Schedule: Weekly on configurable day (default: Sunday)
- Batch size: 500 items
- Action: Batch rescan job submission
- On failure: Log and retry next week
## Normalization Trace
Each scored unknown includes a `NormalizationTrace` for debugging and replay:
```json
{
"rawPopularity": 42,
"normalizedPopularity": 0.65,
"popularityFormula": "min(1, log10(1 + 42) / log10(1 + 100))",
"rawExploitPotential": 0.5,
"normalizedExploitPotential": 0.5,
"rawUncertainty": 0.55,
"normalizedUncertainty": 0.55,
"activeFlags": ["NoProvenanceAnchor", "VersionRange"],
"rawCentrality": 250.0,
"normalizedCentrality": 0.25,
"rawStaleness": 7,
"normalizedStaleness": 0.5,
"weights": {
"wP": 0.25,
"wE": 0.25,
"wU": 0.25,
"wC": 0.15,
"wS": 0.10
},
"finalScore": 0.52,
"assignedBand": "Warm",
"computedAt": "2025-12-15T10:00:00Z"
}
```
**Replay Capability:** Given the trace, the exact score can be recomputed:
```
Score = 0.25×0.65 + 0.25×0.5 + 0.25×0.55 + 0.15×0.25 + 0.10×0.5
= 0.1625 + 0.125 + 0.1375 + 0.0375 + 0.05
= 0.5125 ≈ 0.52
```
## API Endpoints
### Query Unknowns by Band
```
GET /api/signals/unknowns?band=hot&limit=50&offset=0
```
Response:
```json
{
"items": [
{
"id": "unk-123",
"subjectKey": "myapp|1.0.0",
"purl": "pkg:npm/lodash@4.17.21",
"score": 0.82,
"band": "Hot",
"flags": { "noProvenanceAnchor": true, "versionRange": true },
"nextScheduledRescan": "2025-12-15T10:15:00Z"
}
],
"total": 15,
"hasMore": false
}
```
### Get Score Explanation
```
GET /api/signals/unknowns/{id}/explain
```
Response:
```json
{
"unknown": { /* full UnknownSymbolDocument */ },
"normalizationTrace": { /* trace object */ },
"factorBreakdown": {
"popularity": { "raw": 42, "normalized": 0.65, "weighted": 0.1625 },
"exploitPotential": { "raw": 0.5, "normalized": 0.5, "weighted": 0.125 },
"uncertainty": { "raw": 0.55, "normalized": 0.55, "weighted": 0.1375 },
"centrality": { "raw": 250, "normalized": 0.25, "weighted": 0.0375 },
"staleness": { "raw": 7, "normalized": 0.5, "weighted": 0.05 }
},
"bandThresholds": { "hot": 0.70, "warm": 0.40 }
}
```
## Configuration Reference
```yaml
Signals:
UnknownsScoring:
# Factor weights (must sum to 1.0)
WeightPopularity: 0.25
WeightExploitPotential: 0.25
WeightUncertainty: 0.25
WeightCentrality: 0.15
WeightStaleness: 0.10
# Popularity normalization
PopularityMaxDeployments: 100
# Uncertainty flag weights
FlagWeightNoProvenance: 0.30
FlagWeightVersionRange: 0.25
FlagWeightConflictingFeeds: 0.20
FlagWeightMissingVector: 0.15
FlagWeightUnreachableSource: 0.10
FlagWeightDynamicTarget: 0.25
FlagWeightExternalAssembly: 0.20
# Centrality normalization
CentralityMaxBetweenness: 1000.0
# Staleness normalization
StalenessMaxDays: 14
StalenessTau: 14 # For exponential decay
# Band thresholds
HotThreshold: 0.70
WarmThreshold: 0.40
# Rescan scheduling
HotRescanMinutes: 15
WarmRescanHours: 24
ColdRescanDays: 7
UnknownsDecay:
# Nightly batch decay
BatchEnabled: true
MaxSubjectsPerBatch: 1000
ColdBatchDay: Sunday
```
## Determinism Requirements
The scoring algorithm is fully deterministic:
1. **Same inputs produce identical scores** - Given identical `UnknownSymbolDocument`, deployment counts, and graph metrics, the score will always be the same
2. **Normalization trace enables replay** - The trace contains all raw values and weights needed to reproduce the score
3. **Timestamps use UTC ISO 8601** - All `ComputedAt`, `LastAnalyzedAt`, and `NextScheduledRescan` timestamps are UTC
4. **Weights logged per computation** - The trace includes the exact weights used, allowing audit of configuration changes
## Database Schema
```sql
-- Unknowns table (enhanced)
CREATE TABLE signals.unknowns (
id UUID PRIMARY KEY,
subject_key TEXT NOT NULL,
purl TEXT,
symbol_id TEXT,
callgraph_id TEXT,
-- Scoring factors
popularity_score FLOAT DEFAULT 0,
deployment_count INT DEFAULT 0,
exploit_potential_score FLOAT DEFAULT 0,
uncertainty_score FLOAT DEFAULT 0,
centrality_score FLOAT DEFAULT 0,
degree_centrality INT DEFAULT 0,
betweenness_centrality FLOAT DEFAULT 0,
staleness_score FLOAT DEFAULT 0,
days_since_last_analysis INT DEFAULT 0,
-- Composite score and band
score FLOAT DEFAULT 0,
band TEXT DEFAULT 'cold' CHECK (band IN ('hot', 'warm', 'cold')),
-- Metadata
flags JSONB DEFAULT '{}',
normalization_trace JSONB,
rescan_attempts INT DEFAULT 0,
last_rescan_result TEXT,
-- Timestamps
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_analyzed_at TIMESTAMPTZ,
next_scheduled_rescan TIMESTAMPTZ
);
-- Indexes for band-based queries
CREATE INDEX idx_unknowns_band ON signals.unknowns(band);
CREATE INDEX idx_unknowns_score ON signals.unknowns(score DESC);
CREATE INDEX idx_unknowns_next_rescan ON signals.unknowns(next_scheduled_rescan)
WHERE next_scheduled_rescan IS NOT NULL;
CREATE INDEX idx_unknowns_subject ON signals.unknowns(subject_key);
```
## Metrics and Observability
The following metrics are exposed for monitoring:
| Metric | Type | Description |
|--------|------|-------------|
| `signals_unknowns_total` | Gauge | Total unknowns by band |
| `signals_unknowns_rescans_total` | Counter | Rescans triggered by band |
| `signals_unknowns_scoring_duration_seconds` | Histogram | Scoring computation time |
| `signals_unknowns_band_transitions_total` | Counter | Band changes (e.g., WARM->HOT) |
## Related Documentation
- [Unknowns Registry](./unknowns-registry.md) - Data model and API for unknowns
- [Reachability Analysis](./reachability.md) - Reachability scoring integration
- [Callgraph Schema](./callgraph-formats.md) - Graph structure for centrality computation

View File

@@ -46,6 +46,22 @@ All endpoints are additive; no hard deletes. Payloads must include tenant bindin
- Policy can block `not_affected` claims when `unknowns_pressure` exceeds thresholds.
- UI/CLI show unknown chips with reason and depth; operators can triage or suppress.
### 5.1 Multi-Factor Ranking
Unknowns are ranked using a 5-factor scoring algorithm that computes a composite score from:
- **Popularity (P)** - Deployment impact based on usage count
- **Exploit Potential (E)** - CVE severity if known
- **Uncertainty (U)** - Accumulated flag weights
- **Centrality (C)** - Graph position importance (betweenness)
- **Staleness (S)** - Evidence age since last analysis
Based on the composite score, unknowns are assigned to triage bands:
- **HOT** (score >= 0.70): Immediate rescan, 15-minute scheduling
- **WARM** (0.40 <= score < 0.70): Scheduled rescan within 12-72h
- **COLD** (score < 0.40): Weekly batch processing
See [Unknowns Ranking Algorithm](./unknowns-ranking.md) for the complete formula reference.
## 6. Storage & CAS
- Primary store: append-only KV/graph in Mongo (collections `unknowns`, `unknown_metrics`).