save progress

2025-12-26 22:03:32 +02:00
parent 9a4cd2e0f7
commit e6c47c8f50
3634 changed files with 253222 additions and 56632 deletions
--- a/docs/modules/riskengine/architecture.md
+++ b/docs/modules/riskengine/architecture.md
@@ -0,0 +1,325 @@
+# component_architecture_riskengine.md - **Stella Ops RiskEngine** (2025Q4)
+
+> Risk scoring runtime with pluggable providers and explainability.
+
+> **Scope.** Implementation-ready architecture for **RiskEngine**: the scoring runtime that computes Risk Scoring Profiles across deployments while preserving provenance and explainability. Covers scoring workers, providers, caching, and integration with Policy Engine.
+
+---
+
+## 0) Mission & boundaries
+
+**Mission.** Compute **deterministic, explainable risk scores** for vulnerabilities by aggregating signals from multiple data sources (EPSS, CVSS, KEV, VEX, reachability). Produce audit trails and explainability payloads for every scoring decision.
+
+**Boundaries.**
+
+* RiskEngine **does not** make PASS/FAIL decisions. It provides scores to the Policy Engine.
+* RiskEngine **does not** own vulnerability data. It consumes from Concelier, Excititor, and Signals.
+* Scoring is **deterministic**: same inputs produce identical scores.
+* Supports **offline/air-gapped** operation via factor bundles.
+
+---
+
+## 1) Solution & project layout
+
+```
+src/RiskEngine/StellaOps.RiskEngine/
+ ├─ StellaOps.RiskEngine.Core/           # Scoring orchestrators, provider contracts
+ │   ├─ Providers/
+ │   │   ├─ IRiskScoreProvider.cs        # Provider interface
+ │   │   ├─ EpssProvider.cs              # EPSS score provider
+ │   │   ├─ CvssKevProvider.cs           # CVSS + KEV provider
+ │   │   ├─ VexGateProvider.cs           # VEX status provider
+ │   │   ├─ FixExposureProvider.cs       # Fix availability provider
+ │   │   └─ DefaultTransformsProvider.cs # Score transformations
+ │   ├─ Contracts/
+ │   │   ├─ ScoreRequest.cs              # Scoring request DTO
+ │   │   └─ RiskScoreResult.cs           # Scoring result with explanation
+ │   └─ Services/
+ │       ├─ RiskScoreWorker.cs           # Scoring job executor
+ │       └─ RiskScoreQueue.cs            # Job queue management
+ │
+ ├─ StellaOps.RiskEngine.Infrastructure/ # Persistence, caching, connectors
+ │   └─ Stores/
+ │       └─ InMemoryRiskScoreResultStore.cs
+ │
+ ├─ StellaOps.RiskEngine.WebService/     # REST API for jobs and results
+ │   └─ Program.cs
+ │
+ ├─ StellaOps.RiskEngine.Worker/         # Background scoring workers
+ │   ├─ Program.cs
+ │   └─ Worker.cs
+ │
+ └─ StellaOps.RiskEngine.Tests/          # Unit and integration tests
+```
+
+---
+
+## 2) External dependencies
+
+* **PostgreSQL** - Score persistence and job state
+* **Concelier** - Vulnerability advisory data, EPSS scores
+* **Excititor** - VEX statements
+* **Signals** - Reachability and runtime signals
+* **Policy Engine** - Consumes risk scores for decision-making
+* **Authority** - Authentication and authorization
+* **Valkey/Redis** - Score caching (optional)
+
+---
+
+## 3) Contracts & data model
+
+### 3.1 ScoreRequest
+
+```csharp
+public sealed record ScoreRequest
+{
+    public required string VulnerabilityId { get; init; }   // CVE or vuln ID
+    public required string ArtifactId { get; init; }        // PURL or component ID
+    public string? TenantId { get; init; }
+    public string? ContextId { get; init; }                 // Scan or assessment ID
+    public IReadOnlyList<string>? EnabledProviders { get; init; }
+}
+```
+
+### 3.2 RiskScoreResult
+
+```csharp
+public sealed record RiskScoreResult
+{
+    public required string RequestId { get; init; }
+    public required decimal FinalScore { get; init; }       // 0.0-10.0
+    public required string Tier { get; init; }              // Critical/High/Medium/Low/Info
+    public required DateTimeOffset ComputedAt { get; init; }
+    public required IReadOnlyList<ProviderContribution> Contributions { get; init; }
+    public required ExplainabilityPayload Explanation { get; init; }
+}
+
+public sealed record ProviderContribution
+{
+    public required string ProviderId { get; init; }
+    public required decimal RawScore { get; init; }
+    public required decimal Weight { get; init; }
+    public required decimal WeightedScore { get; init; }
+    public string? FactorSource { get; init; }              // Where data came from
+    public DateTimeOffset? FactorTimestamp { get; init; }   // When factor was computed
+}
+```
+
+### 3.3 Provider Interface
+
+```csharp
+public interface IRiskScoreProvider
+{
+    string ProviderId { get; }
+    decimal DefaultWeight { get; }
+    TimeSpan CacheTtl { get; }
+
+    Task<ProviderResult> ComputeAsync(
+        ScoreRequest request,
+        CancellationToken ct);
+
+    Task<bool> IsHealthyAsync(CancellationToken ct);
+}
+```
+
+---
+
+## 4) Score Providers
+
+### 4.1 Built-in Providers
+
+| Provider | Data Source | Weight | Description |
+|----------|-------------|--------|-------------|
+| `epss` | Concelier/EPSS | 0.25 | EPSS probability score (0-1 → 0-10) |
+| `cvss-kev` | Concelier | 0.30 | CVSS base + KEV boost |
+| `vex-gate` | Excititor | 0.20 | VEX status (affected/not_affected) |
+| `fix-exposure` | Concelier | 0.15 | Fix availability window |
+| `reachability` | Signals | 0.10 | Code path reachability |
+
+### 4.2 Score Computation
+
+```
+FinalScore = Σ(provider.weight × provider.score) / Σ(provider.weight)
+
+Tier mapping:
+  9.0-10.0 → Critical
+  7.0-8.9  → High
+  4.0-6.9  → Medium
+  1.0-3.9  → Low
+  0.0-0.9  → Info
+```
+
+### 4.3 Provider Data Sources
+
+```csharp
+public interface IEpssSources
+{
+    Task<EpssScore?> GetScoreAsync(string cveId, CancellationToken ct);
+}
+
+public interface ICvssKevSources
+{
+    Task<CvssData?> GetCvssAsync(string cveId, CancellationToken ct);
+    Task<bool> IsKevAsync(string cveId, CancellationToken ct);
+}
+```
+
+---
+
+## 5) REST API (RiskEngine.WebService)
+
+All under `/api/v1/risk`. Auth: **OpTok**.
+
+```
+POST /scores                    { request: ScoreRequest } → { jobId }
+GET  /scores/{jobId}            → { result: RiskScoreResult, status }
+GET  /scores/{jobId}/explain    → { explanation: ExplainabilityPayload }
+
+POST /batch                     { requests: ScoreRequest[] } → { batchId }
+GET  /batch/{batchId}           → { results: RiskScoreResult[], status }
+
+GET  /providers                 → { providers: ProviderInfo[] }
+GET  /providers/{id}/health     → { healthy: bool, lastCheck }
+
+GET  /healthz | /readyz | /metrics
+```
+
+---
+
+## 6) Configuration (YAML)
+
+```yaml
+RiskEngine:
+  Postgres:
+    ConnectionString: "Host=postgres;Database=risk;..."
+
+  Cache:
+    Enabled: true
+    Provider: "valkey"
+    ConnectionString: "redis://valkey:6379"
+    DefaultTtl: "00:15:00"
+
+  Providers:
+    Epss:
+      Enabled: true
+      Weight: 0.25
+      CacheTtl: "01:00:00"
+      Source: "concelier"
+
+    CvssKev:
+      Enabled: true
+      Weight: 0.30
+      KevBoost: 2.0
+
+    VexGate:
+      Enabled: true
+      Weight: 0.20
+      NotAffectedScore: 0.0
+      AffectedScore: 10.0
+
+    FixExposure:
+      Enabled: true
+      Weight: 0.15
+      NoFixPenalty: 1.5
+
+    Reachability:
+      Enabled: true
+      Weight: 0.10
+      UnreachableDiscount: 0.5
+
+  Worker:
+    Concurrency: 4
+    BatchSize: 100
+    PollInterval: "00:00:05"
+
+  Offline:
+    FactorBundlePath: "/data/risk-factors"
+    AllowStaleData: true
+    MaxStalenessHours: 168
+```
+
+---
+
+## 7) Security & compliance
+
+* **AuthN/Z**: Authority-issued OpToks with `risk.score` scope
+* **Tenant isolation**: Scores scoped by tenant ID
+* **Audit trail**: All scoring decisions logged with inputs and factors
+* **No PII**: Only vulnerability and artifact identifiers processed
+
+---
+
+## 8) Performance targets
+
+* **Single score**: < 100ms P95 (cached factors)
+* **Batch scoring**: < 500ms P95 for 100 items
+* **Provider health check**: < 1s timeout
+* **Cache hit rate**: > 80% for repeated CVEs
+
+---
+
+## 9) Observability
+
+**Metrics:**
+* `risk.scores.computed_total{tier,provider}`
+* `risk.scores.duration_seconds`
+* `risk.providers.health{provider,status}`
+* `risk.cache.hits_total` / `risk.cache.misses_total`
+* `risk.batch.size_histogram`
+
+**Tracing:** Spans for each provider contribution, cache operations, and aggregation.
+
+**Logs:** Structured logs with `cve_id`, `artifact_id`, `tenant`, `final_score`.
+
+---
+
+## 10) Testing matrix
+
+* **Provider tests**: Each provider returns expected scores for fixture data
+* **Aggregation tests**: Weighted combination produces correct final score
+* **Determinism tests**: Same inputs produce identical scores
+* **Cache tests**: Cache hit/miss behavior correct
+* **Offline tests**: Factor bundles load and score correctly
+* **Integration tests**: Full scoring pipeline with mocked data sources
+
+---
+
+## 11) Offline/Air-Gap Support
+
+### Factor Bundles
+
+Pre-computed factor data for offline operation:
+
+```
+/data/risk-factors/
+  ├─ epss/
+  │   └─ epss-2025-01-15.json.gz
+  ├─ cvss/
+  │   └─ cvss-2025-01-15.json.gz
+  ├─ kev/
+  │   └─ kev-2025-01-15.json
+  └─ manifest.json
+```
+
+### Staleness Handling
+
+When operating offline, scores include staleness indicators:
+
+```json
+{
+  "finalScore": 7.2,
+  "dataFreshness": {
+    "epss": { "age": "48h", "stale": false },
+    "kev": { "age": "24h", "stale": false }
+  }
+}
+```
+
+---
+
+## Related Documentation
+
+* Policy scoring: `../policy/architecture.md`
+* Concelier feeds: `../concelier/architecture.md`
+* Excititor VEX: `../excititor/architecture.md`
+* Signals reachability: `../signals/architecture.md`