Files
git.stella-ops.org/docs/modules/riskengine/architecture.md
StellaOps Bot e6c47c8f50 save progress
2025-12-28 23:49:56 +02:00

326 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# component_architecture_riskengine.md - **Stella Ops RiskEngine** (2025Q4)
> Risk scoring runtime with pluggable providers and explainability.
> **Scope.** Implementation-ready architecture for **RiskEngine**: the scoring runtime that computes Risk Scoring Profiles across deployments while preserving provenance and explainability. Covers scoring workers, providers, caching, and integration with Policy Engine.
---
## 0) Mission & boundaries
**Mission.** Compute **deterministic, explainable risk scores** for vulnerabilities by aggregating signals from multiple data sources (EPSS, CVSS, KEV, VEX, reachability). Produce audit trails and explainability payloads for every scoring decision.
**Boundaries.**
* RiskEngine **does not** make PASS/FAIL decisions. It provides scores to the Policy Engine.
* RiskEngine **does not** own vulnerability data. It consumes from Concelier, Excititor, and Signals.
* Scoring is **deterministic**: same inputs produce identical scores.
* Supports **offline/air-gapped** operation via factor bundles.
---
## 1) Solution & project layout
```
src/RiskEngine/StellaOps.RiskEngine/
├─ StellaOps.RiskEngine.Core/ # Scoring orchestrators, provider contracts
│ ├─ Providers/
│ │ ├─ IRiskScoreProvider.cs # Provider interface
│ │ ├─ EpssProvider.cs # EPSS score provider
│ │ ├─ CvssKevProvider.cs # CVSS + KEV provider
│ │ ├─ VexGateProvider.cs # VEX status provider
│ │ ├─ FixExposureProvider.cs # Fix availability provider
│ │ └─ DefaultTransformsProvider.cs # Score transformations
│ ├─ Contracts/
│ │ ├─ ScoreRequest.cs # Scoring request DTO
│ │ └─ RiskScoreResult.cs # Scoring result with explanation
│ └─ Services/
│ ├─ RiskScoreWorker.cs # Scoring job executor
│ └─ RiskScoreQueue.cs # Job queue management
├─ StellaOps.RiskEngine.Infrastructure/ # Persistence, caching, connectors
│ └─ Stores/
│ └─ InMemoryRiskScoreResultStore.cs
├─ StellaOps.RiskEngine.WebService/ # REST API for jobs and results
│ └─ Program.cs
├─ StellaOps.RiskEngine.Worker/ # Background scoring workers
│ ├─ Program.cs
│ └─ Worker.cs
└─ StellaOps.RiskEngine.Tests/ # Unit and integration tests
```
---
## 2) External dependencies
* **PostgreSQL** - Score persistence and job state
* **Concelier** - Vulnerability advisory data, EPSS scores
* **Excititor** - VEX statements
* **Signals** - Reachability and runtime signals
* **Policy Engine** - Consumes risk scores for decision-making
* **Authority** - Authentication and authorization
* **Valkey/Redis** - Score caching (optional)
---
## 3) Contracts & data model
### 3.1 ScoreRequest
```csharp
public sealed record ScoreRequest
{
public required string VulnerabilityId { get; init; } // CVE or vuln ID
public required string ArtifactId { get; init; } // PURL or component ID
public string? TenantId { get; init; }
public string? ContextId { get; init; } // Scan or assessment ID
public IReadOnlyList<string>? EnabledProviders { get; init; }
}
```
### 3.2 RiskScoreResult
```csharp
public sealed record RiskScoreResult
{
public required string RequestId { get; init; }
public required decimal FinalScore { get; init; } // 0.0-10.0
public required string Tier { get; init; } // Critical/High/Medium/Low/Info
public required DateTimeOffset ComputedAt { get; init; }
public required IReadOnlyList<ProviderContribution> Contributions { get; init; }
public required ExplainabilityPayload Explanation { get; init; }
}
public sealed record ProviderContribution
{
public required string ProviderId { get; init; }
public required decimal RawScore { get; init; }
public required decimal Weight { get; init; }
public required decimal WeightedScore { get; init; }
public string? FactorSource { get; init; } // Where data came from
public DateTimeOffset? FactorTimestamp { get; init; } // When factor was computed
}
```
### 3.3 Provider Interface
```csharp
public interface IRiskScoreProvider
{
string ProviderId { get; }
decimal DefaultWeight { get; }
TimeSpan CacheTtl { get; }
Task<ProviderResult> ComputeAsync(
ScoreRequest request,
CancellationToken ct);
Task<bool> IsHealthyAsync(CancellationToken ct);
}
```
---
## 4) Score Providers
### 4.1 Built-in Providers
| Provider | Data Source | Weight | Description |
|----------|-------------|--------|-------------|
| `epss` | Concelier/EPSS | 0.25 | EPSS probability score (0-1 → 0-10) |
| `cvss-kev` | Concelier | 0.30 | CVSS base + KEV boost |
| `vex-gate` | Excititor | 0.20 | VEX status (affected/not_affected) |
| `fix-exposure` | Concelier | 0.15 | Fix availability window |
| `reachability` | Signals | 0.10 | Code path reachability |
### 4.2 Score Computation
```
FinalScore = Σ(provider.weight × provider.score) / Σ(provider.weight)
Tier mapping:
9.0-10.0 → Critical
7.0-8.9 → High
4.0-6.9 → Medium
1.0-3.9 → Low
0.0-0.9 → Info
```
### 4.3 Provider Data Sources
```csharp
public interface IEpssSources
{
Task<EpssScore?> GetScoreAsync(string cveId, CancellationToken ct);
}
public interface ICvssKevSources
{
Task<CvssData?> GetCvssAsync(string cveId, CancellationToken ct);
Task<bool> IsKevAsync(string cveId, CancellationToken ct);
}
```
---
## 5) REST API (RiskEngine.WebService)
All under `/api/v1/risk`. Auth: **OpTok**.
```
POST /scores { request: ScoreRequest } → { jobId }
GET /scores/{jobId} → { result: RiskScoreResult, status }
GET /scores/{jobId}/explain → { explanation: ExplainabilityPayload }
POST /batch { requests: ScoreRequest[] } → { batchId }
GET /batch/{batchId} → { results: RiskScoreResult[], status }
GET /providers → { providers: ProviderInfo[] }
GET /providers/{id}/health → { healthy: bool, lastCheck }
GET /healthz | /readyz | /metrics
```
---
## 6) Configuration (YAML)
```yaml
RiskEngine:
Postgres:
ConnectionString: "Host=postgres;Database=risk;..."
Cache:
Enabled: true
Provider: "valkey"
ConnectionString: "redis://valkey:6379"
DefaultTtl: "00:15:00"
Providers:
Epss:
Enabled: true
Weight: 0.25
CacheTtl: "01:00:00"
Source: "concelier"
CvssKev:
Enabled: true
Weight: 0.30
KevBoost: 2.0
VexGate:
Enabled: true
Weight: 0.20
NotAffectedScore: 0.0
AffectedScore: 10.0
FixExposure:
Enabled: true
Weight: 0.15
NoFixPenalty: 1.5
Reachability:
Enabled: true
Weight: 0.10
UnreachableDiscount: 0.5
Worker:
Concurrency: 4
BatchSize: 100
PollInterval: "00:00:05"
Offline:
FactorBundlePath: "/data/risk-factors"
AllowStaleData: true
MaxStalenessHours: 168
```
---
## 7) Security & compliance
* **AuthN/Z**: Authority-issued OpToks with `risk.score` scope
* **Tenant isolation**: Scores scoped by tenant ID
* **Audit trail**: All scoring decisions logged with inputs and factors
* **No PII**: Only vulnerability and artifact identifiers processed
---
## 8) Performance targets
* **Single score**: < 100ms P95 (cached factors)
* **Batch scoring**: < 500ms P95 for 100 items
* **Provider health check**: < 1s timeout
* **Cache hit rate**: > 80% for repeated CVEs
---
## 9) Observability
**Metrics:**
* `risk.scores.computed_total{tier,provider}`
* `risk.scores.duration_seconds`
* `risk.providers.health{provider,status}`
* `risk.cache.hits_total` / `risk.cache.misses_total`
* `risk.batch.size_histogram`
**Tracing:** Spans for each provider contribution, cache operations, and aggregation.
**Logs:** Structured logs with `cve_id`, `artifact_id`, `tenant`, `final_score`.
---
## 10) Testing matrix
* **Provider tests**: Each provider returns expected scores for fixture data
* **Aggregation tests**: Weighted combination produces correct final score
* **Determinism tests**: Same inputs produce identical scores
* **Cache tests**: Cache hit/miss behavior correct
* **Offline tests**: Factor bundles load and score correctly
* **Integration tests**: Full scoring pipeline with mocked data sources
---
## 11) Offline/Air-Gap Support
### Factor Bundles
Pre-computed factor data for offline operation:
```
/data/risk-factors/
├─ epss/
│ └─ epss-2025-01-15.json.gz
├─ cvss/
│ └─ cvss-2025-01-15.json.gz
├─ kev/
│ └─ kev-2025-01-15.json
└─ manifest.json
```
### Staleness Handling
When operating offline, scores include staleness indicators:
```json
{
"finalScore": 7.2,
"dataFreshness": {
"epss": { "age": "48h", "stale": false },
"kev": { "age": "24h", "stale": false }
}
}
```
---
## Related Documentation
* Policy scoring: `../policy/architecture.md`
* Concelier feeds: `../concelier/architecture.md`
* Excititor VEX: `../excititor/architecture.md`
* Signals reachability: `../signals/architecture.md`