Files
git.stella-ops.org/docs/modules/riskengine/architecture.md
StellaOps Bot e6c47c8f50 save progress
2025-12-28 23:49:56 +02:00

9.1 KiB
Raw Blame History

component_architecture_riskengine.md - Stella Ops RiskEngine (2025Q4)

Risk scoring runtime with pluggable providers and explainability.

Scope. Implementation-ready architecture for RiskEngine: the scoring runtime that computes Risk Scoring Profiles across deployments while preserving provenance and explainability. Covers scoring workers, providers, caching, and integration with Policy Engine.


0) Mission & boundaries

Mission. Compute deterministic, explainable risk scores for vulnerabilities by aggregating signals from multiple data sources (EPSS, CVSS, KEV, VEX, reachability). Produce audit trails and explainability payloads for every scoring decision.

Boundaries.

  • RiskEngine does not make PASS/FAIL decisions. It provides scores to the Policy Engine.
  • RiskEngine does not own vulnerability data. It consumes from Concelier, Excititor, and Signals.
  • Scoring is deterministic: same inputs produce identical scores.
  • Supports offline/air-gapped operation via factor bundles.

1) Solution & project layout

src/RiskEngine/StellaOps.RiskEngine/
 ├─ StellaOps.RiskEngine.Core/           # Scoring orchestrators, provider contracts
 │   ├─ Providers/
 │   │   ├─ IRiskScoreProvider.cs        # Provider interface
 │   │   ├─ EpssProvider.cs              # EPSS score provider
 │   │   ├─ CvssKevProvider.cs           # CVSS + KEV provider
 │   │   ├─ VexGateProvider.cs           # VEX status provider
 │   │   ├─ FixExposureProvider.cs       # Fix availability provider
 │   │   └─ DefaultTransformsProvider.cs # Score transformations
 │   ├─ Contracts/
 │   │   ├─ ScoreRequest.cs              # Scoring request DTO
 │   │   └─ RiskScoreResult.cs           # Scoring result with explanation
 │   └─ Services/
 │       ├─ RiskScoreWorker.cs           # Scoring job executor
 │       └─ RiskScoreQueue.cs            # Job queue management
 │
 ├─ StellaOps.RiskEngine.Infrastructure/ # Persistence, caching, connectors
 │   └─ Stores/
 │       └─ InMemoryRiskScoreResultStore.cs
 │
 ├─ StellaOps.RiskEngine.WebService/     # REST API for jobs and results
 │   └─ Program.cs
 │
 ├─ StellaOps.RiskEngine.Worker/         # Background scoring workers
 │   ├─ Program.cs
 │   └─ Worker.cs
 │
 └─ StellaOps.RiskEngine.Tests/          # Unit and integration tests

2) External dependencies

  • PostgreSQL - Score persistence and job state
  • Concelier - Vulnerability advisory data, EPSS scores
  • Excititor - VEX statements
  • Signals - Reachability and runtime signals
  • Policy Engine - Consumes risk scores for decision-making
  • Authority - Authentication and authorization
  • Valkey/Redis - Score caching (optional)

3) Contracts & data model

3.1 ScoreRequest

public sealed record ScoreRequest
{
    public required string VulnerabilityId { get; init; }   // CVE or vuln ID
    public required string ArtifactId { get; init; }        // PURL or component ID
    public string? TenantId { get; init; }
    public string? ContextId { get; init; }                 // Scan or assessment ID
    public IReadOnlyList<string>? EnabledProviders { get; init; }
}

3.2 RiskScoreResult

public sealed record RiskScoreResult
{
    public required string RequestId { get; init; }
    public required decimal FinalScore { get; init; }       // 0.0-10.0
    public required string Tier { get; init; }              // Critical/High/Medium/Low/Info
    public required DateTimeOffset ComputedAt { get; init; }
    public required IReadOnlyList<ProviderContribution> Contributions { get; init; }
    public required ExplainabilityPayload Explanation { get; init; }
}

public sealed record ProviderContribution
{
    public required string ProviderId { get; init; }
    public required decimal RawScore { get; init; }
    public required decimal Weight { get; init; }
    public required decimal WeightedScore { get; init; }
    public string? FactorSource { get; init; }              // Where data came from
    public DateTimeOffset? FactorTimestamp { get; init; }   // When factor was computed
}

3.3 Provider Interface

public interface IRiskScoreProvider
{
    string ProviderId { get; }
    decimal DefaultWeight { get; }
    TimeSpan CacheTtl { get; }

    Task<ProviderResult> ComputeAsync(
        ScoreRequest request,
        CancellationToken ct);

    Task<bool> IsHealthyAsync(CancellationToken ct);
}

4) Score Providers

4.1 Built-in Providers

Provider Data Source Weight Description
epss Concelier/EPSS 0.25 EPSS probability score (0-1 → 0-10)
cvss-kev Concelier 0.30 CVSS base + KEV boost
vex-gate Excititor 0.20 VEX status (affected/not_affected)
fix-exposure Concelier 0.15 Fix availability window
reachability Signals 0.10 Code path reachability

4.2 Score Computation

FinalScore = Σ(provider.weight × provider.score) / Σ(provider.weight)

Tier mapping:
  9.0-10.0 → Critical
  7.0-8.9  → High
  4.0-6.9  → Medium
  1.0-3.9  → Low
  0.0-0.9  → Info

4.3 Provider Data Sources

public interface IEpssSources
{
    Task<EpssScore?> GetScoreAsync(string cveId, CancellationToken ct);
}

public interface ICvssKevSources
{
    Task<CvssData?> GetCvssAsync(string cveId, CancellationToken ct);
    Task<bool> IsKevAsync(string cveId, CancellationToken ct);
}

5) REST API (RiskEngine.WebService)

All under /api/v1/risk. Auth: OpTok.

POST /scores                    { request: ScoreRequest } → { jobId }
GET  /scores/{jobId}            → { result: RiskScoreResult, status }
GET  /scores/{jobId}/explain    → { explanation: ExplainabilityPayload }

POST /batch                     { requests: ScoreRequest[] } → { batchId }
GET  /batch/{batchId}           → { results: RiskScoreResult[], status }

GET  /providers                 → { providers: ProviderInfo[] }
GET  /providers/{id}/health     → { healthy: bool, lastCheck }

GET  /healthz | /readyz | /metrics

6) Configuration (YAML)

RiskEngine:
  Postgres:
    ConnectionString: "Host=postgres;Database=risk;..."

  Cache:
    Enabled: true
    Provider: "valkey"
    ConnectionString: "redis://valkey:6379"
    DefaultTtl: "00:15:00"

  Providers:
    Epss:
      Enabled: true
      Weight: 0.25
      CacheTtl: "01:00:00"
      Source: "concelier"

    CvssKev:
      Enabled: true
      Weight: 0.30
      KevBoost: 2.0

    VexGate:
      Enabled: true
      Weight: 0.20
      NotAffectedScore: 0.0
      AffectedScore: 10.0

    FixExposure:
      Enabled: true
      Weight: 0.15
      NoFixPenalty: 1.5

    Reachability:
      Enabled: true
      Weight: 0.10
      UnreachableDiscount: 0.5

  Worker:
    Concurrency: 4
    BatchSize: 100
    PollInterval: "00:00:05"

  Offline:
    FactorBundlePath: "/data/risk-factors"
    AllowStaleData: true
    MaxStalenessHours: 168

7) Security & compliance

  • AuthN/Z: Authority-issued OpToks with risk.score scope
  • Tenant isolation: Scores scoped by tenant ID
  • Audit trail: All scoring decisions logged with inputs and factors
  • No PII: Only vulnerability and artifact identifiers processed

8) Performance targets

  • Single score: < 100ms P95 (cached factors)
  • Batch scoring: < 500ms P95 for 100 items
  • Provider health check: < 1s timeout
  • Cache hit rate: > 80% for repeated CVEs

9) Observability

Metrics:

  • risk.scores.computed_total{tier,provider}
  • risk.scores.duration_seconds
  • risk.providers.health{provider,status}
  • risk.cache.hits_total / risk.cache.misses_total
  • risk.batch.size_histogram

Tracing: Spans for each provider contribution, cache operations, and aggregation.

Logs: Structured logs with cve_id, artifact_id, tenant, final_score.


10) Testing matrix

  • Provider tests: Each provider returns expected scores for fixture data
  • Aggregation tests: Weighted combination produces correct final score
  • Determinism tests: Same inputs produce identical scores
  • Cache tests: Cache hit/miss behavior correct
  • Offline tests: Factor bundles load and score correctly
  • Integration tests: Full scoring pipeline with mocked data sources

11) Offline/Air-Gap Support

Factor Bundles

Pre-computed factor data for offline operation:

/data/risk-factors/
  ├─ epss/
  │   └─ epss-2025-01-15.json.gz
  ├─ cvss/
  │   └─ cvss-2025-01-15.json.gz
  ├─ kev/
  │   └─ kev-2025-01-15.json
  └─ manifest.json

Staleness Handling

When operating offline, scores include staleness indicators:

{
  "finalScore": 7.2,
  "dataFreshness": {
    "epss": { "age": "48h", "stale": false },
    "kev": { "age": "24h", "stale": false }
  }
}

  • Policy scoring: ../policy/architecture.md
  • Concelier feeds: ../concelier/architecture.md
  • Excititor VEX: ../excititor/architecture.md
  • Signals reachability: ../signals/architecture.md