Files

StellaOps Bot e6c47c8f50 save progress

2025-12-28 23:49:56 +02:00

9.1 KiB

Raw Blame History

component_architecture_riskengine.md - Stella Ops RiskEngine (2025Q4)

Risk scoring runtime with pluggable providers and explainability.

Scope. Implementation-ready architecture for RiskEngine: the scoring runtime that computes Risk Scoring Profiles across deployments while preserving provenance and explainability. Covers scoring workers, providers, caching, and integration with Policy Engine.

0) Mission & boundaries

Mission. Compute deterministic, explainable risk scores for vulnerabilities by aggregating signals from multiple data sources (EPSS, CVSS, KEV, VEX, reachability). Produce audit trails and explainability payloads for every scoring decision.

Boundaries.

RiskEngine does not make PASS/FAIL decisions. It provides scores to the Policy Engine.
RiskEngine does not own vulnerability data. It consumes from Concelier, Excititor, and Signals.
Scoring is deterministic: same inputs produce identical scores.
Supports offline/air-gapped operation via factor bundles.

1) Solution & project layout

src/RiskEngine/StellaOps.RiskEngine/
 ├─ StellaOps.RiskEngine.Core/           # Scoring orchestrators, provider contracts
 │   ├─ Providers/
 │   │   ├─ IRiskScoreProvider.cs        # Provider interface
 │   │   ├─ EpssProvider.cs              # EPSS score provider
 │   │   ├─ CvssKevProvider.cs           # CVSS + KEV provider
 │   │   ├─ VexGateProvider.cs           # VEX status provider
 │   │   ├─ FixExposureProvider.cs       # Fix availability provider
 │   │   └─ DefaultTransformsProvider.cs # Score transformations
 │   ├─ Contracts/
 │   │   ├─ ScoreRequest.cs              # Scoring request DTO
 │   │   └─ RiskScoreResult.cs           # Scoring result with explanation
 │   └─ Services/
 │       ├─ RiskScoreWorker.cs           # Scoring job executor
 │       └─ RiskScoreQueue.cs            # Job queue management
 │
 ├─ StellaOps.RiskEngine.Infrastructure/ # Persistence, caching, connectors
 │   └─ Stores/
 │       └─ InMemoryRiskScoreResultStore.cs
 │
 ├─ StellaOps.RiskEngine.WebService/     # REST API for jobs and results
 │   └─ Program.cs
 │
 ├─ StellaOps.RiskEngine.Worker/         # Background scoring workers
 │   ├─ Program.cs
 │   └─ Worker.cs
 │
 └─ StellaOps.RiskEngine.Tests/          # Unit and integration tests

2) External dependencies

PostgreSQL - Score persistence and job state
Concelier - Vulnerability advisory data, EPSS scores
Excititor - VEX statements
Signals - Reachability and runtime signals
Policy Engine - Consumes risk scores for decision-making
Authority - Authentication and authorization
Valkey/Redis - Score caching (optional)

3) Contracts & data model

3.1 ScoreRequest

public sealed record ScoreRequest
{
    public required string VulnerabilityId { get; init; }   // CVE or vuln ID
    public required string ArtifactId { get; init; }        // PURL or component ID
    public string? TenantId { get; init; }
    public string? ContextId { get; init; }                 // Scan or assessment ID
    public IReadOnlyList<string>? EnabledProviders { get; init; }
}

3.2 RiskScoreResult

public sealed record RiskScoreResult
{
    public required string RequestId { get; init; }
    public required decimal FinalScore { get; init; }       // 0.0-10.0
    public required string Tier { get; init; }              // Critical/High/Medium/Low/Info
    public required DateTimeOffset ComputedAt { get; init; }
    public required IReadOnlyList<ProviderContribution> Contributions { get; init; }
    public required ExplainabilityPayload Explanation { get; init; }
}

public sealed record ProviderContribution
{
    public required string ProviderId { get; init; }
    public required decimal RawScore { get; init; }
    public required decimal Weight { get; init; }
    public required decimal WeightedScore { get; init; }
    public string? FactorSource { get; init; }              // Where data came from
    public DateTimeOffset? FactorTimestamp { get; init; }   // When factor was computed
}

3.3 Provider Interface

public interface IRiskScoreProvider
{
    string ProviderId { get; }
    decimal DefaultWeight { get; }
    TimeSpan CacheTtl { get; }

    Task<ProviderResult> ComputeAsync(
        ScoreRequest request,
        CancellationToken ct);

    Task<bool> IsHealthyAsync(CancellationToken ct);
}

4) Score Providers

4.1 Built-in Providers

Provider	Data Source	Weight	Description
`epss`	Concelier/EPSS	0.25	EPSS probability score (0-1 → 0-10)
`cvss-kev`	Concelier	0.30	CVSS base + KEV boost
`vex-gate`	Excititor	0.20	VEX status (affected/not_affected)
`fix-exposure`	Concelier	0.15	Fix availability window
`reachability`	Signals	0.10	Code path reachability

4.2 Score Computation

FinalScore = Σ(provider.weight × provider.score) / Σ(provider.weight)

Tier mapping:
  9.0-10.0 → Critical
  7.0-8.9  → High
  4.0-6.9  → Medium
  1.0-3.9  → Low
  0.0-0.9  → Info

4.3 Provider Data Sources

public interface IEpssSources
{
    Task<EpssScore?> GetScoreAsync(string cveId, CancellationToken ct);
}

public interface ICvssKevSources
{
    Task<CvssData?> GetCvssAsync(string cveId, CancellationToken ct);
    Task<bool> IsKevAsync(string cveId, CancellationToken ct);
}

5) REST API (RiskEngine.WebService)

All under /api/v1/risk. Auth: OpTok.

POST /scores                    { request: ScoreRequest } → { jobId }
GET  /scores/{jobId}            → { result: RiskScoreResult, status }
GET  /scores/{jobId}/explain    → { explanation: ExplainabilityPayload }

POST /batch                     { requests: ScoreRequest[] } → { batchId }
GET  /batch/{batchId}           → { results: RiskScoreResult[], status }

GET  /providers                 → { providers: ProviderInfo[] }
GET  /providers/{id}/health     → { healthy: bool, lastCheck }

GET  /healthz | /readyz | /metrics

6) Configuration (YAML)

RiskEngine:
  Postgres:
    ConnectionString: "Host=postgres;Database=risk;..."

  Cache:
    Enabled: true
    Provider: "valkey"
    ConnectionString: "redis://valkey:6379"
    DefaultTtl: "00:15:00"

  Providers:
    Epss:
      Enabled: true
      Weight: 0.25
      CacheTtl: "01:00:00"
      Source: "concelier"

    CvssKev:
      Enabled: true
      Weight: 0.30
      KevBoost: 2.0

    VexGate:
      Enabled: true
      Weight: 0.20
      NotAffectedScore: 0.0
      AffectedScore: 10.0

    FixExposure:
      Enabled: true
      Weight: 0.15
      NoFixPenalty: 1.5

    Reachability:
      Enabled: true
      Weight: 0.10
      UnreachableDiscount: 0.5

  Worker:
    Concurrency: 4
    BatchSize: 100
    PollInterval: "00:00:05"

  Offline:
    FactorBundlePath: "/data/risk-factors"
    AllowStaleData: true
    MaxStalenessHours: 168

7) Security & compliance

AuthN/Z: Authority-issued OpToks with risk.score scope
Tenant isolation: Scores scoped by tenant ID
Audit trail: All scoring decisions logged with inputs and factors
No PII: Only vulnerability and artifact identifiers processed

8) Performance targets

Single score: < 100ms P95 (cached factors)
Batch scoring: < 500ms P95 for 100 items
Provider health check: < 1s timeout
Cache hit rate: > 80% for repeated CVEs

9) Observability

Metrics:

risk.scores.computed_total{tier,provider}
risk.scores.duration_seconds
risk.providers.health{provider,status}
risk.cache.hits_total / risk.cache.misses_total
risk.batch.size_histogram

Tracing: Spans for each provider contribution, cache operations, and aggregation.

Logs: Structured logs with cve_id, artifact_id, tenant, final_score.

10) Testing matrix

Provider tests: Each provider returns expected scores for fixture data
Aggregation tests: Weighted combination produces correct final score
Determinism tests: Same inputs produce identical scores
Cache tests: Cache hit/miss behavior correct
Offline tests: Factor bundles load and score correctly
Integration tests: Full scoring pipeline with mocked data sources

11) Offline/Air-Gap Support

Factor Bundles

Pre-computed factor data for offline operation:

/data/risk-factors/
  ├─ epss/
  │   └─ epss-2025-01-15.json.gz
  ├─ cvss/
  │   └─ cvss-2025-01-15.json.gz
  ├─ kev/
  │   └─ kev-2025-01-15.json
  └─ manifest.json

Staleness Handling

When operating offline, scores include staleness indicators:

{
  "finalScore": 7.2,
  "dataFreshness": {
    "epss": { "age": "48h", "stale": false },
    "kev": { "age": "24h", "stale": false }
  }
}

Policy scoring: ../policy/architecture.md
Concelier feeds: ../concelier/architecture.md
Excititor VEX: ../excititor/architecture.md
Signals reachability: ../signals/architecture.md

9.1 KiB Raw Blame History Unescape Escape