Files
git.stella-ops.org/docs/architecture/policy-engine.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

14 KiB
Raw Blame History

Policy Engine Architecture (v2)

Ownership: Policy Guild • Platform Guild
Services: StellaOps.Policy.Engine (Minimal API + worker host)
Data Stores: MongoDB (policies, policy_runs, effective_finding_*), Object storage (explain bundles), optional NATS/Mongo queue
Related docs: Policy overview, DSL, Lifecycle, Runs, REST API, Policy CLI, Architecture overview, AOC reference

This dossier describes the internal structure of the Policy Engine service delivered in Epic2. It focuses on module boundaries, deterministic evaluation, orchestration, and integration contracts with Concelier, Excititor, SBOM Service, Authority, Scheduler, and Observability stacks.

The service operates strictly downstream of the Aggregation-Only Contract (AOC). It consumes immutable advisory_raw and vex_raw documents emitted by Concelier and Excititor, derives findings inside Policy-owned collections, and never mutates ingestion stores. Refer to the architecture overview and AOC reference for system-wide guardrails and provenance obligations.


1·Responsibilities & Constraints

  • Compile and evaluate stella-dsl@1 policy packs into deterministic verdicts.
  • Join SBOM inventory, Concelier advisories, and Excititor VEX evidence via canonical linksets and equivalence tables.
  • Materialise effective findings (effective_finding_{policyId}) with append-only history and produce explain traces.
  • Operate incrementally: react to change streams (advisory/vex/SBOM deltas) with ≤5min SLA.
  • Provide simulations with diff summaries for UI/CLI workflows without modifying state.
  • Enforce strict determinism guard (no wall-clock, RNG, network beyond allow-listed services) and RBAC + tenancy via Authority scopes.
  • Support sealed/air-gapped deployments with offline bundles and sealed-mode hints.

Non-goals: policy authoring UI (handled by Console), ingestion or advisory normalisation (Concelier), VEX consensus (Excititor), runtime enforcement (Zastava).


2·High-Level Architecture

graph TD
    subgraph Clients
        CLI[stella CLI]
        UI[Console Policy Editor]
        CI[CI Pipelines]
    end
    subgraph PolicyEngine["StellaOps.Policy.Engine"]
        API[Minimal API Host]
        Orchestrator[Run Orchestrator]
        WorkerPool[Evaluation Workers]
        Compiler[DSL Compiler Cache]
        Materializer[Effective Findings Writer]
    end
    subgraph RawStores["Raw Stores (AOC)"]
        AdvisoryRaw[(MongoDB<br/>advisory_raw)]
        VexRaw[(MongoDB<br/>vex_raw)]
    end
    subgraph Derived["Derived Stores"]
        Mongo[(MongoDB<br/>policies / policy_runs / effective_finding_*)]
        Blob[(Object Store / Evidence Locker)]
        Queue[(Mongo Queue / NATS)]
    end
    Concelier[(Concelier APIs)]
    Excititor[(Excititor APIs)]
    SBOM[(SBOM Service)]
    Authority[(Authority / DPoP Gateway)]

    CLI --> API
    UI --> API
    CI --> API
    API --> Compiler
    API --> Orchestrator
    Orchestrator --> Queue
    Queue --> WorkerPool
    Concelier --> AdvisoryRaw
    Excititor --> VexRaw
    WorkerPool --> AdvisoryRaw
    WorkerPool --> VexRaw
    WorkerPool --> SBOM
    WorkerPool --> Materializer
    Materializer --> Mongo
    WorkerPool --> Blob
    API --> Mongo
    API --> Blob
    API --> Authority
    Orchestrator --> Mongo
    Authority --> API

Key notes:

  • API host exposes lifecycle, run, simulate, findings endpoints with DPoP-bound OAuth enforcement.
  • Orchestrator manages run scheduling/fairness; writes run tickets to queue, leases jobs to worker pool.
  • Workers evaluate policies using cached IR; join external services via tenant-scoped clients; pull immutable advisories/VEX from the raw stores; write derived overlays to Mongo and optional explain bundles to blob storage.
  • Observability (metrics/traces/logs) integrated via OpenTelemetry (not shown).

2.1·AOC inputs & immutability

  • Raw-only reads. Evaluation workers access advisory_raw / vex_raw via tenant-scoped Mongo clients or the Concelier/Excititor raw APIs. No Policy Engine component is permitted to mutate these collections.
  • Guarded ingestion. AOCWriteGuard rejects forbidden fields before data reaches the raw stores. Policy tests replay known ERR_AOC_00x violations to confirm ingestion compliance.
  • Change streams as contract. Run orchestration stores resumable cursors for raw change streams. Replays of these cursors (e.g., after failover) must yield identical materialisation outcomes.
  • Derived stores only. All severity, consensus, and suppression state lives in effective_finding_* collections and explain bundles owned by Policy Engine. Provenance fields link back to raw document IDs so auditors can trace every verdict.
  • Authority scopes. Only the Policy Engine service identity holds effective:write. Ingestion identities retain advisory:*/vex:* scopes, ensuring separation of duties enforced by Authority and the API Gateway.

3·Module Breakdown

Module Responsibility Notes
Configuration (Configuration/) Bind settings (Mongo URIs, queue options, service URLs, sealed mode), validate on start. Strict schema; fails fast on missing secrets.
Authority Client (Authority/) Acquire tokens, enforce scopes, perform DPoP key rotation. Only service identity uses effective:write.
DSL Compiler (Dsl/) Parse, canonicalise, IR generation, checksum caching. Uses Roslyn-like pipeline; caches by policyId+version+hash.
Selection Layer (Selection/) Batch SBOM ↔ advisory ↔ VEX joiners; apply equivalence tables; support incremental cursors. Deterministic ordering (SBOM → advisory → VEX).
Evaluator (Evaluation/) Execute IR with first-match semantics, compute severity/trust/reachability weights, record rule hits. Stateless; all inputs provided by selection layer.
Materialiser (Materialization/) Upsert effective findings, append history, manage explain bundle exports. Mongo transactions per SBOM chunk.
Orchestrator (Runs/) Change-stream ingestion, fairness, retry/backoff, queue writer. Works with Scheduler Models DTOs.
API (Api/) Minimal API endpoints, DTO validation, problem responses, idempotency. Generated clients for CLI/UI.
Observability (Telemetry/) Metrics (policy_run_seconds, rules_fired_total), traces, structured logs. Sampled rule-hit logs with redaction.
Offline Adapter (Offline/) Bundle export/import (policies, simulations, runs), sealed-mode enforcement. Uses DSSE signing via Signer service.

4·Data Model & Persistence

4.1 Collections

  • policies policy versions, metadata, lifecycle states, simulation artefact references.
  • policy_runs run records, inputs (cursors, env), stats, determinism hash, run status.
  • policy_run_events append-only log (queued, leased, completed, failed, canceled, replay).
  • effective_finding_{policyId} current verdict snapshot per finding.
  • effective_finding_{policyId}_history append-only history (previous verdicts, timestamps, runId).
  • policy_reviews review comments/decisions.

4.2 Schema Highlights

  • Run records include changeDigests (hash of advisory/VEX inputs) for replay verification.
  • Effective findings store provenance references (advisory_raw_ids, vex_raw_ids, sbom_component_id).
  • All collections include tenant, policyId, version, createdAt, updatedAt, traceId for audit.

4.3 Indexing

  • Compound indexes: {tenant, policyId, status} on policies; {tenant, policyId, status, startedAt} on policy_runs; {policyId, sbomId, findingKey} on findings.
  • TTL indexes on transient explain bundle references (configurable).

5·Evaluation Pipeline

sequenceDiagram
    autonumber
    participant Worker as EvaluationWorker
    participant Compiler as CompilerCache
    participant Selector as SelectionLayer
    participant Eval as Evaluator
    participant Mat as Materialiser
    participant Expl as ExplainStore

    Worker->>Compiler: Load IR (policyId, version, digest)
    Compiler-->>Worker: CompiledPolicy (cached or compiled)
    Worker->>Selector: Fetch tuple batches (sbom, advisory, vex)
    Selector-->>Worker: Deterministic batches (1024 tuples)
    loop For each batch
        Worker->>Eval: Execute rules (batch, env)
        Eval-->>Worker: Verdicts + rule hits
        Worker->>Mat: Upsert effective findings
        Mat-->>Worker: Success
        Worker->>Expl: Persist sampled explain traces (optional)
    end
    Worker->>Mat: Append history + run stats
    Worker-->>Worker: Compute determinism hash
    Worker->>+Mat: Finalize transaction
    Mat-->>Worker: Ack

Determinism guard instrumentation wraps the evaluator, rejecting access to forbidden APIs and ensuring batch ordering remains stable.


6·Run Orchestration & Incremental Flow

  • Change streams: Concelier and Excititor publish document changes to the scheduler queue (policy.trigger.delta). Payload includes tenant, source, linkset digests, cursor.
  • Orchestrator: Maintains per-tenant backlog; merges deltas until time/size thresholds met, then enqueues PolicyRunRequest.
  • Queue: Mongo queue with lease; each job assigned leaseDuration, maxAttempts.
  • Workers: Lease jobs, execute evaluation pipeline, report status (success/failure/canceled). Failures with recoverable errors requeue with backoff; determinism or schema violations mark job failed and raise incident event.
  • Fairness: Round-robin per {tenant, policyId}; emergency jobs (priority=emergency) jump queue but limited via circuit breaker.
  • Replay: On demand, orchestrator rehydrates run via stored cursors and exports sealed bundle for audit/CI determinism checks.

7·Security & Tenancy

  • Auth: All API calls pass through Authority gateway; DPoP tokens enforced for service-to-service (Policy Engine service principal). CLI/UI tokens include scope claims.
  • Scopes: Mutations require policy:* scopes corresponding to action; effective:write restricted to service identity.
  • Tenancy: All queries filter by tenant. Service identity uses tenant-global for shared policies; cross-tenant reads prohibited unless policy:tenant-admin scope present.
  • Secrets: Configuration loaded via environment variables or sealed secrets; runtime avoids writing secrets to logs.
  • Determinism guard: Static analyzer prevents referencing forbidden namespaces; runtime guard intercepts DateTime.Now, Random, Guid, HTTP clients beyond allow-list.
  • Sealed mode: Global flag disables outbound network except allow-listed internal hosts; watchers fail fast if unexpected egress attempted.

8·Observability

  • Metrics:
    • policy_run_seconds{mode,tenant,policy} (histogram)
    • policy_run_queue_depth{tenant}
    • policy_rules_fired_total{policy,rule}
    • policy_vex_overrides_total{policy,vendor}
  • Logs: Structured JSON with traceId, policyId, version, runId, tenant, phase. Guard ensures no sensitive data leakage.
  • Traces: Spans policy.select, policy.evaluate, policy.materialize, policy.simulate. Trace IDs surfaced to CLI/UI.
  • Incident mode toggles 100% sampling and extended retention windows.

9·Offline / Bundle Integration

  • Imports: Offline Kit delivers policy packs, advisory/VEX snapshots, SBOM updates. Policy Engine ingests bundles via offline import.
  • Exports: stella policy bundle export packages policy, IR digest, simulations, run metadata; UI provides export triggers.
  • Sealed hints: Explain traces annotate when cached values used (EPSS, KEV). Run records mark env.sealed=true.
  • Sync cadence: Operators perform monthly bundle sync; Policy Engine warns when snapshots > configured staleness (default 14days).

10·Testing & Quality

  • Unit tests: DSL parsing, evaluator semantics, guard enforcement.
  • Integration tests: Joiners with sample SBOM/advisory/VEX data; materialisation with deterministic ordering; API contract tests generated from OpenAPI.
  • Property tests: Ensure rule evaluation deterministic across permutations.
  • Golden tests: Replay recorded runs, compare determinism hash.
  • Performance tests: Evaluate 100k component / 1M advisory dataset under warmed caches (<30s full run).
  • Chaos hooks: Optional toggles to simulate upstream latency/failures; used in staging.

11·Compliance Checklist

  • Determinism guard enforced: Static analyzer + runtime guard block wall-clock, RNG, unauthorized network calls.
  • Incremental correctness: Change-stream cursors stored and replayed during tests; unit/integration coverage for dedupe.
  • RBAC validated: Endpoint scope requirements match Authority configuration; integration tests cover deny/allow.
  • AOC separation enforced: No code path writes to advisory_raw / vex_raw; integration tests capture ERR_AOC_00x handling; read-only clients verified.
  • Effective findings ownership: Only Policy Engine identity holds effective:write; unauthorized callers receive ERR_AOC_006.
  • Observability wired: Metrics/traces/logs exported with correlation IDs; dashboards include aoc_violation_total and ingest latency panels.
  • Offline parity: Sealed-mode tests executed; bundle import/export flows documented and validated.
  • Schema docs synced: DTOs match Scheduler Models (SCHED-MODELS-20-001); JSON schemas committed.
  • Security reviews complete: Threat model (including queue poisoning, determinism bypass, data exfiltration) documented; mitigations in place.
  • Disaster recovery rehearsed: Run replay+rollback procedures tested and recorded.

Last updated: 2025-10-26 (Sprint 19).