Files
git.stella-ops.org/docs/policy/runs.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

10 KiB
Raw Blame History

Policy Runs & Orchestration

Audience: Policy Engine operators, Scheduler team, DevOps, and tooling engineers planning CI integrations.
Scope: Run modes (full, incremental, simulate), orchestration pipeline, cursor management, replay/determinism guarantees, monitoring, and recovery procedures.

Policies only generate value when they execute deterministically against current SBOM, advisory, and VEX inputs. This guide explains how runs are triggered, how the orchestrator scopes work, and what artefacts you should expect at each stage.


1·Run Modes at a Glance

Mode Trigger sources Scope Persistence Primary use
Full Manual CLI (stella policy run), Console “Run now”, scheduled nightly job Entire tenant (all registered SBOMs) Writes effective_finding_{policyId} and policy_runs record Baseline after policy approval, quarterly attestation, post-incident rechecks
Incremental Change streams (Concelier advisories, Excititor VEX, SBOM imports), orchestrator cron Only affected (sbom, advisory) tuples Writes diffs to effective findings and run record Continuous upkeep meeting ≤5min SLA from input change
Simulate Console review workspace, CLI (stella policy simulate), CI pipeline Selected SBOM sample set (provided or golden set) No materialisation; captures diff summary + explain traces Authoring validation, regression safeguards, sealed-mode rehearsals

All modes record their status in policy_runs with deterministic metadata:

{
  "_id": "run:P-7:2025-10-26T14:05:11Z:3f9a",
  "policy_id": "P-7",
  "policy_version": 4,
  "mode": "incremental",
  "status": "succeeded",     // queued | running | succeeded | failed | canceled | replay_pending
  "inputs": {
    "sbom_set": ["sbom:S-42","sbom:S-318"],
    "advisory_cursor": "2025-10-26T13:59:00Z",
    "vex_cursor": "2025-10-26T13:58:30Z",
    "env": {"exposure":"internet"}
  },
  "stats": {
    "components": 1742,
    "rules_fired": 68023,
    "findings_written": 4321,
    "vex_overrides": 210
  },
  "determinism_hash": "sha256:…",
  "started_at": "2025-10-26T14:05:11Z",
  "finished_at": "2025-10-26T14:06:01Z",
  "tenant": "default"
}

Schemas & samples: see src/StellaOps.Scheduler.Models/docs/SCHED-MODELS-20-001-POLICY-RUNS.md and the fixtures in samples/api/scheduler/policy-*.json for canonical payloads consumed by CLI/UI/worker integrations.


2·Pipeline Overview

sequenceDiagram
    autonumber
    participant Trigger as Trigger (CLI / Console / Change Stream)
    participant Orchestrator as Policy Orchestrator
    participant Queue as Scheduler Queue (Mongo/NATS)
    participant Engine as Policy Engine Workers
    participant Concelier as Concelier Service
    participant Excititor as Excititor Service
    participant SBOM as SBOM Service
    participant Store as Mongo (policy_runs & effective_finding_*)
    participant Observability as Metrics/Events

    Trigger->>Orchestrator: Run request (mode, scope, env)
    Orchestrator->>Queue: Enqueue PolicyRunRequest (idempotent key)
    Queue->>Engine: Lease job (fairness window)
    Engine->>Concelier: Fetch advisories + linksets (cursor-aware)
    Engine->>Excititor: Fetch VEX statements (cursor-aware)
    Engine->>SBOM: Fetch SBOM segments / BOM-Index
    Engine->>Engine: Evaluate policy (deterministic batches)
    Engine->>Store: Upsert effective findings + append history
    Engine->>Store: Persist policy_runs record + determinism hash
    Engine->>Observability: Emit metrics, traces, rule-hit logs
    Engine->>Orchestrator: Ack completion / failure
    Orchestrator->>Trigger: Notify (webhook, CLI, Console update)
  • Trigger CLI, Console, or automated change stream publishes a PolicyRunRequest.
  • Orchestrator Runs inside StellaOps.Policy.Engine worker host; applies fairness (tenant + policy quotas) and idempotency using run keys.
  • Queue Backed by Mongo + optional NATS for fan-out; supports leases and replay on crash.
  • Engine Stateless worker executing the deterministic evaluator.
  • Store Mongo collections: policy_runs, effective_finding_{policyId}, policy_run_events (append-only history), optional object storage for explain traces.
  • Observability Prometheus metrics (policy_run_seconds), OTLP traces, structured logs.

3·Input Scoping & Cursors

3.1 Advisory & VEX Cursors

  • Each run records the latest Concelier change stream timestamp (advisory_cursor) and Excititor timestamp (vex_cursor).
  • Incremental runs receive change batches (feedId, lastOffset); orchestrator deduplicates using change_digest.
  • Full runs set cursors to “current read time”, effectively resetting incremental baseline.

3.2 SBOM Selection

  • Full runs enumerate all SBOM records declared active for the tenant.
  • Incremental runs derive SBOM set by intersecting advisory/VEX changes with BOM-Index lookups (component → SBOM mapping).
  • Simulations accept explicit SBOM list; if omitted, CLI uses etc/policy/golden-sboms.json.

3.3 Environment Metadata

  • env block (free-form key/values) allows scenario-specific evaluation (e.g., env.exposure=internet).
  • Stored verbatim in policy_runs.inputs.env for replay; orchestrator hashes environment data to avoid cache collisions.

4·Execution Semantics

  1. Preparation: Worker loads compiled IR for target policy version (cached by digest).
  2. Batching: Candidate tuples are grouped by SBOM, then by advisory to maintain deterministic order; page size defaults to 1024 tuples.
  3. Evaluation: Rules execute with first-match semantics; results captured as PolicyVerdict.
  4. Materialisation:
    • Upserts into effective_finding_{policyId} using {policyId, sbomId, findingKey}.
    • Previous versions stored in effective_finding_{policyId}_history.
  5. Explain storage: Full explain trees stored in blob store when captureExplain=true; incremental runs keep sampled traces (configurable).
  6. Completion: Worker writes final status, stats, determinism hash (combination of policy digest + ordered input digests), and emits policy.run.completed event.

5·Retry, Replay & Determinism

  • Retries: Failures (network, validation) mark run status=failed and enqueue retry with exponential backoff capped at 3 attempts. Manual re-run via CLI resets counters.
  • Replay:
    • Use policy_runs record to assemble input snapshot (policy version, cursors, env).
    • Fetch associated SBOM/advisory/VEX data via stella policy replay --run <id> which rehydrates data into a sealed bundle.
    • Determinism hash mismatches between replay and recorded run indicate drift; CI job DEVOPS-POLICY-20-003 compares successive runs to guard this.
  • Cancellation: Manual stella policy run cancel <runId> or orchestrator TTL triggers status=canceled; partial changes roll back via history append (no destructive delete).

6·Trigger Sources & Scheduling

Source Description SLAs
Nightly full run Default schedule per tenant; ensures baseline alignment. Finish before 07:00 UTC
Change stream Concelier (advisory_raw), Excititor (vex_raw), SBOM imports emit policy.trigger.delta events. Start within 60s; complete within 5min
Manual CLI/Console Operators run ad-hoc evaluations. No SLA; warns if warm path > target
CI stella policy simulate runs in pipelines referencing golden SBOMs. Must complete under 10min to avoid pipeline timeout

The orchestrator enforces max concurrency per tenant (maxActiveRuns), queue depth alarms, and fairness (round-robin per policy).


7·Monitoring & Alerts

  • Metrics: policy_run_seconds, policy_run_queue_depth, policy_run_failures_total, policy_run_incremental_backlog, policy_rules_fired_total.
  • Dashboards: Highlight pending approvals, incremental backlog age, top failing policies, VEX override ratios (tie-in with /docs/observability/policy.md once published).
  • Alerts:
    • Incremental backlog > 3 cycles.
    • Determinism hash mismatch.
    • Failure rate > 5% over rolling hour.
    • Run duration > SLA (full > 30min, incremental > 5min).

8·Failure Handling & Rollback

  • Soft failures: Worker retries; after final failure, orchestrator emits policy.run.failed with diagnostics and recommended actions (e.g., missing SBOM segment).
  • Hard failures: Schema mismatch, determinism guard violation (ERR_POL_004) blocks further runs until resolved.
  • Rollback: Operators can activate previous policy version (see Lifecycle guide) and schedule full run to restore prior state.

9·Offline / Sealed Mode

  • Change streams originate from offline bundle imports; orchestrator processes delta manifests.
  • Runs execute with sealed=true, blocking any external lookups; policy_runs.inputs.env.sealed set for auditing.
  • Explain traces annotate cached data usage to prompt bundle refresh.
  • Offline Kit exports include latest policy_runs snapshot and determinism hashes for evidence lockers.

10·Compliance Checklist

  • Run schemas validated: PolicyRunRequest / PolicyRunStatus DTOs from Scheduler Models (SCHED-MODELS-20-001) serialise deterministically; schema samples up to date.
  • Cursor integrity: Incremental runs persist advisory & VEX cursors; replay verifies identical input digests.
  • Queue fairness configured: Tenant-level concurrency limits and lease timeouts applied; no starvation of lower-volume policies.
  • Determinism guard active: CI replay job (DEVOPS-POLICY-20-003) green; determinism hash recorded on each run.
  • Observability wired: Metrics exported, alerts configured, and run events flowing to Notifier/Timeline.
  • Offline tested: stella policy run --sealed executed in air-gapped environment; explain traces flag cached evidence usage.
  • Recovery plan rehearsed: Failure and rollback drill documented; incident checklist aligned with Lifecycle guide.

Last updated: 2025-10-26 (Sprint 20).