- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
10 KiB
Policy Runs & Orchestration
Audience: Policy Engine operators, Scheduler team, DevOps, and tooling engineers planning CI integrations.
Scope: Run modes (full,incremental,simulate), orchestration pipeline, cursor management, replay/determinism guarantees, monitoring, and recovery procedures.
Policies only generate value when they execute deterministically against current SBOM, advisory, and VEX inputs. This guide explains how runs are triggered, how the orchestrator scopes work, and what artefacts you should expect at each stage.
1 · Run Modes at a Glance
| Mode | Trigger sources | Scope | Persistence | Primary use |
|---|---|---|---|---|
| Full | Manual CLI (stella policy run), Console “Run now”, scheduled nightly job |
Entire tenant (all registered SBOMs) | Writes effective_finding_{policyId} and policy_runs record |
Baseline after policy approval, quarterly attestation, post-incident rechecks |
| Incremental | Change streams (Concelier advisories, Excititor VEX, SBOM imports), orchestrator cron | Only affected (sbom, advisory) tuples |
Writes diffs to effective findings and run record | Continuous upkeep meeting ≤ 5 min SLA from input change |
| Simulate | Console review workspace, CLI (stella policy simulate), CI pipeline |
Selected SBOM sample set (provided or golden set) | No materialisation; captures diff summary + explain traces | Authoring validation, regression safeguards, sealed-mode rehearsals |
All modes record their status in policy_runs with deterministic metadata:
{
"_id": "run:P-7:2025-10-26T14:05:11Z:3f9a",
"policy_id": "P-7",
"policy_version": 4,
"mode": "incremental",
"status": "succeeded", // queued | running | succeeded | failed | canceled | replay_pending
"inputs": {
"sbom_set": ["sbom:S-42","sbom:S-318"],
"advisory_cursor": "2025-10-26T13:59:00Z",
"vex_cursor": "2025-10-26T13:58:30Z",
"env": {"exposure":"internet"}
},
"stats": {
"components": 1742,
"rules_fired": 68023,
"findings_written": 4321,
"vex_overrides": 210
},
"determinism_hash": "sha256:…",
"started_at": "2025-10-26T14:05:11Z",
"finished_at": "2025-10-26T14:06:01Z",
"tenant": "default"
}
Schemas & samples: see
src/StellaOps.Scheduler.Models/docs/SCHED-MODELS-20-001-POLICY-RUNS.mdand the fixtures insamples/api/scheduler/policy-*.jsonfor canonical payloads consumed by CLI/UI/worker integrations.
2 · Pipeline Overview
sequenceDiagram
autonumber
participant Trigger as Trigger (CLI / Console / Change Stream)
participant Orchestrator as Policy Orchestrator
participant Queue as Scheduler Queue (Mongo/NATS)
participant Engine as Policy Engine Workers
participant Concelier as Concelier Service
participant Excititor as Excititor Service
participant SBOM as SBOM Service
participant Store as Mongo (policy_runs & effective_finding_*)
participant Observability as Metrics/Events
Trigger->>Orchestrator: Run request (mode, scope, env)
Orchestrator->>Queue: Enqueue PolicyRunRequest (idempotent key)
Queue->>Engine: Lease job (fairness window)
Engine->>Concelier: Fetch advisories + linksets (cursor-aware)
Engine->>Excititor: Fetch VEX statements (cursor-aware)
Engine->>SBOM: Fetch SBOM segments / BOM-Index
Engine->>Engine: Evaluate policy (deterministic batches)
Engine->>Store: Upsert effective findings + append history
Engine->>Store: Persist policy_runs record + determinism hash
Engine->>Observability: Emit metrics, traces, rule-hit logs
Engine->>Orchestrator: Ack completion / failure
Orchestrator->>Trigger: Notify (webhook, CLI, Console update)
- Trigger – CLI, Console, or automated change stream publishes a
PolicyRunRequest. - Orchestrator – Runs inside
StellaOps.Policy.Engineworker host; applies fairness (tenant + policy quotas) and idempotency using run keys. - Queue – Backed by Mongo + optional NATS for fan-out; supports leases and replay on crash.
- Engine – Stateless worker executing the deterministic evaluator.
- Store – Mongo collections:
policy_runs,effective_finding_{policyId},policy_run_events(append-only history), optional object storage for explain traces. - Observability – Prometheus metrics (
policy_run_seconds), OTLP traces, structured logs.
3 · Input Scoping & Cursors
3.1 Advisory & VEX Cursors
- Each run records the latest Concelier change stream timestamp (
advisory_cursor) and Excititor timestamp (vex_cursor). - Incremental runs receive change batches
(feedId, lastOffset); orchestrator deduplicates usingchange_digest. - Full runs set cursors to “current read time”, effectively resetting incremental baseline.
3.2 SBOM Selection
- Full runs enumerate all SBOM records declared active for the tenant.
- Incremental runs derive SBOM set by intersecting advisory/VEX changes with BOM-Index lookups (component → SBOM mapping).
- Simulations accept explicit SBOM list; if omitted, CLI uses
etc/policy/golden-sboms.json.
3.3 Environment Metadata
envblock (free-form key/values) allows scenario-specific evaluation (e.g.,env.exposure=internet).- Stored verbatim in
policy_runs.inputs.envfor replay; orchestrator hashes environment data to avoid cache collisions.
4 · Execution Semantics
- Preparation: Worker loads compiled IR for target policy version (cached by digest).
- Batching: Candidate tuples are grouped by SBOM, then by advisory to maintain deterministic order; page size defaults to 1024 tuples.
- Evaluation: Rules execute with first-match semantics; results captured as
PolicyVerdict. - Materialisation:
- Upserts into
effective_finding_{policyId}using{policyId, sbomId, findingKey}. - Previous versions stored in
effective_finding_{policyId}_history.
- Upserts into
- Explain storage: Full explain trees stored in blob store when
captureExplain=true; incremental runs keep sampled traces (configurable). - Completion: Worker writes final status, stats, determinism hash (combination of policy digest + ordered input digests), and emits
policy.run.completedevent.
5 · Retry, Replay & Determinism
- Retries: Failures (network, validation) mark run
status=failedand enqueue retry with exponential backoff capped at 3 attempts. Manual re-run via CLI resets counters. - Replay:
- Use
policy_runsrecord to assemble input snapshot (policy version, cursors, env). - Fetch associated SBOM/advisory/VEX data via
stella policy replay --run <id>which rehydrates data into a sealed bundle. - Determinism hash mismatches between replay and recorded run indicate drift; CI job
DEVOPS-POLICY-20-003compares successive runs to guard this.
- Use
- Cancellation: Manual
stella policy run cancel <runId>or orchestrator TTL triggersstatus=canceled; partial changes roll back via history append (no destructive delete).
6 · Trigger Sources & Scheduling
| Source | Description | SLAs |
|---|---|---|
| Nightly full run | Default schedule per tenant; ensures baseline alignment. | Finish before 07:00 UTC |
| Change stream | Concelier (advisory_raw), Excititor (vex_raw), SBOM imports emit policy.trigger.delta events. |
Start within 60 s; complete within 5 min |
| Manual CLI/Console | Operators run ad-hoc evaluations. | No SLA; warns if warm path > target |
| CI | stella policy simulate runs in pipelines referencing golden SBOMs. |
Must complete under 10 min to avoid pipeline timeout |
The orchestrator enforces max concurrency per tenant (maxActiveRuns), queue depth alarms, and fairness (round-robin per policy).
7 · Monitoring & Alerts
- Metrics:
policy_run_seconds,policy_run_queue_depth,policy_run_failures_total,policy_run_incremental_backlog,policy_rules_fired_total. - Dashboards: Highlight pending approvals, incremental backlog age, top failing policies, VEX override ratios (tie-in with
/docs/observability/policy.mdonce published). - Alerts:
- Incremental backlog > 3 cycles.
- Determinism hash mismatch.
- Failure rate > 5 % over rolling hour.
- Run duration > SLA (full > 30 min, incremental > 5 min).
8 · Failure Handling & Rollback
- Soft failures: Worker retries; after final failure, orchestrator emits
policy.run.failedwith diagnostics and recommended actions (e.g., missing SBOM segment). - Hard failures: Schema mismatch, determinism guard violation (
ERR_POL_004) blocks further runs until resolved. - Rollback: Operators can activate previous policy version (see Lifecycle guide) and schedule full run to restore prior state.
9 · Offline / Sealed Mode
- Change streams originate from offline bundle imports; orchestrator processes delta manifests.
- Runs execute with
sealed=true, blocking any external lookups;policy_runs.inputs.env.sealedset for auditing. - Explain traces annotate cached data usage to prompt bundle refresh.
- Offline Kit exports include latest
policy_runssnapshot and determinism hashes for evidence lockers.
10 · Compliance Checklist
- Run schemas validated:
PolicyRunRequest/PolicyRunStatusDTOs from Scheduler Models (SCHED-MODELS-20-001) serialise deterministically; schema samples up to date. - Cursor integrity: Incremental runs persist advisory & VEX cursors; replay verifies identical input digests.
- Queue fairness configured: Tenant-level concurrency limits and lease timeouts applied; no starvation of lower-volume policies.
- Determinism guard active: CI replay job (
DEVOPS-POLICY-20-003) green; determinism hash recorded on each run. - Observability wired: Metrics exported, alerts configured, and run events flowing to Notifier/Timeline.
- Offline tested:
stella policy run --sealedexecuted in air-gapped environment; explain traces flag cached evidence usage. - Recovery plan rehearsed: Failure and rollback drill documented; incident checklist aligned with Lifecycle guide.
Last updated: 2025-10-26 (Sprint 20).