Files

Docs CI / lint-and-preview (push) Has been cancelled

Details

Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools

- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.

2025-10-27 08:00:11 +02:00

10 KiB

Raw Blame History

Policy Runs & Orchestration

Audience: Policy Engine operators, Scheduler team, DevOps, and tooling engineers planning CI integrations.
Scope: Run modes (full, incremental, simulate), orchestration pipeline, cursor management, replay/determinism guarantees, monitoring, and recovery procedures.

Policies only generate value when they execute deterministically against current SBOM, advisory, and VEX inputs. This guide explains how runs are triggered, how the orchestrator scopes work, and what artefacts you should expect at each stage.

1 · Run Modes at a Glance

Mode	Trigger sources	Scope	Persistence	Primary use
Full	Manual CLI (`stella policy run`), Console “Run now”, scheduled nightly job	Entire tenant (all registered SBOMs)	Writes `effective_finding_{policyId}` and `policy_runs` record	Baseline after policy approval, quarterly attestation, post-incident rechecks
Incremental	Change streams (Concelier advisories, Excititor VEX, SBOM imports), orchestrator cron	Only affected `(sbom, advisory)` tuples	Writes diffs to effective findings and run record	Continuous upkeep meeting ≤ 5 min SLA from input change
Simulate	Console review workspace, CLI (`stella policy simulate`), CI pipeline	Selected SBOM sample set (provided or golden set)	No materialisation; captures diff summary + explain traces	Authoring validation, regression safeguards, sealed-mode rehearsals

All modes record their status in policy_runs with deterministic metadata:

{
  "_id": "run:P-7:2025-10-26T14:05:11Z:3f9a",
  "policy_id": "P-7",
  "policy_version": 4,
  "mode": "incremental",
  "status": "succeeded",     // queued | running | succeeded | failed | canceled | replay_pending
  "inputs": {
    "sbom_set": ["sbom:S-42","sbom:S-318"],
    "advisory_cursor": "2025-10-26T13:59:00Z",
    "vex_cursor": "2025-10-26T13:58:30Z",
    "env": {"exposure":"internet"}
  },
  "stats": {
    "components": 1742,
    "rules_fired": 68023,
    "findings_written": 4321,
    "vex_overrides": 210
  },
  "determinism_hash": "sha256:…",
  "started_at": "2025-10-26T14:05:11Z",
  "finished_at": "2025-10-26T14:06:01Z",
  "tenant": "default"
}

Schemas & samples: see src/StellaOps.Scheduler.Models/docs/SCHED-MODELS-20-001-POLICY-RUNS.md and the fixtures in samples/api/scheduler/policy-*.json for canonical payloads consumed by CLI/UI/worker integrations.

2 · Pipeline Overview

sequenceDiagram
    autonumber
    participant Trigger as Trigger (CLI / Console / Change Stream)
    participant Orchestrator as Policy Orchestrator
    participant Queue as Scheduler Queue (Mongo/NATS)
    participant Engine as Policy Engine Workers
    participant Concelier as Concelier Service
    participant Excititor as Excititor Service
    participant SBOM as SBOM Service
    participant Store as Mongo (policy_runs & effective_finding_*)
    participant Observability as Metrics/Events

    Trigger->>Orchestrator: Run request (mode, scope, env)
    Orchestrator->>Queue: Enqueue PolicyRunRequest (idempotent key)
    Queue->>Engine: Lease job (fairness window)
    Engine->>Concelier: Fetch advisories + linksets (cursor-aware)
    Engine->>Excititor: Fetch VEX statements (cursor-aware)
    Engine->>SBOM: Fetch SBOM segments / BOM-Index
    Engine->>Engine: Evaluate policy (deterministic batches)
    Engine->>Store: Upsert effective findings + append history
    Engine->>Store: Persist policy_runs record + determinism hash
    Engine->>Observability: Emit metrics, traces, rule-hit logs
    Engine->>Orchestrator: Ack completion / failure
    Orchestrator->>Trigger: Notify (webhook, CLI, Console update)

Trigger – CLI, Console, or automated change stream publishes a PolicyRunRequest.
Orchestrator – Runs inside StellaOps.Policy.Engine worker host; applies fairness (tenant + policy quotas) and idempotency using run keys.
Queue – Backed by Mongo + optional NATS for fan-out; supports leases and replay on crash.
Engine – Stateless worker executing the deterministic evaluator.
Store – Mongo collections: policy_runs, effective_finding_{policyId}, policy_run_events (append-only history), optional object storage for explain traces.
Observability – Prometheus metrics (policy_run_seconds), OTLP traces, structured logs.

3 · Input Scoping & Cursors

3.1 Advisory & VEX Cursors

Each run records the latest Concelier change stream timestamp (advisory_cursor) and Excititor timestamp (vex_cursor).
Incremental runs receive change batches (feedId, lastOffset); orchestrator deduplicates using change_digest.
Full runs set cursors to “current read time”, effectively resetting incremental baseline.

3.2 SBOM Selection

Full runs enumerate all SBOM records declared active for the tenant.
Incremental runs derive SBOM set by intersecting advisory/VEX changes with BOM-Index lookups (component → SBOM mapping).
Simulations accept explicit SBOM list; if omitted, CLI uses etc/policy/golden-sboms.json.

3.3 Environment Metadata

env block (free-form key/values) allows scenario-specific evaluation (e.g., env.exposure=internet).
Stored verbatim in policy_runs.inputs.env for replay; orchestrator hashes environment data to avoid cache collisions.

4 · Execution Semantics

Preparation: Worker loads compiled IR for target policy version (cached by digest).
Batching: Candidate tuples are grouped by SBOM, then by advisory to maintain deterministic order; page size defaults to 1024 tuples.
Evaluation: Rules execute with first-match semantics; results captured as PolicyVerdict.
Materialisation:
- Upserts into effective_finding_{policyId} using {policyId, sbomId, findingKey}.
- Previous versions stored in effective_finding_{policyId}_history.
Explain storage: Full explain trees stored in blob store when captureExplain=true; incremental runs keep sampled traces (configurable).
Completion: Worker writes final status, stats, determinism hash (combination of policy digest + ordered input digests), and emits policy.run.completed event.

5 · Retry, Replay & Determinism

Retries: Failures (network, validation) mark run status=failed and enqueue retry with exponential backoff capped at 3 attempts. Manual re-run via CLI resets counters.
Replay:
- Use policy_runs record to assemble input snapshot (policy version, cursors, env).
- Fetch associated SBOM/advisory/VEX data via stella policy replay --run <id> which rehydrates data into a sealed bundle.
- Determinism hash mismatches between replay and recorded run indicate drift; CI job DEVOPS-POLICY-20-003 compares successive runs to guard this.
Cancellation: Manual stella policy run cancel <runId> or orchestrator TTL triggers status=canceled; partial changes roll back via history append (no destructive delete).

6 · Trigger Sources & Scheduling

Source	Description	SLAs
Nightly full run	Default schedule per tenant; ensures baseline alignment.	Finish before 07:00 UTC
Change stream	Concelier (`advisory_raw`), Excititor (`vex_raw`), SBOM imports emit `policy.trigger.delta` events.	Start within 60 s; complete within 5 min
Manual CLI/Console	Operators run ad-hoc evaluations.	No SLA; warns if warm path > target
CI	`stella policy simulate` runs in pipelines referencing golden SBOMs.	Must complete under 10 min to avoid pipeline timeout

The orchestrator enforces max concurrency per tenant (maxActiveRuns), queue depth alarms, and fairness (round-robin per policy).

7 · Monitoring & Alerts

Metrics: policy_run_seconds, policy_run_queue_depth, policy_run_failures_total, policy_run_incremental_backlog, policy_rules_fired_total.
Dashboards: Highlight pending approvals, incremental backlog age, top failing policies, VEX override ratios (tie-in with /docs/observability/policy.md once published).
Alerts:
- Incremental backlog > 3 cycles.
- Determinism hash mismatch.
- Failure rate > 5 % over rolling hour.
- Run duration > SLA (full > 30 min, incremental > 5 min).

8 · Failure Handling & Rollback

Soft failures: Worker retries; after final failure, orchestrator emits policy.run.failed with diagnostics and recommended actions (e.g., missing SBOM segment).
Hard failures: Schema mismatch, determinism guard violation (ERR_POL_004) blocks further runs until resolved.
Rollback: Operators can activate previous policy version (see Lifecycle guide) and schedule full run to restore prior state.

9 · Offline / Sealed Mode

Change streams originate from offline bundle imports; orchestrator processes delta manifests.
Runs execute with sealed=true, blocking any external lookups; policy_runs.inputs.env.sealed set for auditing.
Explain traces annotate cached data usage to prompt bundle refresh.
Offline Kit exports include latest policy_runs snapshot and determinism hashes for evidence lockers.

10 · Compliance Checklist

Run schemas validated: PolicyRunRequest / PolicyRunStatus DTOs from Scheduler Models (SCHED-MODELS-20-001) serialise deterministically; schema samples up to date.
Cursor integrity: Incremental runs persist advisory & VEX cursors; replay verifies identical input digests.
Queue fairness configured: Tenant-level concurrency limits and lease timeouts applied; no starvation of lower-volume policies.
Determinism guard active: CI replay job (DEVOPS-POLICY-20-003) green; determinism hash recorded on each run.
Observability wired: Metrics exported, alerts configured, and run events flowing to Notifier/Timeline.
Offline tested: stella policy run --sealed executed in air-gapped environment; explain traces flag cached evidence usage.
Recovery plan rehearsed: Failure and rollback drill documented; incident checklist aligned with Lifecycle guide.

Last updated: 2025-10-26 (Sprint 20).

10 KiB Raw Blame History Unescape Escape

Policy Runs & Orchestration

1 · Run Modes at a Glance

2 · Pipeline Overview

3 · Input Scoping & Cursors

3.1 Advisory & VEX Cursors

3.2 SBOM Selection

3.3 Environment Metadata

4 · Execution Semantics

5 · Retry, Replay & Determinism

6 · Trigger Sources & Scheduling

7 · Monitoring & Alerts

8 · Failure Handling & Rollback

9 · Offline / Sealed Mode

10 · Compliance Checklist