# Policy Runs & Orchestration

> **Audience:** Policy Engine operators, Scheduler team, DevOps, and tooling engineers planning CI integrations.  
> **Scope:** Run modes (`full`, `incremental`, `simulate`), orchestration pipeline, cursor management, replay/determinism guarantees, monitoring, and recovery procedures.

Policies only generate value when they execute deterministically against current SBOM, advisory, and VEX inputs. This guide explains how runs are triggered, how the orchestrator scopes work, and what artefacts you should expect at each stage.

---

## 1 · Run Modes at a Glance

| Mode | Trigger sources | Scope | Persistence | Primary use |
|------|-----------------|-------|-------------|-------------|
| **Full** | Manual CLI (`stella policy run`), Console “Run now”, scheduled nightly job | Entire tenant (all registered SBOMs) | Writes `effective_finding_{policyId}` and `policy_runs` record | Baseline after policy approval, quarterly attestation, post-incident rechecks |
| **Incremental** | Change streams (Concelier advisories, Excititor VEX, SBOM imports), orchestrator cron | Only affected `(sbom, advisory)` tuples | Writes diffs to effective findings and run record | Continuous upkeep meeting ≤ 5 min SLA from input change |
| **Simulate** | Console review workspace, CLI (`stella policy simulate`), CI pipeline | Selected SBOM sample set (provided or golden set) | No materialisation; captures diff summary + explain traces | Authoring validation, regression safeguards, sealed-mode rehearsals |

All modes record their status in `policy_runs` with deterministic metadata:

```json
{
  "_id": "run:P-7:2025-10-26T14:05:11Z:3f9a",
  "policy_id": "P-7",
  "policy_version": 4,
  "mode": "incremental",
  "status": "succeeded",     // queued | running | succeeded | failed | canceled | replay_pending
  "inputs": {
    "sbom_set": ["sbom:S-42","sbom:S-318"],
    "advisory_cursor": "2025-10-26T13:59:00Z",
    "vex_cursor": "2025-10-26T13:58:30Z",
    "env": {"exposure":"internet"}
  },
  "stats": {
    "components": 1742,
    "rules_fired": 68023,
    "findings_written": 4321,
    "vex_overrides": 210
  },
  "determinism_hash": "sha256:…",
  "started_at": "2025-10-26T14:05:11Z",
  "finished_at": "2025-10-26T14:06:01Z",
  "tenant": "default"
}
```

> **Schemas & samples:** see `src/Scheduler/__Libraries/StellaOps.Scheduler.Models/docs/SCHED-MODELS-20-001-POLICY-RUNS.md` and the fixtures in `samples/api/scheduler/policy-*.json` for canonical payloads consumed by CLI/UI/worker integrations.

---

## 2 · Pipeline Overview

```mermaid
sequenceDiagram
    autonumber
    participant Trigger as Trigger (CLI / Console / Change Stream)
    participant Orchestrator as Policy Orchestrator
    participant Queue as Scheduler Queue (Mongo/NATS)
    participant Engine as Policy Engine Workers
    participant Concelier as Concelier Service
    participant Excititor as Excititor Service
    participant SBOM as SBOM Service
    participant Store as Mongo (policy_runs & effective_finding_*)
    participant Observability as Metrics/Events

    Trigger->>Orchestrator: Run request (mode, scope, env)
    Orchestrator->>Queue: Enqueue PolicyRunRequest (idempotent key)
    Queue->>Engine: Lease job (fairness window)
    Engine->>Concelier: Fetch advisories + linksets (cursor-aware)
    Engine->>Excititor: Fetch VEX statements (cursor-aware)
    Engine->>SBOM: Fetch SBOM segments / BOM-Index
    Engine->>Engine: Evaluate policy (deterministic batches)
    Engine->>Store: Upsert effective findings + append history
    Engine->>Store: Persist policy_runs record + determinism hash
    Engine->>Observability: Emit metrics, traces, rule-hit logs
    Engine->>Orchestrator: Ack completion / failure
    Orchestrator->>Trigger: Notify (webhook, CLI, Console update)
```

- **Trigger** – CLI, Console, or automated change stream publishes a `PolicyRunRequest`.
- **Orchestrator** – Runs inside `StellaOps.Policy.Engine` worker host; applies fairness (tenant + policy quotas) and idempotency using run keys.
- **Queue** – Backed by Mongo + optional NATS for fan-out; supports leases and replay on crash.
- **Engine** – Stateless worker executing the deterministic evaluator.
- **Store** – Mongo collections: `policy_runs`, `effective_finding_{policyId}`, `policy_run_events` (append-only history), optional object storage for explain traces.
- **Observability** – Prometheus metrics (`policy_run_seconds`), OTLP traces, structured logs.

---

## 3 · Input Scoping & Cursors

### 3.1 Advisory & VEX Cursors

- Each run records the latest Concelier change stream timestamp (`advisory_cursor`) and Excititor timestamp (`vex_cursor`).
- Incremental runs receive change batches `(feedId, lastOffset)`; orchestrator deduplicates using `change_digest`.
- Full runs set cursors to “current read time”, effectively resetting incremental baseline.

### 3.2 SBOM Selection

- Full runs enumerate all SBOM records declared active for the tenant.
- Incremental runs derive SBOM set by intersecting advisory/VEX changes with BOM-Index lookups (component → SBOM mapping).
- Simulations accept explicit SBOM list; if omitted, CLI uses `etc/policy/golden-sboms.json`.

### 3.3 Environment Metadata

- `env` block (free-form key/values) allows scenario-specific evaluation (e.g., `env.exposure=internet`).
- Stored verbatim in `policy_runs.inputs.env` for replay; orchestrator hashes environment data to avoid cache collisions.

---

## 4 · Execution Semantics

1. **Preparation:** Worker loads compiled IR for target policy version (cached by digest).
2. **Batching:** Candidate tuples are grouped by SBOM, then by advisory to maintain deterministic order; page size defaults to 1024 tuples.
3. **Evaluation:** Rules execute with first-match semantics; results captured as `PolicyVerdict`.
4. **Materialisation:** 
   - Upserts into `effective_finding_{policyId}` using `{policyId, sbomId, findingKey}`.
   - Previous versions stored in `effective_finding_{policyId}_history`.
5. **Explain storage:** Full explain trees stored in blob store when `captureExplain=true`; incremental runs keep sampled traces (configurable).
6. **Completion:** Worker writes final status, stats, determinism hash (combination of policy digest + ordered input digests), and emits `policy.run.completed` event.

---

## 5 · Retry, Replay & Determinism

- **Retries:** Failures (network, validation) mark run `status=failed` and enqueue retry with exponential backoff capped at 3 attempts. Manual re-run via CLI resets counters.
- **Replay:** 
  - Use `policy_runs` record to assemble input snapshot (policy version, cursors, env).
  - Fetch associated SBOM/advisory/VEX data via `stella policy replay --run <id>` which rehydrates data into a sealed bundle.
  - Determinism hash mismatches between replay and recorded run indicate drift; CI job `DEVOPS-POLICY-20-003` compares successive runs to guard this.
- **Cancellation:** Manual `stella policy run cancel <runId>` or orchestrator TTL triggers `status=canceled`; partial changes roll back via history append (no destructive delete).

---

## 6 · Trigger Sources & Scheduling

| Source | Description | SLAs |
|--------|-------------|------|
| **Nightly full run** | Default schedule per tenant; ensures baseline alignment. | Finish before 07:00 UTC |
| **Change stream** | Concelier (`advisory_raw`), Excititor (`vex_raw`), SBOM imports emit `policy.trigger.delta` events. | Start within 60 s; complete within 5 min |
| **Manual CLI/Console** | Operators run ad-hoc evaluations. | No SLA; warns if warm path > target |
| **CI** | `stella policy simulate` runs in pipelines referencing golden SBOMs. | Must complete under 10 min to avoid pipeline timeout |

The orchestrator enforces max concurrency per tenant (`maxActiveRuns`), queue depth alarms, and fairness (round-robin per policy).

---

## 7 · Monitoring & Alerts

- **Metrics:** `policy_run_seconds`, `policy_run_queue_depth`, `policy_run_failures_total`, `policy_run_incremental_backlog`, `policy_rules_fired_total`.
- **Dashboards:** Highlight pending approvals, incremental backlog age, top failing policies, VEX override ratios (tie-in with `/docs/observability/policy.md` once published).
- **Alerts:**
  - Incremental backlog > 3 cycles.
  - Determinism hash mismatch.
  - Failure rate > 5 % over rolling hour.
  - Run duration > SLA (full > 30 min, incremental > 5 min).

---

## 8 · Failure Handling & Rollback

- **Soft failures:** Worker retries; after final failure, orchestrator emits `policy.run.failed` with diagnostics and recommended actions (e.g., missing SBOM segment).
- **Hard failures:** Schema mismatch, determinism guard violation (`ERR_POL_004`) blocks further runs until resolved.
- **Rollback:** Operators can activate previous policy version (see [Lifecycle guide](lifecycle.md)) and schedule full run to restore prior state.

---

## 9 · Offline / Sealed Mode

- Change streams originate from offline bundle imports; orchestrator processes delta manifests.
- Runs execute with `sealed=true`, blocking any external lookups; `policy_runs.inputs.env.sealed` set for auditing.
- Explain traces annotate cached data usage to prompt bundle refresh.
- Offline Kit exports include latest `policy_runs` snapshot and determinism hashes for evidence lockers.

---

## 10 · Compliance Checklist

- [ ] **Run schemas validated:** `PolicyRunRequest` / `PolicyRunStatus` DTOs from Scheduler Models (`SCHED-MODELS-20-001`) serialise deterministically; schema samples up to date.
- [ ] **Cursor integrity:** Incremental runs persist advisory & VEX cursors; replay verifies identical input digests.
- [ ] **Queue fairness configured:** Tenant-level concurrency limits and lease timeouts applied; no starvation of lower-volume policies.
- [ ] **Determinism guard active:** CI replay job (`DEVOPS-POLICY-20-003`) green; determinism hash recorded on each run.
- [ ] **Observability wired:** Metrics exported, alerts configured, and run events flowing to Notifier/Timeline.
- [ ] **Offline tested:** `stella policy run --sealed` executed in air-gapped environment; explain traces flag cached evidence usage.
- [ ] **Recovery plan rehearsed:** Failure and rollback drill documented; incident checklist aligned with Lifecycle guide.

---

*Last updated: 2025-10-26 (Sprint 20).*