Fine. Here’s the next epic, written so you can paste it straight into the repo without having to babysit me. Same structure as before, maximum detail, zero hand‑waving. --- # Epic 2: Policy Engine & Policy Editor (VEX + Advisory Application Rules) > Short name: **Policy Engine v2** > Services touched: **Policy Engine, Web API, Console (Policy Editor), CLI, Conseiller, Excitator, SBOM Service, Authority, Workers/Scheduler** > Data stores: **MongoDB (policies, runs, effective findings), optional Redis/NATS for jobs** --- ## 1) What it is This epic delivers the **organization‑specific decision layer** for Stella. Ingestion is now AOC‑compliant (Epic 1). That means advisories and VEX arrive as immutable raw facts. This epic builds the place where those facts become **effective findings** under policies you control. Core deliverables: * **Policy Engine**: deterministic evaluator that applies rule sets to inputs: * Inputs: `advisory_raw`, `vex_raw`, SBOMs, optional telemetry hooks (reachability stubs), org metadata. * Outputs: `effective_finding_{policyId}` materializations, with full explanation traces. * **Policy Editor (Console + CLI)**: versioned policy authoring, simulation, review/approval workflow, and change diffs. * **Rules DSL v1**: safe, declarative language for VEX application, advisory normalization, and risk scoring. No arbitrary code execution, no network calls. * **Run Orchestrator**: incremental re‑evaluation when new raw facts or SBOM changes arrive; efficient partial updates. The philosophy is boring on purpose: policy is a **pure function of inputs**. Same inputs and same policy yield the same outputs, every time, on every machine. If you want drama, watch reality TV, not your risk pipeline. --- ## 2) Why * Vendors disagree, contexts differ, and your tolerance for risk is not universal. * VEX means nothing until you decide **how** to apply it to **your** assets. * Auditors care about the “why.” You’ll need consistent, replayable answers, with traces. * Security teams need **simulation** before rollouts, and **diffs** after. --- ## 3) How it should work (deep details) ### 3.1 Data model #### 3.1.1 Policy documents (Mongo: `policies`) ```json { "_id": "policy:P-7:v3", "policy_id": "P-7", "version": 3, "name": "Default Org Policy", "status": "approved", // draft | submitted | approved | archived "owned_by": "team:sec-plat", "valid_from": "2025-01-15T00:00:00Z", "valid_to": null, "dsl": { "syntax": "stella-dsl@1", "source": "rule-set text or compiled IR ref" }, "metadata": { "description": "Baseline scoring + VEX precedence", "tags": ["baseline","vex","cvss"] }, "provenance": { "created_by": "user:ali", "created_at": "2025-01-15T08:00:00Z", "submitted_by": "user:kay", "approved_by": "user:root", "approval_at": "2025-01-16T10:00:00Z", "checksum": "sha256:..." }, "tenant": "default" } ``` Constraints: * `status=approved` is required to run in production. * Version increments are append‑only. Old versions remain runnable for replay. #### 3.1.2 Policy runs (Mongo: `policy_runs`) ```json { "_id": "run:P-7:2025-02-20T12:34:56Z:abcd", "policy_id": "P-7", "policy_version": 3, "inputs": { "sbom_set": ["sbom:S-42"], "advisory_cursor": "2025-02-20T00:00:00Z", "vex_cursor": "2025-02-20T00:00:00Z" }, "mode": "incremental", // full | incremental | simulate "stats": { "components": 1742, "advisories_considered": 9210, "vex_considered": 1187, "rules_fired": 68023, "findings_out": 4321 }, "trace": { "location": "blob://traces/run-.../index.json", "sampling": "smart-10pct" }, "status": "succeeded", // queued | running | failed | succeeded | canceled "started_at": "2025-02-20T12:34:56Z", "finished_at": "2025-02-20T12:35:41Z", "tenant": "default" } ``` #### 3.1.3 Effective findings (Mongo: `effective_finding_P-7`) ```json { "_id": "P-7:S-42:pkg:npm/lodash@4.17.21:CVE-2021-23337", "policy_id": "P-7", "policy_version": 3, "sbom_id": "S-42", "component_purl": "pkg:npm/lodash@4.17.21", "advisory_ids": ["CVE-2021-23337", "GHSA-..."], "status": "affected", // affected | not_affected | fixed | under_investigation | suppressed "severity": { "normalized": "High", "score": 7.5, "vector": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N", "rationale": "cvss_base(OSV) + vendor_weighting + env_modifiers" }, "rationale": [ {"rule":"vex.precedence","detail":"VendorX not_affected justified=component_not_present wins"}, {"rule":"advisory.cvss.normalization","detail":"mapped GHSA severity to CVSS 3.1 = 7.5"} ], "references": { "advisory_raw_ids": ["advisory_raw:osv:GHSA-...:v3"], "vex_raw_ids": ["vex_raw:VendorX:doc-123:v4"] }, "run_id": "run:P-7:2025-02-20T12:34:56Z:abcd", "tenant": "default" } ``` Write protection: only the **Policy Engine** service identity may write any `effective_finding_*` collection. --- ### 3.2 Rules DSL v1 (stella‑dsl@1) **Design goals** * Declarative, composable, deterministic. * No loops, no network IO, no non‑deterministic time. * Policy authors see readable text; the engine compiles to a safe IR. **Concepts** * **WHEN** condition matches a tuple `(sbom_component, advisory, optional vex_statements)` * **THEN** actions set `status`, compute `severity`, attach `rationale`, or `suppress` with reason. * **Profiles** for severity and scoring; **Maps** for vendor weighting; **Guards** for VEX justification. **Mini‑grammar (subset)** ``` policy "Default Org Policy" syntax "stella-dsl@1" { profile severity { map vendor_weight { source "GHSA" => +0.5 source "OSV" => +0.0 source "VendorX" => -0.2 } env base_cvss { if env.runtime == "serverless" then -0.5 if env.exposure == "internal-only" then -1.0 } } rule vex_precedence { when vex.any(status in ["not_affected","fixed"]) and vex.justification in ["component_not_present","vulnerable_code_not_present"] then status := vex.status because "VEX strong justification prevails"; } rule advisory_to_cvss { when advisory.source in ["GHSA","OSV"] then severity := normalize_cvss(advisory) because "Map vendor severity or CVSS vector"; } rule reachability_soft_suppress { when severity.normalized <= "Medium" and telemetry.reachability == "none" then status := "suppressed" because "not reachable and low severity"; } } ``` **Built‑ins** (non‑exhaustive) * `normalize_cvss(advisory)` maps GHSA/OSV/CSAF severity fields to CVSS v3.1 numbers when possible; otherwise vendor‑to‑numeric mapping table in policy. * `vex.any(...)` tests across matching VEX statements for the same `(component, advisory)`. * `telemetry.*` is an optional input namespace reserved for future reachability data; if absent, expressions evaluate to `unknown` (no effect). **Determinism** * Rules are evaluated in **stable order**: explicit `priority` attribute or lexical order. * **First‑match** semantics for conflicting status unless `combine` is used. * Severity computations are pure; numeric maps are part of policy document. --- ### 3.3 Evaluation model 1. **Selection** * For each SBOM component PURL, find candidate advisories from `advisory_raw` via linkset PURLs or identifiers. * For each pair `(component, advisory)`, load all matching VEX facts from `vex_raw`. 2. **Context assembly** * Build an evaluation context from: * `sbom_component`: PURL, licenses, relationships. * `advisory`: source, identifiers, references, embedded vendor severity (kept in `content.raw`). * `vex`: list of statements with status and justification. * `env`: org‑specific env vars configured per policy run (e.g., exposure). * Optional `telemetry` if available. 3. **Rule execution** * Compile DSL to IR once per policy version; cache. * Execute rules per tuple; record which rules fired and the order. * If no rule sets status, default is `affected`. * If no rule sets severity, default severity uses `normalize_cvss(advisory)` with vendor defaults. 4. **Materialization** * Write to `effective_finding_{policyId}` with `rationale` chain and references to raw docs. * Emit per‑tuple trace events; sample and store full traces per run. 5. **Incremental updates** * A watch job observes new `advisory_raw` and `vex_raw` inserts and SBOM deltas. * The orchestrator computes the affected tuples and re‑evaluates only those. 6. **Replay** * Any `policy_run` is fully reproducible by `(policy_id, version, input set, cursors)`. --- ### 3.4 VEX application semantics * **Precedence**: a `not_affected` with strong justification (`component_not_present`, `vulnerable_code_not_present`, `fix_not_required`) wins unless another rule explicitly overrides by environment context. * **Scoping**: VEX statements often specify product/component scope. Matching uses PURL equivalence and version ranges extracted during ingestion linkset generation. * **Conflicts**: If multiple VEX statements conflict, the default is **most‑specific scope wins** (component > product > vendor), then newest `document_version`. Policies can override with explicit rules. * **Explainability**: Every VEX‑driven decision records which statement IDs were considered and which one won. --- ### 3.5 Advisory normalization rules * **Vendor severity mapping**: Map GHSA levels or CSAF product‑tree severities to CVSS‑like numeric bands via policy maps. * **CVSS vector use**: If a valid vector exists in `content.raw`, parse and compute; apply policy modifiers from `profile severity`. * **Temporal/environment modifiers**: Optional reductions for network exposure, isolation, or compensating controls, all encoded in policy. --- ### 3.6 Performance and scale * Partition evaluation by SBOM ID and hash ranges of PURLs. * Pre‑index `advisory_raw.linkset.purls` and `vex_raw.linkset.purls` (already in Epic 1). * Use streaming iterators; avoid loading entire SBOM or advisory sets into memory. * Materialize only changed findings (diff‑aware writes). * Target: 100k components, 1M advisories considered, 5 minutes incremental SLA on commodity hardware. --- ### 3.7 Error codes | Code | Meaning | HTTP | | ------------- | ----------------------------------------------------- | ---- | | `ERR_POL_001` | Policy syntax error | 400 | | `ERR_POL_002` | Policy not approved for run | 403 | | `ERR_POL_003` | Missing inputs (SBOM/advisory/vex fetch failed) | 424 | | `ERR_POL_004` | Determinism guard triggered (non‑pure function usage) | 500 | | `ERR_POL_005` | Write denied to effective findings (caller invalid) | 403 | | `ERR_POL_006` | Run canceled or timed out | 408 | --- ### 3.8 Observability * Metrics: * `policy_compile_seconds`, `policy_run_seconds{mode=...}`, `rules_fired_total`, `findings_written_total`, `vex_overrides_total`, `simulate_diff_total{delta=up|down|unchanged}`. * Tracing: * Spans: `policy.compile`, `policy.select`, `policy.eval`, `policy.materialize`. * Logs: * Include `policy_id`, `version`, `run_id`, `sbom_id`, `component_purl`, `advisory_id`, `vex_count`, `rule_hits`. > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- ### 3.9 Security and tenancy * Only users with `policy:write` can create/modify policies. * `policy:approve` is a separate privileged role. * Only Policy Engine service identity has `effective:write`. * Tenancy is explicit on all documents and queries. > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- ## 4) API surface ### 4.1 Policy CRUD and lifecycle * `POST /policies` create draft * `GET /policies?status=...` list * `GET /policies/{policyId}/versions/{v}` fetch * `POST /policies/{policyId}/submit` move draft to submitted * `POST /policies/{policyId}/approve` approve version * `POST /policies/{policyId}/archive` archive version ### 4.2 Compilation and validation * `POST /policies/{policyId}/versions/{v}/compile` * Returns IR checksum, syntax diagnostics, rule stats. ### 4.3 Runs * `POST /policies/{policyId}/runs` body: `{mode, sbom_set, advisory_cursor?, vex_cursor?, env?}` * `GET /policies/{policyId}/runs/{runId}` status + stats * `POST /policies/{policyId}/simulate` returns **diff** vs current approved version on a sample SBOM set. ### 4.4 Findings and explanations * `GET /findings/{policyId}?sbom_id=S-42&status=affected&severity=High+Critical` * `GET /findings/{policyId}/{findingId}/explain` returns ordered rule hits and linked raw IDs. All endpoints require tenant scoping and appropriate `policy:*` or `findings:*` roles. --- ## 5) Console (Policy Editor) and CLI behavior **Console** * Monaco‑style editor with DSL syntax highlighting, lint, quick docs. * Side‑by‑side **Simulation** panel: show count of affected findings before/after. * Approval workflow: submit, review comments, approve with rationale. * Diffs: show rule‑wise changes and estimated impact. * Read‑only run viewer: heatmap of rules fired, top suppressions, VEX wins. **CLI** * `stella policy new --name "Default Org Policy"` * `stella policy edit P-7` opens local editor -> `submit` * `stella policy approve P-7 --version 3` * `stella policy simulate P-7 --sbom S-42 --env exposure=internal-only` * `stella findings ls --policy P-7 --sbom S-42 --status affected` Exit codes map to `ERR_POL_*`. --- ## 6) Implementation tasks ### 6.1 Policy Engine service * [ ] Implement DSL parser and IR compiler (`stella-dsl@1`). * [ ] Build evaluator with stable ordering and first‑match semantics. * [ ] Implement selection joiners for SBOM↔advisory↔vex using linksets. * [ ] Materialization writer with upsert‑only semantics to `effective_finding_{policyId}`. * [ ] Determinism guard (ban wall‑clock, network, and RNG during eval). * [ ] Incremental orchestrator listening to advisory/vex/SBOM change streams. * [ ] Trace emitter with rule‑hit sampling. * [ ] Unit tests, property tests, golden fixtures; perf tests to target SLA. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.2 Web API * [ ] Policy CRUD, compile, run, simulate, findings, explain endpoints. * [ ] Pagination, filters, and tenant enforcement on all list endpoints. * [ ] Error mapping to `ERR_POL_*`. * [ ] Rate limits on simulate endpoints. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.3 Console (Policy Editor) * [ ] Editor with DSL syntax highlighting and inline diagnostics. * [ ] Simulation UI with pre/post counts and top deltas. * [ ] Approval workflow UI with audit trail. * [ ] Run viewer dashboards (rule heatmap, VEX wins, suppressions). **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.4 CLI * [ ] New commands: `policy new|edit|submit|approve|simulate`, `findings ls|get`. * [ ] Json/YAML output formats for CI consumption. * [ ] Non‑zero exits on syntax errors or simulation failures; map to `ERR_POL_*`. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.5 Conseiller & Excitator integration * [ ] Provide search endpoints optimized for policy selection (batch by PURLs and IDs). * [ ] Harden linkset extraction to maximize join recall. * [ ] Add cursors for incremental selection windows per run. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.6 SBOM Service * [ ] Ensure fast PURL index and component metadata projection for policy queries. * [ ] Provide relationship graph API for future transitive logic. * [ ] Emit change events on SBOM updates. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.7 Authority * [ ] Define scopes: `policy:write`, `policy:approve`, `policy:run`, `findings:read`, `effective:write`. * [ ] Issue service identity for Policy Engine with `effective:write` only. * [ ] Enforce tenant claims at gateway. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. ### 6.8 CI/CD * [ ] Lint policy DSL in PRs; block invalid syntax. * [ ] Run `simulate` against golden SBOMs to detect explosive deltas. * [ ] Determinism CI: two runs with identical seeds produce identical outputs. **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- ## 7) Documentation changes (create/update these files) 1. **`/docs/policy/overview.md`** * What the Policy Engine is, high‑level concepts, inputs, outputs, determinism. 2. **`/docs/policy/dsl.md`** * Full grammar, built‑ins, examples, best practices, anti‑patterns. 3. **`/docs/policy/lifecycle.md`** * Draft → submitted → approved → archived, roles, and audit trail. 4. **`/docs/policy/runs.md`** * Run modes, incremental mechanics, cursors, replay. 5. **`/docs/api/policy.md`** * Endpoints, request/response schemas, error codes. 6. **`/docs/cli/policy.md`** * Command usage, examples, exit codes, JSON output contracts. 7. **`/docs/ui/policy-editor.md`** * Screens, workflows, simulation, diffs, approvals. 8. **`/docs/architecture/policy-engine.md`** * Detailed sequence diagrams, selection/join strategy, materialization schema. 9. **`/docs/observability/policy.md`** * Metrics, tracing, logs, sample dashboards. 10. **`/docs/security/policy-governance.md`** * Scopes, approvals, tenancy, least privilege. 11. **`/docs/examples/policies/`** * `baseline.pol`, `serverless.pol`, `internal-only.pol`, each with commentary. 12. **`/docs/faq/policy-faq.md`** * Common pitfalls, VEX conflict handling, determinism gotchas. Each file includes a **Compliance checklist** for authors and reviewers. > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- ## 8) Acceptance criteria * Policies are versioned, approvable, and compilable; invalid DSL blocks merges. * Engine produces deterministic outputs with full rationale chains. * VEX precedence rules work per spec and are overridable by policy. * Simulation yields accurate pre/post deltas and diffs. * Only Policy Engine can write to `effective_finding_*`. * Incremental runs pick up new advisories/VEX/SBOM changes without full re‑runs. * Console and CLI cover authoring, simulation, approval, and retrieval. * Observability dashboards show rule hits, VEX wins, and run timings. --- ## 9) Risks and mitigations * **Policy sprawl**: too many similar policies. * Mitigation: templates, policy inheritance in v1.1, tagging, ownership metadata. * **Non‑determinism creep**: someone sneaks wall‑clock or network into evaluation. * Mitigation: determinism guard, static analyzer, and CI replay check. * **Join miss‑rate**: weak linksets cause under‑matching. * Mitigation: linkset strengthening in ingestion, PURL equivalence tables, monitoring for “zero‑hit” rates. * **Approval bottlenecks**: blocked rollouts. * Mitigation: RBAC with delegated approvers and time‑boxed SLAs. --- ## 10) Test plan * **Unit**: parser, compiler, evaluator; conflict resolution; precedence. * **Property**: random policies over synthetic inputs; ensure no panics and stable outputs. * **Golden**: fixed SBOM + curated advisories/VEX → expected findings; compare every run. * **Performance**: large SBOMs with heavy rule sets; assert run times and memory ceilings. * **Integration**: end‑to‑end simulate → approve → run → diff; verify write protections. * **Chaos**: inject malformed VEX, missing advisories; ensure graceful degradation and clear errors. --- ## 11) Developer checklists **Definition of Ready** * Policy grammar finalized; examples prepared. * Linkset join queries benchmarked. * Owner and approvers assigned. **Definition of Done** * All APIs live with RBAC. * CLI and Console features shipped. * Determinism and golden tests green. * Observability dashboards deployed. * Docs in section 7 merged. * Two real org policies migrated and in production. --- ## 12) Glossary * **Policy**: versioned rule set controlling status and severity. * **DSL**: domain‑specific language used to express rules. * **Run**: a single evaluation execution with defined inputs and outputs. * **Simulation**: a run that doesn’t write findings; returns diffs. * **Materialization**: persisted effective findings for fast queries. * **Determinism**: same inputs + same policy = same outputs. Always. --- ### Final imposed reminder **Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.**