- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution. - Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done. - Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
410 lines
19 KiB
Markdown
410 lines
19 KiB
Markdown
Here’s Epic 4 in the same paste‑into‑repo, implementation‑ready style as the prior epics. It’s exhaustive, formal, and slots directly into the existing AOC model, Policy Engine, and Console.
|
||
|
||
---
|
||
|
||
# Epic 4: Policy Studio (author, version, simulate)
|
||
|
||
> Short name: **Policy Studio**
|
||
> Services touched: **Policy Engine**, **Policy Registry** (new), **Web API Gateway**, **Authority** (authN/Z), **Scheduler/Workers**, **SBOM Service**, **Conseiller (Feedser)**, **Excitator (Vexer)**, **Telemetry**
|
||
> Surfaces: **Console (Web UI)** feature module, **CLI**, **CI hooks**
|
||
> Deliverables: Authoring workspace, policy versioning, static checks, simulation at scale, reviews/approvals, signing/publishing, promotion
|
||
|
||
---
|
||
|
||
## 1) What it is
|
||
|
||
**Policy Studio** is the end‑to‑end system for creating, evolving, and safely rolling out the rules that turn AOC facts (SBOM, advisories, VEX) into **effective findings**. It provides:
|
||
|
||
* A **workspace** where authors write policies in the DSL (Epic 2), with linting, autocompletion, snippets, and templates.
|
||
* A **Policy Registry** that stores immutable versions, compiled artifacts, metadata, provenance, and signatures.
|
||
* **Simulation** at two levels: quick local samples and large batch simulations across real SBOM inventories with full deltas.
|
||
* A **review/approval** workflow with comments, diffs, required approvers, and promotion to environments (dev/test/prod).
|
||
* **Publishing** semantics: signed, immutable versions bound to tenants; rollback and deprecation.
|
||
* Tight integration with **Explain** traces so any change can show exactly which rules fired and why outcomes shifted.
|
||
|
||
The Studio respects **AOC enforcement**: policies never edit or merge facts. They only interpret facts and produce determinations consistent with precedence rules defined in the DSL.
|
||
|
||
---
|
||
|
||
## 2) Why
|
||
|
||
* Policy errors are expensive. Authors need safe sandboxes, deterministic builds, and evidence before rollout.
|
||
* Auditors require immutability, provenance, and reproducibility from “source policy” to “effective finding.”
|
||
* Teams want gradual rollout: simulate, canary, promote, observe, rollback.
|
||
* Policy knowledge should be modular, reusable, and testable, not tribal.
|
||
|
||
---
|
||
|
||
## 3) How it should work (maximum detail)
|
||
|
||
### 3.1 Domain model
|
||
|
||
* **PolicyPackage**: `{name, tenant, description, owners[], tags[], created_at}`
|
||
* **PolicyVersion** (immutable): `{package, semver, source_sha, compiled_sha, status: draft|review|approved|published|deprecated|archived, created_by, created_at, signatures[], changelog, metadata{}}`
|
||
* **Workspace**: mutable working area for authors; holds unversioned edits until compiled.
|
||
* **CompilationArtifact**: `{policy_version, compiler_version, diagnostics[], rule_index[], symbol_table}`
|
||
* **SimulationSpec**: `{policy_version|workspace, sbom_selector, time_window?, environment?, sample_size?, severity_floor?, includes{advisories?, vex?}}`
|
||
* **SimulationRun**: `{run_id, spec, started_at, finished_at, result{counts_before, counts_after, top_deltas[], by_rule_hit[], sample_explains[]}}`
|
||
* **Review**: `{policy_version, required_approvers[], votes[], comments[], files_changed[], diffs[]}`
|
||
* **Promotion**: `{policy_version, environment: dev|test|prod, promoted_by, promoted_at, rollout_strategy: All|Percent|TenantSubset}`
|
||
* **Attestation**: OIDC‑backed signature metadata binding `source_sha` and `compiled_sha` to an actor and time.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
### 3.2 Authoring workflow
|
||
|
||
1. **Create** a workspace from a template (e.g., “Default Risk Model,” “License Tilted,” “Cloud‑Native SBOM”).
|
||
2. **Edit** in the Studio: Monaco editor with DSL grammar, intelligent completion for predicates, policies, attributes.
|
||
3. **Lint & compile** locally: semantic checks, forbidden rules detection, policy size limits, constant‑folding.
|
||
4. **Unit tests**: run policy test cases on bundled fixtures and golden expectations.
|
||
5. **Quick simulate** on selected SBOMs (10–50 items) to preview counts, examples, and rule heatmap.
|
||
6. **Propose version**: bump semver, enter changelog; create a **PolicyVersion** in `review` with compiled artifacts.
|
||
7. **Review & approval**: side‑by‑side diff, comments, required approvers enforced by RBAC.
|
||
8. **Batch simulation**: run at scale across tenant inventory; produce deltas, sample explainer evidence.
|
||
9. **Publish**: sign and move to `published`; optional **Promotion** to target environment(s).
|
||
10. **Run** evaluation with the selected policy version; verify outcomes; optionally promote to default.
|
||
11. **Rollback**: select an older version; promotion updates references without mutating older versions.
|
||
|
||
### 3.3 Editing experience (Console)
|
||
|
||
* **Three‑pane layout**: file tree, editor, diagnostics/simulation.
|
||
* **Features**: autocomplete from symbol table, in‑editor docs on hover, go‑to definition, rule references, rename symbols across files, snippet library, policy templates.
|
||
* **Validations**:
|
||
|
||
* AOC guardrails: no edit/merge actions on source facts, only interpretation.
|
||
* Precedence correctness: if rules conflict, studio shows explicit order and effective winner.
|
||
* Severity floor and normalization mapping validated against registry configuration.
|
||
* **Diagnostics panel**: errors, warnings, performance hints (e.g., “predicate X loads N advisories per component; consider indexing”).
|
||
* **Rule heatmap**: during simulation, bar chart of rule firings and the objects they impact.
|
||
* **Explain sampler**: click any delta bucket to open a sampled finding with full trace.
|
||
|
||
### 3.4 Simulation
|
||
|
||
* **Quick Sim**: synchronous; runs in browser‑orchestrated job against API, constrained by `sample_size`.
|
||
* **Batch Sim**: asynchronous run in workers:
|
||
|
||
* Input selection: all SBOMs, labels, artifact regex, last N ingests, or a curated set.
|
||
* Outputs: counts by severity before/after, by status, top deltas by component and advisory, rule heatmap, top K affected artifacts.
|
||
* Evidence: NDJSON of sampled findings with traces; CSV summary; signed result manifest.
|
||
* Guardrails: cannot publish if batch sim drift > configurable threshold without an override justification.
|
||
|
||
### 3.5 Versioning & promotion
|
||
|
||
* Semver enforced: `major` implies compatibility break (e.g., precedence changes), `minor` adds rules, `patch` fixes.
|
||
* **Immutable**: after `published`, the version cannot change; deprecate instead.
|
||
* **Environment bindings**: dev/test/prod mapping per tenant; default policy per environment.
|
||
* **Canary**: promote to a subset of tenants or artifacts; the Runs page displays A/B comparisons.
|
||
|
||
### 3.6 Review & approval
|
||
|
||
* Require N approvers by role; self‑approval optionally prohibited.
|
||
* Line and file comments; overall decision with justification.
|
||
* Review snapshot captures: diffs, diagnostics, simulation summary.
|
||
* Webhooks to notify external systems of review events.
|
||
|
||
### 3.7 RBAC (Authority)
|
||
|
||
Roles per tenant:
|
||
|
||
* **Policy Author**: create/edit workspace, quick sim, propose versions.
|
||
* **Policy Reviewer**: comment, request changes, approve/reject.
|
||
* **Policy Approver**: final approve, publish.
|
||
* **Policy Operator**: promote, rollback, schedule runs.
|
||
* **Read‑only Auditor**: view everything, download evidence.
|
||
|
||
All actions server‑checked; UI only hides affordances.
|
||
|
||
### 3.8 CLI + CI integration
|
||
|
||
CLI verbs (examples):
|
||
|
||
```
|
||
stella policy init --template default
|
||
stella policy lint
|
||
stella policy compile
|
||
stella policy test --golden ./tests
|
||
stella policy simulate --sboms label:prod --sample 1000
|
||
stella policy version bump --level minor --changelog "Normalize GHSA CVSS"
|
||
stella policy submit --reviewers alice@example.com,bob@example.com
|
||
stella policy approve --version 1.3.0
|
||
stella policy publish --version 1.3.0 --sign
|
||
stella policy promote --version 1.3.0 --env test --percent 20
|
||
stella policy rollback --env prod --to 1.2.1
|
||
```
|
||
|
||
CI usage:
|
||
|
||
* Lint, compile, and run unit tests on PRs that modify `/policies/**`.
|
||
* Optionally trigger **Batch Sim** against a staging inventory and post a Markdown report to the PR.
|
||
* Block merge if diagnostics include errors or drift exceeds thresholds.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
### 3.9 APIs (representative)
|
||
|
||
* `POST /policies/workspaces` create from template
|
||
* `PUT /policies/workspaces/{id}/files` edit source files
|
||
* `POST /policies/workspaces/{id}/compile` get diagnostics + compiled artifact
|
||
* `POST /policies/workspaces/{id}/simulate` quick sim
|
||
* `POST /policies/versions` create version from workspace with semver + changelog
|
||
* `GET /policies/versions/{id}` fetch version + diagnostics + sim summary
|
||
* `POST /policies/versions/{id}/reviews` open review
|
||
* `POST /policies/versions/{id}/approve` record approval
|
||
* `POST /policies/versions/{id}/publish` sign + publish
|
||
* `POST /policies/versions/{id}/promote` bind to env/canary
|
||
* `POST /policies/versions/{id}/simulate-batch` start batch sim (async)
|
||
* `GET /policies/simulations/{run_id}` get sim results and artifacts
|
||
* `GET /policies/registry` list packages/versions, status and bindings
|
||
|
||
All calls require tenant scoping and RBAC.
|
||
|
||
### 3.10 Storage & data
|
||
|
||
* **Policy Registry DB** (MongoDB): packages, versions, workspaces, metadata.
|
||
* **Object storage**: source bundles, compiled artifacts, simulation result bundles, evidence.
|
||
* **Indexing**: compound indexes by `{tenant, package}`, `{tenant, status}`, `{tenant, environment}`.
|
||
* **Retention**: configurable retention for workspaces and simulation artifacts; versions never deleted, only archived.
|
||
|
||
### 3.11 Evidence & provenance
|
||
|
||
* Every published version has:
|
||
|
||
* `source_sha` (content digest of the policy source bundle)
|
||
* `compiled_sha` (digest of compiled artifact)
|
||
* Attestation: signed envelope binding digests to an identity, time, and tenant.
|
||
* Links to the exact compiler version, inputs, and environment.
|
||
|
||
### 3.12 Observability
|
||
|
||
* Metrics: compile time, diagnostics rate, simulation queue depth, delta magnitude distribution, approval latencies.
|
||
* Logs: structured events for lifecycle transitions.
|
||
* Traces: long simulations emit span per shard.
|
||
|
||
### 3.13 Performance & scale
|
||
|
||
* Compilation should complete under 3 seconds for typical policies; warn at 10s.
|
||
* Batch sim uses workers with partitioning by SBOM id; results reduced by the API.
|
||
* Memory guardrails on rule execution; deny policies that exceed configured complexity limits.
|
||
|
||
### 3.14 Security
|
||
|
||
* OIDC‑backed signing and attestation.
|
||
* Policy sources are scanned on upload for secrets; blocked if found.
|
||
* Strict CSP in Studio pages; tokens stored in memory, not localStorage.
|
||
* Tenant isolation in buckets and DB collections.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 4) Implementation plan
|
||
|
||
### 4.1 Services
|
||
|
||
* **Policy Registry (new microservice)**
|
||
|
||
* REST API and background workers for batch simulation orchestration.
|
||
* Stores workspaces, versions, metadata, bindings, reviews.
|
||
* Generates signed attestations at publish time.
|
||
* Coordinates with **Policy Engine** for compile/simulate invocations.
|
||
|
||
* **Policy Engine (existing)**
|
||
|
||
* Expose compile and simulate endpoints with deterministic outputs.
|
||
* Provide rule coverage, symbol table, and explain traces for samples.
|
||
|
||
* **Web API Gateway**
|
||
|
||
* Routes requests; injects tenant context; enforces RBAC.
|
||
|
||
### 4.2 Console (Web UI) feature module
|
||
|
||
* `packages/features/policies` (shared with Epic 3):
|
||
|
||
* **Studio** routes: `/policies/studio`, `/policies/:id/versions/:v/edit`, `/simulate`, `/review`.
|
||
* Monaco editor wrapper for DSL with hover docs, autocomplete.
|
||
* Diff viewer, diagnostics, heatmap, explain sampler, review UI.
|
||
|
||
### 4.3 CLI
|
||
|
||
* New commands under `stella policy *`; typed client generated from OpenAPI.
|
||
* Outputs machine‑readable JSON and pretty tables.
|
||
|
||
### 4.4 Workers
|
||
|
||
* **Simulation workers**: pull shards of SBOMs, run policy, emit partials, reduce into result bundle.
|
||
* **Notification worker**: sends webhooks on review, approval, publish, promote.
|
||
|
||
---
|
||
|
||
## 5) Documentation changes (create/update)
|
||
|
||
1. **`/docs/policy/studio-overview.md`**
|
||
|
||
* Concepts, roles, lifecycle, glossary.
|
||
2. **`/docs/policy/authoring.md`**
|
||
|
||
* Workspace, templates, snippets, lint rules, best practices.
|
||
3. **`/docs/policy/versioning-and-publishing.md`**
|
||
|
||
* Semver, immutability, deprecation, rollback, attestations.
|
||
4. **`/docs/policy/simulation.md`**
|
||
|
||
* Quick vs batch sim, selection strategies, thresholds, evidence artifacts.
|
||
5. **`/docs/policy/review-and-approval.md`**
|
||
|
||
* Required approvers, comments, webhooks, audit trail.
|
||
6. **`/docs/policy/promotion.md`**
|
||
|
||
* Environments, canary, default policy binding, rollback.
|
||
7. **`/docs/policy/cli.md`**
|
||
|
||
* Command reference with examples and JSON outputs.
|
||
8. **`/docs/policy/api.md`**
|
||
|
||
* REST endpoints, request/response schemas, error codes.
|
||
9. **`/docs/security/policy-attestations.md`**
|
||
|
||
* Signatures, digests, verifier steps.
|
||
10. **`/docs/architecture/policy-registry.md`**
|
||
|
||
* Service design, schemas, queues, failure modes.
|
||
11. **`/docs/observability/policy-telemetry.md`**
|
||
|
||
* Metrics, logs, tracing, dashboards.
|
||
12. **`/docs/runbooks/policy-incident.md`**
|
||
|
||
* Rolling back a bad policy, freezing publishes, forensic steps.
|
||
13. **`/docs/examples/policy-templates.md`**
|
||
|
||
* Ready‑made templates and snippet catalog.
|
||
14. **`/docs/aoc/aoc-guardrails.md`**
|
||
|
||
* How Studio enforces AOC in authoring and review.
|
||
|
||
Each doc ends with a “Compliance checklist.”
|
||
**Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 6) Tasks
|
||
|
||
### 6.1 Backend: Policy Registry
|
||
|
||
* [ ] Define OpenAPI spec for Registry (workspaces, versions, reviews, sim).
|
||
* [ ] Implement workspace storage and file CRUD.
|
||
* [ ] Integrate with Policy Engine compile endpoint; return diagnostics, symbol table.
|
||
* [ ] Implement quick simulation with request limits.
|
||
* [ ] Implement batch simulation orchestration: enqueue shards, collect results, reduce deltas, store artifacts.
|
||
* [ ] Implement review model: comments, required approvers, decisions.
|
||
* [ ] Implement publish: sign, persist attestation, set status=published.
|
||
* [ ] Implement promotion bindings per tenant/environment; canary subsets.
|
||
* [ ] RBAC checks for all endpoints.
|
||
* [ ] Unit/integration tests; load tests for batch sim.
|
||
|
||
### 6.2 Policy Engine enhancements
|
||
|
||
* [ ] Return rule coverage and firing counts with compile/simulate.
|
||
* [ ] Return symbol table and inline docs for editor autocomplete.
|
||
* [ ] Expose deterministic Explain traces for sampled findings.
|
||
* [ ] Enforce complexity/time limits and report breaches.
|
||
|
||
### 6.3 Console (Web UI)
|
||
|
||
* [ ] Build Studio editor wrapper with Monaco + DSL language server hooks.
|
||
* [ ] Implement file tree, snippets, templates, hotkeys, search/replace.
|
||
* [ ] Diagnostics panel with jump‑to‑line, quick fixes.
|
||
* [ ] Simulation panel: quick sim UI, charts, heatmap, sample explains.
|
||
* [ ] Review UI: diff, comments, approvals, status badges.
|
||
* [ ] Publish & Promote flows with confirmation and post‑actions.
|
||
* [ ] Batch sim results pages with export buttons.
|
||
* [ ] Accessibility audits and keyboard‑only authoring flow.
|
||
|
||
### 6.4 CLI
|
||
|
||
* [ ] Implement commands listed in 3.8 with rich help and examples.
|
||
* [ ] Add `--json` flag for machine consumption; emit stable schemas.
|
||
* [ ] Exit codes aligned with CI usage (lint errors → non‑zero).
|
||
|
||
### 6.5 CI/CD & Security
|
||
|
||
* [ ] Add CI job that runs `stella policy lint/compile/test` on PRs.
|
||
* [ ] Optional job that triggers batch sim against staging inventory; post summary to PR.
|
||
* [ ] Policy source secret scanning; block on findings.
|
||
* [ ] Signing keys configuration; verify pipeline for attestation on publish.
|
||
|
||
### 6.6 Docs
|
||
|
||
* [ ] Write all docs in section 5 with screenshots and CLI transcripts.
|
||
* [ ] Add cookbook examples and templates in `/docs/examples/policy-templates.md`.
|
||
* [ ] Wire contextual Help links from Studio to relevant docs.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 7) Acceptance criteria
|
||
|
||
* Authors can create, edit, lint, compile policies with inline diagnostics and autocomplete.
|
||
* Quick simulation produces counts, rule heatmap, and sample explains within UI.
|
||
* Batch simulation scales across large SBOM sets, producing deltas and downloadable evidence.
|
||
* Review requires configured approvers; comments and diffs are preserved.
|
||
* Publish generates immutable, signed versions with attestations.
|
||
* Promotion binds versions to environments and supports canary and rollback.
|
||
* CLI supports full lifecycle and is usable in CI.
|
||
* All actions are tenant‑scoped, RBAC‑enforced, and logged.
|
||
* AOC guardrails prevent any mutation of raw facts.
|
||
* Documentation shipped and linked contextually from the Studio.
|
||
|
||
---
|
||
|
||
## 8) Risks & mitigations
|
||
|
||
* **Policy complexity causes timeouts** → compile‑time complexity scoring, execution limits, early diagnostics.
|
||
* **Simulation cost at scale** → sharding and streaming reducers; sampling; configurable quotas.
|
||
* **RBAC misconfiguration** → server‑enforced checks, defense‑in‑depth tests, deny‑by‑default.
|
||
* **Attestation key management** → OIDC‑backed signatures; auditable verifier tool; time‑boxed credentials.
|
||
* **Editor usability** → language server with accurate completions; docs on hover; snippet library.
|
||
|
||
---
|
||
|
||
## 9) Test plan
|
||
|
||
* **Unit**: compiler adapters, registry models, reviewers workflow, CLI options.
|
||
* **Integration**: compile→simulate→publish→promote on seeded data.
|
||
* **E2E**: Playwright flows for author→review→batch sim→publish→promote→rollback.
|
||
* **Performance**: load test batch simulation with 100k components spread across SBOMs.
|
||
* **Security**: RBAC matrix tests; secret scanning; signing and verification.
|
||
* **Determinism**: same inputs produce identical `compiled_sha` and simulation summaries.
|
||
|
||
---
|
||
|
||
## 10) Feature flags
|
||
|
||
* `policy.studio` (enables editor and quick sim)
|
||
* `policy.batch-sim`
|
||
* `policy.canary-promotion`
|
||
* `policy.signature-required` (enforce signing on publish)
|
||
|
||
Flags documented in `/docs/observability/policy-telemetry.md`.
|
||
|
||
---
|
||
|
||
## 11) Non‑goals (this epic)
|
||
|
||
* Building a general IDE for arbitrary languages; the editor is purpose‑built for the DSL.
|
||
* Auto‑generated policies from AI without human approval.
|
||
* Cross‑tenant policies; all policies are tenant‑scoped.
|
||
|
||
---
|
||
|
||
## 12) Philosophy
|
||
|
||
* **Safety first**: it’s cheaper to prevent a bad policy than to fix its fallout.
|
||
* **Determinism**: same inputs, same outputs, verifiably.
|
||
* **Immutability**: versions and evidence are forever; we deprecate, not mutate.
|
||
* **Transparency**: every change is explainable with traces and proofs.
|
||
* **Reusability**: templates, snippets, and tests turn policy from art into engineering.
|
||
|
||
> Final reminder: **Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.**
|