feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions
--- a/docs/modules/policy/AGENTS.md
+++ b/docs/modules/policy/AGENTS.md
@@ -0,0 +1,22 @@
+# Policy Engine agent guide
+
+## Mission
+Policy Engine compiles and evaluates Stella DSL policies deterministically, producing explainable findings with full provenance.
+
+## Key docs
+- [Module README](./README.md)
+- [Architecture](./architecture.md)
+- [Implementation plan](./implementation_plan.md)
+- [Task board](./TASKS.md)
+
+## How to get started
+1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
+2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
+3. Read the architecture and README for domain context before editing code or docs.
+4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
+
+## Guardrails
+- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
+- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
+- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
+- Update runbooks/observability assets when operational characteristics change.
--- a/docs/modules/policy/README.md
+++ b/docs/modules/policy/README.md
@@ -0,0 +1,31 @@
+# StellaOps Policy Engine
+
+Policy Engine compiles and evaluates Stella DSL policies deterministically, producing explainable findings with full provenance.
+
+## Responsibilities
+- Compile `stella-dsl@1` packs into executable graphs.
+- Join advisories, VEX evidence, and SBOM inventories to derive effective findings.
+- Expose simulation and diff APIs for UI/CLI workflows.
+- Emit change-stream driven events for Notify/Scheduler integrations.
+
+## Key components
+- `StellaOps.Policy.Engine` service host.
+- Shared libraries under `StellaOps.Policy.*` for evaluation, storage, DSL tooling.
+
+## Integrations & dependencies
+- MongoDB findings collections, RustFS explain bundles.
+- Scheduler for incremental re-evaluation triggers.
+- CLI/UI for policy authoring and runs.
+
+## Operational notes
+- DSL grammar and lifecycle docs in ../../policy/.
+- Observability guidance in ../../observability/policy.md.
+- Governance and scope mapping in ../../security/policy-governance.md.
+
+## Backlog references
+- DOCS-POLICY-20-001 … DOCS-POLICY-20-012 (completed baseline).
+- DOCS-POLICY-23-007 (upcoming command updates).
+
+## Epic alignment
+- **Epic 2 – Policy Engine & Editor:** deliver deterministic evaluation, DSL infrastructure, explain traces, and incremental runs.
+- **Epic 4 – Policy Studio:** integrate registry workflows, simulation at scale, approvals, and promotion semantics.
--- a/docs/modules/policy/TASKS.md
+++ b/docs/modules/policy/TASKS.md
@@ -0,0 +1,9 @@
+# Task board — Policy Engine
+
+> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
+
+| ID | Status | Owner(s) | Description | Notes |
+|----|--------|----------|-------------|-------|
+| POLICY ENGINE-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
+| POLICY ENGINE-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
+| POLICY ENGINE-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
--- a/docs/modules/policy/architecture.md
+++ b/docs/modules/policy/architecture.md
@@ -0,0 +1,245 @@
+# Policy Engine Architecture (v2)
+
+> Derived from Epic 2 – Policy Engine & Policy Editor and Epic 4 – Policy Studio.
+
+> **Ownership:** Policy Guild • Platform Guild  
+> **Services:** `StellaOps.Policy.Engine` (Minimal API + worker host)  
+> **Data Stores:** MongoDB (`policies`, `policy_runs`, `effective_finding_*`), Object storage (explain bundles), optional NATS/Mongo queue  
+> **Related docs:** [Policy overview](../../policy/overview.md), [DSL](../../policy/dsl.md), [Lifecycle](../../policy/lifecycle.md), [Runs](../../policy/runs.md), [REST API](../../api/policy.md), [Policy CLI](../cli/guides/policy.md), [Architecture overview](../platform/architecture-overview.md), [AOC reference](../../ingestion/aggregation-only-contract.md)
+
+This dossier describes the internal structure of the Policy Engine service delivered in Epic 2. It focuses on module boundaries, deterministic evaluation, orchestration, and integration contracts with Concelier, Excititor, SBOM Service, Authority, Scheduler, and Observability stacks.
+
+The service operates strictly downstream of the **Aggregation-Only Contract (AOC)**. It consumes immutable `advisory_raw` and `vex_raw` documents emitted by Concelier and Excititor, derives findings inside Policy-owned collections, and never mutates ingestion stores. Refer to the architecture overview and AOC reference for system-wide guardrails and provenance obligations.
+
+---
+
+## 1 · Responsibilities & Constraints
+
+- Compile and evaluate `stella-dsl@1` policy packs into deterministic verdicts.
+- Join SBOM inventory, Concelier advisories, and Excititor VEX evidence via canonical linksets and equivalence tables.
+- Materialise effective findings (`effective_finding_{policyId}`) with append-only history and produce explain traces.
+- Operate incrementally: react to change streams (advisory/vex/SBOM deltas) with ≤ 5 min SLA.
+- Provide simulations with diff summaries for UI/CLI workflows without modifying state.
+- Enforce strict determinism guard (no wall-clock, RNG, network beyond allow-listed services) and RBAC + tenancy via Authority scopes.
+- Support sealed/air-gapped deployments with offline bundles and sealed-mode hints.
+
+Non-goals: policy authoring UI (handled by Console), ingestion or advisory normalisation (Concelier), VEX consensus (Excititor), runtime enforcement (Zastava).
+
+---
+
+## 2 · High-Level Architecture
+
+```mermaid
+graph TD
+    subgraph Clients
+        CLI[stella CLI]
+        UI[Console Policy Editor]
+        CI[CI Pipelines]
+    end
+    subgraph PolicyEngine["StellaOps.Policy.Engine"]
+        API[Minimal API Host]
+        Orchestrator[Run Orchestrator]
+        WorkerPool[Evaluation Workers]
+        Compiler[DSL Compiler Cache]
+        Materializer[Effective Findings Writer]
+    end
+    subgraph RawStores["Raw Stores (AOC)"]
+        AdvisoryRaw[(MongoDB<br/>advisory_raw)]
+        VexRaw[(MongoDB<br/>vex_raw)]
+    end
+    subgraph Derived["Derived Stores"]
+        Mongo[(MongoDB<br/>policies / policy_runs / effective_finding_*)]
+        Blob[(Object Store / Evidence Locker)]
+        Queue[(Mongo Queue / NATS)]
+    end
+    Concelier[(Concelier APIs)]
+    Excititor[(Excititor APIs)]
+    SBOM[(SBOM Service)]
+    Authority[(Authority / DPoP Gateway)]
+
+    CLI --> API
+    UI --> API
+    CI --> API
+    API --> Compiler
+    API --> Orchestrator
+    Orchestrator --> Queue
+    Queue --> WorkerPool
+    Concelier --> AdvisoryRaw
+    Excititor --> VexRaw
+    WorkerPool --> AdvisoryRaw
+    WorkerPool --> VexRaw
+    WorkerPool --> SBOM
+    WorkerPool --> Materializer
+    Materializer --> Mongo
+    WorkerPool --> Blob
+    API --> Mongo
+    API --> Blob
+    API --> Authority
+    Orchestrator --> Mongo
+    Authority --> API
+```
+
+Key notes:
+
+- API host exposes lifecycle, run, simulate, findings endpoints with DPoP-bound OAuth enforcement.
+- Orchestrator manages run scheduling/fairness; writes run tickets to queue, leases jobs to worker pool.
+- Workers evaluate policies using cached IR; join external services via tenant-scoped clients; pull immutable advisories/VEX from the raw stores; write derived overlays to Mongo and optional explain bundles to blob storage.
+- Observability (metrics/traces/logs) integrated via OpenTelemetry (not shown).
+
+---
+
+### 2.1 · AOC inputs & immutability
+
+- **Raw-only reads.** Evaluation workers access `advisory_raw` / `vex_raw` via tenant-scoped Mongo clients or the Concelier/Excititor raw APIs. No Policy Engine component is permitted to mutate these collections.
+- **Guarded ingestion.** `AOCWriteGuard` rejects forbidden fields before data reaches the raw stores. Policy tests replay known `ERR_AOC_00x` violations to confirm ingestion compliance.
+- **Change streams as contract.** Run orchestration stores resumable cursors for raw change streams. Replays of these cursors (e.g., after failover) must yield identical materialisation outcomes.
+- **Derived stores only.** All severity, consensus, and suppression state lives in `effective_finding_*` collections and explain bundles owned by Policy Engine. Provenance fields link back to raw document IDs so auditors can trace every verdict.
+- **Authority scopes.** Only the Policy Engine service identity holds `effective:write`. Ingestion identities retain `advisory:*`/`vex:*` scopes, ensuring separation of duties enforced by Authority and the API Gateway.
+
+---
+
+## 3 · Module Breakdown
+
+| Module | Responsibility | Notes |
+|--------|----------------|-------|
+| **Configuration** (`Configuration/`) | Bind settings (Mongo URIs, queue options, service URLs, sealed mode), validate on start. | Strict schema; fails fast on missing secrets. |
+| **Authority Client** (`Authority/`) | Acquire tokens, enforce scopes, perform DPoP key rotation. | Only service identity uses `effective:write`. |
+| **DSL Compiler** (`Dsl/`) | Parse, canonicalise, IR generation, checksum caching. | Uses Roslyn-like pipeline; caches by `policyId+version+hash`. |
+| **Selection Layer** (`Selection/`) | Batch SBOM ↔ advisory ↔ VEX joiners; apply equivalence tables; support incremental cursors. | Deterministic ordering (SBOM → advisory → VEX). |
+| **Evaluator** (`Evaluation/`) | Execute IR with first-match semantics, compute severity/trust/reachability weights, record rule hits. | Stateless; all inputs provided by selection layer. |
+| **Materialiser** (`Materialization/`) | Upsert effective findings, append history, manage explain bundle exports. | Mongo transactions per SBOM chunk. |
+| **Orchestrator** (`Runs/`) | Change-stream ingestion, fairness, retry/backoff, queue writer. | Works with Scheduler Models DTOs. |
+| **API** (`Api/`) | Minimal API endpoints, DTO validation, problem responses, idempotency. | Generated clients for CLI/UI. |
+| **Observability** (`Telemetry/`) | Metrics (`policy_run_seconds`, `rules_fired_total`), traces, structured logs. | Sampled rule-hit logs with redaction. |
+| **Offline Adapter** (`Offline/`) | Bundle export/import (policies, simulations, runs), sealed-mode enforcement. | Uses DSSE signing via Signer service. |
+
+---
+
+## 4 · Data Model & Persistence
+
+### 4.1 Collections
+
+- `policies` – policy versions, metadata, lifecycle states, simulation artefact references.
+- `policy_runs` – run records, inputs (cursors, env), stats, determinism hash, run status.
+- `policy_run_events` – append-only log (queued, leased, completed, failed, canceled, replay).
+- `effective_finding_{policyId}` – current verdict snapshot per finding.
+- `effective_finding_{policyId}_history` – append-only history (previous verdicts, timestamps, runId).
+- `policy_reviews` – review comments/decisions.
+
+### 4.2 Schema Highlights
+
+- Run records include `changeDigests` (hash of advisory/VEX inputs) for replay verification.
+- Effective findings store provenance references (`advisory_raw_ids`, `vex_raw_ids`, `sbom_component_id`).
+- All collections include `tenant`, `policyId`, `version`, `createdAt`, `updatedAt`, `traceId` for audit.
+
+### 4.3 Indexing
+
+- Compound indexes: `{tenant, policyId, status}` on `policies`; `{tenant, policyId, status, startedAt}` on `policy_runs`; `{policyId, sbomId, findingKey}` on findings.
+- TTL indexes on transient explain bundle references (configurable).
+
+---
+
+## 5 · Evaluation Pipeline
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant Worker as EvaluationWorker
+    participant Compiler as CompilerCache
+    participant Selector as SelectionLayer
+    participant Eval as Evaluator
+    participant Mat as Materialiser
+    participant Expl as ExplainStore
+
+    Worker->>Compiler: Load IR (policyId, version, digest)
+    Compiler-->>Worker: CompiledPolicy (cached or compiled)
+    Worker->>Selector: Fetch tuple batches (sbom, advisory, vex)
+    Selector-->>Worker: Deterministic batches (1024 tuples)
+    loop For each batch
+        Worker->>Eval: Execute rules (batch, env)
+        Eval-->>Worker: Verdicts + rule hits
+        Worker->>Mat: Upsert effective findings
+        Mat-->>Worker: Success
+        Worker->>Expl: Persist sampled explain traces (optional)
+    end
+    Worker->>Mat: Append history + run stats
+    Worker-->>Worker: Compute determinism hash
+    Worker->>+Mat: Finalize transaction
+    Mat-->>Worker: Ack
+```
+
+Determinism guard instrumentation wraps the evaluator, rejecting access to forbidden APIs and ensuring batch ordering remains stable.
+
+---
+
+## 6 · Run Orchestration & Incremental Flow
+
+- **Change streams:** Concelier and Excititor publish document changes to the scheduler queue (`policy.trigger.delta`). Payload includes `tenant`, `source`, `linkset digests`, `cursor`.
+- **Orchestrator:** Maintains per-tenant backlog; merges deltas until time/size thresholds met, then enqueues `PolicyRunRequest`.
+- **Queue:** Mongo queue with lease; each job assigned `leaseDuration`, `maxAttempts`.
+- **Workers:** Lease jobs, execute evaluation pipeline, report status (success/failure/canceled). Failures with recoverable errors requeue with backoff; determinism or schema violations mark job `failed` and raise incident event.
+- **Fairness:** Round-robin per `{tenant, policyId}`; emergency jobs (`priority=emergency`) jump queue but limited via circuit breaker.
+- **Replay:** On demand, orchestrator rehydrates run via stored cursors and exports sealed bundle for audit/CI determinism checks.
+
+---
+
+## 7 · Security & Tenancy
+
+- **Auth:** All API calls pass through Authority gateway; DPoP tokens enforced for service-to-service (Policy Engine service principal). CLI/UI tokens include scope claims.
+- **Scopes:** Mutations require `policy:*` scopes corresponding to action; `effective:write` restricted to service identity.
+- **Tenancy:** All queries filter by `tenant`. Service identity uses `tenant-global` for shared policies; cross-tenant reads prohibited unless `policy:tenant-admin` scope present.
+- **Secrets:** Configuration loaded via environment variables or sealed secrets; runtime avoids writing secrets to logs.
+- **Determinism guard:** Static analyzer prevents referencing forbidden namespaces; runtime guard intercepts `DateTime.Now`, `Random`, `Guid`, HTTP clients beyond allow-list.
+- **Sealed mode:** Global flag disables outbound network except allow-listed internal hosts; watchers fail fast if unexpected egress attempted.
+
+---
+
+## 8 · Observability
+
+- Metrics:
+  - `policy_run_seconds{mode,tenant,policy}` (histogram)
+  - `policy_run_queue_depth{tenant}`
+  - `policy_rules_fired_total{policy,rule}`
+  - `policy_vex_overrides_total{policy,vendor}`
+- Logs: Structured JSON with `traceId`, `policyId`, `version`, `runId`, `tenant`, `phase`. Guard ensures no sensitive data leakage.
+- Traces: Spans `policy.select`, `policy.evaluate`, `policy.materialize`, `policy.simulate`. Trace IDs surfaced to CLI/UI.
+- Incident mode toggles 100 % sampling and extended retention windows.
+
+---
+
+## 9 · Offline / Bundle Integration
+
+- **Imports:** Offline Kit delivers policy packs, advisory/VEX snapshots, SBOM updates. Policy Engine ingests bundles via `offline import`.
+- **Exports:** `stella policy bundle export` packages policy, IR digest, simulations, run metadata; UI provides export triggers.
+- **Sealed hints:** Explain traces annotate when cached values used (EPSS, KEV). Run records mark `env.sealed=true`.
+- **Sync cadence:** Operators perform monthly bundle sync; Policy Engine warns when snapshots > configured staleness (default 14 days).
+
+---
+
+## 10 · Testing & Quality
+
+- **Unit tests:** DSL parsing, evaluator semantics, guard enforcement.
+- **Integration tests:** Joiners with sample SBOM/advisory/VEX data; materialisation with deterministic ordering; API contract tests generated from OpenAPI.
+- **Property tests:** Ensure rule evaluation deterministic across permutations.
+- **Golden tests:** Replay recorded runs, compare determinism hash.
+- **Performance tests:** Evaluate 100k component / 1M advisory dataset under warmed caches (<30 s full run).
+- **Chaos hooks:** Optional toggles to simulate upstream latency/failures; used in staging.
+
+---
+
+## 11 · Compliance Checklist
+
+- [ ] **Determinism guard enforced:** Static analyzer + runtime guard block wall-clock, RNG, unauthorized network calls.
+- [ ] **Incremental correctness:** Change-stream cursors stored and replayed during tests; unit/integration coverage for dedupe.
+- [ ] **RBAC validated:** Endpoint scope requirements match Authority configuration; integration tests cover deny/allow.
+- [ ] **AOC separation enforced:** No code path writes to `advisory_raw` / `vex_raw`; integration tests capture `ERR_AOC_00x` handling; read-only clients verified.
+- [ ] **Effective findings ownership:** Only Policy Engine identity holds `effective:write`; unauthorized callers receive `ERR_AOC_006`.
+- [ ] **Observability wired:** Metrics/traces/logs exported with correlation IDs; dashboards include `aoc_violation_total` and ingest latency panels.
+- [ ] **Offline parity:** Sealed-mode tests executed; bundle import/export flows documented and validated.
+- [ ] **Schema docs synced:** DTOs match Scheduler Models (`SCHED-MODELS-20-001`); JSON schemas committed.
+- [ ] **Security reviews complete:** Threat model (including queue poisoning, determinism bypass, data exfiltration) documented; mitigations in place.
+- [ ] **Disaster recovery rehearsed:** Run replay+rollback procedures tested and recorded.
+
+---
+
+*Last updated: 2025-10-26 (Sprint 19).* 
--- a/docs/modules/policy/implementation_plan.md
+++ b/docs/modules/policy/implementation_plan.md
@@ -0,0 +1,67 @@
+# Implementation plan — Policy Engine
+
+## Delivery phases
+- **Phase 1 – Deterministic evaluation core**  
+  Finalise DSL compiler, runtime guardrails, evaluation workers, change-stream integration (advisories, VEX, SBOM), and append-only effective findings.
+- **Phase 2 – Orchestration & incremental runs**  
+  Implement run scheduler, incremental deltas, change-stream replay, simulation hooks, and determinism hashing.
+- **Phase 3 – Policy Studio workflows**  
+  Deliver policy registry, versioning, approvals, explain trace API, client editor integration, and signed promotion pipelines.
+- **Phase 4 – Simulation & approvals**  
+  Provide diff/simulation APIs, approval queues, change management, and integration with CLI/Console.
+- **Phase 5 – Exports & offline parity**  
+  Produce policy bundles, explain archives, Offline Kit assets, and deterministic manifests; integrate with Export Center.
+- **Phase 6 – Observability & hardening**  
+  Ship metrics, logs, traces, incident response runbooks, guardrail analyzers, and compliance attestations.
+
+## Work breakdown
+- **Evaluation engine**
+  - DSL compiler with caching, static analysis, and guard rails (no wall-clock/random/network outside allowlist).
+  - Batch evaluator with deterministic ordering, change-stream inputs, policy IR caching.
+  - Explain trace generation, evidence linking, storage in object store.
+- **Run orchestration**
+  - Scheduler for incremental runs, job leasing, fair-share per tenant/policy.
+  - Determinism hash + replay verification, time-travel snapshots, resume cursors.
+  - Simulation endpoints returning diff summaries, rationale breakdown, exit codes.
+- **Policy Studio**
+  - Policy registry (draft→review→approved), signed promotion pipeline, approvals workflow (multi-step).
+  - Console integration (editor, simulation, approvals, explain viewer) and CLI parity.
+- **Integrations**
+  - Inputs: Concelier, Excititor, SBOM Service, VEX Lens, runtime signals.
+  - Outputs: Findings ledger, Vuln Explorer, Notify (policy events), Export Center (policy bundles).
+  - Authority scopes, tenancy enforcement, RBAC for policy author/reviewer/operator.
+- **Observability & compliance**
+  - Metrics: run duration, evaluation verdict counts, simulation latency, guard violations.
+  - Logs/traces with trace ID propagation, policy version references, tenant scoping.
+  - Guard analyzers (static + runtime), unit/property tests, compliance reports.
+- **Docs & tooling**
+  - Update DSL guide, policy lifecycle/runbooks, simulation manual, CLI reference, Offline Kit instructions.
+  - Provide sample policies, fixtures, and analyzer rules.
+
+## Acceptance criteria
+- Evaluation engine deterministic across runs; effective findings materialised only by Policy Engine; guardrails prevent forbidden IO.
+- Incremental runs handle advisory/VEX/SBOM deltas within ≤5 min SLA; determinism hash and replay verification succeed.
+- Policy Studio supports draft/review/approval, signed promotions, simulation diffing, and explain traces in UI/CLI.
+- Exports (policy bundles, explain archives) reproducible with signed manifests; Offline Kit packages deliver same tooling.
+- Observability dashboards show run metrics, guard violations, simulation usage; alerts trigger on determinism hash mismatch or backlog.
+- CLI/Console parity for policy management, simulation, approvals, and export workflows.
+
+## Risks & mitigations
+- **Non-determinism:** strict static analysis, runtime guard, determinism hash, replay tests.
+- **Policy drift vs reality:** simulation diff previews, approval workflow, history/audit trail.
+- **Scaling evaluations:** sharded workers, incremental deltas, caching, job queue fairness.
+- **Guard bypass:** analyzers integrated into CI, runtime guard rejects forbidden operations.
+- **Offline compliance:** deterministic exports, manifest verification, documentation for sealed-mode deployments.
+
+## Test strategy
+- **Unit:** DSL parsing, guard analyzer, evaluation pipeline, simulation diff calculations.
+- **Property:** randomised policy inputs verifying determinism and guard enforcement.
+- **Integration:** Concelier/Excititor/SBOM feeds → Policy Engine → findings ledger, simulation, approvals.
+- **Performance:** evaluation throughput, change-stream backlog recovery, simulation under load.
+- **Security/compliance:** RBAC/tenancy, analyzer enforcement, audit logging, signed promotions.
+- **Offline:** export/import of policy bundles, explain archives, CLI verification.
+
+## Definition of done
+- Policy Engine core, orchestration, Policy Studio workflows, exports, and observability delivered with runbooks and Offline Kit parity.
+- Documentation suite (overview, architecture, DSL, lifecycle, Studio, simulation, CLI) updated with imposed rule statements.
+- ./TASKS.md and ../../TASKS.md reflect status; analyzers integrated into CI; compliance evidence captured.