docs consolidation work

This commit is contained in:
StellaOps Bot
2025-12-25 10:53:53 +02:00
parent b9f71fc7e9
commit deb82b4f03
117 changed files with 852 additions and 847 deletions

View File

@@ -17,7 +17,7 @@ Attestor moves signed evidence through the trust chain by accepting DSSE bundles
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -17,7 +17,7 @@ Authority is the platform OIDC/OAuth2 control plane that mints short-lived, send
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -16,7 +16,7 @@ CI module collects reproducible pipeline recipes for builds, tests, and release
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -16,7 +16,7 @@ The `stella` CLI is the operator-facing Swiss army knife for scans, exports, pol
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -3,7 +3,7 @@
> **Audience:** DevEx engineers, operators, and CI authors integrating the `stella` CLI with Aggregation-Only Contract (AOC) workflows.
> **Scope:** Command synopsis, options, exit codes, and offline considerations for `stella sources ingest --dry-run` and `stella aoc verify` as introduced in Sprint19.
Both commands are designed to enforce the AOC guardrails documented in the [aggregation-only reference](../../../ingestion/aggregation-only-contract.md) and the [architecture overview](../architecture.md). They consume Authority-issued tokens with tenant scopes and never mutate ingestion stores.
Both commands are designed to enforce the AOC guardrails documented in the [aggregation-only reference](../../../aoc/aggregation-only-contract.md) and the [architecture overview](../architecture.md). They consume Authority-issued tokens with tenant scopes and never mutate ingestion stores.
---
@@ -416,7 +416,7 @@ Additional notes:
## 5·Related references
- [Aggregation-Only Contract reference](../../../ingestion/aggregation-only-contract.md)
- [Aggregation-Only Contract reference](../../../aoc/aggregation-only-contract.md)
- [Architecture overview](../../platform/architecture-overview.md)
- [Console operator guide](../../../15_UI_GUIDE.md)
- [Authority scopes](../../authority/architecture.md)

View File

@@ -103,7 +103,7 @@ See the [CLI Consolidation Migration Guide](../../../../cli/cli-consolidation-mi
## Related Documentation
- [Aggregation-Only Contract Reference](../../../../ingestion/aggregation-only-contract.md)
- [Aggregation-Only Contract Reference](../../../../aoc/aggregation-only-contract.md)
- [CLI Reference](../cli-reference.md)
- [Container Deployment Guide](../../../../deploy/containers.md)

View File

@@ -16,7 +16,7 @@ Concelier ingests signed advisories from dozens of sources and converts them int
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -15,7 +15,7 @@ Concelier ingests signed advisories from dozens of sources and converts them int
- Exporter packages (`StellaOps.Concelier.Exporter.*`).
## Recent updates
- **2025-11-07:** Paragraph-anchored `/advisories/{advisoryKey}/chunks` endpoint shipped for Advisory AI paragraph retrieval. Details and rollout notes live in [`../../updates/2025-11-07-concelier-advisory-chunks.md`](../../updates/2025-11-07-concelier-advisory-chunks.md).
- **2025-11-07:** Paragraph-anchored `/advisories/{advisoryKey}/chunks` endpoint shipped for Advisory AI paragraph retrieval. Details and rollout notes live in [`../../implplan/archived/updates/2025-11-07-concelier-advisory-chunks.md`](../../implplan/archived/updates/2025-11-07-concelier-advisory-chunks.md).
## Integrations & dependencies
- PostgreSQL (schema `vuln`) for canonical observations and schedules.

View File

@@ -1,8 +1,8 @@
# Implementation plan — Concelier
## Delivery timeline
- **Phase 1 — Guardrails & schema**
Stand up Mongo JSON validators for `advisory_raw` and `vex_raw`, wire the `AOCWriteGuard` repository interceptor, and seed deterministic linkset builders. Freeze legacy normalisation paths and migrate callers to the new raw schema.
- **Phase 1 — Guardrails & schema**
Stand up PostgreSQL JSON schema validators for `advisory_raw` and `vex_raw`, wire the `AOCWriteGuard` repository interceptor, and seed deterministic linkset builders. Freeze legacy normalisation paths and migrate callers to the new raw schema.
- **Phase 2 — API & observability**
Publish ingestion and verification endpoints (`POST /ingest/*`, `GET /advisories.raw`, `POST /aoc/verify`) with Authority scopes, expose telemetry (`aoc_violation_total`, guard spans, structured logs), and ensure Offline Kit packaging captures validator deployment steps.
- **Phase 3 — Experience polish**
@@ -10,7 +10,7 @@
## Work breakdown by component
- **Concelier WebService & worker**
- Add Mongo validators and unique indexes over `(tenant, source.vendor, upstream.upstream_id, upstream.content_hash)`.
- Add PostgreSQL validators and unique indexes over `(tenant, source.vendor, upstream.upstream_id, upstream.content_hash)`.
- Implement write interceptors rejecting forbidden fields, missing provenance, or merge attempts.
- Deterministically compute linksets and persist canonical JSON payloads.
- Introduce `/ingest/advisory`, `/advisories/raw*`, and `/aoc/verify` surfaces guarded by `advisory:*` and `aoc:verify` scopes.
@@ -34,13 +34,13 @@
- Seed fixtures and run `stella aoc verify` against snapshots in pipeline gating.
## Documentation deliverables
- Update `docs/ingestion/aggregation-only-contract.md` with guard invariants, schemas, error codes, and migration guidance.
- Update `docs/aoc/aggregation-only-contract.md` with guard invariants, schemas, error codes, and migration guidance.
- Refresh `docs/modules/concelier/operations/*.md` (mirror, conflict-resolution, authority audit) with validator rollouts and observability dashboards.
- Cross-link Authority scope definitions, CLI reference, Console sources guide, and observability runbooks to the AOC guard changes.
- Ensure Offline Kit documentation captures validator bootstrap and verify workflows.
## Acceptance criteria
- Mongo validators and runtime guards reject forbidden fields and missing provenance with the documented `ERR_AOC_00x` codes.
- PostgreSQL validators and runtime guards reject forbidden fields and missing provenance with the documented `ERR_AOC_00x` codes.
- Linksets and supersedes chains are deterministic; rerunning ingestion over identical payloads yields byte-identical documents.
- CLI `stella aoc verify` exits non-zero on seeded violations and zero on clean datasets; Console dashboards show real-time guard status.
- Export Center consumes advisory datasets without relying on legacy normalised fields.
@@ -56,18 +56,18 @@
- **Unit**: guard rejection paths, provenance enforcement, idempotent insertions, linkset determinism.
- **Property**: fuzz upstream payloads to guarantee no forbidden fields emerge.
- **Integration**: batch ingest (50k advisories, mixed VEX fixtures), verifying zero guard violations and consistent supersedes.
- **Contract**: Policy Engine consumers verify raw-only reads; Export Center consumes canonical datasets.
- **End-to-end**: ingest/verify flow with CLI + Console actions to confirm observability and guard reporting.
## Definition of done
- Validators deployed and verified in staging/offline environments.
- Runtime guards, CLI/Console workflows, and CI linting all active.
- Observability dashboards and runbooks updated; metrics visible.
- Documentation updates merged; Offline Kit instructions published.
- ./TASKS.md reflects status transitions; cross-module dependencies acknowledged in ../../TASKS.md.
## Readiness checkpoints (2025-11-25)
- Sprint 110 attestation chain validated: `/internal/attestations/verify` endpoint and evidence bundle tests green (`TestResults/concelier-attestation/web.trx`, `core.trx`).
- Link-Not-Merge cache + console consumption docs frozen (see `operations/lnm-cache-plan.md`, `operations/console-lnm-consumption.md`); cache headers remain deterministic.
- Observation events transport reviewed; backlog guardrails and NATS/air-gap guidance updated in `operations/observation-events.md`.
- Next gating dependency: TaskRunner contract drop (sprint 0157 blockers) before wiring approvals/pack ingest flows into Concelier.
- **Contract**: Policy Engine consumers verify raw-only reads; Export Center consumes canonical datasets.
- **End-to-end**: ingest/verify flow with CLI + Console actions to confirm observability and guard reporting.
## Definition of done
- Validators deployed and verified in staging/offline environments.
- Runtime guards, CLI/Console workflows, and CI linting all active.
- Observability dashboards and runbooks updated; metrics visible.
- Documentation updates merged; Offline Kit instructions published.
- ./TASKS.md reflects status transitions; cross-module dependencies acknowledged in ../../TASKS.md.
## Readiness checkpoints (2025-11-25)
- Sprint 110 attestation chain validated: `/internal/attestations/verify` endpoint and evidence bundle tests green (`TestResults/concelier-attestation/web.trx`, `core.trx`).
- Link-Not-Merge cache + console consumption docs frozen (see `operations/lnm-cache-plan.md`, `operations/console-lnm-consumption.md`); cache headers remain deterministic.
- Observation events transport reviewed; backlog guardrails and NATS/air-gap guidance updated in `operations/observation-events.md`.
- Next gating dependency: TaskRunner contract drop (sprint 0157 blockers) before wiring approvals/pack ingest flows into Concelier.

View File

@@ -11,7 +11,7 @@ _Frozen v1 (add-only) — approved 2025-11-17 for CONCELIER-LNM-21-001/002/101._
- Frozen v1 as of 2025-11-17; further schema changes must go through ADR + sprint gating (CONCELIER-LNM-22x+).
- Canonical JSON Schemas + signed manifest live in `docs/modules/concelier/schemas/` (advisory observation, linkset, offline bundle). Verify with `openssl dgst -sha256 -verify schema-signing-pub.pem -signature schema.manifest.sig schema.manifest.json`.
## Observation document (Mongo JSON Schema excerpt)
## Observation document (PostgreSQL JSON Schema excerpt)
```json
{
"bsonType": "object",
@@ -152,11 +152,11 @@ When an advisory source publishes a revised version of an advisory:
- Deterministic sort: observations sorted by `source, advisoryId, fetchedAt` before hashing.
- Conflicts are additive only and now carry optional `sourceIds[]` to trace which upstream sources produced divergent values.
## Indexes (Mongo)
## Indexes (PostgreSQL)
- Observations: `{ tenantId:1, source:1, advisoryId:1, provenance.fetchedAt:-1 }` (compound for ingest); `{ provenance.sourceArtifactSha:1 }` unique to avoid dup writes.
- Linksets: `{ tenantId:1, advisoryId:1, source:1 }` unique; `{ observations:1 }` sparse for reverse lookups.
## Collections
## Tables
- `advisory_observations` — raw per-source docs (immutable).
- `advisory_linksets` — derived normalized aggregates with observation pointers and hashes.
@@ -170,7 +170,7 @@ See `docs/samples/lnm/observation-ghsa.json` and `docs/samples/lnm/linkset-ghsa.
## Approval path
1) Architecture + Concelier Core review this document.
2) If accepted, freeze JSON Schema and roll into `src/Concelier/__Libraries/StellaOps.Concelier.Storage.Mongo` migrations.
2) If accepted, freeze JSON Schema and roll into `src/Concelier/__Libraries/StellaOps.Concelier.Storage.Postgres` migrations.
3) Update consumers (policy/CLI/export) to read from linksets only; deprecate Merge endpoints.
---

View File

@@ -1,14 +1,14 @@
# DevOps agent guide
## Mission
The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments.
## Advisory Handling
- Any new/updated advisory triggers immediate doc + sprint updates (no approval).
- Update high-level + detailed docs; inline only short snippets; put runnable/long code in `docs/benchmarks/**` or `tests/**` (deterministic/offline) and link.
- Add tasks + Execution Log entries in relevant `SPRINT_*.md` with doc paths/owners; add risks if schema/feed/transparency caps apply.
- Check archived advisories; mark supersedes/extends if overlapping.
- Defaults: hybrid reachability, deterministic/frozen feeds; act first, report after.
## Mission
The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments.
## Advisory Handling
- Any new/updated advisory triggers immediate doc + sprint updates (no approval).
- Update high-level + detailed docs; inline only short snippets; put runnable/long code in `docs/benchmarks/**` or `tests/**` (deterministic/offline) and link.
- Add tasks + Execution Log entries in relevant `SPRINT_*.md` with doc paths/owners; add risks if schema/feed/transparency caps apply.
- Check archived advisories; mark supersedes/extends if overlapping.
- Defaults: hybrid reachability, deterministic/frozen feeds; act first, report after.
## Key docs
- [Module README](./README.md)
@@ -24,7 +24,7 @@ The DevOps module captures release, deployment, and migration playbooks that kee
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -0,0 +1,24 @@
# DevOps Governance Rules Anchor (Sprint33)
> **Scope** · Exit deliverable for `DEVOPS-RULES-33-001`
> **Audience** · DevOps Guild, Platform leads, service owners
> **Related** · `ops/devops/TASKS.md`, `docs/backlog/2025-10-cleanup.md`, `docs/modules/platform/architecture-overview.md`
This note consolidates the platform governance rules ratified on 30October2025.
Each rule captures intent, affected surfaces, enforcement actions, and references to the
source-of-truth backlogs so that subsequent sprints do not reintroduce conflicting work.
| Rule | Intent & Rationale | Enforcement & Ownership | Follow-ups |
|------|--------------------|-------------------------|------------|
| **Gateway is a proxy only; Policy Engine owns overlays/simulations.** | Keep Gateway thin and deterministic: it authenticates, authorises, and forwards requests. All overlay composition, simulation, and policy evaluation stays inside Policy Engine so we avoid duplicated logic and time-of-check drift. | *Owners:* BEBase Platform Guild + Policy Engine Guild. <br/>*Enforcement:* Gateway PR reviews block embedded overlay code, new endpoints require `Policy Engine` contracts, CI parity checks compare Gateway ↔ Policy overlay schemas. | - Update open tasks referencing “gateway overlay” work to point at `POLICY-ENGINE-20-00x`.<br/>- Close or rewrite backlog items `WEB-POLICY-20-00x` that attempted to compute overlays in Gateway. |
| **AOC ingestion is canonical-only; no merges at ingest.** | Concelier/Excititor persist upstream truth plus provenance. Derived severity, merges, or dedupe belong to downstream Policy workflows. This keeps ingestion auditable and replayable. | *Owners:* Concelier & Excititor guilds, DevOps Guild for CI pipelines. <br/>*Enforcement:* `StellaOps.Aoc` guard library, Mongo validators, Roslyn analyzer backlog (`WEB-AOC-19-003`), CI job `stella aoc verify`. | - Ensure ingestion tasks reference the guard library (`StellaOps.Aoc`).<br/>- Retire legacy tasks that still mention merge-at-ingest (see backlog cleanup note). |
| **Single graph platform: Graph Indexer + Graph API (Cartographer retired).** | Replace the historical Cartographer service with the Graph Indexer + Graph API pairing so graph storage, overlays, and explorer flows share one platform. | *Owners:* Graph Platform Guild, Scheduler Guild, DevOps Guild. <br/>*Enforcement:* New graph work lands in `docs/modules/graph/**` and `src/Graph/**`. Gateway/UI/CLI tickets reference the Graph API endpoints only. | - Archive Cartographer handshake docs and mark Cartographer backlog items as historical.<br/>- Update Scheduler/SBOM/Console tickets to depend on `GRAPH-*` IDs instead of `CARTO-*`. |
## Tracking & documentation
- ✅ Rules recorded in correspoding sprint file `/docs/implplan/SPRINT_*.md` (Sprint33) and `/docs/ops/devops/TASKS.md`.
- ✅ Repository-wide references to “Cartographer as active platform” updated (see backlog note amendment and doc banner).
- ✅ Changelog entry (`docs/updates/2025-10-30-devops-governance.md`) captures reviewer acknowledgement.
Future adjustments to these rules must update this file and reference `DEVOPS-RULES-33-001`
when proposing changes so the DevOps Guild can track history.

View File

@@ -2,6 +2,8 @@
_Last updated: 2025-10-11_
> **Note (2025-12):** This runbook is obsolete. MongoDB was fully removed in Sprint 4400 and replaced with PostgreSQL. The migration functionality described here was executed during the transition period and is no longer applicable. Retained for historical reference only.
## Overview
The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures

View File

@@ -0,0 +1,27 @@
# Policy Schema Export Automation
This utility generates JSON Schema documents for the Policy Engine run contracts.
## Command
```
scripts/export-policy-schemas.sh [output-directory]
```
When no output directory is supplied, schemas are written to `docs/schemas/`.
The exporter builds against `StellaOps.Scheduler.Models` and emits:
- `policy-run-request.schema.json`
- `policy-run-status.schema.json`
- `policy-diff-summary.schema.json`
- `policy-explain-trace.schema.json`
The build pipeline (`.gitea/workflows/build-test-deploy.yml`, job **Export policy run schemas**) runs this script on every push and pull request. Exports land under `artifacts/policy-schemas/<commit>/`, are published as the `policy-schema-exports` artifact, and changes trigger a Slack post to `#policy-engine` via the `POLICY_ENGINE_SCHEMA_WEBHOOK` secret. A unified diff is stored alongside the exports for downstream consumers.
## CI integration checklist
- [x] Invoke the script in the DevOps pipeline (see `DEVOPS-POLICY-20-004`).
- [x] Publish the generated schemas as pipeline artifacts.
- [x] Notify downstream consumers when schemas change (Slack `#policy-engine`, changelog snippet).
- [ ] Gate CLI validation once schema artifacts are available.

View File

@@ -14,7 +14,7 @@ This runbook describes how to promote a new release across the supported deploym
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
Infrastructure components (PostgreSQL, Valkey, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
---
@@ -48,8 +48,8 @@ Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release man
```
Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
5. **Review compatibility matrix**
Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258`, `minio@sha256:14ce`, `rustfs:2025.10.0-edge`.
5. **Review compatibility matrix**
Confirm PostgreSQL, Valkey, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `postgres:16-alpine`, `valkey:8.0`, `minio@sha256:14ce`, `rustfs:2025.10.0-edge`.
6. **Create a rollback bookmark**
Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.

View File

@@ -1,8 +1,10 @@
# Launch Cutover Runbook - Stella Ops
_Document owner: DevOps Guild (2025-10-26)_
_Document owner: DevOps Guild (2025-10-26)_
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
> **Note (2025-12):** This document reflects the state at initial launch. Since then, MongoDB has been fully removed (Sprint 4400) and replaced with PostgreSQL. MinIO references now use RustFS. Redis references now use Valkey. See current deployment docs in `deploy/` for up-to-date configuration.
## 1. Roles and Communication
| Role | Primary | Backup | Contact |

View File

@@ -16,7 +16,7 @@ Excititor converts heterogeneous VEX feeds into raw observations and linksets th
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -6,17 +6,17 @@ Excititor converts heterogeneous VEX feeds into raw observations and linksets th
- Chunk API documentation remains blocked until CI is green and a pinned OpenAPI spec + deterministic samples are available.
- Sprint tracker `docs/implplan/SPRINT_0333_0001_0001_docs_modules_excititor.md` and module `TASKS.md` mirror status.
- Observability/runbook assets remain in `operations/observability.md` and `observability/` (timeline, locker manifests); dashboards stay offline-import friendly.
- Prior updates (2025-11-05): Link-Not-Merge readiness and consensus beta note (`../../updates/2025-11-05-excitor-consensus-beta.md`), observability guide additions, DSSE packaging guidance, and Policy/CLI follow-ups tracked in SPRINT_200.
- Link-Not-Merge readiness: release note [Excitor consensus beta](../../updates/2025-11-05-excitor-consensus-beta.md) captures how Excititor feeds power the Excititor consensus beta (sample payload in [consensus JSON](../../vex/consensus-json.md)).
- Prior updates (2025-11-05): Link-Not-Merge readiness and consensus beta note (`../../implplan/archived/updates/2025-11-05-excitor-consensus-beta.md`), observability guide additions, DSSE packaging guidance, and Policy/CLI follow-ups tracked in SPRINT_200.
- Link-Not-Merge readiness: release note [Excitor consensus beta](../../implplan/archived/updates/2025-11-05-excitor-consensus-beta.md) captures how Excititor feeds power the Excititor consensus beta (sample payload in [consensus JSON](../../vex/consensus-json.md)).
- Added [observability guide](operations/observability.md) describing the evidence metrics emitted by `EXCITITOR-AIAI-31-003` (request counters, statement histogram, signature status, guard violations) so Ops/Lens can alert on misuse.
- README now points policy/UI teams to the upcoming consensus integration work.
- DSSE packaging for consensus bundles and Export Center hooks are documented in the [beta release note](../../updates/2025-11-05-excitor-consensus-beta.md); operators mirroring Excititor exports must verify detached JWS artefacts (`bundle.json.jws`) alongside each bundle.
- DSSE packaging for consensus bundles and Export Center hooks are documented in the [beta release note](../../implplan/archived/updates/2025-11-05-excitor-consensus-beta.md); operators mirroring Excititor exports must verify detached JWS artefacts (`bundle.json.jws`) alongside each bundle.
- Follow-ups called out in the release note (Policy weighting knobs `POLICY-ENGINE-30-101`, CLI verb `CLI-VEX-30-002`) remain in-flight and are tracked in `/docs/implplan/SPRINT_200_documentation_process.md`.
## Release references
- Consensus beta payload reference: [docs/vex/consensus-json.md](../../vex/consensus-json.md)
- Export Center offline packaging: [docs/modules/export-center/devportal-offline.md](../export-center/devportal-offline.md)
- Historical release log: [docs/updates/](../../updates/)
- Historical release log: [docs/implplan/archived/updates/](../../implplan/archived/updates/)
## Responsibilities
- Fetch OpenVEX/CSAF/CycloneDX statements via restart-only connectors.

View File

@@ -2,6 +2,7 @@
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://stellaops.dev/schemas/excititor/vex_raw.schema.json",
"title": "Excititor VEX Raw Document",
"$comment": "Note (2025-12): The gridFsObjectId field is legacy. Since Sprint 4400, all large content is stored in PostgreSQL with RustFS. This field exists only for backward compatibility with migrated data.",
"type": "object",
"additionalProperties": true,
"required": ["_id", "providerId", "format", "sourceUri", "retrievedAt", "digest"],

View File

@@ -1,6 +1,6 @@
# VEX Observation Model (`vex_observations`)
> Authored 2025-11-14 for Sprint 120 (`EXCITITOR-LNM-21-001`). This document is the canonical schema description for Excititors immutable observation records. It unblocks downstream documentation tasks (`DOCS-LNM-22-002`) and aligns the WebService/Worker data structures with Mongo persistence.
> Authored 2025-11-14 for Sprint 120 (`EXCITITOR-LNM-21-001`). This document is the canonical schema description for Excititor's immutable observation records. It unblocks downstream documentation tasks (`DOCS-LNM-22-002`) and aligns the WebService/Worker data structures with PostgreSQL persistence.
Excititor ingests heterogeneous VEX statements, normalizes them under the Aggregation-Only Contract (AOC), and persists each normalized statement as a **VEX observation**. These observations are the source of truth for:
@@ -15,7 +15,7 @@ All observation documents are immutable. New information creates a new observati
| Aspect | Value |
| --- | --- |
| Collection | `vex_observations` (Mongo) |
| Table | `vex_observations` (PostgreSQL) |
| Upstream generator | `VexObservationProjectionService` (WebService) and Worker normalization pipeline |
| Primary key | `{tenant, observationId}` |
| Required indexes | `{tenant, vulnerabilityId}`, `{tenant, productKey}`, `{tenant, document.digest}`, `{tenant, providerId, status}` |
@@ -114,7 +114,7 @@ All observation documents are immutable. New information creates a new observati
2. **Sorted collections** arrays (`anchors`, `purls`, `cpes`) are sorted lexicographically before persistence.
3. **Guard metadata** `aoc.guardVersion` records the guard library version (`docs/aoc/guard-library.md`), enabling audits.
4. **Signatures** only verification metadata proven by the Worker is stored; WebService never recomputes trust.
5. **Time normalization** all timestamps stored as UTC ISO-8601 strings (Mongo `DateTime`).
5. **Time normalization** all timestamps stored as UTC ISO-8601 strings (PostgreSQL `timestamptz`).
## API mapping

View File

@@ -17,7 +17,7 @@ Export Center packages reproducible evidence bundles (JSON, Trivy DB, mirror) wi
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -16,8 +16,8 @@ This reference describes the Export Center API introduced in Export Center Phase
- `export:download` for bundle downloads and manifests.
- **Tenant context:** Provide `X-Stella-Tenant` when the token carries multiple tenants; defaults to token tenant otherwise.
- **Idempotency:** Mutating endpoints accept `Idempotency-Key` (UUID). Retrying with the same key returns the original result.
- **Rate limits and quotas:** Responses include `X-Stella-Quota-Limit`, `X-Stella-Quota-Remaining`, and `X-Stella-Quota-Reset`. Exceeding quotas returns `429 Too Many Requests` with `ERR_EXPORT_QUOTA`.
- **Integrity headers (downloads):** `Digest: sha-256=<base64>`, `X-Stella-Signature: dsse-b64=<payload>`, and `X-Stella-Immutability: true` accompany bundle/manifest downloads; clients must validate before use.
- **Rate limits and quotas:** Responses include `X-Stella-Quota-Limit`, `X-Stella-Quota-Remaining`, and `X-Stella-Quota-Reset`. Exceeding quotas returns `429 Too Many Requests` with `ERR_EXPORT_QUOTA`.
- **Integrity headers (downloads):** `Digest: sha-256=<base64>`, `X-Stella-Signature: dsse-b64=<payload>`, and `X-Stella-Immutability: true` accompany bundle/manifest downloads; clients must validate before use.
- **Content negotiation:** Requests and responses use `application/json; charset=utf-8` unless otherwise stated. Downloads stream binary content with profile-specific media types.
- **SSE:** Event streams set `Content-Type: text/event-stream` and keep connections alive with comment heartbeats every 15 seconds.
@@ -101,29 +101,29 @@ Scopes: export:profile:manage
**Request**
```json
{
"profileId": "prof-airgap-mirror",
"name": "Airgap Mirror Weekly",
"kind": "mirror",
"variant": "full",
"include": ["advisories", "vex", "sboms", "policy"],
"distribution": ["http", "object"],
"encryption": {
"enabled": true,
"recipientKeys": ["age1tenantkey..."],
"strict": false
},
"retention": {"mode": "days", "value": 30},
"limits": {
"maxActiveRuns": 4,
"maxQueuedRuns": 50,
"backpressureMode": "reject"
},
"approval": {
"required": false
}
}
```
{
"profileId": "prof-airgap-mirror",
"name": "Airgap Mirror Weekly",
"kind": "mirror",
"variant": "full",
"include": ["advisories", "vex", "sboms", "policy"],
"distribution": ["http", "object"],
"encryption": {
"enabled": true,
"recipientKeys": ["age1tenantkey..."],
"strict": false
},
"retention": {"mode": "days", "value": 30},
"limits": {
"maxActiveRuns": 4,
"maxQueuedRuns": 50,
"backpressureMode": "reject"
},
"approval": {
"required": false
}
}
```
**Response 201**
@@ -192,24 +192,24 @@ Scopes: export:run
{
"runId": "run-20251029-01",
"status": "pending",
"profileId": "prof-json-raw",
"createdAt": "2025-10-29T12:12:11Z",
"createdBy": "user:ops",
"selectors": { "...": "..." },
"links": {
"self": "/api/export/runs/run-20251029-01",
"events": "/api/export/runs/run-20251029-01/events"
},
"quotas": {
"maxActiveRuns": 4,
"maxQueuedRuns": 50,
"backpressureMode": "reject"
},
"approval": {
"required": false
}
}
```
"profileId": "prof-json-raw",
"createdAt": "2025-10-29T12:12:11Z",
"createdBy": "user:ops",
"selectors": { "...": "..." },
"links": {
"self": "/api/export/runs/run-20251029-01",
"events": "/api/export/runs/run-20251029-01/events"
},
"quotas": {
"maxActiveRuns": 4,
"maxQueuedRuns": 50,
"backpressureMode": "reject"
},
"approval": {
"required": false
}
}
```
### 4.2 List runs
@@ -231,15 +231,15 @@ Response fields:
| Field | Description |
|-------|-------------|
| `status` | `pending`, `running`, `success`, `failed`, `canceled`. |
| `progress` | Object with `adapters`, `bytesWritten`, `recordsProcessed`. |
| `errorCode` | Populated when `status=failed` (`signing`, `distribution`, etc). |
| `policySnapshotId` | Returned for policy-aware profiles. |
| `distributions` | List of available distribution descriptors (type, location, sha256, expiresAt). |
| `rerunHash` | SHA-256 over sorted `contents[*].digest`; used for determinism checks. |
| `integrity` | Expected HTTP headers (`Digest`, `X-Stella-Signature`, `X-Stella-Immutability`) and OCI annotations (`io.stellaops.export.*`). |
| `quotas` | Active limits/backpressure settings returned with the run. |
| `approval` | Cross-tenant approval ticket when selectors span multiple tenants/wildcards. |
| `status` | `pending`, `running`, `success`, `failed`, `canceled`. |
| `progress` | Object with `adapters`, `bytesWritten`, `recordsProcessed`. |
| `errorCode` | Populated when `status=failed` (`signing`, `distribution`, etc). |
| `policySnapshotId` | Returned for policy-aware profiles. |
| `distributions` | List of available distribution descriptors (type, location, sha256, expiresAt). |
| `rerunHash` | SHA-256 over sorted `contents[*].digest`; used for determinism checks. |
| `integrity` | Expected HTTP headers (`Digest`, `X-Stella-Signature`, `X-Stella-Immutability`) and OCI annotations (`io.stellaops.export.*`). |
| `quotas` | Active limits/backpressure settings returned with the run. |
| `approval` | Cross-tenant approval ticket when selectors span multiple tenants/wildcards. |
### 4.4 Cancel a run
@@ -294,16 +294,16 @@ GET /api/export/runs/{runId}/download
Scopes: export:download
```
Streams the primary bundle (tarball, zip, or profile-specific layout). Headers:
- `Content-Disposition: attachment; filename="export-run-20251029-01.tar.zst"`
- `Digest: sha-256=<base64>` (EC5)
- `X-Stella-Signature: dsse-b64:<payload>` (EC3/EC5)
- `X-Stella-Immutability: true`
- `X-Export-Size: 73482019`
- `X-Export-Encryption: age` (when mirror encryption enabled)
Supports HTTP range requests for resume functionality. If no bundle exists yet, responds `409` with `ERR_EXPORT_007`.
Streams the primary bundle (tarball, zip, or profile-specific layout). Headers:
- `Content-Disposition: attachment; filename="export-run-20251029-01.tar.zst"`
- `Digest: sha-256=<base64>` (EC5)
- `X-Stella-Signature: dsse-b64:<payload>` (EC3/EC5)
- `X-Stella-Immutability: true`
- `X-Export-Size: 73482019`
- `X-Export-Encryption: age` (when mirror encryption enabled)
Supports HTTP range requests for resume functionality. If no bundle exists yet, responds `409` with `ERR_EXPORT_007`.
### 6.2 Manifest download
@@ -312,8 +312,8 @@ GET /api/export/runs/{runId}/manifest
Scopes: export:download
```
Returns signed `export.json`. To fetch the detached signature, append `?signature=true`.
- Integrity annotations are mirrored in response headers (`Digest`, `X-Stella-Signature`, `X-Stella-Immutability`) and in the manifest `integrity` block to keep rerun-hash deterministic.
Returns signed `export.json`. To fetch the detached signature, append `?signature=true`.
- Integrity annotations are mirrored in response headers (`Digest`, `X-Stella-Signature`, `X-Stella-Immutability`) and in the manifest `integrity` block to keep rerun-hash deterministic.
### 6.3 Provenance download
@@ -322,7 +322,7 @@ GET /api/export/runs/{runId}/provenance
Scopes: export:download
```
Returns signed `provenance.json`. Supports `?signature=true`. Provenance includes attestation subject digests, policy snapshot ids, adapter versions, and KMS key identifiers.
Returns signed `provenance.json`. Supports `?signature=true`. Provenance includes attestation subject digests, policy snapshot ids, adapter versions, and KMS key identifiers.
### 6.4 Distribution descriptors
@@ -356,6 +356,6 @@ Payload includes `targetUrl`, `events` (e.g., `run.succeeded`), and optional sec
- [Export Center Architecture](architecture.md)
- [Export Center Profiles](profiles.md)
- [Export Center CLI Guide](cli.md) *(companion document)*
- [Aggregation-Only Contract reference](../../ingestion/aggregation-only-contract.md)
- [Aggregation-Only Contract reference](../../aoc/aggregation-only-contract.md)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -144,20 +144,20 @@ stella export provenance run-20251029-01 --output manifests/provenance.json
Retrieves the signed provenance file. `--signature` behaves like the manifest command.
### 4.4 `stella export verify`
```
stella export verify run-20251029-01 \
--manifest manifests/export.json \
--provenance manifests/provenance.json \
--key keys/acme-export.pub
```
Wrapper around `cosign verify`. Returns exit `0` when signatures and digests validate. Exit `20` when verification fails.
Integrity and determinism checks (EC1EC10):
- `stella export manifest` and `provenance` commands emit `Digest`/`X-Stella-Signature` headers; cache them for rerun-hash validation.
- Offline kits: run `docs/modules/export-center/operations/verify-export-kit.sh <kit_dir>` to assert rerunHash, integrity headers vs OCI annotations, quotas/backpressure block, approvals, and log metadata in provenance.
### 4.4 `stella export verify`
```
stella export verify run-20251029-01 \
--manifest manifests/export.json \
--provenance manifests/provenance.json \
--key keys/acme-export.pub
```
Wrapper around `cosign verify`. Returns exit `0` when signatures and digests validate. Exit `20` when verification fails.
Integrity and determinism checks (EC1EC10):
- `stella export manifest` and `provenance` commands emit `Digest`/`X-Stella-Signature` headers; cache them for rerun-hash validation.
- Offline kits: run `docs/modules/export-center/operations/verify-export-kit.sh <kit_dir>` to assert rerunHash, integrity headers vs OCI annotations, quotas/backpressure block, approvals, and log metadata in provenance.
## 5. CI recipe (GitHub Actions example)
@@ -230,6 +230,6 @@ Exit codes above 100 are reserved for future profile-specific tooling.
- [Export Center Profiles](profiles.md)
- [Export Center API reference](api.md)
- [Export Center Architecture](architecture.md)
- [Aggregation-Only Contract reference](../../ingestion/aggregation-only-contract.md)
- [Aggregation-Only Contract reference](../../aoc/aggregation-only-contract.md)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -92,11 +92,11 @@ delta/
manifest.diff.json # summary of counts, hashes, base export metadata
```
- **Base lookup:** The worker verifies that the base export is reachable (download path or OCI reference). If missing, the run fails with `ERR_EXPORT_BASE_MISSING`.
- **Change detection:** Uses deterministic hashing of normalized records to compute additions/updates. Indexes are regenerated only for affected subjects.
- **Application order:** Consumers apply deltas sequentially. A `resetBaseline=true` flag instructs them to drop cached state and apply the bundle as a full refresh.
- **Tombstones required:** Every removal must emit a tombstone entry plus the `removed` list; deltas without tombstones fail verification (`verify-export-kit.sh`).
- **Integrity headers:** Each delta bundle exports `Digest`, `X-Stella-Signature`, and `X-Stella-Immutability` derived from the OCI annotation `io.stellaops.export.manifest-digest`. Consumers must validate before applying.
- **Base lookup:** The worker verifies that the base export is reachable (download path or OCI reference). If missing, the run fails with `ERR_EXPORT_BASE_MISSING`.
- **Change detection:** Uses deterministic hashing of normalized records to compute additions/updates. Indexes are regenerated only for affected subjects.
- **Application order:** Consumers apply deltas sequentially. A `resetBaseline=true` flag instructs them to drop cached state and apply the bundle as a full refresh.
- **Tombstones required:** Every removal must emit a tombstone entry plus the `removed` list; deltas without tombstones fail verification (`verify-export-kit.sh`).
- **Integrity headers:** Each delta bundle exports `Digest`, `X-Stella-Signature`, and `X-Stella-Immutability` derived from the OCI annotation `io.stellaops.export.manifest-digest`. Consumers must validate before applying.
Example `manifest.diff.json` (delta):
@@ -182,40 +182,40 @@ sequenceDiagram
3. Re-run integrity checks (`mirror verify <path>`).
- **Audit logging:** Export Center logs `mirror.bundle.created`, `mirror.delta.applied`, and `mirror.encryption.enabled` events. Consume them in the central observability pipeline.
## 7. Validation checklist (Trivy / mirror bundles)
- Download and verify:
- `stella export download <exportId> --format mirror`
- `stella export verify <exportId>`
- Delta ordering:
- Ensure `manifest.diff.json.baseExportId` exists locally before applying delta.
- Track applied order in `appliedExportIds.log`.
- Trivy adapter (if enabled):
- `stella export trivy-validate --bundle mirror-YYYYMMDD.tar.zst --policy ./policies/export-center.rego`
- Dry-run import:
- `stella export mirror-validate --bundle mirror-YYYYMMDD.tar.zst --dry-run`
- Post-import checks:
- Recompute SHA256 for `manifest.yaml` and a sample data file; compare to manifest.
- Run `mirror verify` (Offline Kit) and confirm zero mismatches.
- Confirm OCI annotations `io.stellaops.export.profile/run/manifest-digest/provenance-ref` match the bundle being applied.
## 8. Troubleshooting
| Symptom | Meaning | Action |
|---------|---------|--------|
| `ERR_EXPORT_BASE_MISSING` | Base export not available | Republish base bundle or rebuild as full export. |
| Delta applies but mirror misses entries | Deltas applied out of order | Rebuild from last full bundle and reapply in sequence. |
## 7. Validation checklist (Trivy / mirror bundles)
- Download and verify:
- `stella export download <exportId> --format mirror`
- `stella export verify <exportId>`
- Delta ordering:
- Ensure `manifest.diff.json.baseExportId` exists locally before applying delta.
- Track applied order in `appliedExportIds.log`.
- Trivy adapter (if enabled):
- `stella export trivy-validate --bundle mirror-YYYYMMDD.tar.zst --policy ./policies/export-center.rego`
- Dry-run import:
- `stella export mirror-validate --bundle mirror-YYYYMMDD.tar.zst --dry-run`
- Post-import checks:
- Recompute SHA256 for `manifest.yaml` and a sample data file; compare to manifest.
- Run `mirror verify` (Offline Kit) and confirm zero mismatches.
- Confirm OCI annotations `io.stellaops.export.profile/run/manifest-digest/provenance-ref` match the bundle being applied.
## 8. Troubleshooting
| Symptom | Meaning | Action |
|---------|---------|--------|
| `ERR_EXPORT_BASE_MISSING` | Base export not available | Republish base bundle or rebuild as full export. |
| Delta applies but mirror misses entries | Deltas applied out of order | Rebuild from last full bundle and reapply in sequence. |
| Decryption fails | Recipient key mismatch or corrupted bundle | Confirm key distribution and re-download bundle. |
| Verification errors | Signature mismatch | Do not import; regenerate bundle and investigate signing pipeline. |
| Manifest hash mismatch | Files changed after extraction | Re-extract bundle and re-run verification; check storage tampering. |
## 9. References
- [Export Center Overview](overview.md)
- [Export Center Architecture](architecture.md)
- [Export Center API reference](api.md)
- [Export Center CLI Guide](cli.md)
| Verification errors | Signature mismatch | Do not import; regenerate bundle and investigate signing pipeline. |
| Manifest hash mismatch | Files changed after extraction | Re-extract bundle and re-run verification; check storage tampering. |
## 9. References
- [Export Center Overview](overview.md)
- [Export Center Architecture](architecture.md)
- [Export Center API reference](api.md)
- [Export Center CLI Guide](cli.md)
- [Concelier mirror runbook](../concelier/operations/mirror.md)
- [Aggregation-Only Contract reference](../../ingestion/aggregation-only-contract.md)
- [Aggregation-Only Contract reference](../../aoc/aggregation-only-contract.md)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -11,17 +11,17 @@ The Export Center packages StellaOps evidence and policy overlays into reproduci
- Runbook execution for recovery, retention, and compliance.
- Coordination with DevOps validation (cosign + `trivy module db import` smoke tests).
Related documentation:
- `docs/modules/export-center/overview.md`
- `docs/modules/export-center/architecture.md`
- `docs/modules/export-center/profiles.md`
- `docs/modules/export-center/trivy-adapter.md`
- `docs/modules/export-center/mirror-bundles.md`
- `docs/modules/export-center/api.md`
- `docs/modules/export-center/cli.md`
- `docs/modules/export-center/operations/kms-envelope-pattern.md`
- `docs/modules/export-center/operations/risk-bundle-provider-matrix.md`
Related documentation:
- `docs/modules/export-center/overview.md`
- `docs/modules/export-center/architecture.md`
- `docs/modules/export-center/profiles.md`
- `docs/modules/export-center/trivy-adapter.md`
- `docs/modules/export-center/mirror-bundles.md`
- `docs/modules/export-center/api.md`
- `docs/modules/export-center/cli.md`
- `docs/modules/export-center/operations/kms-envelope-pattern.md`
- `docs/modules/export-center/operations/risk-bundle-provider-matrix.md`
## 2. Contacts & tooling
@@ -199,7 +199,7 @@ If encryption enabled, decrypt using age or AES key before verification.
- `docs/modules/export-center/trivy-adapter.md`
- `docs/modules/export-center/mirror-bundles.md`
- `ops/devops/TASKS.md` (`DEVOPS-EXPORT-36-001`, `DEVOPS-EXPORT-37-001`)
- `docs/ingestion/aggregation-only-contract.md`
- `docs/aoc/aggregation-only-contract.md`
- `docs/24_OFFLINE_KIT.md`
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -33,7 +33,7 @@ Refer to `docs/modules/export-center/architecture.md` (Sprint 35 task) for compo
- **Signing and encryption.** Manifests and payloads are signed using the platform KMS. Mirror profiles support optional in-bundle encryption (age/AES-GCM) with key wrapping.
- **Determinism.** Identical inputs yield identical bundles. Timestamps serialize in UTC ISO-8601; manifests include content hashes for audit replay.
See `docs/security/policy-governance.md` and `docs/ingestion/aggregation-only-contract.md` for broader guardrail context.
See `docs/security/policy-governance.md` and `docs/aoc/aggregation-only-contract.md` for broader guardrail context.
## Operating it offline
- **Offline Kit integration.** Air-gapped deployments receive pre-built export profiles and object storage layout templates through the Offline Kit bundles.

View File

@@ -23,7 +23,7 @@ Graph module (upcoming) will power graph-indexed queries for SBOM relationships,
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -17,7 +17,7 @@ Notify evaluates operator-defined rules against platform events and dispatches c
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -28,25 +28,25 @@ Notify (Notifications Studio) converts platform events into tenant-scoped alerts
Status for these items is tracked in `src/Notifier/StellaOps.Notifier/TASKS.md` and sprint plans; update this README once tasks merge.
## Key docs & release alignment
- [`docs/notifications/overview.md`](../../notifications/overview.md) — summary of capabilities, imposed rules, and customer journey.
- [`docs/notifications/architecture.md`](../../notifications/architecture.md) — Notifications Studio runtime view (published 2025-10-29).
- [`docs/notifications/rules.md`](../../notifications/rules.md) — declarative matcher syntax and evaluation order.
- [`docs/notifications/digests.md`](../../notifications/digests.md) — digest windows, coalescing logic, and delivery samples.
- [`docs/notifications/templates.md`](../../notifications/templates.md) — template helpers, localisation, and redaction guidelines.
- [`docs/updates/2025-10-29-notify-docs.md`](../../updates/2025-10-29-notify-docs.md) — latest release note; follow-ups remain to validate connector metadata, quiet-hours semantics, and simulation payloads once Sprint 39 drops land.
- [`overview.md`](overview.md) — summary of capabilities, imposed rules, and customer journey.
- [`architecture.md`](architecture.md) / [`architecture-detail.md`](architecture-detail.md) — Notifications Studio runtime view.
- [`rules.md`](rules.md) — declarative matcher syntax and evaluation order.
- [`digests.md`](digests.md) — digest windows, coalescing logic, and delivery samples.
- [`templates.md`](templates.md) — template helpers, localisation, and redaction guidelines.
- [`docs/implplan/archived/updates/2025-10-29-notify-docs.md`](../../implplan/archived/updates/2025-10-29-notify-docs.md) — latest release note; follow-ups remain to validate connector metadata, quiet-hours semantics, and simulation payloads once Sprint 39 drops land.
## Integrations & dependencies
- **Storage:** PostgreSQL (schema `notify`) for rules, channels, deliveries, digests, and throttles; Valkey for worker coordination.
- **Queues:** Valkey Streams or NATS JetStream for ingestion, throttling, and DLQs (`notify.dlq`).
- **Authority:** OpTok-protected APIs, DPoP-backed CLI/UI scopes (`notify.viewer`, `notify.operator`, `notify.admin`), and secret references for channel credentials.
- **Observability:** Prometheus metrics (`notify.sent_total`, `notify.failed_total`, `notify.digest_coalesced_total`, etc.), OTEL traces, and dashboards documented in `docs/notifications/architecture.md#12-observability-prometheus--otel`.
- **Observability:** Prometheus metrics (`notify.sent_total`, `notify.failed_total`, `notify.digest_coalesced_total`, etc.), OTEL traces, and dashboards documented in `architecture-detail.md`.
## Operational notes
- Schema fixtures live in `./resources/schemas`; event and delivery samples live in `./resources/samples` for contract tests and UI mocks.
- Offline Kit bundles ship plug-ins, default templates, and seed rules; update manifests under `ops/offline-kit/` when connectors change.
- Dashboards and alert references depend on `DEVOPS-NOTIFY-39-002`; coordinate before renaming metrics or labels.
- Observability assets: `operations/observability.md` and `operations/dashboards/notify-observability.json` (offline import).
- When releasing new rule or connector features, mirror guidance into `docs/notifications/*.md` and checklists in `docs/updates/2025-10-29-notify-docs.md` until the follow-ups are closed.
- When releasing new rule or connector features, update guidance in this directory and related checklists until the follow-ups are closed.
## Epic alignment
- **Epic 11 Notifications Studio:** notifications workspace, preview tooling, immutable delivery ledger, throttling/digest controls, and forthcoming correlation/simulation features.

View File

@@ -0,0 +1,42 @@
# Notifications API
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
All endpoints require `Authorization: Bearer <token>` and `X-Stella-Tenant` header. Responses use the common error envelope (`docs/api/overview.md`). Paths are rooted at `/api/v1/notify`.
## Channels
- `POST /channels` — create channel. Body matches `channels.md` schema. Returns `201` + channel.
- `GET /channels` — list channels (deterministic order: type ASC, id ASC). Supports `type` filter.
- `GET /channels/{id}` — fetch single channel.
- `DELETE /channels/{id}` — soft-delete; fails if referenced by active rules unless `force=true` query.
## Rules
- `POST /rules` — create/update rule; idempotency via `Idempotency-Key`.
- `GET /rules` — list rules with paging (`page_token`, `page_size`). Sorted by `name` ASC.
- `POST /rules:preview` — dry-run rule against sample event; returns matched actions and rendered templates.
## Policies & escalations
- `POST /policies/escalations` — create escalation policy (see `escalations.md`).
- `GET /policies/escalations` — list policies.
## Deliveries & digests
- `GET /deliveries` — query delivery ledger; filters: `status`, `channel`, `rule_id`, `from`, `to`. Sorted by `createdUtc` DESC then `id` ASC.
- `GET /deliveries/{id}` — single delivery with rendered payload hash and attempts.
- `POST /digests/preview` — preview digest rendering for a tenant/rule set; returns deterministic body/hash without sending.
## Acknowledgements
- `POST /acks/{token}` — acknowledge an escalation token. Validates DSSE signature, token expiry, and tenant. Returns `200` with cancellation summary.
## Simulations
- `POST /simulations/rules` — simulate a rule set for a supplied event payload; no side effects. Returns matched actions and throttling outcome.
## Health & metadata
- `GET /health` — liveness/readiness probes.
- `GET /metadata` — returns supported channel types, max payload sizes, and server version.
## Determinism notes
- All list endpoints are stable and include `next_page_token` when applicable.
- Templates render with fixed locale `en-US` unless `Accept-Language` provided; rendering is pure (no network calls).
- `bodyHash` uses SHA-256 over canonical JSON; repeated sends with identical inputs produce identical hashes and are de-duplicated.

View File

@@ -0,0 +1,120 @@
# Notifications Architecture
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
This dossier distils the Notify architecture into implementation-ready guidance for service owners, SREs, and integrators. It complements the high-level overview by detailing process boundaries, persistence models, and extensibility points.
---
## 1. Runtime shape
```
┌──────────────────┐
│ Authority (OpTok)│
└───────┬──────────┘
┌───────▼──────────┐ ┌───────────────┐
│ Notify.WebService│◀──────▶│ PostgreSQL │
Tenant API│ REST + gRPC WIP │ │ rules/channels│
└───────▲──────────┘ │ deliveries │
│ │ digests │
Internal bus │ └───────────────┘
(NATS/Valkey/etc)│
┌─────────▼─────────┐ ┌───────────────┐
│ Notify.Worker │◀────▶│ Valkey / Cache│
│ rule eval + render│ │ throttles/locks│
└─────────▲─────────┘ └───────▲───────┘
│ │
│ │
┌──────┴──────┐ ┌─────────┴────────┐
│ Connectors │──────▶│ Slack/Teams/... │
│ (plug-ins) │ │ External targets │
└─────────────┘ └──────────────────┘
```
- **2025-11-02 decision — module boundaries.** Keep `src/Notify/` as the shared notification toolkit (engine, storage, queue, connectors) that multiple hosts can consume. `src/Notifier/` remains the Notifications Studio runtime (WebService + Worker) composed from those libraries. Do not collapse the directories until a packaging RFC covers build impacts, offline kit parity, and imposed-rule propagation.
- **WebService** hosts REST endpoints (`/channels`, `/rules`, `/templates`, `/deliveries`, `/digests`, `/stats`) and handles schema normalisation, validation, and Authority enforcement.
- **Worker** subscribes to the platform event bus, evaluates rules per tenant, applies throttles/digests, renders payloads, writes ledger entries, and invokes connectors.
- **Plug-ins** live under `plugins/notify/` and are loaded deterministically at service start (`orderedPlugins` list). Each implements connector contracts and optional health/test-preview providers.
Both services share options via `notify.yaml` (see `etc/notify.yaml.sample`). For dev/test scenarios, an in-memory repository exists but production requires PostgreSQL + Valkey/NATS for durability and coordination.
---
## 2. Event ingestion and rule evaluation
1. **Subscription.** Workers attach to the internal bus (Valkey Streams or NATS JetStream). Each partition key is `tenantId|scope.digest|event.kind` to preserve order for a given artefact.
2. **Normalisation.** Incoming events are hydrated into `NotifyEvent` envelopes. Payload JSON is normalised (sorted object keys) to preserve determinism and enable hashing.
3. **Rule snapshot.** Per-tenant rule sets are cached in memory. PostgreSQL LISTEN/NOTIFY triggers snapshot refreshes without restart.
4. **Match pipeline.**
- Tenant check (`rule.tenantId` vs. event tenant).
- Kind/namespace/repository/digest filters.
- Severity and KEV gating based on event deltas.
- VEX gating using `NotifyRuleMatchVex`.
- Action iteration with throttle/digest decisions.
5. **Idempotency.** Each action computes `hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket)`; matches within throttle TTL record `status=Throttled` and stop.
6. **Dispatch.** If digest is `instant`, the renderer immediately processes the action. Otherwise the event is appended to the digest window for later flush.
Failures during evaluation are logged with correlation IDs and surfaced through `/stats` and worker metrics (`notify_rule_eval_failures_total`, `notify_digest_flush_errors_total`).
---
## 3. Rendering & connectors
- **Template resolution.** The renderer picks the template in this order: action template → channel default template → locale fallback → built-in minimal template. Locale negotiation reduces `en-US` to `en-us`.
- **Helpers & partials.** Exposed helpers mirror the list in [`templates.md`](templates.md#3-variables-helpers-and-context). Plug-ins may register additional helpers but must remain deterministic and side-effect free.
- **Attestation lifecycle suite.** Sprint171 introduced dedicated `tmpl-attest-*` templates for verification failures, expiring attestations, key rotations, and transparency anomalies (see [`templates.md` §7](templates.md#7-attestation--signing-lifecycle-templates-notify-attest-74-001)). Rule actions referencing those templates must populate the attestation context fields so channels stay consistent online/offline.
- **Rendering output.** `NotifyDeliveryRendered` captures:
- `channelType`, `format`, `locale`
- `title`, `body`, optional `summary`, `textBody`
- `target` (redacted where necessary)
- `attachments[]` (safe URLs or references)
- `bodyHash` (lowercase SHA-256) for audit parity
- **Connector contract.** Connectors implement `INotifyConnector` (send + health) and can implement `INotifyChannelTestProvider` for `/channels/{id}/test`. All plugs are single-tenant aware; secrets are pulled via references at send time and never persisted in the database.
- **Retries.** Workers track attempts with exponential jitter. On permanent failure, deliveries are marked `Failed` with `statusReason`, and optional DLQ fan-out is slated for Sprint 40.
---
## 4. Persistence model
| Table | Purpose | Key fields & indexes |
|-------|---------|----------------------|
| `rules` | Tenant rule definitions. | `id`, `tenant_id`, `enabled`; index on `(tenant_id, enabled)`. |
| `channels` | Channel metadata + config references. | `id`, `tenant_id`, `type`; index on `(tenant_id, type)`. |
| `templates` | Locale-specific render bodies. | `id`, `tenant_id`, `channel_type`, `key`; index on `(tenant_id, channel_type, key)`. |
| `deliveries` | Ledger of rendered notifications. | `id`, `tenant_id`, `sent_at`; compound index on `(tenant_id, sent_at DESC)` for history queries. |
| `digests` | Open digest windows per action. | `id` (`tenant_id:action_key:window`), `status`; index on `(tenant_id, action_key)`. |
| `throttles` | Short-lived throttle tokens (PostgreSQL or Valkey). | Key format `idem:<hash>` with TTL aligned to throttle duration. |
Records are stored using the canonical JSON serializer (`NotifyCanonicalJsonSerializer`) to preserve property ordering and casing. Schema migration helpers upgrade stored records when new versions ship.
---
## 5. Deployment & configuration
- **Configuration sources.** YAML files feed typed options (`NotifyPostgresOptions`, `NotifyWorkerOptions`, etc.). Environment variables can override connection strings and rate limits for production.
- **Authority integration.** Two OAuth clients (`notify-web`, `notify-web-dev`) with scopes `notify.viewer`, `notify.operator`, and (for dev/admin flows) `notify.admin` are required. Authority enforcement can be disabled for air-gapped dev use by providing `developmentSigningKey`.
- **Plug-in management.** `plugins.baseDirectory` and `orderedPlugins` guarantee deterministic loading. Offline Kits copy the plug-in tree verbatim; operations must keep the order aligned across environments.
- **Observability.** Workers expose structured logs (`ruleId`, `actionId`, `eventId`, `throttleKey`). Metrics include:
- `notify_rule_matches_total{tenant,eventKind}`
- `notify_delivery_attempts_total{channelType,status}`
- `notify_digest_open_windows{window}`
- Optional OpenTelemetry traces for rule evaluation and connector round-trips.
- **Scaling levers.** Increase worker replicas to cope with bus throughput; adjust `worker.prefetchCount` for Valkey Streams or `ackWait` for NATS JetStream. WebService remains stateless and scales horizontally behind the gateway.
---
## 6. Roadmap alignment
| Backlog | Architectural note |
|---------|--------------------|
| `NOTIFY-SVC-38-001` | Standardise event envelope publication (idempotency keys) ensure bus bindings use the documented key format. |
| `NOTIFY-SVC-38-002..004` | Introduce simulation endpoints and throttle dashboards expect additional `/internal/notify/simulate` routes and metrics; update once merged. |
| `NOTIFY-SVC-39-001..004` | Correlation engine, digests generator, simulation API, quiet hours anticipate new PostgreSQL tables (`quiet_hours`, correlation caches) and connector metadata (quiet mode hints). Review this guide when implementations land. |
Action: schedule a documentation sync with the Notifications Service Guild immediately after `NOTIFY-SVC-39-001..004` merge to confirm schema adjustments (e.g., correlation edge storage, quiet hour calendars) and add any new persistence or API details here.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,51 @@
# Notification Channels
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
## Supported channel types
- **Slack / Teams**: webhook-based with optional slash-command ack URLs.
- **Email (SMTP/SMTPS)**: relay-only; secrets provided via `secretRef` in Authority.
- **Generic webhook**: signed (HMAC-SHA256) payloads with replay protection and allowlisted hosts.
- **Pager duty-style escalation webhooks**: same contract as generic webhooks but with escalation metadata.
- **Console in-app**: stored delivery rendered in UI; always enabled for each tenant.
## Channel resource schema (Notify API)
```json
{
"id": "uuid",
"tenant": "string",
"type": "slack|teams|email|webhook|pager|console",
"endpoint": "https://..." ,
"secretRef": "authority://secrets/notify/slack-hook", // optional per type
"labels": { "env": "prod", "team": "sre" },
"throttle": { "windowSeconds": 60, "max": 10 },
"quietHours": { "from": "22:00", "to": "06:00", "timezone": "UTC" },
"enabled": true,
"createdUtc": "2025-11-25T00:00:00Z"
}
```
- **Determinism**: channel ids are UUIDv5 seeded by `(tenant, type, endpoint)` when created via manifests; server generates new IDs for ad-hoc API calls.
- **Validation**: endpoints must be on the allowlist; secretRef must exist in Authority; quiet hours use 24h clock UTC.
## Connector rules
- No secrets are stored in Notify DB; only `secretRef` is persisted.
- Per-tenant allowlists control outbound hostnames/ports; defaults block public internet in air-gapped kits.
- Payload signing:
- Slack/Teams: bearer secret in URL (indirect via secretRef) plus optional HMAC header `X-Stella-Signature` for mirror validation.
- Webhook/Pager: HMAC `X-Stella-Signature` (hex) over body with nonce + timestamp; receivers must enforce 5minute skew.
## Offline posture
- Offline kits ship default channel manifests under `out/offline/notify/channels/*.json` with placeholder endpoints.
- Operators must replace endpoints and secretRefs before deploy; validation rejects placeholder values.
## Observability
- Emit `notify.channel.delivery` counter with tags: `channel_type`, `tenant`, `status` (success/fail/throttled/quiet_hours), `rule_id`.
- Store delivery attempt hashes in the delivery ledger; duplicate payloads are de-duplicated per `(channel, bodyHash)` for 24h.
## Safety checklist
- [ ] Endpoint on allowlist and TLS valid.
- [ ] `secretRef` exists in Authority and scoped to tenant.
- [ ] Quiet hours configured for non-critical alerts; throttles set for bursty rules.
- [ ] HMAC signing verified in downstream system (webhook/pager).

View File

@@ -0,0 +1,92 @@
# Notifications Digests
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Digests coalesce multiple matching events into a single notification when rules request batched delivery. They protect responders from alert storms while preserving a deterministic record of every input.
---
## 1. Digest lifecycle
1. **Window selection.** Rule actions opt into a digest cadence by setting `actions[].digest` (`instant`, `5m`, `15m`, `1h`, `1d`). `instant` skips digest logic entirely.
2. **Aggregation.** When an event matches, the worker appends it to the open digest window (`tenantId + actionId + window`). Events include the canonical scope, delta counts, and references.
3. **Flush.** When the window expires or hits the workers safety cap (configurable), the worker renders a digest template and emits a single delivery with status `Digested`.
4. **Audit.** The delivery ledger links back to the digest document so operators can inspect individual items and the aggregated summary.
---
## 2. Storage model
Digest state lives in PostgreSQL (`notify.digests` table) and mirrors the schema described in [modules/notify/architecture.md](../modules/notify/architecture.md#7-data-model):
```json
{
"id": "tenant-dev:act-email-compliance:1h",
"tenantId": "tenant-dev",
"actionKey": "act-email-compliance",
"window": "1h",
"openedAt": "2025-10-24T08:00:00Z",
"status": "open",
"items": [
{
"eventId": "00000000-0000-0000-0000-000000000001",
"scope": {
"namespace": "prod-payments",
"repo": "ghcr.io/acme/api",
"digest": "sha256:…"
},
"delta": {
"newCritical": 1,
"kev": 1
}
}
]
}
```
- `status` reflects whether the window is currently collecting (`open`) or has been completed (`closed`). Future revisions may introduce `flushing` for in-progress operations.
- `items[].delta` captures aggregated counts for reporting (e.g., new critical findings, KEV, quieted).
- Workers use optimistic concurrency on the document ID to avoid duplicate flushes across replicas.
---
## 3. Rendering and templates
- Digest deliveries use the same template engine as instant notifications. Templates receive an additional `digest` object with `window`, `openedAt`, `itemCount`, and `items` (findings grouped by namespace/repository when available).
- Provide digest-specific templates (e.g., `tmpl-digest-hourly`) so the body can enumerate top offenders, summarise totals, and link to detailed dashboards.
- When no template is specified, Notify falls back to channel defaults that emphasise summary counts and redirect to Console for detail.
---
## 4. API surface
| Endpoint | Description | Notes |
|----------|-------------|-------|
| `POST /digests` | Issues administrative commands (e.g., force flush, reopen) for a specific action/window. | Request body specifies the command target; requires `notify.admin`. |
| `GET /digests/{actionKey}` | Returns the currently open window (if any) for the referenced action. | Supports operators/CLI inspecting pending digests; requires `notify.viewer`. |
| `DELETE /digests/{actionKey}` | Drops the open window without notifying (emergency stop). | Emits an audit record; use sparingly. |
All routes honour the tenant header and reuse the standard Notify rate limits.
---
## 5. Worker behaviour and safety nets
- **Idempotency.** Flush operations generate a deterministic digest delivery ID (`digest:<tenant>:<actionId>:<window>:<openedAt>`). Retries reuse the same ID.
- **Throttles.** Digest generation respects action throttles; setting an aggressive throttle together with a digest window may result in deliberate skips (logged as `Throttled` in the delivery ledger).
- **Quiet hours.** Future sprint work (`NOTIFY-SVC-39-004`) integrates quiet-hour calendars. When enabled, flush timers pause during quiet windows and resume afterwards.
- **Back-pressure.** When the window reaches the configured item cap before the timer, the worker flushes early and starts a new window immediately.
- **Crash resilience.** Workers rebuild in-flight windows from PostgreSQL on startup; partially flushed windows remain closed after success or reopened if the flush fails.
---
## 6. Operator guidance
- Choose hourly digests for high-volume compliance events; daily digests suit executive reporting.
- Pair digests with incident-focused instant rules so critical items surface immediately while less urgent noise is summarised.
- Monitor `/stats` output for `openDigestCount` to ensure windows are flushing; spikes may indicate downstream connector failures.
- When testing new digest templates, open a small (`5m`) window, trigger sample events, then call `POST /digests/{actionId}/flush` to validate rendering before moving to longer cadences.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,51 @@
# Escalations & Acknowledgements
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
## Model
- **Escalation policy**: ordered stages of channels with delays; stored per tenant.
- **Acknowledgement**: DSSE-signed token embedded in messages; acknowledger must present token to stop escalation.
- **Suppression**: rules may mark events as non-escalating (informational) while still sending single notifications.
## Policy schema (conceptual)
```json
{
"id": "uuid",
"tenant": "string",
"name": "pager-policy-prod",
"stages": [
{ "delaySeconds": 0, "channels": ["slack-prod", "email-oncall"] },
{ "delaySeconds": 900, "channels": ["pager-primary"] },
{ "delaySeconds": 1800,"channels": ["pager-management"] }
],
"autoCloseMinutes": 120,
"retry": { "maxAttempts": 3, "backoffSeconds": 60 }
}
```
- Stages execute sequentially until an **ack** is recorded.
- Deterministic ordering: channels within a stage are sorted lexicographically before dispatch.
## Ack tokens
- Token payload: `{ tenant, deliveryId, expiresUtc, ruleId, actionHash }`.
- Signed with Authority-issued DSSE key; verified by Notify WebService before accepting `POST /acks/{token}`.
- Expiry defaults to 24h; tokens are single-use and idempotent.
## Escalation flow
1) Rule fires → action references an escalation policy.
2) Stage 0 deliveries sent; ledger records attempts and ack URL.
3) If no ack by `delaySeconds`, next stage dispatches; repeats until ack or final stage.
4) On ack, remaining stages are cancelled; ledger entry marked `acknowledged` with timestamp and subject.
## Quiet hours & throttles
- Quiet hours suppress *new* escalations; in-flight escalations continue.
- Per-policy throttle prevents repeated escalation runs for identical `actionHash` within a configurable window (default 30m).
## Observability
- Counters: `notify.escalation.started`, `notify.escalation.stage_sent`, `notify.escalation.ack`, `notify.escalation.cancelled` tagged by `tenant`, `policy`, `stage`.
- Logs: structured `escalation.{started|stage_sent|ack|cancelled}` with delivery ids and rationale.
## Runbooks
- Update escalation policy safely: create new policy id, switch rules, then delete old policy to avoid mid-flight ambiguity.
- If a stage storms, set throttle higher or add quiet hours; do not delete the policy mid-flight—use `cancelEscalation` endpoint instead.

View File

@@ -0,0 +1,9 @@
{
"tenant_id": "tenant-123",
"delivery_id": "00000000-0000-4000-8000-000000000001",
"channel": "email",
"subject": "User signup",
"body": "User john@example.com joined",
"redacted_body": "User ***@example.com joined",
"pii_hash": "dd4eefc8dded5d6f46c832e959ba0eef95ee8b77f10ac0aae90f7c89ad42906c"
}

View File

@@ -0,0 +1 @@
{"template_id":"tmpl-incident-start","locale":"en-US","channel":"email","expected_hash":"05eb80e384eaf6edf0c44a655ca9064ca4e88b8ad7cefa1483eda5c9aaface00","body_sample_path":"tmpl-incident-start.email.en-US.json"}

View File

@@ -0,0 +1,6 @@
{
"subject": "Incident started: ${incident_id}",
"body": "Incident ${incident_id} started at ${started_at}. Severity: ${severity}.",
"merge_fields": ["incident_id", "started_at", "severity"],
"preview_hash": "05eb80e384eaf6edf0c44a655ca9064ca4e88b8ad7cefa1483eda5c9aaface00"
}

View File

@@ -0,0 +1,10 @@
{
"trace_id": "00000000000000000000000000000001",
"tenant_id": "tenant-123",
"rule_id": "RULE-INCIDENT",
"channel_id": "email-default",
"attributes": {
"delivery_id": "00000000-0000-4000-8000-000000000001",
"status": "sent"
}
}

View File

@@ -0,0 +1,32 @@
# Notify Gaps NR1NR10 — Remediation Blueprint (source: `docs/product-advisories/31-Nov-2025 FINDINGS.md`)
## Scope
Close NR1NR10 by defining contracts, evidence, and deterministic test hooks for the Notifier runtime (service + worker + offline kit). This doc is the detailed layer referenced by sprint `SPRINT_0171_0001_0001_notifier_i` and NOTIFY-GAPS-171-014.
## Gap requirements, evidence, and tests
| ID | Requirement | Evidence to publish | Deterministic tests/fixtures |
| --- | --- | --- | --- |
| NR1 | Versioned JSON Schemas for event envelopes, rules, templates, channels, receipts, and webhooks; DSSE-signed catalog with canonical hash recipe (BLAKE3-256 over normalized JSON). | `docs/modules/notify/schemas/notify-schemas-catalog.json` + `.dsse.json`; `docs/modules/notify/schemas/inputs.lock` capturing digests and canonicalization flags. | Golden canonicalization harness under `tests/notifications/Schemas/SchemaCanonicalizationTests.cs` using frozen inputs + hash assertions. |
| NR2 | Tenant scoping + approvals for high-impact rules (escalations, PII, cross-tenant fan-out). Every API and receipt carries `tenant_id`; RBAC/approvals enforced. | RBAC/approval matrix (`docs/modules/notify/security/tenant-approvals.md`) listing actions × roles × required approvals. | API contract tests in `StellaOps.Notifier.Tests/TenantScopeTests.cs` plus integration fixtures with mixed-tenant payloads (should reject). |
| NR3 | Deterministic rendering/localization: stable merge-field ordering, UTC ISO-8601 timestamps, locale whitelist, hashed previews recorded in ledger. | Rendering fixture pack `docs/modules/notify/fixtures/rendering/*.json`; hash ledger samples `docs/modules/notify/fixtures/rendering/index.ndjson` with BLAKE3 digests. | `StellaOps.Notifier.Tests/RenderingDeterminismTests.cs` compares golden bodies/subjects across locales/timezones; seeds fixed RNG/time. |
| NR4 | Quotas/backpressure/DLQ: per-tenant/channel quotas, burst budgets, enqueue gating, DLQ schema with redrive + idempotent keys; metrics/alerts for backlog/DLQ growth. | Quota policy `docs/modules/notify/operations/quotas.md`; DLQ schema `docs/modules/notify/schemas/dlq-notify.schema.json`. | Worker tests `StellaOps.Notifier.Tests/BackpressureAndDlqTests.cs` validating quota enforcement, DLQ insertion, redrive idempotency. |
| NR5 | Retry & idempotency: canonical `delivery_id` (UUIDv7) + dedupe key (event×rule×channel); bounded exponential backoff with jitter; idempotent connectors; ignore out-of-order acks. | Retry matrix `docs/modules/notify/operations/retries.md`; connector idempotency checklist. | `StellaOps.Notifier.Tests/RetryPolicyTests.cs` + connector harness fixtures demonstrating dedupe across duplicate events. |
| NR6 | Webhook/ack security: HMAC or mTLS/DPoP required; signed ack URLs/tokens with nonce, expiry, audience, single-use; per-tenant allowlists for domains/paths. | Security policy `docs/modules/notify/security/webhook-ack-hardening.md`; sample signed-ack token format + validation steps. | Negative-path tests `StellaOps.Notifier.Tests/WebhookSecurityTests.cs` covering wrong HMAC, replayed nonce, expired token, disallowed domain. |
| NR7 | Redaction & PII limits: classify template fields; redact secrets/PII in storage/logs; hash sensitive values; size/field allowlists; previews/logs default to redacted variant. | Redaction catalog `docs/modules/notify/security/redaction-catalog.md`; sample redacted payloads `docs/modules/notify/fixtures/redaction/*.json`. | `StellaOps.Notifier.Tests/RedactionTests.cs` asserting stored/preview payloads match redacted expectations. |
| NR8 | Observability SLO alerts: SLOs for delivery latency/success/backlog/DLQ age; standard metrics names; dashboards/alerts/runbooks; traces include tenant/rule/channel IDs with sampling rules. | Dashboard JSON `docs/modules/notify/operations/dashboards/notify-slo.json`; alert rules `docs/modules/notify/operations/alerts/notify-slo-alerts.yaml`; runbook link. | `StellaOps.Notifier.Tests/ObservabilityContractsTests.cs` verifying metric names/labels; trace exemplar fixture `docs/modules/notify/fixtures/traces/sample-trace.json`. |
| NR9 | Offline notify-kit with DSSE: bundle schemas, rules/templates, connector configs, verify script, hash list, time-anchor hook; deterministic packaging flags; tenant/env scoping; DSSE-signed manifest. | Manifest `offline/notifier/notify-kit.manifest.json`, DSSE `offline/notifier/notify-kit.manifest.dsse.json`, hash list `offline/notifier/artifact-hashes.json`, verify script `offline/notifier/verify_notify_kit.sh`. | Determinism check `tests/offline/NotifyKitDeterminismTests.sh` (shell) verifying hash list, DSSE, scope enforcement, packaging flags. |
| NR10 | Mandatory simulations & evidence before activation: dry-run against frozen fixtures; DSSE-signed simulation results attached to approvals; regression tests per high-impact rule/template change. | Simulation report `docs/modules/notify/simulations/<rule-id>-report.json` + DSSE; approval evidence log `docs/modules/notify/simulations/index.ndjson`. | `StellaOps.Notifier.Tests/SimulationGateTests.cs` enforcing simulation requirement and evidence linkage before `active=true`. |
## Delivery + governance hooks
- Add the above evidence paths to the NOTIFY-GAPS-171-014 task in `docs/implplan/SPRINT_0171_0001_0001_notifier_i.md` and mirror status in `src/Notifier/StellaOps.Notifier/TASKS.md`.
- When artifacts land, append TRX/fixture links in the sprint **Execution Log** and reference this doc under **Decisions & Risks**.
- Offline kit artefacts must mirror mirror/offline packaging rules (deterministic flags, time-anchor hook, PQ dual-sign toggle) already used by Mirror/Offline sprints.
- Simulation evidence lives in `docs/modules/notify/simulations/` (index.ndjson + per-rule reports) and is validated by contract tests under `Contracts/PolicyDocsCompletenessTests.cs`.
- Contract tests under `Contracts/` verify schema catalog ↔ DSSE alignment, fixture hashes, simulation index presence, and offline kit manifest/DSSE consistency.
## Next steps
1) Generate initial schema catalog (`notify-schemas-catalog.json`) with rule/template/channel/webhook/receipt definitions and run canonicalization harness.
2) Produce redaction catalog, quotas policy, retry matrix, and security hardening docs referenced above.
3) Add golden fixtures/tests outlined above and wire CI filters to run determinism + security suites for Notify.
4) Build notify-kit manifest + DSSE and publish `verify_notify_kit.sh` aligned with offline bundle policies.

View File

@@ -0,0 +1,27 @@
groups:
- name: notify-slo
rules:
- alert: NotifyDeliverySuccessSLO
expr: sum(rate(notify_delivery_success_total[5m])) / sum(rate(notify_delivery_total[5m])) < 0.98
for: 10m
labels:
severity: page
annotations:
summary: "Notify delivery success below SLO"
description: "Success ratio below 98% over 10m"
- alert: NotifyBacklogDepthHigh
expr: notify_backlog_depth > 5000
for: 5m
labels:
severity: page
annotations:
summary: "Notify backlog too high"
description: "Backlog depth exceeded 5000 messages"
- alert: NotifyDlqGrowth
expr: rate(notify_dlq_depth[10m]) > 50
for: 10m
labels:
severity: ticket
annotations:
summary: "Notify DLQ growth"
description: "Dead letter queue growing faster than threshold"

View File

@@ -0,0 +1,9 @@
{
"title": "Notify SLO",
"panels": [
{ "title": "Delivery success", "target": "sum(rate(notify_delivery_success_total[5m])) / sum(rate(notify_delivery_total[5m]))" },
{ "title": "Backlog depth", "target": "notify_backlog_depth" },
{ "title": "DLQ depth", "target": "notify_dlq_depth" },
{ "title": "Latency p95", "target": "histogram_quantile(0.95, rate(notify_delivery_latency_seconds_bucket[5m]))" }
]
}

View File

@@ -0,0 +1,7 @@
# Quotas, backpressure, and DLQ (NR4)
- Per-tenant quotas: 500 deliveries/minute default; channel overrides: webhook 200/min, email 120/min, chat 240/min.
- Burst budget: 2x quota for 60 seconds, then hard clamp.
- Backpressure: reject enqueue when backlog > quota*10 or DLQ growth > 5%/min.
- DLQ schema: `docs/modules/notify/schemas/dlq-notify.schema.json`; redrive requires idempotent `delivery_id`/`dedupe_key`.
- Metrics to alert: backlog depth, DLQ depth, redrive success rate, enqueue reject count.

View File

@@ -0,0 +1,7 @@
# Retry and idempotency policy (NR5)
- `delivery_id`: UUIDv7; `dedupe_key`: hash(event_id + rule_id + channel_id).
- Backoff: exponential with jitter; base 2s, factor 2, max 5 attempts, cap 5 minutes between attempts.
- Connectors must be idempotent; retries reuse the same `dedupe_key` and must not duplicate sends.
- Out-of-order acks ignored: only monotonic `attempt` accepted.
- Record retry outcomes in receipts and include attempt count + reason.

View File

@@ -0,0 +1,78 @@
# Notifications Overview
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Notifications Studio turns raw platform events into concise, tenant-scoped alerts that reach the right responders without overwhelming them. The service is sovereign/offline-first, follows the Aggregation-Only Contract (AOC), and produces deterministic outputs so the same configuration yields identical deliveries across environments.
---
## 1. Mission & value
- **Reduce noise.** Only materially new or high-impact changes reach chat, email, or webhooks thanks to rule filters, throttles, and digest windows.
- **Explainable results.** Every delivery is traceable back to a rule, action, and event payload stored in the delivery ledger; operators can audit what fired and why.
- **Safe by default.** Secrets remain in external stores, templates are sandboxed, quiet hours and throttles prevent storms, and idempotency guarantees protect downstream systems.
- **Offline-aligned.** All configuration, templates, and plug-ins ship with Offline Kits; no external SaaS is required to send notifications.
---
## 2. Core capabilities
| Capability | What it does | Key docs |
|------------|--------------|----------|
| Rules engine | Declarative matchers for event kinds, severities, namespaces, VEX context, KEV flags, and more. | [rules.md](rules.md) |
| Channel catalog | Slack, Teams, Email, Webhook connectors loaded via restart-time plug-ins; metadata stored without secrets. | [architecture.md](architecture.md) |
| Templates | Locale-aware, deterministic rendering via safe helpers; channel defaults plus tenant-specific overrides, including the attestation lifecycle suite (`tmpl-attest-*`). | [templates.md](templates.md#7-attestation--signing-lifecycle-templates-notify-attest-74-001) |
| Digests | Coalesce bursts into periodic summaries with deterministic IDs and audit trails. | [digests.md](digests.md) |
| Delivery ledger | Tracks rendered payload hashes, attempts, throttles, and outcomes for every action. | [architecture.md](architecture.md#7-data-model) |
| Ack tokens | DSSE-signed acknowledgement tokens with webhook allowlists and escalation guardrails enforced by Authority. | [architecture.md](architecture.md#81-ack-tokens--escalation-workflows) |
---
## 3. How it fits into StellaOps
1. **Producers emit events.** Scanner, Scheduler, VEX Lens, Attestor, and Zastava publish canonical envelopes (`NotifyEvent`) onto the internal bus.
2. **Notify.Worker evaluates rules.** For each tenant, the worker applies match filters, VEX gates, throttles, and digest policies before rendering the action.
3. **Connectors deliver.** Channel plug-ins send the rendered payload to Slack/Teams/Email/Webhook targets and report back attempts and outcomes.
4. **Consumers investigate.** Operators pivot from message links into Console dashboards, SBOM views, or policy overlays with correlation IDs preserved.
The Notify WebService fronts worker state with REST APIs used by the UI and CLI. Tenants authenticate via StellaOps Authority scopes `notify.viewer`, `notify.operator`, and (for escalated actions) `notify.admin`. All operations require the tenant header (`X-StellaOps-Tenant`) to preserve sovereignty boundaries.
---
## 4. Operating model
| Area | Guidance |
|------|----------|
| **Tenancy** | Each rule, channel, template, and delivery belongs to exactly one tenant. Cross-tenant sharing is intentionally unsupported. |
| **Determinism** | Configuration persistence normalises strings and sorts collections. Template rendering produces identical `bodyHash` values when inputs match; attestation events always reference the canonical `tmpl-attest-*` keys documented in the template guide. |
| **Scaling** | Workers scale horizontally; per-tenant rule snapshots are cached and refreshed from PostgreSQL change notifications. Valkey (or Redis-compatible) guards throttles and locks. |
| **Offline** | Offline Kits include plug-ins, default templates, and seed rules. Operators can edit YAML/JSON manifests before air-gapped deployment. |
| **Security** | Channel secrets use indirection (`secretRef`), Authority-protected OAuth clients secure API access, and delivery payloads are redacted before storage where required. |
| **Module boundaries** | 2025-11-02 decision: keep `src/Notify/` as the shared notification toolkit and `src/Notifier/` as the Notifications Studio runtime host until a packaging RFC covers the implications of merging. |
---
## 5. Getting started (first 30 minutes)
| Step | Goal | Reference |
|------|------|-----------|
| 1 | Deploy Notify WebService + Worker with PostgreSQL and Valkey | [`modules/notify/architecture.md`](../modules/notify/architecture.md#1-runtime-shape--projects) |
| 2 | Register OAuth clients/scopes in Authority | [`etc/authority.yaml.sample`](../../etc/authority.yaml.sample) |
| 3 | Install channel plug-ins and capture secret references | [`plugins/notify`](../../plugins) |
| 4 | Create a tenant rule and test preview | [`POST /channels/{id}/test`](../modules/notify/architecture.md#8-external-apis-webservice) |
| 5 | Inspect deliveries and digests | `/api/v1/notify/deliveries`, `/api/v1/notify/digests` |
---
## 6. Alignment with implementation work
| Backlog item | Impact on docs | Status |
|--------------|----------------|--------|
| `NOTIFY-SVC-38-001..004` | Foundational correlation, throttling, simulation hooks. | **In progress** align behaviour once services publish beta APIs. |
| `NOTIFY-SVC-39-001..004` | Adds correlation engine, digest generator, simulation API, quiet hours. | **Pending** revisit rule/digest sections when these tasks merge. |
Action: coordinate with the Notifications Service Guild when `NOTIFY-SVC-39-001..004` land to validate payload fields, quiet-hours semantics, and any new connector metadata that should be documented here and in the channel-specific guides.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,259 @@
# Pack Approvals Notification Contract
> **Status:** Implemented (NOTIFY-SVC-37-001)
> **Last Updated:** 2025-11-27
> **OpenAPI Spec:** `src/Notifier/StellaOps.Notifier/StellaOps.Notifier.WebService/openapi/pack-approvals.yaml`
## Overview
This document defines the canonical contract for pack approval notifications between Task Runner and the Notifier service. It covers event payloads, resume token mechanics, error handling, and security requirements.
## Event Kinds
| Kind | Description | Trigger |
|------|-------------|---------|
| `pack.approval.requested` | Approval required for pack deployment | Task Runner initiates deployment requiring approval |
| `pack.approval.updated` | Approval state changed | Decision recorded or timeout |
| `pack.policy.hold` | Policy gate blocked deployment | Policy Engine rejects deployment |
| `pack.policy.released` | Policy hold lifted | Policy conditions satisfied |
## Canonical Event Schema
```json
{
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"issuedAt": "2025-11-27T10:30:00Z",
"kind": "pack.approval.requested",
"packId": "pkg:oci/stellaops/scanner@v2.1.0",
"policy": {
"id": "policy-prod-deploy",
"version": "1.2.3"
},
"decision": "pending",
"actor": "ci-pipeline@stellaops.example.com",
"resumeToken": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9...",
"summary": "Deployment approval required for production scanner update",
"labels": {
"environment": "production",
"team": "security"
}
}
```
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `eventId` | UUID | Unique event identifier; used for deduplication |
| `issuedAt` | ISO 8601 | Event timestamp in UTC |
| `kind` | string | Event type (see Event Kinds table) |
| `packId` | string | Package identifier in PURL format |
| `decision` | string | Current state: `pending`, `approved`, `rejected`, `hold`, `expired` |
| `actor` | string | Identity that triggered the event |
### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `policy` | object | Policy metadata (`id`, `version`) |
| `resumeToken` | string | Opaque token for Task Runner resume flow |
| `summary` | string | Human-readable summary for notifications |
| `labels` | object | Custom key-value metadata |
## Resume Token Mechanics
### Token Flow
```
┌─────────────┐ POST /pack-approvals ┌──────────────┐
│ Task Runner │ ──────────────────────────────►│ Notifier │
│ │ { resumeToken: "abc123" } │ │
│ │◄──────────────────────────────│ │
│ │ X-Resume-After: "abc123" │ │
└─────────────┘ └──────────────┘
│ │
│ │
│ User acknowledges approval │
│ ▼
│ ┌──────────────────────────────┐
│ │ POST /pack-approvals/{id}/ack
│ │ { ackToken: "..." } │
│ └──────────────────────────────┘
│ │
│◄─────────────────────────────────────────────┤
│ Resume callback (webhook or message bus) │
```
### Token Properties
- **Format:** Opaque string; clients must not parse or modify
- **TTL:** 24 hours from `issuedAt`
- **Uniqueness:** Scoped to tenant + packId + eventId
- **Expiry Handling:** Expired tokens return `410 Gone`
### X-Resume-After Header
When `resumeToken` is present in the request, the server echoes it in the `X-Resume-After` response header. This enables cursor-based processing for Task Runner polling.
## Error Handling
### HTTP Status Codes
| Code | Meaning | Client Action |
|------|---------|---------------|
| `200` | Duplicate request (idempotent) | Treat as success |
| `202` | Accepted for processing | Continue normal flow |
| `204` | Acknowledgement recorded | Continue normal flow |
| `400` | Validation error | Fix request and retry |
| `401` | Authentication required | Refresh token and retry |
| `403` | Insufficient permissions | Check scope; contact admin |
| `404` | Resource not found | Verify packId; may have expired |
| `410` | Token expired | Re-initiate approval flow |
| `429` | Rate limited | Retry after `Retry-After` seconds |
| `5xx` | Server error | Retry with exponential backoff |
### Error Response Format
```json
{
"error": {
"code": "invalid_request",
"message": "eventId, packId, kind, decision, actor are required.",
"traceId": "00-abc123-def456-00"
}
}
```
### Retry Strategy
- **Transient errors (5xx, 429):** Exponential backoff starting at 1s, max 60s, max 5 retries
- **Validation errors (4xx except 429):** Do not retry; fix request
- **Idempotency:** Safe to retry any request with the same `Idempotency-Key`
## Security Requirements
### Authentication
All endpoints require a valid OAuth2 bearer token with one of these scopes:
- `packs.approve` — Full approval flow access
- `Notifier.Events:Write` — Event ingestion only
### Tenant Isolation
- `X-StellaOps-Tenant` header is **required** on all requests
- Server validates token tenant claim matches header
- Cross-tenant access returns `403 Forbidden`
### Idempotency
- `Idempotency-Key` header is **required** for POST endpoints
- Keys are scoped to tenant and expire after 15 minutes
- Duplicate requests within the window return `200 OK`
### HMAC Signature (Webhooks)
For webhook callbacks from Notifier to Task Runner:
```
X-StellaOps-Signature: sha256=<hex-encoded-signature>
X-StellaOps-Timestamp: <unix-timestamp>
```
Signature computed as:
```
HMAC-SHA256(secret, timestamp + "." + body)
```
Verification requirements:
- Reject if timestamp is >5 minutes old
- Reject if signature does not match
- Reject if body has been modified
### IP Allowlist
Configurable per environment in `notifier:security:ipAllowlist`:
```yaml
notifier:
security:
ipAllowlist:
- "10.0.0.0/8"
- "192.168.1.100"
```
### Sensitive Data Handling
- **Resume tokens:** Encrypted at rest; never logged in full
- **Ack tokens:** Signed with KMS; validated on acknowledgement
- **Labels:** Redacted if keys match `secret`, `password`, `token`, `key` patterns
## Audit Trail
All operations emit structured audit events:
| Event | Fields | Retention |
|-------|--------|-----------|
| `pack.approval.ingested` | packId, kind, decision, actor, eventId | 90 days |
| `pack.approval.acknowledged` | packId, ackToken, decision, actor | 90 days |
| `pack.policy.hold` | packId, policyId, reason | 90 days |
## Observability
### Metrics
| Metric | Type | Labels |
|--------|------|--------|
| `notifier_pack_approvals_total` | Counter | `kind`, `decision`, `tenant` |
| `notifier_pack_approvals_outstanding` | Gauge | `tenant` |
| `notifier_pack_approval_ack_latency_seconds` | Histogram | `decision` |
| `notifier_pack_approval_errors_total` | Counter | `code`, `tenant` |
### Structured Logs
All operations include:
- `traceId` — Distributed trace correlation
- `tenantId` — Tenant identifier
- `packId` — Package identifier
- `eventId` — Event identifier
## Integration Examples
### Task Runner → Notifier (Ingestion)
```bash
curl -X POST https://notifier.stellaops.example.com/api/v1/notify/pack-approvals \
-H "Authorization: Bearer $TOKEN" \
-H "X-StellaOps-Tenant: tenant-acme-corp" \
-H "Idempotency-Key: $(uuidgen)" \
-H "Content-Type: application/json" \
-d '{
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"issuedAt": "2025-11-27T10:30:00Z",
"kind": "pack.approval.requested",
"packId": "pkg:oci/stellaops/scanner@v2.1.0",
"decision": "pending",
"actor": "ci-pipeline@stellaops.example.com",
"resumeToken": "abc123",
"summary": "Approval required for production deployment"
}'
```
### Console → Notifier (Acknowledgement)
```bash
curl -X POST https://notifier.stellaops.example.com/api/v1/notify/pack-approvals/pkg%3Aoci%2Fstellaops%2Fscanner%40v2.1.0/ack \
-H "Authorization: Bearer $TOKEN" \
-H "X-StellaOps-Tenant: tenant-acme-corp" \
-H "Content-Type: application/json" \
-d '{
"ackToken": "ack-token-xyz789",
"decision": "approved",
"comment": "Reviewed and approved"
}'
```
## Related Documents
- [Pack Approvals Integration Requirements](pack-approvals-integration.md)
- [Notifications Architecture](architecture.md)
- [Notifications API Reference](api.md)
- [Notification Templates](templates.md)

View File

@@ -0,0 +1,62 @@
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
# Pack Approval Notification Integration — Requirements
## Overview
Task Runner now produces pack plans with explicit approval and policy-gate metadata. The Notifications service must ingest those events, persist their state, and fan out actionable alerts (approvals requested, policy holds, resumptions). This document captures the requirements for the first Notifications sprint dedicated to the Task Runner bridge.
Deliverables feed Sprint 37 tasks (`NOTIFY-SVC-37-00x`) and unblock Task Runner sprint 43 (`TASKRUN-43-001`).
## Functional Requirements
### 1. Approval Event Contract
- Define a canonical schema for **PackApprovalRequested** and **PackApprovalUpdated** events.
- Fields must include `runId`, `approvalId`, tenant context, plan hash, required grants, step identifiers, message template, and resume callback metadata.
- Provide an OpenAPI fragment and x-go/x-cs models for Task Runner and CLI compatibility.
- Document error/acknowledgement semantics (success, retryable failure, validation failure).
### 2. Ingestion & Persistence
- Expose a secure Notifications API endpoint (`POST /notifications/pack-approvals`) receiving Task Runner events.
- Validate scope (`packs.approve`, `Notifier.Events:Write`) and tenant match.
- Persist approval state transitions in PostgreSQL (`notify.pack_approvals` table) with indexes on run/approval/tenant.
- Store outbound notification audit records with correlation IDs to support Task Runner resume flow.
### 3. Notification Routing
- Derive recipients from new rule predicates (`event.kind == "pack.approval"`).
- Render approval templates (email + webhook JSON) including plan metadata and approval links (resume token).
- Emit policy gate notifications as “hold” incidents with context (parameters, messages).
- Support localization fallback and redaction of secrets (never ship approval tokens unencrypted).
### 4. Resume & Ack Handshake
- Provide an approval ack endpoint (`POST /notifications/pack-approvals/{runId}/{approvalId}/ack`) that records decision metadata and forwards to Task Runner resume hook (HTTP callback + message bus placeholder).
- Return structured responses with resume token / status for CLI integration.
- Ensure idempotent updates (dedupe by runId + approvalId + decisionHash).
### 5. Observability & Security
- Emit metrics for approval notifications queued/sent, outstanding approvals, and acknowledgement latency.
- Log audit trail events (`pack.approval.requested`, `pack.approval.acknowledged`, `pack.policy.hold`).
- Enforce HMAC or mTLS for Task Runner -> Notifier ingestion; support configurable IP allowlist.
- Provide chaos-test plan for notification failure modes (channel outage, storage failure).
## Non-Functional Requirements
- Deterministic processing: identical approval events lead to identical outbound notifications (idempotent).
- Timeouts: ingestion endpoint must respond < 500ms under nominal load.
- Retry strategy: Task Runner expects 5xx/429 for transient errors; document backoff guidance.
- Data retention: approval records retained 90 days, purge job tracked under ops runbook.
## Sprint 37 Task Mapping
| Task ID | Scope |
| --- | --- |
| **NOTIFY-SVC-37-001** | Author this contract doc, OpenAPI fragment, and schema references. Coordinate with Task Runner/Authority guilds. |
| **NOTIFY-SVC-37-002** | Implement secure ingestion endpoint, PostgreSQL persistence, and audit hooks. Provide integration tests with sample events. |
| **NOTIFY-SVC-37-003** | Build approval/policy notification templates, routing rules, and channel dispatch (email + webhook). |
| **NOTIFY-SVC-37-004** | Ship acknowledgement endpoint + Task Runner callback client, resume token handling, and metrics/dashboards. |
## Open Questions
1. Who owns approval resume callback (Task Runner Worker vs Orchestrator)? Resolve before NOTIFY-SVC-37-004.
2. Should approvals generate incidents in existing incident schema or dedicated table? Decision impacts PostgreSQL schema design.
3. Authority scopes for approval ingestion/ack reuse `packs.approve` or introduce `packs.approve:notify`? Coordinate with Authority team.

View File

@@ -0,0 +1,160 @@
# Notifications Rules
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Rules decide which platform events deserve a notification, how aggressively they should be throttled, and which channels/actions should run. They are tenant-scoped contracts that guarantee deterministic routing across Notify.Worker replicas.
---
## 1. Rule lifecycle
1. **Authoring.** Operators create or update rules through the Notify WebService (`POST /rules`, `PATCH /rules/{id}`) or UI. Payloads are normalised to the current `NotifyRule` schema version.
2. **Evaluation.** Notify.Worker evaluates enabled rules per incoming event. Tenancy is enforced first, followed by match filters, VEX gates, throttles, and digest handling.
3. **Delivery.** Matching actions are enqueued with an idempotency key to prevent storm loops. Throttle rejections and digest coalescing are recorded in the delivery ledger.
4. **Audit.** Every change carries `createdBy`/`updatedBy` plus timestamps; the delivery ledger references `ruleId`/`actionId` for traceability.
---
## 2. Rule schema reference
| Field | Type | Notes |
|-------|------|-------|
| `ruleId` | string | Stable identifier; clients may provide UUID/slug. |
| `tenantId` | string | Must match the tenant header supplied when the rule is created. |
| `name` | string | Display label shown in UI and audits. |
| `description` | string? | Optional operator-facing note. |
| `enabled` | bool | Disabled rules remain stored but skipped during evaluation. |
| `labels` | map<string,string> | Sorted, trimmed key/value tags supporting filtering. |
| `metadata` | map<string,string> | Reserved for automation; stored verbatim (sorted). |
| `match` | [`NotifyRuleMatch`](#3-match-filters) | Declarative filters applied before actions execute. |
| `actions[]` | [`NotifyRuleAction`](#4-actions-throttles-and-digests) | Ordered set of channel dispatchers; minimum one. |
| `createdBy`/`createdAt` | string?, instant | Populated automatically when omitted. |
| `updatedBy`/`updatedAt` | string?, instant | Defaults to creation values when unspecified. |
| `schemaVersion` | string | Auto-upgraded during persistence; use for migrations. |
Rules are immutable snapshots; updates produce a full document write so workers observing change streams can refresh caches deterministically.
---
## 3. Match filters
`NotifyRuleMatch` narrows which events trigger the rule. All string collections are trimmed, deduplicated, and sorted to guarantee deterministic evaluation.
| Field | Type | Behaviour |
|-------|------|-----------|
| `eventKinds[]` | string | Lower-cased; supports any canonical Notify event (`scanner.report.ready`, `scheduler.rescan.delta`, `zastava.admission`, etc.). Empty list matches all kinds. |
| `namespaces[]` | string | Exact match against `event.scope.namespace`. Supports glob-style filters via upstream enrichment (planned). |
| `repositories[]` | string | Matches `event.scope.repo`. |
| `digests[]` | string | Lower-cased; matches `event.scope.digest`. |
| `labels[]` | string | Matches event attributes or delta labels (`kev`, `critical`, `license`, …). |
| `componentPurls[]` | string | Matches component identifiers inside the event payload when provided. |
| `minSeverity` | string? | Lower-cased severity gate (e.g., `medium`, `high`, `critical`). Evaluated on new findings inside event deltas; events lacking severity bypass this gate unless set. |
| `verdicts[]` | string | Accepts scan/report verdicts (`fail`, `warn`, `block`, `escalate`, `deny`). |
| `kevOnly` | bool? | When `true`, only KEV-tagged findings fire. |
| `vex` | object | Additional gating aligned with VEX consensus; see below. |
### 3.1 VEX gates
`NotifyRuleMatchVex` offers fine-grained control when VEX findings accompany events:
| Field | Default | Effect |
|-------|---------|--------|
| `includeAcceptedJustifications` | `true` | Include findings marked `not_affected`/`acceptable` in consensus. |
| `includeRejectedJustifications` | `false` | Surface findings the consensus rejected. |
| `includeUnknownJustifications` | `false` | Allow findings without explicit justification. |
| `justificationKinds[]` | `[]` | Optional allow-list of justification codes (e.g., `exploit_observed`, `component_not_present`). |
If the VEX block filters out every applicable finding, the rule is treated as a non-match and no actions run.
---
## 4. Actions, throttles, and digests
Each rule requires at least one action. Actions are deduplicated and sorted by `actionId`, so prefer deterministic identifiers.
| Field | Type | Notes |
|-------|------|-------|
| `actionId` | string | Stable identifier unique within the rule. |
| `channel` | string | Reference to a channel (`channelId`) configured in `/channels`. |
| `template` | string? | Template key to use for rendering; falls back to channel default when omitted. |
| `digest` | string? | Digest window key (`instant`, `5m`, `15m`, `1h`, `1d`). `instant` bypasses coalescing. |
| `throttle` | ISO8601 duration? | Optional throttle TTL (`PT300S`, `PT1H`). Prevents duplicate deliveries when the same idempotency hash appears before expiry. |
| `locale` | string? | BCP-47 tag (stored lower-case). Template lookup falls back to channel locale then `en-us`. |
| `enabled` | bool | Disabled actions skip rendering but remain stored. |
| `metadata` | map<string,string> | Connector-specific hints (priority, layout, etc.). |
### 4.0 Attestation lifecycle templates
Rules targeting attestation/signing events (`attestor.verification.failed`, `attestor.attestation.expiring`, `authority.keys.revoked`, `attestor.transparency.anomaly`) must reference the dedicated template keys documented in [`templates.md` §7](templates.md#7-attestation--signing-lifecycle-templates-notify-attest-74-001) so payloads remain deterministic across channels and Offline Kits:
| Event kind | Required template key | Notes |
| --- | --- | --- |
| `attestor.verification.failed` | `tmpl-attest-verify-fail` | Include failure code, Rekor UUID/index, last good attestation link. |
| `attestor.attestation.expiring` | `tmpl-attest-expiry-warning` | Surface issued/expires timestamps, time remaining, renewal instructions. |
| `authority.keys.revoked` / `authority.keys.rotated` | `tmpl-attest-key-rotation` | List rotation batch ID, impacted services, remediation steps. |
| `attestor.transparency.anomaly` | `tmpl-attest-transparency-anomaly` | Highlight Rekor/witness metadata and anomaly classification. |
Locale-specific variants keep the same template key while varying `locale`; rule actions shouldn't create ad-hoc templates for these events.
### 4.1 Evaluation order
1. Verify channel exists and is enabled; disabled channels mark the delivery as `Dropped`.
2. Apply throttle idempotency key: `hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket)`. Hits are logged as `Throttled`.
3. If the action defines a digest window other than `instant`, append the event to the open window and defer delivery until flush.
4. When delivery proceeds, the renderer resolves the template, locale, and metadata before invoking the connector.
---
## 5. Example rule payload
```json
{
"ruleId": "rule-critical-soc",
"tenantId": "tenant-dev",
"name": "Critical scanner verdicts",
"description": "Route KEV-tagged critical findings to SOC Slack with zero delay.",
"enabled": true,
"match": {
"eventKinds": ["scanner.report.ready"],
"labels": ["kev", "critical"],
"minSeverity": "critical",
"verdicts": ["fail", "block"],
"kevOnly": true
},
"actions": [
{
"actionId": "act-slack-critical",
"channel": "chn-slack-soc",
"template": "tmpl-critical",
"digest": "instant",
"throttle": "PT300S",
"locale": "en-us",
"metadata": {
"priority": "p1"
}
}
],
"labels": {
"owner": "soc"
},
"metadata": {
"revision": "12"
}
}
```
Dry-run calls (`POST /rules/{id}/test`) accept the same structure along with a sample Notify event payload to exercise match logic without invoking connectors.
---
## 6. Operational guidance
- Keep rule scopes narrow (namespace/repository) before relying on severity gates; this minimises noise and improves digest summarisation.
- Always configure a throttle window for instant actions to protect against repeated upstream retries.
- Use rule labels to organise dashboards and access control (e.g., `owner:soc`, `env:prod`).
- Prefer tenant-specific rule IDs so Offline Kit exports remain deterministic across environments.
- If a rule depends on derived metadata (e.g., policy verdict tags), list those dependencies in the rule description for audit readiness.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,3 @@
# Notify Schemas Catalog
Placeholder for NR1 deliverables: versioned JSON Schemas for Notify event envelopes, rules, templates, channels, receipts, and webhooks. Publish `notify-schemas-catalog.json` + `.dsse.json` here with canonicalization recipe (BLAKE3-256 over normalized JSON) and `inputs.lock` capturing digests.

View File

@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/channel.schema.json",
"title": "Notify Channel Configuration",
"type": "object",
"required": ["schema_version", "tenant_id", "channel_id", "kind", "config"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"channel_id": { "type": "string", "pattern": "^[A-Z0-9_-]{4,64}$" },
"kind": { "type": "string", "enum": ["email", "slack", "teams", "webhook", "sms"] },
"config": { "type": "object" },
"secrets_ref": { "type": "object", "additionalProperties": { "type": "string" } },
"rate_limit": { "type": "object" },
"enabled": { "type": "boolean", "default": true },
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/dlq-notify.schema.json",
"title": "Notify Dead Letter Entry",
"type": "object",
"required": ["schema_version", "tenant_id", "delivery_id", "reason", "payload", "first_failed_at"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"delivery_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"reason": { "type": "string" },
"payload": { "type": "object" },
"backoff_attempts": { "type": "integer", "minimum": 0 },
"dedupe_key": { "type": "string" },
"first_failed_at": { "type": "string", "format": "date-time" },
"last_failed_at": { "type": "string", "format": "date-time" },
"redrive_after": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/event-envelope.schema.json",
"title": "Notify Event Envelope",
"type": "object",
"required": [
"schema_version",
"tenant_id",
"event_id",
"occurred_at",
"kind",
"payload"
],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"event_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"occurred_at": { "type": "string", "format": "date-time" },
"kind": { "type": "string", "minLength": 1 },
"correlation_id": { "type": "string" },
"source": { "type": "string" },
"payload": { "type": "object" },
"attributes": { "type": "object", "additionalProperties": { "type": ["string", "number", "boolean", "null"] } }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,14 @@
{
"catalog": "notify-schemas-catalog.json",
"hash_algorithm": "blake3-256",
"canonicalization": "json-normalized-utf8",
"entries": [
{ "file": "event-envelope.schema.json", "digest": "0534e778a7e24dfdcbdc66cec2902f24684ec0bdf26d708ab9bca98e6674a318" },
{ "file": "rule.schema.json", "digest": "34d4f1c2ba97b76acf85ad61f4e8de4591664eefecbc7ebb6d168aa5a998ddd1" },
{ "file": "template.schema.json", "digest": "e0a8f9bb5e5f29a11b040e7cb0e7e9a8c5d42256f9a4bd72f79460eb613dac52" },
{ "file": "channel.schema.json", "digest": "bd9e2dfb4e6e7e7a38f26cc94ae8bcdf9b8c44b1e97bf78c146711783fe8fa2b" },
{ "file": "receipt.schema.json", "digest": "fb4431019b3803081983b215fc9ca2e7618c3cf91f8274baedf72cacad8dfe46" },
{ "file": "webhook.schema.json", "digest": "54a6e0d956fd6af7e88f6508bda78221ca04cfedea4112bfefc7fa5dbfa45c09" },
{ "file": "dlq-notify.schema.json", "digest": "1330e589245b923f6e1fea6af080b7b302a97effa360a90dbef4ba3b06021b2f" }
]
}

View File

@@ -0,0 +1,11 @@
{
"payloadType": "application/vnd.notify.schema-catalog+json",
"payload": "eyJjYW5vbmljYWxpemF0aW9uIjoianNvbi1ub3JtYWxpemVkLXV0ZjgiLCJjYXRhbG9nX3ZlcnNpb24iOiJ2MS4wIiwiZ2VuZXJhdGVkX2F0IjoiMjAyNS0xMi0wNFQwMDowMDowMFoiLCJoYXNoX2FsZ29yaXRobSI6ImJsYWtlMy0yNTYiLCJzY2hlbWFzIjpbeyJkaWdlc3QiOiIwNTM0ZTc3OGE3ZTI0ZGZkY2JkYzY2Y2VjMjkwMmYyNDY4NGVjMGJkZjI2ZDcwOGFiOWJjYTk4ZTY2NzRhMzE4IiwiZmlsZSI6ImV2ZW50LWVudmVsb3BlLnNjaGVtYS5qc29uIiwiaWQiOiJldmVudC1lbnZlbG9wZSIsInZlcnNpb24iOiJ2MS4wIn0seyJkaWdlc3QiOiIzNGQ0ZjFjMmJhOTdiNzZhY2Y4NWFkNjFmNGU4ZGU0NTkxNjY0ZWVmZWNiYzdlYmI2ZDE2OGFhNWE5OThkZGQxIiwiZmlsZSI6InJ1bGUuc2NoZW1hLmpzb24iLCJpZCI6InJ1bGUiLCJ2ZXJzaW9uIjoidjEuMCJ9LHsiZGlnZXN0IjoiZTBhOGY5YmI1ZTVmMjlhMTFiMDQwZTdjYjBlN2U5YThjNWQ0MjI1NmY5YTRiZDcyZjc5NDYwZWI2MTNkYWM1MiIsImZpbGUiOiJ0ZW1wbGF0ZS5zY2hlbWEuanNvbiIsImlkIjoidGVtcGxhdGUiLCJ2ZXJzaW9uIjoidjEuMCJ9LHsiZGlnZXN0IjoiYmQ5ZTJkZmI0ZTZlN2U3YTM4ZjI2Y2M5NGFlOGJjZGY5YjhjNDRiMWU5N2JmNzhjMTQ2NzExNzgzZmU4ZmEyYiIsImZpbGUiOiJjaGFubmVsLnNjaGVtYS5qc29uIiwiaWQiOiJjaGFubmVsIiwidmVyc2lvbiI6InYxLjAifSx7ImRpZ2VzdCI6ImZiNDQzMTAxOWIzODAzMDgxOTgzYjIxNWZjOWNhMmU3NjE4YzNjZjkxZjgyNzRiYWVkZjcyY2FjYWQ4ZGZlNDYiLCJmaWxlIjoicmVjZWlwdC5zY2hlbWEuanNvbiIsImlkIjoicmVjZWlwdCIsInZlcnNpb24iOiJ2MS4wIn0seyJkaWdlc3QiOiI1NGE2ZTBkOTU2ZmQ2YWY3ZTg4ZjY1MDhiZGE3ODIyMWNhMDRjZmVkZWE0MTEyYmZlZmM3ZmE1ZGJmYTQ1YzA5IiwiZmlsZSI6IndlYmhvb2suc2NoZW1hLmpzb24iLCJpZCI6IndlYmhvb2siLCJ2ZXJzaW9uIjoidjEuMCJ9LHsiZGlnZXN0IjoiMTMzMGU1ODkyNDViOTIzZjZlMWZlYTZhZjA4MGI3YjMwMmE5N2VmZmEzNjBhOTBkYmVmNGJhM2IwNjAyMWIyZiIsImZpbGUiOiJkbHEtbm90aWZ5LnNjaGVtYS5qc29uIiwiaWQiOiJkbHEiLCJ2ZXJzaW9uIjoidjEuMCJ9XX0=",
"signatures": [
{
"sig": "99WPzzc6sCaEQHXk2B15aLxtG/Ics6qsgHYa2oDTI1g=",
"keyid": "notify-dev-hmac-001",
"signedAt": "2025-12-04T21:12:53+00:00"
}
]
}

View File

@@ -0,0 +1,15 @@
{
"catalog_version": "v1.0",
"hash_algorithm": "blake3-256",
"canonicalization": "json-normalized-utf8",
"generated_at": "2025-12-04T00:00:00Z",
"schemas": [
{ "id": "event-envelope", "file": "event-envelope.schema.json", "version": "v1.0", "digest": "0534e778a7e24dfdcbdc66cec2902f24684ec0bdf26d708ab9bca98e6674a318" },
{ "id": "rule", "file": "rule.schema.json", "version": "v1.0", "digest": "34d4f1c2ba97b76acf85ad61f4e8de4591664eefecbc7ebb6d168aa5a998ddd1" },
{ "id": "template", "file": "template.schema.json", "version": "v1.0", "digest": "e0a8f9bb5e5f29a11b040e7cb0e7e9a8c5d42256f9a4bd72f79460eb613dac52" },
{ "id": "channel", "file": "channel.schema.json", "version": "v1.0", "digest": "bd9e2dfb4e6e7e7a38f26cc94ae8bcdf9b8c44b1e97bf78c146711783fe8fa2b" },
{ "id": "receipt", "file": "receipt.schema.json", "version": "v1.0", "digest": "fb4431019b3803081983b215fc9ca2e7618c3cf91f8274baedf72cacad8dfe46" },
{ "id": "webhook", "file": "webhook.schema.json", "version": "v1.0", "digest": "54a6e0d956fd6af7e88f6508bda78221ca04cfedea4112bfefc7fa5dbfa45c09" },
{ "id": "dlq", "file": "dlq-notify.schema.json", "version": "v1.0", "digest": "1330e589245b923f6e1fea6af080b7b302a97effa360a90dbef4ba3b06021b2f" }
]
}

View File

@@ -0,0 +1,21 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/receipt.schema.json",
"title": "Notify Delivery Receipt",
"type": "object",
"required": ["schema_version", "tenant_id", "delivery_id", "rule_id", "channel", "status", "sent_at"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"delivery_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"rule_id": { "type": "string" },
"channel": { "type": "string" },
"status": { "type": "string", "enum": ["sent", "delivered", "failed", "queued", "acknowledged"] },
"attempt": { "type": "integer", "minimum": 1 },
"sent_at": { "type": "string", "format": "date-time" },
"ack_url": { "type": "string", "format": "uri" },
"response": { "type": "object" },
"errors": { "type": "array", "items": { "type": "string" } }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,37 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/rule.schema.json",
"title": "Notify Rule",
"type": "object",
"required": [
"schema_version",
"tenant_id",
"rule_id",
"name",
"sources",
"predicates",
"actions",
"approvals_required"
],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"rule_id": { "type": "string", "pattern": "^[A-Z0-9_-]{4,64}$" },
"name": { "type": "string", "minLength": 1 },
"description": { "type": "string" },
"severity": { "type": "string", "enum": ["info", "low", "medium", "high", "critical"] },
"sources": { "type": "array", "items": { "type": "string" }, "minItems": 1 },
"predicates": { "type": "array", "items": { "type": "object" }, "minItems": 1 },
"actions": {
"type": "array",
"items": { "type": "object" },
"minItems": 1
},
"approvals_required": { "type": "integer", "minimum": 0, "maximum": 3 },
"quiet_hours": { "type": "array", "items": { "type": "string" } },
"simulation_required": { "type": "boolean", "default": true },
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,22 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/template.schema.json",
"title": "Notify Template",
"type": "object",
"required": ["schema_version", "tenant_id", "template_id", "channel", "locale", "body"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"template_id": { "type": "string", "pattern": "^[A-Z0-9_-]{4,64}$" },
"channel": { "type": "string", "enum": ["email", "slack", "teams", "webhook", "sms"] },
"locale": { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
"subject": { "type": "string" },
"body": { "type": "string" },
"helpers": { "type": "object", "additionalProperties": { "type": "string" } },
"merge_fields": { "type": "array", "items": { "type": "string" } },
"preview_hash": { "type": "string" },
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/webhook.schema.json",
"title": "Notify Webhook Payload",
"type": "object",
"required": ["schema_version", "tenant_id", "delivery_id", "signature", "body"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"delivery_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"signature": { "type": "string" },
"hmac_id": { "type": "string" },
"body": { "type": "object" },
"sent_at": { "type": "string", "format": "date-time" },
"nonce": { "type": "string" },
"audience": { "type": "string" },
"expires_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,3 @@
# Notify Security Notes
Holds NR2, NR6, and NR7 artefacts: tenant/RBAC approval matrix, webhook/ack hardening policy (HMAC/mTLS/DPoP + signed acks), and redaction/PII catalog with sanitized fixture samples.

View File

@@ -0,0 +1,6 @@
# Redaction and PII catalog (NR7)
- Classify merge fields: identifiers (hash), secrets (strip), PII (mask), operational metadata (retain).
- Storage and previews must use redacted forms by default; full bodies allowed only with `Notify.Audit` permission.
- Log payloads must omit secrets; hashes use BLAKE3-256 over UTF-8 normalized values.
- Fixtures under `docs/modules/notify/fixtures/redaction/` show expected redacted shapes for templates and receipts.

View File

@@ -0,0 +1,6 @@
# Tenant scoping and approvals (NR2)
- All Notify APIs require `tenant_id` in request and ledger records.
- High-impact actions (escalations, PII-bearing templates, cross-tenant fan-out) need N-of-M approvals: default 2 of 3 approvers with `Notify.Approver` role.
- Approvals captured as DSSE-signed records (future hook) and stored alongside rule change requests.
- Rejection reasons must be logged and returned in error payloads; audit log keeps requester, approver IDs, timestamps, and rule/template IDs.

View File

@@ -0,0 +1,6 @@
# Webhook and ack security (NR6)
- Webhooks must use HMAC-SHA256 with per-tenant rotating secrets or mTLS/DPoP. `hmac_id` maps to secret material.
- Ack URLs carry signed tokens (nonce, audience, tenant_id, delivery_id, expires_at) and are single-use. Reject replay or expired tokens.
- Enforce allowlists for domains and paths per tenant; deny wildcards.
- Capture failures in observability pipeline and DLQ with redrive after investigation.

View File

@@ -0,0 +1 @@
{"rule_id":"RULE-INCIDENT","tenant_id":"tenant-123","simulation_report":"sample-simulation-report.json","status":"passed"}

View File

@@ -0,0 +1,28 @@
{
"rule_id": "sample-rule",
"simulation_id": "sim-2025-12-04-001",
"executed_at": "2025-12-04T00:00:00Z",
"tenant_id": "test-tenant",
"fixtures_used": [
"docs/notifications/fixtures/rendering/tmpl-incident-start.email.en-US.json"
],
"channels_tested": ["email", "slack"],
"results": {
"events_processed": 10,
"deliveries_simulated": 20,
"delivery_success": 20,
"delivery_failure": 0,
"quota_blocked": 0,
"redaction_applied": true
},
"determinism_check": {
"hash_algorithm": "blake3-256",
"output_digest": "0000000000000000000000000000000000000000000000000000000000000000",
"verified": true
},
"approval_required": true,
"approval_status": "pending",
"evidence_links": [
"docs/notifications/fixtures/rendering/index.ndjson"
]
}

View File

@@ -0,0 +1,13 @@
{
"rule_id": "RULE-INCIDENT",
"tenant_id": "tenant-123",
"fixtures_version": "v1",
"result": "passed",
"evidence": [
{
"event_id": "evt-1",
"decision": "send",
"channel": "email"
}
]
}

View File

@@ -0,0 +1,86 @@
# Notifier Telemetry SLO Webhook Schema (1.0.0)
Purpose: define the payload emitted by Telemetry SLO evaluators toward Notifier so that NOTIFY-OBS-51-001 can consume alerts deterministically (online and offline).
## Delivery contract
- Content-Type: `application/json`
- Encoding: UTF-8
- Authentication: mTLS (service identity) or DPoP/JWT with `aud` = `notifier` and `scope` = `obs:slo:ingest`.
- Determinism: timestamps are UTC ISO-8601 with `Z`; field order stable for hashing (see canonical JSON below).
## Payload fields
```
{
"id": "uuid",
"tenant": "string", // required; aligns with orchestrator/telemetry tenant id
"service": "string", // logical service name
"host": "string", // optional; k8s node/hostname
"slo": {
"name": "string", // human-readable
"id": "string", // immutable key used for dedupe
"objective": {
"window": "PT5M", // ISO-8601 duration
"target": 0.995 // decimal between 0 and 1
}
},
"metric": {
"type": "latency|error|availability|custom",
"value": 0.0123, // double; units depend on type
"unit": "seconds|ratio|percent|count",
"labels": { // sanitized, deterministic ordering when serialized
"endpoint": "/api/jobs",
"method": "GET"
}
},
"window": {
"start": "2025-11-19T12:00:00Z",
"end": "2025-11-19T12:05:00Z"
},
"breach": {
"state": "breaching|warning|ok",
"reason": "p95 latency above objective",
"evidence": [
{
"type": "timeseries",
"href": "cas://telemetry/series/abc123",
"hash": "sha256:..."
}
]
},
"quietHours": {
"active": false,
"policyId": null
},
"trace": {
"trace_id": "optional-trace-id",
"span_id": "optional-span-id"
},
"version": "1.0.0",
"issued_at": "2025-11-19T12:05:07Z"
}
```
### Canonical JSON rules
- Sort object keys lexicographically before hashing/signing.
- Use lowercase for enum-like fields shown above.
- `version` is required for evolution; new fields must be add-only.
### Retry and idempotency
- `id` is the idempotency key; Notifier treats duplicates as no-op.
- Producers retry with exponential backoff up to 10 minutes; consumers respond 2xx only after persistence.
### Validation checklist (for tests/CI)
- Required fields: id, tenant, service, slo.id, slo.objective.window, slo.objective.target, metric.type, metric.value, window.start/end, breach.state, version, issued_at.
- Timestamps parse with `DateTimeStyles.RoundtripKind`.
- When `breach.state=ok`, `breach.reason` may be null but object must exist.
- `quietHours.active=true` must include `policyId`.
### Sample canonical JSON (minified)
```
{"breach":{"evidence":[],"reason":"p99 latency above objective","state":"breaching"},"host":"orchestrator-0","id":"8c1d58c4-b1de-4b3c-9c7b-40a6b0f8d4c1","issued_at":"2025-11-19T12:05:07Z","metric":{"labels":{"endpoint":"/api/jobs","method":"GET"},"type":"latency","unit":"seconds","value":1.234},"quietHours":{"active":false,"policyId":null},"service":"orchestrator","slo":{"id":"orch-api-latency-p99","name":"Orchestrator API p99","objective":{"target":0.99,"window":"PT5M"}},"tenant":"default","trace":{"span_id":null,"trace_id":null},"version":"1.0.0","window":{"end":"2025-11-19T12:05:00Z","start":"2025-11-19T12:00:00Z"}}
```
### Evidence to surface in sprint tasks
- File: `docs/modules/notify/slo-webhook-schema.md` (this document).
- Sample payload (canonical) and validation checklist above.
- Dependencies: upstream Telemetry evaluator must emit `metric.labels` sanitized; Notifier to persist `id` for idempotency.

View File

@@ -0,0 +1,239 @@
# Notifications Templates
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Templates shape the payload rendered for each channel when a rule action fires. They are deterministic, locale-aware artefacts stored alongside rules so Notify.Worker replicas can render identical messages regardless of environment.
---
## 1. Template lifecycle
1. **Authoring.** Operators create templates via the API (`POST /templates`) or UI. Each template binds to a channel type (`Slack`, `Teams`, `Email`, `Webhook`, `Custom`) and a locale.
2. **Reference.** Rule actions opt in by referencing the template key (`actions[].template`). Channel defaults apply when no template is specified.
3. **Rendering.** During delivery, the worker resolves the template (locale fallbacks included), executes it using the safe Handlebars-style engine, and passes the rendered payload plus metadata to the connector.
4. **Audit.** Rendered payloads stored in the delivery ledger include the `templateId` so operators can trace which text was used.
---
## 2. Template schema reference
| Field | Type | Notes |
|-------|------|-------|
| `templateId` | string | Stable identifier (UUID/slug). |
| `tenantId` | string | Must match the tenant header in API calls. |
| `channelType` | [`NotifyChannelType`](../modules/notify/architecture.md#5-channels--connectors-plug-ins) | Determines connector payload envelope. |
| `key` | string | Human-readable key referenced by rules (`tmpl-critical`). |
| `locale` | string | BCP-47 tag, stored lower-case (`en-us`, `bg-bg`). |
| `body` | string | Template body; rendered strictly without executing arbitrary code. |
| `renderMode` | enum | `Markdown`, `Html`, `AdaptiveCard`, `PlainText`, or `Json`. Guides connector sanitisation. |
| `format` | enum | `Slack`, `Teams`, `Email`, `Webhook`, or `Json`. Signals delivery payload structure. |
| `description` | string? | Optional operator note. |
| `metadata` | map<string,string> | Sorted map for automation (layout hints, fallback text). |
| `createdBy`/`createdAt` | string?, instant | Auto-populated. |
| `updatedBy`/`updatedAt` | string?, instant | Auto-populated. |
| `schemaVersion` | string | Auto-upgraded on persistence. |
Templates are normalised: string fields trimmed, locale lower-cased, metadata sorted to preserve determinism.
---
## 3. Variables, helpers, and context
Templates receive a structured context derived from the Notify event, rule match, and rendering metadata.
| Path | Description |
|------|-------------|
| `event.*` | Canonical event envelope (`kind`, `tenant`, `ts`, `actor`). |
| `event.scope.*` | Namespace, repository, digest, image, component identifiers, labels, attributes. |
| `payload.*` | Raw event payload (e.g., `payload.verdict`, `payload.delta.*`, `payload.links.*`). |
| `rule.*` | Rule descriptor (`ruleId`, `name`, `labels`, `metadata`). |
| `action.*` | Action descriptor (`actionId`, `channel`, `digest`, `throttle`, `metadata`). |
| `policy.*` | Policy metadata when supplied (`revisionId`, `name`). |
| `topFindings[]` | Top-N findings summarised for convenience (vulnerability ID, severity, reachability). |
| `digest.*` | When rendering digest flushes: `window`, `openedAt`, `itemCount`. |
Built-in helpers mirror the architecture dossier:
| Helper | Usage |
|--------|-------|
| `severity_icon severity` | Returns emoji/text badge representing severity. |
| `link text url` | Produces channel-safe hyperlink. |
| `pluralize count "finding"` | Adds plural suffix when `count != 1`. |
| `truncate text maxLength` | Cuts strings while preserving determinism. |
| `code text` | Formats inline code (Markdown/HTML aware). |
Connectors may expose additional helpers via partials, but must remain deterministic and side-effect free.
---
## 4. Sample templates
### 4.1 Slack (Markdown + block kit)
```hbs
{{#*inline "findingLine"}}
- {{severity_icon severity}} {{vulnId}} ({{severity}}) in `{{component}}`
{{/inline}}
*:rotating_light: {{payload.summary.total}} findings {{#if payload.delta.newCritical}}(new critical: {{payload.delta.newCritical}}){{/if}}*
{{#if topFindings.length}}
Top findings:
{{#each topFindings}}{{> findingLine}}{{/each}}
{{/if}}
{{link "Open report in Console" payload.links.ui}}
```
### 4.2 Email (HTML + text alternative)
```hbs
<h2>{{payload.verdict}} for {{event.scope.repo}}</h2>
<p>{{payload.summary.total}} findings ({{payload.summary.blocked}} blocked, {{payload.summary.warned}} warned)</p>
<table>
<thead><tr><th>Finding</th><th>Severity</th><th>Package</th></tr></thead>
<tbody>
{{#each topFindings}}
<tr>
<td>{{this.vulnId}}</td>
<td>{{this.severity}}</td>
<td>{{this.component}}</td>
</tr>
{{/each}}
</tbody>
</table>
<p>{{link "View full analysis" payload.links.ui}}</p>
```
When delivering via email, connectors automatically attach a plain-text alternative derived from the rendered content to preserve accessibility.
---
## 5. Preview and validation
- `POST /channels/{id}/test` accepts an optional `templateId` and sample payload to produce a rendered preview without dispatching the event. Results include channel type, target, title/summary, locale, body hash, and connector metadata.
- UI previews rely on the same API and highlight connector fallbacks (e.g., Teams adaptive card vs. text fallback).
- Offline Kit scenarios can call `/internal/notify/templates/normalize` to ensure bundled templates match the canonical schema before packaging.
---
## 6. Best practices
- Keep channel-specific limits in mind (Slack block/character quotas, Teams adaptive card size, email line length). Lean on digests to summarise long lists.
- Provide locale-specific versions for high-volume tenants; Notify selects the closest locale, falling back to `en-us`.
- Store connector-specific hints (`metadata.layout`, `metadata.emoji`) in template metadata rather than rules when they affect rendering.
- Version template bodies through metadata (e.g., `metadata.revision: "2025-10-28"`) so tenants can track changes over time.
- Run test previews whenever introducing new helpers to confirm body hashes remain stable across environments.
---
## 7. Attestation & signing lifecycle templates (NOTIFY-ATTEST-74-001)
Attestation lifecycle events (verification failures, expiring attestations, key revocations, transparency anomalies) reuse the same structural context so operators can differentiate urgency while reusing channels. Every template **must** surface:
- **Subject** (`payload.subject.digest`, `payload.subject.repository`, `payload.subject.tag`).
- **Attestation metadata** (`payload.attestation.kind`, `payload.attestation.id`, `payload.attestation.issuedAt`, `payload.attestation.expiresAt`).
- **Signer/Key fingerprint** (`payload.signer.kid`, `payload.signer.algorithm`, `payload.signer.rotationId`).
- **Traceability** (`payload.links.console`, `payload.links.rekor`, `payload.links.docs`).
### 7.1 Template keys & channels
| Event | Template key | Required channels | Optional channels | Notes |
| --- | --- | --- | --- | --- |
| Verification failure (`attestor.verification.failed`) | `tmpl-attest-verify-fail` | Slack `sec-alerts`, Email `supply-chain@`, Webhook (Pager/SOC) | Teams `risk-war-room`, Custom SIEM feed | Include failure code, Rekor UUID, last-known good attestation link. |
| Expiring attestation (`attestor.attestation.expiring`) | `tmpl-attest-expiry-warning` | Email summary, Slack reminder | Digest window (daily) | Provide expiration window, renewal instructions, `expiresIn` helper. |
| Key revocation/rotation (`authority.keys.revoked`, `authority.keys.rotated`) | `tmpl-attest-key-rotation` | Email + Webhook | Slack (if SOC watches channel) | Add rotation batch ID, impacted tenants/services, remediation steps. |
| Transparency anomaly (`attestor.transparency.anomaly`) | `tmpl-attest-transparency-anomaly` | Slack high-priority, Webhook, PagerDuty | Email follow-up | Show Rekor index delta, witness ID, anomaly classification, recommended actions. |
Assign these keys when creating templates so rule actions can reference them deterministically (`actions[].template: "tmpl-attest-verify-fail"`).
### 7.2 Context helpers
- `attestation_status_badge status`: renders ✅/⚠️/❌ depending on verdict (`valid`, `expiring`, `failed`).
- `expires_in expiresAt now`: returns human-readable duration, constrained to deterministic units (h/d).
- `fingerprint key`: shortens long key IDs/pems, exposing the last 10 characters.
### 7.3 Slack sample (verification failure)
```hbs
:rotating_light: {{attestation_status_badge payload.failure.status}} verification failed for `{{payload.subject.digest}}`
Signer: `{{fingerprint payload.signer.kid}}` ({{payload.signer.algorithm}})
Reason: `{{payload.failure.reasonCode}}` — {{payload.failure.reason}}
Last valid attestation: {{link "Console report" payload.links.console}}
Rekor entry: {{link "Transparency log" payload.links.rekor}}
```
### 7.4 Email sample (expiring attestation)
```hbs
<h2>Attestation expiry notice</h2>
<p>The attestation for <code>{{payload.subject.repository}}</code> (digest {{payload.subject.digest}}) expires on <strong>{{payload.attestation.expiresAt}}</strong>.</p>
<ul>
<li>Issued: {{payload.attestation.issuedAt}}</li>
<li>Signer: {{payload.signer.kid}} ({{payload.signer.algorithm}})</li>
<li>Time remaining: {{expires_in payload.attestation.expiresAt event.ts}}</li>
</ul>
<p>Please rotate the attestation before expiry. Reference <a href="{{payload.links.docs}}">renewal steps</a>.</p>
```
### 7.5 Webhook sample (transparency anomaly)
```json
{
"event": "attestor.transparency.anomaly",
"tenantId": "{{event.tenant}}",
"subjectDigest": "{{payload.subject.digest}}",
"rekorIndex": "{{payload.transparency.rekorIndex}}",
"witnessId": "{{payload.transparency.witnessId}}",
"anomaly": "{{payload.transparency.classification}}",
"detailsUrl": "{{payload.links.console}}",
"recommendation": "{{payload.recommendation}}"
}
```
### 7.6 Offline kit guidance
- Bundle these templates (JSON export) under `offline/notifier/templates/attestation/`.
- Baseline English templates for Slack, Email, and Webhook ship in the repository at `offline/notifier/templates/attestation/*.template.json`; copy and localise them per tenant as needed.
- Provide localized variants for `en-us` and `de-de` at minimum; additional locales can be appended per customer.
- Include preview fixtures in Offline Kit smoke tests to guarantee channel render parity when air-gapped.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
---
## 8. Incident mode templates (NOTIFY-OBS-55-001)
Incident toggles are high-noise events that must pierce quiet hours and include audit-ready context. Use dedicated templates so downstream tooling can distinguish activation vs. recovery and surface the required evidence.
**Required context keys**
- `payload.incidentId`, `payload.reason`, `payload.startedAt` / `payload.stoppedAt`.
- `payload.links.trace` (root cause trace/span), `payload.links.evidence` (timeline/export bundle), `payload.links.timeline`.
- `payload.retentionDays` (active) and `payload.retentionBaselineDays` (post-incident).
- `payload.quietHoursOverride` (boolean) to justify bypassing quiet hours.
- `payload.legal.jurisdiction`, `payload.legal.ticket`, `payload.legal.logPath` for compliance logging.
**Template keys**
- `tmpl-incident-start` — activation notice.
- `tmpl-incident-stop` — recovery/cleanup notice.
**Slack sample (start)**
```hbs
:rotating_light: Incident mode activated for {{payload.incidentId}}
Reason: {{payload.reason}}
Trace: {{link "root span" payload.links.trace}} · Evidence: {{link "bundle" payload.links.evidence}}
Retention extended to {{payload.retentionDays}} days (baseline {{payload.retentionBaselineDays}})
Quiet hours overridden: {{payload.quietHoursOverride}}
Legal: {{payload.legal.jurisdiction}} (ticket {{payload.legal.ticket}})
```
**Email sample (stop)**
```hbs
<h2>Incident mode cleared: {{payload.incidentId}}</h2>
<p>Stopped at {{payload.stoppedAt}} — retention reset to {{payload.retentionBaselineDays}} days.</p>
<p>Timeline: {{link "view timeline" payload.links.timeline}} · Audit log: {{payload.legal.logPath}}</p>
```
See `src/Notifier/StellaOps.Notifier/docs/incident-mode-rules.sample.json` for ready-to-import rules referencing these templates with quiet-hour overrides and legal logging metadata.

View File

@@ -17,7 +17,7 @@ Platform module describes cross-cutting architecture, contracts, and guardrails
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -2,7 +2,7 @@
> **Ownership:** Architecture Guild • Docs Guild
> **Audience:** Service owners, platform engineers, solution architects
> **Related:** [High-Level Architecture](../../07_HIGH_LEVEL_ARCHITECTURE.md), [Concelier Architecture](../concelier/architecture.md), [Policy Engine Architecture](../policy/architecture.md), [Aggregation-Only Contract](../../ingestion/aggregation-only-contract.md)
> **Related:** [High-Level Architecture](../../07_HIGH_LEVEL_ARCHITECTURE.md), [Concelier Architecture](../concelier/architecture.md), [Policy Engine Architecture](../policy/architecture.md), [Aggregation-Only Contract](../../aoc/aggregation-only-contract.md)
This dossier summarises the end-to-end runtime topology after the Aggregation-Only Contract (AOC) rollout. It highlights where raw facts live, how ingest services enforce guardrails, and how downstream components consume those facts to derive policy decisions and user-facing experiences.
@@ -158,13 +158,13 @@ sequenceDiagram
- **Offline Kit:** Packages raw PostgreSQL snapshots (`advisory_raw`, `vex_raw`) plus guard configuration and CLI verifier binaries so air-gapped sites can re-run AOC checks before promotion.
- **Recovery:** Supersedes chains allow rollback to prior revisions without mutating rows. Disaster exercises must rehearse restoring from snapshot, replaying logical replication into Policy Engine, and re-validating guard compliance.
- **Migration:** Legacy normalised fields are moved to temporary views during cutover; ingestion runtime removes writes once guard-enforced path is live (see [Migration playbook](../../ingestion/aggregation-only-contract.md#8-migration-playbook)).
- **Migration:** Legacy normalised fields are moved to temporary views during cutover; ingestion runtime removes writes once guard-enforced path is live (see [Migration playbook](../../aoc/aggregation-only-contract.md#8-migration-playbook)).
---
## 5·Replay CAS & deterministic bundles
- **Replay CAS:** Content-addressed storage lives under `cas://replay/<sha256-prefix>/<digest>.tar.zst`. Writers must use [StellaOps.Replay.Core](../../src/__Libraries/StellaOps.Replay.Core/AGENTS.md) helpers to ensure lexicographic file ordering, POSIX mode normalisation (0644/0755), LF newlines, zstd level19 compression, and shard-by-prefix CAS URIs (`BuildCasUri`). Bundle metadata (size, hash, created) feeds the platform-wide `replay_bundles` collection defined in `docs/data/replay_schema.md`.
- **Replay CAS:** Content-addressed storage lives under `cas://replay/<sha256-prefix>/<digest>.tar.zst`. Writers must use [StellaOps.Replay.Core](../../src/__Libraries/StellaOps.Replay.Core/AGENTS.md) helpers to ensure lexicographic file ordering, POSIX mode normalisation (0644/0755), LF newlines, zstd level19 compression, and shard-by-prefix CAS URIs (`BuildCasUri`). Bundle metadata (size, hash, created) feeds the platform-wide `replay_bundles` collection defined in `docs/db/replay-schema.md`.
- **Artifacts:** Each recorded scan stores three bundles:
1. `manifest.json` (canonical JSON, hashed and signed via DSSE).
2. `inputbundle.tar.zst` (feeds, policies, tools, environment snapshot).
@@ -179,14 +179,14 @@ sequenceDiagram
## 6·References
- [Aggregation-Only Contract reference](../../ingestion/aggregation-only-contract.md)
- [Aggregation-Only Contract reference](../../aoc/aggregation-only-contract.md)
- [Concelier architecture](../concelier/architecture.md)
- [Excititor architecture](../excititor/architecture.md)
- [Policy Engine architecture](../policy/architecture.md)
- [Authority service](../authority/architecture.md)
- [Replay specification](../../replay/DETERMINISTIC_REPLAY.md)
- [Replay developer guide](../../replay/DEVS_GUIDE_REPLAY.md)
- [Replay schema](../../data/replay_schema.md) *(pending)*
- [Replay schema](../../db/replay-schema.md)
- [Replay test strategy](../../replay/TEST_STRATEGY.md) *(draft)*
- [Observability standards (upcoming)](../../observability/policy.md) interim reference for telemetry naming.

View File

@@ -5,7 +5,7 @@ This module aggregates cross-cutting contracts and guardrails that every StellaO
## Anchors
- High-level system view: `../../07_HIGH_LEVEL_ARCHITECTURE.md`
- Platform overview: `architecture-overview.md`
- Aggregation-Only Contract: `../ingestion/aggregation-only-contract.md` (referenced across ingestion/observability docs)
- Aggregation-Only Contract: `../../aoc/aggregation-only-contract.md` (referenced across ingestion/observability docs)
## Scope
- **Identity & tenancy**: Authority-issued OpToks, tenant scoping, RBAC, short TTLs; see Authority module docs.

View File

@@ -18,7 +18,7 @@ Policy Engine compiles and evaluates Stella DSL policies deterministically, prod
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -5,7 +5,7 @@
> **Ownership:** Policy Guild • Platform Guild
> **Services:** `StellaOps.Policy.Engine` (Minimal API + worker host)
> **Data Stores:** PostgreSQL (`policy.*` schemas for packs, runs, exceptions, receipts), Object storage (explain bundles), optional queue
> **Related docs:** [Policy overview](../../policy/overview.md), [DSL](../../policy/dsl.md), [SPL v1](../../policy/spl-v1.md), [Lifecycle](../../policy/lifecycle.md), [Runtime](../../policy/runtime.md), [Governance](../../policy/governance.md), [REST API](../../policy/api.md), [Policy CLI](../cli/guides/policy.md), [Architecture overview](../platform/architecture-overview.md), [AOC reference](../../ingestion/aggregation-only-contract.md)
> **Related docs:** [Policy overview](../../policy/overview.md), [DSL](../../policy/dsl.md), [SPL v1](../../policy/spl-v1.md), [Lifecycle](../../policy/lifecycle.md), [Runtime](../../policy/runtime.md), [Governance](../../policy/governance.md), [REST API](../../policy/api.md), [Policy CLI](../cli/guides/policy.md), [Architecture overview](../platform/architecture-overview.md), [AOC reference](../../aoc/aggregation-only-contract.md)
This dossier describes the internal structure of the Policy Engine service delivered in Epic2. It focuses on module boundaries, deterministic evaluation, orchestration, and integration contracts with Concelier, Excititor, SBOM Service, Authority, Scheduler, and Observability stacks.
@@ -111,7 +111,7 @@ Key notes:
| **Authority Client** (`Authority/`) | Acquire tokens, enforce scopes, perform DPoP key rotation. | Only service identity uses `effective:write`. |
| **DSL Compiler** (`Dsl/`) | Parse, canonicalise, IR generation, checksum caching. | Uses Roslyn-like pipeline; caches by `policyId+version+hash`. |
| **Selection Layer** (`Selection/`) | Batch SBOM ↔ advisory ↔ VEX joiners; apply equivalence tables; support incremental cursors. | Deterministic ordering (SBOM → advisory → VEX). |
| **Evaluator** (`Evaluation/`) | Execute IR with first-match semantics, compute severity/trust/reachability weights, record rule hits, and emit a unified confidence score with factor breakdown (reachability/runtime/VEX/provenance/policy). | Stateless; all inputs provided by selection layer. |
| **Evaluator** (`Evaluation/`) | Execute IR with first-match semantics, compute severity/trust/reachability weights, record rule hits, and emit a unified confidence score with factor breakdown (reachability/runtime/VEX/provenance/policy). | Stateless; all inputs provided by selection layer. |
| **Signals** (`Signals/`) | Normalizes reachability, trust, entropy, uncertainty, runtime hits into a single dictionary passed to Evaluator; supplies default `unknown` values when signals missing. Entropy penalties are derived from Scanner `layer_summary.json`/`entropy.report.json` (K=0.5, cap=0.3, block at image opaque ratio &gt; 0.15 w/ unknown provenance) and exported via `policy_entropy_penalty_value` / `policy_entropy_image_opaque_ratio`; SPL scope `entropy.*` exposes `penalty`, `image_opaque_ratio`, `blocked`, `warned`, `capped`, `top_file_opaque_ratio`. | Aligns with `signals.*` namespace in DSL. |
| **Materialiser** (`Materialization/`) | Upsert effective findings, append history, manage explain bundle exports. | PostgreSQL transactions per SBOM chunk. |
| **Orchestrator** (`Runs/`) | Change-stream ingestion, fairness, retry/backoff, queue writer. | Works with Scheduler Models DTOs. |
@@ -203,150 +203,150 @@ Determinism guard instrumentation wraps the evaluator, rejecting access to forbi
All payloads are immutable and include analyzer fingerprints (`scanner.native@sha256:...`, `policyEngine@sha256:...`) so replay tooling can recompute identical digests. Determinism tests cover both the OpenVEX JSON and the DSSE payload bytes.
---
### 6.2 · Trust Lattice Policy Gates
The Policy Engine evaluates Trust Lattice gates after claim score merging to enforce trust-based constraints on VEX verdicts.
#### Gate Interface
```csharp
public interface IPolicyGate
{
Task<GateResult> EvaluateAsync(
MergeResult mergeResult,
PolicyGateContext context,
CancellationToken ct = default);
}
public sealed record GateResult
{
public required string GateName { get; init; }
public required bool Passed { get; init; }
public string? Reason { get; init; }
public ImmutableDictionary<string, object> Details { get; init; }
}
```
#### Available Gates
| Gate | Purpose | Configuration Key |
|------|---------|-------------------|
| **MinimumConfidenceGate** | Reject verdicts below confidence threshold per environment | `gates.minimumConfidence` |
| **UnknownsBudgetGate** | Fail scan if unknowns exceed budget | `gates.unknownsBudget` |
| **SourceQuotaGate** | Prevent single-source dominance without corroboration | `gates.sourceQuota` |
| **ReachabilityRequirementGate** | Require reachability proof for critical CVEs | `gates.reachabilityRequirement` |
| **EvidenceFreshnessGate** | Reject stale evidence below freshness threshold | `gates.evidenceFreshness` |
#### MinimumConfidenceGate
Requires minimum confidence threshold for suppression verdicts:
```yaml
gates:
minimumConfidence:
enabled: true
thresholds:
production: 0.75 # High bar for production
staging: 0.60 # Moderate for staging
development: 0.40 # Permissive for dev
applyToStatuses:
- not_affected
- fixed
```
- **Behavior**: `affected` status bypasses this gate (conservative default).
- **Result**: `confidence_below_threshold` when verdict confidence < environment threshold.
#### UnknownsBudgetGate
Limits exposure to unknown/unscored dependencies:
```yaml
gates:
unknownsBudget:
enabled: true
maxUnknownCount: 5
maxCumulativeUncertainty: 2.0
escalateOnExceed: true
```
- **Behavior**: Fails when unknowns exceed count limit OR cumulative uncertainty exceeds budget.
- **Cumulative uncertainty**: `sum(1 - ClaimScore)` across all verdicts.
#### SourceQuotaGate
Prevents single-source verdicts without corroboration:
```yaml
gates:
sourceQuota:
enabled: true
maxInfluencePercent: 60
corroborationDelta: 0.10
requireCorroboration: true
```
- **Behavior**: Fails when single source provides > 60% of verdict weight AND no second source is within delta (0.10).
- **Rationale**: Ensures critical decisions have multi-source validation.
#### ReachabilityRequirementGate
Requires reachability proof for high-severity vulnerabilities:
```yaml
gates:
reachabilityRequirement:
enabled: true
applySeverities:
- critical
- high
exemptStatuses:
- not_affected
bypassReasons:
- component_not_present
```
- **Behavior**: Fails when CRITICAL/HIGH CVE marked `not_affected` lacks reachability proof (unless bypass reason applies).
#### Gate Registry
Gates are registered via DI and evaluated in sequence:
```csharp
public interface IPolicyGateRegistry
{
IEnumerable<IPolicyGate> GetEnabledGates(string environment);
Task<GateEvaluationResult> EvaluateAllAsync(
MergeResult mergeResult,
PolicyGateContext context,
CancellationToken ct = default);
}
```
#### Gate Metrics
- `policy_gate_evaluations_total{gate,result}` — Count of gate evaluations by outcome
- `policy_gate_failures_total{gate,reason}` — Count of gate failures by reason
- `policy_gate_latency_seconds{gate}` — Gate evaluation latency histogram
#### Gate Implementation Reference
| Gate | Source File |
|------|-------------|
| MinimumConfidenceGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/MinimumConfidenceGate.cs` |
| UnknownsBudgetGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/UnknownsBudgetGate.cs` |
| SourceQuotaGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/SourceQuotaGate.cs` |
| ReachabilityRequirementGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/ReachabilityRequirementGate.cs` |
| EvidenceFreshnessGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/EvidenceFreshnessGate.cs` |
See `etc/policy-gates.yaml.sample` for complete gate configuration options.
**Related Documentation:**
- [Trust Lattice Specification](../excititor/trust-lattice.md)
- [Verdict Manifest Specification](../authority/verdict-manifest.md)
---
### 6.2 · Trust Lattice Policy Gates
The Policy Engine evaluates Trust Lattice gates after claim score merging to enforce trust-based constraints on VEX verdicts.
#### Gate Interface
```csharp
public interface IPolicyGate
{
Task<GateResult> EvaluateAsync(
MergeResult mergeResult,
PolicyGateContext context,
CancellationToken ct = default);
}
public sealed record GateResult
{
public required string GateName { get; init; }
public required bool Passed { get; init; }
public string? Reason { get; init; }
public ImmutableDictionary<string, object> Details { get; init; }
}
```
#### Available Gates
| Gate | Purpose | Configuration Key |
|------|---------|-------------------|
| **MinimumConfidenceGate** | Reject verdicts below confidence threshold per environment | `gates.minimumConfidence` |
| **UnknownsBudgetGate** | Fail scan if unknowns exceed budget | `gates.unknownsBudget` |
| **SourceQuotaGate** | Prevent single-source dominance without corroboration | `gates.sourceQuota` |
| **ReachabilityRequirementGate** | Require reachability proof for critical CVEs | `gates.reachabilityRequirement` |
| **EvidenceFreshnessGate** | Reject stale evidence below freshness threshold | `gates.evidenceFreshness` |
#### MinimumConfidenceGate
Requires minimum confidence threshold for suppression verdicts:
```yaml
gates:
minimumConfidence:
enabled: true
thresholds:
production: 0.75 # High bar for production
staging: 0.60 # Moderate for staging
development: 0.40 # Permissive for dev
applyToStatuses:
- not_affected
- fixed
```
- **Behavior**: `affected` status bypasses this gate (conservative default).
- **Result**: `confidence_below_threshold` when verdict confidence < environment threshold.
#### UnknownsBudgetGate
Limits exposure to unknown/unscored dependencies:
```yaml
gates:
unknownsBudget:
enabled: true
maxUnknownCount: 5
maxCumulativeUncertainty: 2.0
escalateOnExceed: true
```
- **Behavior**: Fails when unknowns exceed count limit OR cumulative uncertainty exceeds budget.
- **Cumulative uncertainty**: `sum(1 - ClaimScore)` across all verdicts.
#### SourceQuotaGate
Prevents single-source verdicts without corroboration:
```yaml
gates:
sourceQuota:
enabled: true
maxInfluencePercent: 60
corroborationDelta: 0.10
requireCorroboration: true
```
- **Behavior**: Fails when single source provides > 60% of verdict weight AND no second source is within delta (0.10).
- **Rationale**: Ensures critical decisions have multi-source validation.
#### ReachabilityRequirementGate
Requires reachability proof for high-severity vulnerabilities:
```yaml
gates:
reachabilityRequirement:
enabled: true
applySeverities:
- critical
- high
exemptStatuses:
- not_affected
bypassReasons:
- component_not_present
```
- **Behavior**: Fails when CRITICAL/HIGH CVE marked `not_affected` lacks reachability proof (unless bypass reason applies).
#### Gate Registry
Gates are registered via DI and evaluated in sequence:
```csharp
public interface IPolicyGateRegistry
{
IEnumerable<IPolicyGate> GetEnabledGates(string environment);
Task<GateEvaluationResult> EvaluateAllAsync(
MergeResult mergeResult,
PolicyGateContext context,
CancellationToken ct = default);
}
```
#### Gate Metrics
- `policy_gate_evaluations_total{gate,result}` — Count of gate evaluations by outcome
- `policy_gate_failures_total{gate,reason}` — Count of gate failures by reason
- `policy_gate_latency_seconds{gate}` — Gate evaluation latency histogram
#### Gate Implementation Reference
| Gate | Source File |
|------|-------------|
| MinimumConfidenceGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/MinimumConfidenceGate.cs` |
| UnknownsBudgetGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/UnknownsBudgetGate.cs` |
| SourceQuotaGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/SourceQuotaGate.cs` |
| ReachabilityRequirementGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/ReachabilityRequirementGate.cs` |
| EvidenceFreshnessGate | `src/Policy/__Libraries/StellaOps.Policy/Gates/EvidenceFreshnessGate.cs` |
See `etc/policy-gates.yaml.sample` for complete gate configuration options.
**Related Documentation:**
- [Trust Lattice Specification](../excititor/trust-lattice.md)
- [Verdict Manifest Specification](../authority/verdict-manifest.md)
---
## 7·Security & Tenancy

View File

@@ -1,5 +1,7 @@
# Provcache Module
> **Status: Planned** — This module is documented for upcoming implementation in Sprint 8200. The design is finalized but source code does not yet exist.
> Provenance Cache — Maximizing Trust Evidence Density
## Overview

View File

@@ -16,7 +16,7 @@ The registry module issues scoped pull tokens for mirrored container registries
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -22,7 +22,7 @@ Scanner analyses container images layer-by-layer, producing deterministic SBOM f
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -25,7 +25,7 @@ Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates
5. On completion, set status to `DONE` in both the sprint file and `TASKS.md`; if paused, revert to `TODO` and add a brief note.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see `../../ingestion/aggregation-only-contract.md`).
- Honour the Aggregation-Only Contract where applicable (see `../../aoc/aggregation-only-contract.md`).
- No undocumented schema or API contract changes; document deltas in architecture or implementation_plan.
- Keep Offline Kit parity—document air-gapped workflows for any new feature.
- Prefer deterministic fixtures and avoid machine-specific artefacts in examples.

View File

@@ -23,7 +23,7 @@ Signer validates callers, enforces Proof-of-Entitlement, and produces signed DSS
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -26,7 +26,7 @@ Telemetry module captures deployment and operations guidance for the shared obse
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -22,7 +22,7 @@ The Console presents operator dashboards for scans, policies, VEX evidence, runt
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -24,7 +24,7 @@ Zastava monitors running workloads, verifies supply chain posture, and enforces
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -66,15 +66,15 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
"imageRef": "ghcr.io/acme/api@sha256:abcd…",
"owner": { "kind": "Deployment", "name": "api" }
},
"process": {
"pid": 12345,
"entrypoint": ["/entrypoint.sh", "--serve"],
"entryTrace": [
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
],
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
},
"process": {
"pid": 12345,
"entrypoint": ["/entrypoint.sh", "--serve"],
"entryTrace": [
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
],
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
},
"loadedLibs": [
{ "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
{ "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
@@ -116,35 +116,35 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
],
"decision": "Allow|Deny",
"ttlSeconds": 300
}
```
### 2.3 Schema negotiation & hashing guarantees
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
---
## 3) Observer — node agent (DaemonSet)
}
```
### 2.3 Schema negotiation & hashing guarantees
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
---
## 3) Observer — node agent (DaemonSet)
### 3.1 Responsibilities
* **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC readonly) or `/var/log/containers/*.log` tail fallback.
* **Resolve** container → image digest, mount point rootfs.
* **Trace entrypoint**: attach **shortlived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**.
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
* **Posture check** (cheap):
* Image signature presence (if cosign policies are local; else ask backend).
* SBOM **referrers** presence (HEAD to registry, optional).
* Rekor UUID known (query Scanner.WebService by image digest).
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
### 3.2 Privileges & mounts (K8s)
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
### 3.2 Privileges & mounts (K8s)
* **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`.
* **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`.
@@ -154,22 +154,22 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
* `/run/containerd/containerd.sock` (or CRIO socket)
* `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids)
* **Networking:** clusterinternal egress to Scanner.WebService only.
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
### 3.3 Event batching
* Buffer NDJSON; flush by **N events** or **2s**.
* Backpressure: local disk ring buffer (50MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
### 3.4 Build-id capture & validation workflow
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
4. Operators resolve symbols by either:
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
### 3.3 Event batching
* Buffer NDJSON; flush by **N events** or **2s**.
* Backpressure: local disk ring buffer (50MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
### 3.4 Build-id capture & validation workflow
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
4. Operators resolve symbols by either:
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
---
@@ -221,20 +221,20 @@ sequenceDiagram
`POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)*
* Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL).
* Validates event schema; enforces rate caps by tenant/node; persists to **PostgreSQL** (`runtime.events` table with TTL-based retention).
* Performs **correlation**:
* Attach nearest **image SBOM** (inventory/usage) and **BOMIndex** if known.
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
```json
{
@@ -273,44 +273,44 @@ Response:
```yaml
zastava:
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
allowInsecureHttp: false
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner","zastava"]
scopes:
- "api:scanner.runtime.write"
refreshSkewSeconds: 120
requireDpop: true
requireMutualTls: true
allowStaticTokenFallback: false
staticTokenPath: null # Optional bootstrap secret
tenant: "tenant-01"
environment: "prod"
deployment: "cluster-a"
logging:
includeScopes: true
includeActivityTracking: true
staticScope:
plane: "runtime"
metrics:
meterName: "StellaOps.Zastava"
meterVersion: "1.0.0"
commonTags:
cluster: "prod-cluster"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
allowInsecureHttp: false
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner","zastava"]
scopes:
- "api:scanner.runtime.write"
refreshSkewSeconds: 120
requireDpop: true
requireMutualTls: true
allowStaticTokenFallback: false
staticTokenPath: null # Optional bootstrap secret
tenant: "tenant-01"
environment: "prod"
deployment: "cluster-a"
logging:
includeScopes: true
includeActivityTracking: true
staticScope:
plane: "runtime"
metrics:
meterName: "StellaOps.Zastava"
meterVersion: "1.0.0"
commonTags:
cluster: "prod-cluster"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
maxLibs: 256
maxHashBytesPerContainer: 64_000_000
maxDepth: 48
@@ -327,49 +327,49 @@ zastava:
eventsPerSecond: 50
burst: 200
perNodeQueue: 10_000
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
---
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
---
## 7) Security posture
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
---
## 8) Metrics, logs, tracing
**Observer**
* `zastava.runtime.events.total{kind}`
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
* `zastava.proc_maps.samples.total{result}`
* `zastava.entrytrace.depth{p99}`
* `zastava.hash.bytes.total`
* `zastava.buffer.drops.total`
**Webhook**
* `zastava.admission.decisions.total{decision}`
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
* `zastava.admission.cache.hits.total`
* `zastava.backend.failures.total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
**Observer**
* `zastava.runtime.events.total{kind}`
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
* `zastava.proc_maps.samples.total{result}`
* `zastava.entrytrace.depth{p99}`
* `zastava.hash.bytes.total`
* `zastava.buffer.drops.total`
**Webhook**
* `zastava.admission.decisions.total{decision}`
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
* `zastava.admission.cache.hits.total`
* `zastava.backend.failures.total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
---
@@ -486,20 +486,20 @@ webhooks:
---
## 15) Roadmap
* **eBPF** option for syscall/library load tracing (kernellevel, optin).
* **Windows containers** support (ETW providers, loaded modules).
* **Network posture** checks: listening ports vs policy.
* **Live **usedbyentrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
* **Admission dryrun** dashboards (simulate block lists before enforcing).
---
## 16) Observability (stub)
- Runbook + dashboard placeholder for offline import: `operations/observability.md`, `operations/dashboards/zastava-observability.json`.
- Metrics to surface: admission latency p95/p99, allow/deny counts, Surface.Env miss rate, Surface.Secrets failures, Surface.FS cache freshness, drift events.
- Health endpoints: `/health/liveness`, `/health/readiness`, `/status`, `/surface/fs/cache/status` (see runbook).
- Alert hints: deny spikes, latency > 800ms p99, cache freshness lag > 10m, any secrets failure.
## 15) Roadmap
* **eBPF** option for syscall/library load tracing (kernellevel, optin).
* **Windows containers** support (ETW providers, loaded modules).
* **Network posture** checks: listening ports vs policy.
* **Live **usedbyentrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
* **Admission dryrun** dashboards (simulate block lists before enforcing).
---
## 16) Observability (stub)
- Runbook + dashboard placeholder for offline import: `operations/observability.md`, `operations/dashboards/zastava-observability.json`.
- Metrics to surface: admission latency p95/p99, allow/deny counts, Surface.Env miss rate, Surface.Secrets failures, Surface.FS cache freshness, drift events.
- Health endpoints: `/health/liveness`, `/health/readiness`, `/status`, `/surface/fs/cache/status` (see runbook).
- Alert hints: deny spikes, latency > 800ms p99, cache freshness lag > 10m, any secrets failure.