Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Added `LedgerMetrics` class to record write latency and total events for ledger operations. - Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling. - Introduced `TestSurfaceSecretsScope` for managing environment variables during tests. - Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents. - Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB. - Established MongoDB indexes for efficient querying of events based on provenance and trust. - Added models and JSON parsing logic for DSSE provenance and trust information.
13 KiB
13 KiB
Sprint 170 - Notifications & Telemetry
Active items only. Completed/historic work now resides in docs/implplan/archived/tasks.md (updated 2025-11-08).
This file now only tracks the notifications & telemetry status snapshot. Active backlog lives in Sprint 171+ files.
Wave coordination
| Wave | Guild owners | Shared prerequisites | Status | Notes |
|---|---|---|---|---|
| 170.A Notifier | Notifications Service Guild · Attestor Service Guild · Observability Guild | Sprint 150.A – Orchestrator | DOING (2025-11-12) | Scope confirmation + template/OAS prep underway; execution tracked in SPRINT_171_notifier_i.md (NOTIFY-ATTEST/OAS/OBS/RISK series). |
| 170.B Telemetry | Telemetry Core Guild · Observability Guild · Security Guild | Sprint 150.A – Orchestrator | DOING (2025-11-12) | Bootstrapping StellaOps.Telemetry.Core plus adoption runway in SPRINT_174_telemetry.md; waiting on Orchestrator/Policy hosts to consume new helpers. |
Sprint 170 - Notifications & Telemetry
Wave 170.A – Notifier readiness
Scope & goals
- Deliver attestation/key-rotation alert templates plus routing so Attestor/Signer incidents surface immediately (NOTIFY-ATTEST-74-001/002).
- Refresh Notifier OpenAPI/SDK surface (
NOTIFY-OAS-61-001→NOTIFY-OAS-63-001) so Console/CLI teams can self-serve the new endpoints. - Wire SLO/incident inputs into rules (NOTIFY-OBS-51-001/55-001) and extend risk-profile routing (NOTIFY-RISK-66-001 → NOTIFY-RISK-68-001) without regressing quiet-hours/dedup.
- Preserve Offline Kit and documentation parity (NOTIFY-DOC-70-001 — done, NOTIFY-AIRGAP-56-002 — done) while adding the new rule surfaces.
Entry criteria
- Orchestrator job attest events flowing to Notify bus (Sprint 150.A dependency) with test fixtures approved by Attestor Guild.
- Quiet-hours/digest backlog reconciled (no pending blockers in
docs/notifications/*.md). - Observability Guild sign-off on telemetry fields reused by Notifier SLO webhooks.
Exit criteria
- All NOTIFY-ATTEST/OAS/OBS/RISK tasks in
SPRINT_171_notifier_i.mdmoved to DONE with accompanying doc updates. - Templates promoted to Offline Kit manifests and sample payloads stored under
docs/notifications/templates.md. - Incident mode notifications exercised in staging with audit logs + DSSE evidence attached.
Task clusters & owners
| Cluster | Linked tasks | Owners | Status snapshot | Notes |
|---|---|---|---|---|
| Attestation / key lifecycle alerts | NOTIFY-ATTEST-74-001/74-002 | Notifications Service Guild · Attestor Service Guild | TODO → DOING (prep) | Template scaffolding drafted; awaiting Rekor witness payload contract freeze. |
| API/OAS refresh & SDK parity | NOTIFY-OAS-61-001 → NOTIFY-OAS-63-001 | Notifications Service Guild · API Contracts Guild · SDK Generator Guild | TODO | Contract doc outline in review; SDK generator blocked on /notifications/rules schema finalize date (target 2025-11-15). |
| Observability-driven triggers | NOTIFY-OBS-51-001/55-001 | Notifications Service Guild · Observability Guild | TODO | Depends on Telemetry team exposing SLO webhook payload shape (see TELEMETRY-OBS-51-001). |
| Risk profile routing | NOTIFY-RISK-66-001 → NOTIFY-RISK-68-001 | Notifications Service Guild · Risk Engine Guild · Policy Guild | TODO | Requires Policy’s risk profile metadata (POLICY-RISK-40-002) export; follow up in Sprint 175. |
| Docs & offline parity | NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002 | Notifications Service Guild · DevOps Guild | DONE | Remains reference for GA checklists; keep untouched unless new surfaces appear. |
Observability checkpoints
- Align metric names/labels with
docs/notifications/architecture.md#12-observability-prometheus--otelbefore promoting new dashboards. - Ensure Notifier spans/logs include tenant, ruleId, actionId, and
attestation_event_idfor attestation-triggered templates. - Capture incident notification smoke tests via
ops/devops/telemetry/tenant_isolation_smoke.pyonce Telemetry wave lands.
Wave 170.B – Telemetry bootstrap
Scope & goals
- Ship
StellaOps.Telemetry.Corebootstrap + propagation helpers (TELEMETRY-OBS-50-001/50-002). - Provide golden-signal helpers + scrubbing/PII safety nets (TELEMETRY-OBS-51-001/51-002) so service teams can onboard without bespoke plumbing.
- Implement incident + sealed-mode toggles (TELEMETRY-OBS-55-001/56-001) and document the integration contract for Orchestrator, Policy, Task Runner, Gateway (
WEB-OBS-50-001).
Entry criteria
- Orchestrator + Policy hosts expose extension points for telemetry bootstrap (tracked via Sprint 150.A and IDs ORCH-OBS-50-001 / POLICY-OBS-50-001).
- Observability Guild reviewed storage footprint impacts for Prometheus/Tempo/Loki per module (docs/modules/telemetry/architecture.md §2).
- Security Guild signs off on redaction defaults + tenant override audit logging.
Exit criteria
- Core library published to
/local-nugetsand referenced by at least Orchestrator & Policy in integration branches. - Context propagation middleware validated through HTTP/gRPC/job smoke tests with deterministic trace IDs.
- Incident/sealed-mode toggles wired into CLI + Notify hooks (NOTIFY-OBS-55-001) with runbooks updated under
docs/notifications/architecture.md.
Task clusters & owners
| Cluster | Linked tasks | Owners | Status snapshot | Notes |
|---|---|---|---|---|
| Bootstrap & propagation | TELEMETRY-OBS-50-001/50-002 | Telemetry Core Guild | TODO → DOING (scaffolding) | Collector profile templates staged; need service metadata detector + sample host integration PRs. |
| Metrics helpers + scrubbing | TELEMETRY-OBS-51-001/51-002 | Telemetry Core Guild · Observability Guild · Security Guild | TODO | Roslyn analyzer spec drafted; waiting on scrub policy from Security (POLICY-SEC-42-003). |
| Incident & sealed-mode controls | TELEMETRY-OBS-55-001/56-001 | Telemetry Core Guild · Observability Guild | TODO | Requires CLI toggle contract (CLI-OBS-12-001) and Notify incident payload spec (NOTIFY-OBS-55-001). |
Tooling & validation
- Smoke:
ops/devops/telemetry/smoke_otel_collector.py+tenant_isolation_smoke.pyto run for each profile (default/forensic/airgap). - Offline bundle packaging:
ops/devops/telemetry/package_offline_bundle.pyto include updated collectors, dashboards, manifest digests. - Incident simulation: reuse
ops/devops/telemetry/generate_dev_tls.shfor local collector certs during sealed-mode testing.
Shared milestones & dependencies
| Target date | Milestone | Owners | Dependency notes |
|---|---|---|---|
| 2025-11-13 | Finalize attestation payload schema + template variables | Notifications Service Guild · Attestor Service Guild | Unblocks NOTIFY-ATTEST-74-001/002 + Telemetry incident span labels. |
| 2025-11-15 | Publish draft Notifier OAS + SDK snippets | Notifications Service Guild · API Contracts Guild | Required for CLI/UI adoption; prereq for NOTIFY-OAS-61/62 series. |
| 2025-11-18 | Land Telemetry.Core bootstrap sample in Orchestrator | Telemetry Core Guild · Orchestrator Guild | Demonstrates TELEMETRY-OBS-50-001 viability; prerequisite for Policy adoption + Notify SLO hooks. |
| 2025-11-20 | Incident/quiet-hour end-to-end rehearsal | Notifications Service Guild · Telemetry Core Guild · Observability Guild | Validates TELEMETRY-OBS-55-001 + NOTIFY-OBS-55-001 + CLI toggle contract. |
| 2025-11-22 | Offline kit bundle refresh (notifications + telemetry assets) | DevOps Guild · Notifications Service Guild · Telemetry Core Guild | Ensure docs/ops/offline-kit manifests reference new templates/configs. |
Risks & mitigations
- Telemetry data drift in sealed mode. Mitigate by enforcing
IEgressPolicychecks (TELEMETRY-OBS-56-001) and documenting fallback exporters; schedule smoke runs after each config change. - Template/API divergence. Maintain single source of truth in
SPRINT_171_notifier_i.mdtasks; require API Contracts review before merging SDK updates to avoid drift with UI consumers. - Observability storage overhead. Coordinate with Ops Guild to project Prometheus/Tempo growth when SLO webhooks + incident toggles increase cardinality; adjust retention per docs/modules/telemetry/architecture.md §2.
- Cross-sprint dependency churn. Track ORCH-OBS-50-001, POLICY-OBS-50-001, WEB-OBS-50-001 weekly; if they slip, re-baseline Telemetry wave deliverables or gate Notifier observability triggers accordingly.
Task mirror snapshot (reference: Sprint 171 & 174 trackers)
Wave 170.A – Notifier (Sprint 171 mirror)
- Open tasks: 11 (NOTIFY-ATTEST/OAS/OBS/RISK series).
- Done tasks: 2 (NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002) – serve as baselines for doc/offline parity.
| Category | Task IDs | Current state | Notes |
|---|---|---|---|
| Attestation + key lifecycle | NOTIFY-ATTEST-74-001/002 | DOING / TODO | Template creation in progress (74-001) with doc updates in docs/notifications/templates.md; wiring (74-002) waiting on schema freeze & template hand-off. |
| API/OAS + SDK refresh | NOTIFY-OAS-61-001 → 63-001 | DOING / TODO | OAS doc updates underway (61-001); downstream endpoints/SDK items remain TODO until schema merged. |
| Observability-driven triggers | NOTIFY-OBS-51-001/55-001 | TODO | Depends on Telemetry SLO webhook schema + incident toggle contract. |
| Risk routing | NOTIFY-RISK-66-001 → 68-001 | TODO | Policy/Risk metadata export (POLICY-RISK-40-002) required before implementation. |
| Completed prerequisites | NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002 | DONE | Keep as reference for documentation/offline-kit parity. |
Wave 170.B – Telemetry (Sprint 174 mirror)
- Open tasks: 6 (TELEMETRY-OBS-50/51/55/56 series).
- Done tasks: 0 (wave not yet started in Sprint 174 beyond scaffolding work-in-progress).
| Category | Task IDs | Current state | Notes |
|---|---|---|---|
| Bootstrap & propagation | TELEMETRY-OBS-50-001/002 | DOING / TODO | Core bootstrap coding active (50-001); propagation adapters (50-002) queued pending package publication. |
| Metrics helpers & scrubbing | TELEMETRY-OBS-51-001/002 | TODO | Roslyn analyzer + scrub policy review pending Security Guild approval. |
| Incident & sealed-mode controls | TELEMETRY-OBS-55-001/56-001 | TODO | Requires CLI toggle contract (CLI-OBS-12-001) and Notify incident payload spec (NOTIFY-OBS-55-001). |
External dependency tracker
| Dependency | Source sprint / doc | Current state (as of 2025-11-12) | Impact on waves |
|---|---|---|---|
| Sprint 150.A – Orchestrator (wave table) | SPRINT_150_scheduling_automation.md |
TODO | Blocks Notifier template wiring + Telemetry consumption of job events until orchestration telemetry lands. |
ORCH-OBS-50-001 orchestrator instrumentation |
docs/implplan/archived/tasks.md excerpt / Sprint 150 backlog |
TODO | Needed for Telemetry.Core sample + Notify SLO hooks; monitor for slip. |
POLICY-OBS-50-001 policy instrumentation |
Sprint 150 backlog | TODO | Required before Telemetry helpers can be adopted by Policy + risk routing. |
WEB-OBS-50-001 gateway telemetry core adoption |
Sprint 214/215 backlogs | TODO | Ensures web/gateway emits trace IDs that Notify incident payload references. |
POLICY-RISK-40-002 risk profile metadata export |
Sprint 215+ (Policy) | TODO | Prerequisite for NOTIFY-RISK-66/67/68 payload enrichment. |
Coordination log
| Date (UTC) | Update | Owner |
|---|---|---|
| 2025-11-12 10:15 | Wave rows flipped to DOING; baseline scope/entry/exit criteria recorded for both waves. | Observability Guild · Notifications Service Guild |
| 2025-11-12 14:40 | Added task mirror + dependency tracker + milestone table to keep Sprint 170 snapshot aligned with Sprint 171/174 execution plans. | Observability Guild |
| 2025-11-12 18:05 | Marked NOTIFY-ATTEST-74-001, NOTIFY-OAS-61-001, and TELEMETRY-OBS-50-001 as DOING in their sprint trackers; added status notes reflecting in-flight work vs. gated follow-ups. | Notifications Service Guild · Telemetry Core Guild |
| 2025-11-12 19:20 | Documented attestation template suite (Section 7 in docs/notifications/templates.md) to unblock NOTIFY-ATTEST-74-001 deliverables and updated sprint mirrors accordingly. |
Notifications Service Guild |
| 2025-11-12 19:32 | Synced notifications architecture doc to reference the new attestation template suite so downstream teams see the dependency in one place. | Notifications Service Guild |
| 2025-11-12 19:45 | Updated notifications overview + rules docs with tmpl-attest-* requirements so rule authors/operators share the same contract. |
Notifications Service Guild |
| 2025-11-12 20:05 | Published baseline Offline Kit templates under offline/notifier/templates/attestation/ for Slack/Email/Webhook so NOTIFY-ATTEST-74-002 wiring has ready-made artefacts. |
Notifications Service Guild |