Files
git.stella-ops.org/docs/implplan/SPRINT_170_notifications_telemetry.md
master 61f963fd52
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Implement ledger metrics for observability and add tests for Ruby packages endpoints
- Added `LedgerMetrics` class to record write latency and total events for ledger operations.
- Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling.
- Introduced `TestSurfaceSecretsScope` for managing environment variables during tests.
- Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents.
- Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB.
- Established MongoDB indexes for efficient querying of events based on provenance and trust.
- Added models and JSON parsing logic for DSSE provenance and trust information.
2025-11-13 09:29:09 +02:00

13 KiB
Raw Blame History

Sprint 170 - Notifications & Telemetry

Active items only. Completed/historic work now resides in docs/implplan/archived/tasks.md (updated 2025-11-08).

This file now only tracks the notifications & telemetry status snapshot. Active backlog lives in Sprint 171+ files.

Wave coordination

Wave Guild owners Shared prerequisites Status Notes
170.A Notifier Notifications Service Guild · Attestor Service Guild · Observability Guild Sprint 150.A Orchestrator DOING (2025-11-12) Scope confirmation + template/OAS prep underway; execution tracked in SPRINT_171_notifier_i.md (NOTIFY-ATTEST/OAS/OBS/RISK series).
170.B Telemetry Telemetry Core Guild · Observability Guild · Security Guild Sprint 150.A Orchestrator DOING (2025-11-12) Bootstrapping StellaOps.Telemetry.Core plus adoption runway in SPRINT_174_telemetry.md; waiting on Orchestrator/Policy hosts to consume new helpers.

Sprint 170 - Notifications & Telemetry

Wave 170.A Notifier readiness

Scope & goals

  • Deliver attestation/key-rotation alert templates plus routing so Attestor/Signer incidents surface immediately (NOTIFY-ATTEST-74-001/002).
  • Refresh Notifier OpenAPI/SDK surface (NOTIFY-OAS-61-001NOTIFY-OAS-63-001) so Console/CLI teams can self-serve the new endpoints.
  • Wire SLO/incident inputs into rules (NOTIFY-OBS-51-001/55-001) and extend risk-profile routing (NOTIFY-RISK-66-001 → NOTIFY-RISK-68-001) without regressing quiet-hours/dedup.
  • Preserve Offline Kit and documentation parity (NOTIFY-DOC-70-001 — done, NOTIFY-AIRGAP-56-002 — done) while adding the new rule surfaces.

Entry criteria

  • Orchestrator job attest events flowing to Notify bus (Sprint 150.A dependency) with test fixtures approved by Attestor Guild.
  • Quiet-hours/digest backlog reconciled (no pending blockers in docs/notifications/*.md).
  • Observability Guild sign-off on telemetry fields reused by Notifier SLO webhooks.

Exit criteria

  • All NOTIFY-ATTEST/OAS/OBS/RISK tasks in SPRINT_171_notifier_i.md moved to DONE with accompanying doc updates.
  • Templates promoted to Offline Kit manifests and sample payloads stored under docs/notifications/templates.md.
  • Incident mode notifications exercised in staging with audit logs + DSSE evidence attached.

Task clusters & owners

Cluster Linked tasks Owners Status snapshot Notes
Attestation / key lifecycle alerts NOTIFY-ATTEST-74-001/74-002 Notifications Service Guild · Attestor Service Guild TODO → DOING (prep) Template scaffolding drafted; awaiting Rekor witness payload contract freeze.
API/OAS refresh & SDK parity NOTIFY-OAS-61-001 → NOTIFY-OAS-63-001 Notifications Service Guild · API Contracts Guild · SDK Generator Guild TODO Contract doc outline in review; SDK generator blocked on /notifications/rules schema finalize date (target 2025-11-15).
Observability-driven triggers NOTIFY-OBS-51-001/55-001 Notifications Service Guild · Observability Guild TODO Depends on Telemetry team exposing SLO webhook payload shape (see TELEMETRY-OBS-51-001).
Risk profile routing NOTIFY-RISK-66-001 → NOTIFY-RISK-68-001 Notifications Service Guild · Risk Engine Guild · Policy Guild TODO Requires Policys risk profile metadata (POLICY-RISK-40-002) export; follow up in Sprint 175.
Docs & offline parity NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002 Notifications Service Guild · DevOps Guild DONE Remains reference for GA checklists; keep untouched unless new surfaces appear.

Observability checkpoints

  • Align metric names/labels with docs/notifications/architecture.md#12-observability-prometheus--otel before promoting new dashboards.
  • Ensure Notifier spans/logs include tenant, ruleId, actionId, and attestation_event_id for attestation-triggered templates.
  • Capture incident notification smoke tests via ops/devops/telemetry/tenant_isolation_smoke.py once Telemetry wave lands.

Wave 170.B Telemetry bootstrap

Scope & goals

  • Ship StellaOps.Telemetry.Core bootstrap + propagation helpers (TELEMETRY-OBS-50-001/50-002).
  • Provide golden-signal helpers + scrubbing/PII safety nets (TELEMETRY-OBS-51-001/51-002) so service teams can onboard without bespoke plumbing.
  • Implement incident + sealed-mode toggles (TELEMETRY-OBS-55-001/56-001) and document the integration contract for Orchestrator, Policy, Task Runner, Gateway (WEB-OBS-50-001).

Entry criteria

  • Orchestrator + Policy hosts expose extension points for telemetry bootstrap (tracked via Sprint 150.A and IDs ORCH-OBS-50-001 / POLICY-OBS-50-001).
  • Observability Guild reviewed storage footprint impacts for Prometheus/Tempo/Loki per module (docs/modules/telemetry/architecture.md §2).
  • Security Guild signs off on redaction defaults + tenant override audit logging.

Exit criteria

  • Core library published to /local-nugets and referenced by at least Orchestrator & Policy in integration branches.
  • Context propagation middleware validated through HTTP/gRPC/job smoke tests with deterministic trace IDs.
  • Incident/sealed-mode toggles wired into CLI + Notify hooks (NOTIFY-OBS-55-001) with runbooks updated under docs/notifications/architecture.md.

Task clusters & owners

Cluster Linked tasks Owners Status snapshot Notes
Bootstrap & propagation TELEMETRY-OBS-50-001/50-002 Telemetry Core Guild TODO → DOING (scaffolding) Collector profile templates staged; need service metadata detector + sample host integration PRs.
Metrics helpers + scrubbing TELEMETRY-OBS-51-001/51-002 Telemetry Core Guild · Observability Guild · Security Guild TODO Roslyn analyzer spec drafted; waiting on scrub policy from Security (POLICY-SEC-42-003).
Incident & sealed-mode controls TELEMETRY-OBS-55-001/56-001 Telemetry Core Guild · Observability Guild TODO Requires CLI toggle contract (CLI-OBS-12-001) and Notify incident payload spec (NOTIFY-OBS-55-001).

Tooling & validation

  • Smoke: ops/devops/telemetry/smoke_otel_collector.py + tenant_isolation_smoke.py to run for each profile (default/forensic/airgap).
  • Offline bundle packaging: ops/devops/telemetry/package_offline_bundle.py to include updated collectors, dashboards, manifest digests.
  • Incident simulation: reuse ops/devops/telemetry/generate_dev_tls.sh for local collector certs during sealed-mode testing.

Shared milestones & dependencies

Target date Milestone Owners Dependency notes
2025-11-13 Finalize attestation payload schema + template variables Notifications Service Guild · Attestor Service Guild Unblocks NOTIFY-ATTEST-74-001/002 + Telemetry incident span labels.
2025-11-15 Publish draft Notifier OAS + SDK snippets Notifications Service Guild · API Contracts Guild Required for CLI/UI adoption; prereq for NOTIFY-OAS-61/62 series.
2025-11-18 Land Telemetry.Core bootstrap sample in Orchestrator Telemetry Core Guild · Orchestrator Guild Demonstrates TELEMETRY-OBS-50-001 viability; prerequisite for Policy adoption + Notify SLO hooks.
2025-11-20 Incident/quiet-hour end-to-end rehearsal Notifications Service Guild · Telemetry Core Guild · Observability Guild Validates TELEMETRY-OBS-55-001 + NOTIFY-OBS-55-001 + CLI toggle contract.
2025-11-22 Offline kit bundle refresh (notifications + telemetry assets) DevOps Guild · Notifications Service Guild · Telemetry Core Guild Ensure docs/ops/offline-kit manifests reference new templates/configs.

Risks & mitigations

  • Telemetry data drift in sealed mode. Mitigate by enforcing IEgressPolicy checks (TELEMETRY-OBS-56-001) and documenting fallback exporters; schedule smoke runs after each config change.
  • Template/API divergence. Maintain single source of truth in SPRINT_171_notifier_i.md tasks; require API Contracts review before merging SDK updates to avoid drift with UI consumers.
  • Observability storage overhead. Coordinate with Ops Guild to project Prometheus/Tempo growth when SLO webhooks + incident toggles increase cardinality; adjust retention per docs/modules/telemetry/architecture.md §2.
  • Cross-sprint dependency churn. Track ORCH-OBS-50-001, POLICY-OBS-50-001, WEB-OBS-50-001 weekly; if they slip, re-baseline Telemetry wave deliverables or gate Notifier observability triggers accordingly.

Task mirror snapshot (reference: Sprint 171 & 174 trackers)

Wave 170.A Notifier (Sprint 171 mirror)

  • Open tasks: 11 (NOTIFY-ATTEST/OAS/OBS/RISK series).
  • Done tasks: 2 (NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002) serve as baselines for doc/offline parity.
Category Task IDs Current state Notes
Attestation + key lifecycle NOTIFY-ATTEST-74-001/002 DOING / TODO Template creation in progress (74-001) with doc updates in docs/notifications/templates.md; wiring (74-002) waiting on schema freeze & template hand-off.
API/OAS + SDK refresh NOTIFY-OAS-61-001 → 63-001 DOING / TODO OAS doc updates underway (61-001); downstream endpoints/SDK items remain TODO until schema merged.
Observability-driven triggers NOTIFY-OBS-51-001/55-001 TODO Depends on Telemetry SLO webhook schema + incident toggle contract.
Risk routing NOTIFY-RISK-66-001 → 68-001 TODO Policy/Risk metadata export (POLICY-RISK-40-002) required before implementation.
Completed prerequisites NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002 DONE Keep as reference for documentation/offline-kit parity.

Wave 170.B Telemetry (Sprint 174 mirror)

  • Open tasks: 6 (TELEMETRY-OBS-50/51/55/56 series).
  • Done tasks: 0 (wave not yet started in Sprint 174 beyond scaffolding work-in-progress).
Category Task IDs Current state Notes
Bootstrap & propagation TELEMETRY-OBS-50-001/002 DOING / TODO Core bootstrap coding active (50-001); propagation adapters (50-002) queued pending package publication.
Metrics helpers & scrubbing TELEMETRY-OBS-51-001/002 TODO Roslyn analyzer + scrub policy review pending Security Guild approval.
Incident & sealed-mode controls TELEMETRY-OBS-55-001/56-001 TODO Requires CLI toggle contract (CLI-OBS-12-001) and Notify incident payload spec (NOTIFY-OBS-55-001).

External dependency tracker

Dependency Source sprint / doc Current state (as of 2025-11-12) Impact on waves
Sprint 150.A Orchestrator (wave table) SPRINT_150_scheduling_automation.md TODO Blocks Notifier template wiring + Telemetry consumption of job events until orchestration telemetry lands.
ORCH-OBS-50-001 orchestrator instrumentation docs/implplan/archived/tasks.md excerpt / Sprint 150 backlog TODO Needed for Telemetry.Core sample + Notify SLO hooks; monitor for slip.
POLICY-OBS-50-001 policy instrumentation Sprint 150 backlog TODO Required before Telemetry helpers can be adopted by Policy + risk routing.
WEB-OBS-50-001 gateway telemetry core adoption Sprint 214/215 backlogs TODO Ensures web/gateway emits trace IDs that Notify incident payload references.
POLICY-RISK-40-002 risk profile metadata export Sprint 215+ (Policy) TODO Prerequisite for NOTIFY-RISK-66/67/68 payload enrichment.

Coordination log

Date (UTC) Update Owner
2025-11-12 10:15 Wave rows flipped to DOING; baseline scope/entry/exit criteria recorded for both waves. Observability Guild · Notifications Service Guild
2025-11-12 14:40 Added task mirror + dependency tracker + milestone table to keep Sprint170 snapshot aligned with Sprint171/174 execution plans. Observability Guild
2025-11-12 18:05 Marked NOTIFY-ATTEST-74-001, NOTIFY-OAS-61-001, and TELEMETRY-OBS-50-001 as DOING in their sprint trackers; added status notes reflecting in-flight work vs. gated follow-ups. Notifications Service Guild · Telemetry Core Guild
2025-11-12 19:20 Documented attestation template suite (Section7 in docs/notifications/templates.md) to unblock NOTIFY-ATTEST-74-001 deliverables and updated sprint mirrors accordingly. Notifications Service Guild
2025-11-12 19:32 Synced notifications architecture doc to reference the new attestation template suite so downstream teams see the dependency in one place. Notifications Service Guild
2025-11-12 19:45 Updated notifications overview + rules docs with tmpl-attest-* requirements so rule authors/operators share the same contract. Notifications Service Guild
2025-11-12 20:05 Published baseline Offline Kit templates under offline/notifier/templates/attestation/ for Slack/Email/Webhook so NOTIFY-ATTEST-74-002 wiring has ready-made artefacts. Notifications Service Guild