Implement ledger metrics for observability and add tests for Ruby packages endpoints
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Added `LedgerMetrics` class to record write latency and total events for ledger operations.
- Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling.
- Introduced `TestSurfaceSecretsScope` for managing environment variables during tests.
- Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents.
- Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB.
- Established MongoDB indexes for efficient querying of events based on provenance and trust.
- Added models and JSON parsing logic for DSSE provenance and trust information.
This commit is contained in:
master
2025-11-13 09:29:09 +02:00
parent 151f6b35cc
commit 61f963fd52
101 changed files with 5881 additions and 1776 deletions

View File

@@ -8,7 +8,132 @@ This file now only tracks the notifications & telemetry status snapshot. Active
| Wave | Guild owners | Shared prerequisites | Status | Notes |
| --- | --- | --- | --- | --- |
| 170.A Notifier | Notifications Service Guild · Attestor Service Guild · Observability Guild | Sprint 150.A Orchestrator | TODO | Needs orchestrator job events/attest data; keep templates staged for when job attestations land. |
| 170.B Telemetry | Telemetry Core Guild · Observability Guild · Security Guild | Sprint 150.A Orchestrator | TODO | Library scaffolding is ready but should launch once orchestrator/Policy consumers can adopt shared helpers. |
| 170.A Notifier | Notifications Service Guild · Attestor Service Guild · Observability Guild | Sprint 150.A Orchestrator | **DOING (2025-11-12)** | Scope confirmation + template/OAS prep underway; execution tracked in `SPRINT_171_notifier_i.md` (NOTIFY-ATTEST/OAS/OBS/RISK series). |
| 170.B Telemetry | Telemetry Core Guild · Observability Guild · Security Guild | Sprint 150.A Orchestrator | **DOING (2025-11-12)** | Bootstrapping `StellaOps.Telemetry.Core` plus adoption runway in `SPRINT_174_telemetry.md`; waiting on Orchestrator/Policy hosts to consume new helpers. |
# Sprint 170 - Notifications & Telemetry
## Wave 170.A Notifier readiness
### Scope & goals
- Deliver attestation/key-rotation alert templates plus routing so Attestor/Signer incidents surface immediately (NOTIFY-ATTEST-74-001/002).
- Refresh Notifier OpenAPI/SDK surface (`NOTIFY-OAS-61-001``NOTIFY-OAS-63-001`) so Console/CLI teams can self-serve the new endpoints.
- Wire SLO/incident inputs into rules (NOTIFY-OBS-51-001/55-001) and extend risk-profile routing (NOTIFY-RISK-66-001 → NOTIFY-RISK-68-001) without regressing quiet-hours/dedup.
- Preserve Offline Kit and documentation parity (NOTIFY-DOC-70-001 — done, NOTIFY-AIRGAP-56-002 — done) while adding the new rule surfaces.
### Entry criteria
- Orchestrator job attest events flowing to Notify bus (Sprint 150.A dependency) with test fixtures approved by Attestor Guild.
- Quiet-hours/digest backlog reconciled (no pending blockers in `docs/notifications/*.md`).
- Observability Guild sign-off on telemetry fields reused by Notifier SLO webhooks.
### Exit criteria
- All NOTIFY-ATTEST/OAS/OBS/RISK tasks in `SPRINT_171_notifier_i.md` moved to DONE with accompanying doc updates.
- Templates promoted to Offline Kit manifests and sample payloads stored under `docs/notifications/templates.md`.
- Incident mode notifications exercised in staging with audit logs + DSSE evidence attached.
### Task clusters & owners
| Cluster | Linked tasks | Owners | Status snapshot | Notes |
| --- | --- | --- | --- | --- |
| Attestation / key lifecycle alerts | NOTIFY-ATTEST-74-001/74-002 | Notifications Service Guild · Attestor Service Guild | TODO → DOING (prep) | Template scaffolding drafted; awaiting Rekor witness payload contract freeze. |
| API/OAS refresh & SDK parity | NOTIFY-OAS-61-001 → NOTIFY-OAS-63-001 | Notifications Service Guild · API Contracts Guild · SDK Generator Guild | TODO | Contract doc outline in review; SDK generator blocked on `/notifications/rules` schema finalize date (target 2025-11-15). |
| Observability-driven triggers | NOTIFY-OBS-51-001/55-001 | Notifications Service Guild · Observability Guild | TODO | Depends on Telemetry team exposing SLO webhook payload shape (see TELEMETRY-OBS-51-001). |
| Risk profile routing | NOTIFY-RISK-66-001 → NOTIFY-RISK-68-001 | Notifications Service Guild · Risk Engine Guild · Policy Guild | TODO | Requires Policys risk profile metadata (POLICY-RISK-40-002) export; follow up in Sprint 175. |
| Docs & offline parity | NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002 | Notifications Service Guild · DevOps Guild | DONE | Remains reference for GA checklists; keep untouched unless new surfaces appear. |
### Observability checkpoints
- Align metric names/labels with `docs/notifications/architecture.md#12-observability-prometheus--otel` before promoting new dashboards.
- Ensure Notifier spans/logs include tenant, ruleId, actionId, and `attestation_event_id` for attestation-triggered templates.
- Capture incident notification smoke tests via `ops/devops/telemetry/tenant_isolation_smoke.py` once Telemetry wave lands.
## Wave 170.B Telemetry bootstrap
### Scope & goals
- Ship `StellaOps.Telemetry.Core` bootstrap + propagation helpers (TELEMETRY-OBS-50-001/50-002).
- Provide golden-signal helpers + scrubbing/PII safety nets (TELEMETRY-OBS-51-001/51-002) so service teams can onboard without bespoke plumbing.
- Implement incident + sealed-mode toggles (TELEMETRY-OBS-55-001/56-001) and document the integration contract for Orchestrator, Policy, Task Runner, Gateway (`WEB-OBS-50-001`).
### Entry criteria
- Orchestrator + Policy hosts expose extension points for telemetry bootstrap (tracked via Sprint 150.A and IDs ORCH-OBS-50-001 / POLICY-OBS-50-001).
- Observability Guild reviewed storage footprint impacts for Prometheus/Tempo/Loki per module (docs/modules/telemetry/architecture.md §2).
- Security Guild signs off on redaction defaults + tenant override audit logging.
### Exit criteria
- Core library published to `/local-nugets` and referenced by at least Orchestrator & Policy in integration branches.
- Context propagation middleware validated through HTTP/gRPC/job smoke tests with deterministic trace IDs.
- Incident/sealed-mode toggles wired into CLI + Notify hooks (NOTIFY-OBS-55-001) with runbooks updated under `docs/notifications/architecture.md`.
### Task clusters & owners
| Cluster | Linked tasks | Owners | Status snapshot | Notes |
| --- | --- | --- | --- | --- |
| Bootstrap & propagation | TELEMETRY-OBS-50-001/50-002 | Telemetry Core Guild | TODO → DOING (scaffolding) | Collector profile templates staged; need service metadata detector + sample host integration PRs. |
| Metrics helpers + scrubbing | TELEMETRY-OBS-51-001/51-002 | Telemetry Core Guild · Observability Guild · Security Guild | TODO | Roslyn analyzer spec drafted; waiting on scrub policy from Security (POLICY-SEC-42-003). |
| Incident & sealed-mode controls | TELEMETRY-OBS-55-001/56-001 | Telemetry Core Guild · Observability Guild | TODO | Requires CLI toggle contract (CLI-OBS-12-001) and Notify incident payload spec (NOTIFY-OBS-55-001). |
### Tooling & validation
- Smoke: `ops/devops/telemetry/smoke_otel_collector.py` + `tenant_isolation_smoke.py` to run for each profile (default/forensic/airgap).
- Offline bundle packaging: `ops/devops/telemetry/package_offline_bundle.py` to include updated collectors, dashboards, manifest digests.
- Incident simulation: reuse `ops/devops/telemetry/generate_dev_tls.sh` for local collector certs during sealed-mode testing.
## Shared milestones & dependencies
| Target date | Milestone | Owners | Dependency notes |
| --- | --- | --- | --- |
| 2025-11-13 | Finalize attestation payload schema + template variables | Notifications Service Guild · Attestor Service Guild | Unblocks NOTIFY-ATTEST-74-001/002 + Telemetry incident span labels. |
| 2025-11-15 | Publish draft Notifier OAS + SDK snippets | Notifications Service Guild · API Contracts Guild | Required for CLI/UI adoption; prereq for NOTIFY-OAS-61/62 series. |
| 2025-11-18 | Land Telemetry.Core bootstrap sample in Orchestrator | Telemetry Core Guild · Orchestrator Guild | Demonstrates TELEMETRY-OBS-50-001 viability; prerequisite for Policy adoption + Notify SLO hooks. |
| 2025-11-20 | Incident/quiet-hour end-to-end rehearsal | Notifications Service Guild · Telemetry Core Guild · Observability Guild | Validates TELEMETRY-OBS-55-001 + NOTIFY-OBS-55-001 + CLI toggle contract. |
| 2025-11-22 | Offline kit bundle refresh (notifications + telemetry assets) | DevOps Guild · Notifications Service Guild · Telemetry Core Guild | Ensure docs/ops/offline-kit manifests reference new templates/configs. |
## Risks & mitigations
- **Telemetry data drift in sealed mode.** Mitigate by enforcing `IEgressPolicy` checks (TELEMETRY-OBS-56-001) and documenting fallback exporters; schedule smoke runs after each config change.
- **Template/API divergence.** Maintain single source of truth in `SPRINT_171_notifier_i.md` tasks; require API Contracts review before merging SDK updates to avoid drift with UI consumers.
- **Observability storage overhead.** Coordinate with Ops Guild to project Prometheus/Tempo growth when SLO webhooks + incident toggles increase cardinality; adjust retention per docs/modules/telemetry/architecture.md §2.
- **Cross-sprint dependency churn.** Track ORCH-OBS-50-001, POLICY-OBS-50-001, WEB-OBS-50-001 weekly; if they slip, re-baseline Telemetry wave deliverables or gate Notifier observability triggers accordingly.
## Task mirror snapshot (reference: Sprint 171 & 174 trackers)
### Wave 170.A Notifier (Sprint 171 mirror)
- **Open tasks:** 11 (NOTIFY-ATTEST/OAS/OBS/RISK series).
- **Done tasks:** 2 (NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002) serve as baselines for doc/offline parity.
| Category | Task IDs | Current state | Notes |
| --- | --- | --- | --- |
| Attestation + key lifecycle | NOTIFY-ATTEST-74-001/002 | **DOING / TODO** | Template creation in progress (74-001) with doc updates in `docs/notifications/templates.md`; wiring (74-002) waiting on schema freeze & template hand-off. |
| API/OAS + SDK refresh | NOTIFY-OAS-61-001 → 63-001 | **DOING / TODO** | OAS doc updates underway (61-001); downstream endpoints/SDK items remain TODO until schema merged. |
| Observability-driven triggers | NOTIFY-OBS-51-001/55-001 | TODO | Depends on Telemetry SLO webhook schema + incident toggle contract. |
| Risk routing | NOTIFY-RISK-66-001 → 68-001 | TODO | Policy/Risk metadata export (POLICY-RISK-40-002) required before implementation. |
| Completed prerequisites | NOTIFY-DOC-70-001, NOTIFY-AIRGAP-56-002 | DONE | Keep as reference for documentation/offline-kit parity. |
### Wave 170.B Telemetry (Sprint 174 mirror)
- **Open tasks:** 6 (TELEMETRY-OBS-50/51/55/56 series).
- **Done tasks:** 0 (wave not yet started in Sprint 174 beyond scaffolding work-in-progress).
| Category | Task IDs | Current state | Notes |
| --- | --- | --- | --- |
| Bootstrap & propagation | TELEMETRY-OBS-50-001/002 | **DOING / TODO** | Core bootstrap coding active (50-001); propagation adapters (50-002) queued pending package publication. |
| Metrics helpers & scrubbing | TELEMETRY-OBS-51-001/002 | TODO | Roslyn analyzer + scrub policy review pending Security Guild approval. |
| Incident & sealed-mode controls | TELEMETRY-OBS-55-001/56-001 | TODO | Requires CLI toggle contract (CLI-OBS-12-001) and Notify incident payload spec (NOTIFY-OBS-55-001). |
## External dependency tracker
| Dependency | Source sprint / doc | Current state (as of 2025-11-12) | Impact on waves |
| --- | --- | --- | --- |
| Sprint 150.A Orchestrator (wave table) | `SPRINT_150_scheduling_automation.md` | TODO | Blocks Notifier template wiring + Telemetry consumption of job events until orchestration telemetry lands. |
| ORCH-OBS-50-001 `orchestrator instrumentation` | `docs/implplan/archived/tasks.md` excerpt / Sprint 150 backlog | TODO | Needed for Telemetry.Core sample + Notify SLO hooks; monitor for slip. |
| POLICY-OBS-50-001 `policy instrumentation` | Sprint 150 backlog | TODO | Required before Telemetry helpers can be adopted by Policy + risk routing. |
| WEB-OBS-50-001 `gateway telemetry core adoption` | Sprint 214/215 backlogs | TODO | Ensures web/gateway emits trace IDs that Notify incident payload references. |
| POLICY-RISK-40-002 `risk profile metadata export` | Sprint 215+ (Policy) | TODO | Prerequisite for NOTIFY-RISK-66/67/68 payload enrichment. |
## Coordination log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2025-11-12 10:15 | Wave rows flipped to DOING; baseline scope/entry/exit criteria recorded for both waves. | Observability Guild · Notifications Service Guild |
| 2025-11-12 14:40 | Added task mirror + dependency tracker + milestone table to keep Sprint170 snapshot aligned with Sprint171/174 execution plans. | Observability Guild |
| 2025-11-12 18:05 | Marked NOTIFY-ATTEST-74-001, NOTIFY-OAS-61-001, and TELEMETRY-OBS-50-001 as DOING in their sprint trackers; added status notes reflecting in-flight work vs. gated follow-ups. | Notifications Service Guild · Telemetry Core Guild |
| 2025-11-12 19:20 | Documented attestation template suite (Section7 in `docs/notifications/templates.md`) to unblock NOTIFY-ATTEST-74-001 deliverables and updated sprint mirrors accordingly. | Notifications Service Guild |
| 2025-11-12 19:32 | Synced notifications architecture doc to reference the new attestation template suite so downstream teams see the dependency in one place. | Notifications Service Guild |
| 2025-11-12 19:45 | Updated notifications overview + rules docs with `tmpl-attest-*` requirements so rule authors/operators share the same contract. | Notifications Service Guild |
| 2025-11-12 20:05 | Published baseline Offline Kit templates under `offline/notifier/templates/attestation/` for Slack/Email/Webhook so NOTIFY-ATTEST-74-002 wiring has ready-made artefacts. | Notifications Service Guild |