Implement ledger metrics for observability and add tests for Ruby packages endpoints
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Added `LedgerMetrics` class to record write latency and total events for ledger operations. - Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling. - Introduced `TestSurfaceSecretsScope` for managing environment variables during tests. - Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents. - Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB. - Established MongoDB indexes for efficient querying of events based on provenance and trust. - Added models and JSON parsing logic for DSSE provenance and trust information.
This commit is contained in:
65
docs/modules/findings-ledger/observability.md
Normal file
65
docs/modules/findings-ledger/observability.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# Findings Ledger Observability Profile (Sprint 120)
|
||||
|
||||
> **Audience:** Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
|
||||
> **Scope:** Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.
|
||||
|
||||
## 1. Telemetry stack & conventions
|
||||
- **Export path:** .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via `observability.enabled=true` in `appsettings`.
|
||||
- **Namespace prefix:** `ledger.*` for metrics, `Ledger.*` for logs/traces. Labels follow `tenant`, `chain`, `policy`, `status`, `reason`, `anchor`.
|
||||
- **Time provenance:** All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from `TimeProvider`.
|
||||
|
||||
## 2. Metrics
|
||||
|
||||
| Metric | Type | Labels | Description / target |
|
||||
| --- | --- | --- | --- |
|
||||
| `ledger_write_latency_seconds` | Histogram | `tenant`, `event_type` | End-to-end append latency (API ingress → persisted). P95 ≤ 120 ms. |
|
||||
| `ledger_events_total` | Counter | `tenant`, `event_type`, `source` (`policy`, `workflow`, `orchestrator`) | Incremented per committed event. Mirrors Merkle leaf count. |
|
||||
| `ledger_ingest_backlog_events` | Gauge | `tenant` | Number of events buffered in the writer queue. Alert when >5 000 for 5 min. |
|
||||
| `ledger_projection_lag_seconds` | Gauge | `tenant` | Wall-clock difference between latest ledger event and projection tail. Target <30 s. |
|
||||
| `ledger_projection_rebuild_seconds` | Histogram | `tenant` | Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
|
||||
| `ledger_merkle_anchor_duration_seconds` | Histogram | `tenant` | Time to batch + anchor events. Target <60 s per 10k events. |
|
||||
| `ledger_merkle_anchor_failures_total` | Counter | `tenant`, `reason` (`db`, `signing`, `network`) | Alerts at >0 within 15 min. |
|
||||
| `ledger_attachments_encryption_failures_total` | Counter | `tenant`, `stage` (`encrypt`, `sign`, `upload`) | Ensures secure attachment pipeline stays healthy. |
|
||||
| `ledger_db_connections_active` | Gauge | `role` (`writer`, `projector`) | Helps tune pool size. |
|
||||
| `ledger_app_version_info` | Gauge | `version`, `git_sha` | Static metric for fleet observability. |
|
||||
|
||||
### Derived dashboards
|
||||
- **Writer health:** `ledger_write_latency_seconds` (P50/P95/P99), backlog gauge, event throughput.
|
||||
- **Projection health:** `ledger_projection_lag_seconds`, rebuild durations, conflict counts (from logs).
|
||||
- **Anchoring:** Anchor duration histogram, failure counter, root hash timeline.
|
||||
|
||||
## 3. Logs & traces
|
||||
- **Log structure:** Serilog JSON with fields `tenant`, `chainId`, `sequence`, `eventId`, `eventType`, `actorId`, `policyVersion`, `hash`, `merkleRoot`.
|
||||
- **Log levels:** `Information` for success summaries (sampled), `Warning` for retried operations, `Error` for failed writes/anchors.
|
||||
- **Correlation:** Each API request includes `requestId` + `traceId` logged with events. Projector logs capture `replayId` and `rebuildReason`.
|
||||
- **Secrets:** Ensure `event_body` is never logged; log only metadata/hashes.
|
||||
|
||||
## 4. Alerts
|
||||
|
||||
| Alert | Condition | Response |
|
||||
| --- | --- | --- |
|
||||
| **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 0.12 s for 3 intervals | Check DB contention, review queue backlog, scale writer. |
|
||||
| **LedgerBacklogGrowing** | `ledger_ingest_backlog_events` > 5 000 for 5 min | Inspect upstream policy runs, ensure projector keeping up. |
|
||||
| **ProjectionLag** | `ledger_projection_lag_seconds` > 60 s | Trigger rebuild, verify change streams. |
|
||||
| **AnchorFailure** | `ledger_merkle_anchor_failures_total` increase > 0 | Collect logs, rerun anchor, verify signing service. |
|
||||
| **AttachmentSecurityError** | `ledger_attachments_encryption_failures_total` increase > 0 | Audit attachments pipeline; check key material and storage endpoints. |
|
||||
|
||||
Alerts integrate with Notifier channel `ledger.alerts`. For air-gapped deployments emit to local syslog + CLI incident scripts.
|
||||
|
||||
## 5. Testing & determinism harness
|
||||
- **Replay harness:** CLI `dotnet run --project tools/LedgerReplayHarness` executes deterministic replays at 5 M findings/tenant. Metrics emitted: `ledger_projection_rebuild_seconds` with `scenario` label.
|
||||
- **Property tests:** Seeded tests ensure `ledger_events_total` and Merkle leaf counts match after replay.
|
||||
- **CI gating:** `LEDGER-29-008` requires harness output uploaded as signed JSON (`harness-report.json` + DSSE) and referenced in sprint notes.
|
||||
|
||||
## 6. Offline & air-gap guidance
|
||||
- Collect metrics/log snapshots via `stella ledger observability snapshot --out offline/ledger/metrics.tar.gz`. Include `ledger_write_latency_seconds` summary, anchor root history, and projection lag samples.
|
||||
- Include default Grafana JSON under `offline/telemetry/dashboards/ledger/*.json`. Dashboards use the metrics above; filter by `tenant`.
|
||||
- Ensure sealed-mode doc (`docs/modules/findings-ledger/schema.md` §3.3) references `ledger_attachments_encryption_failures_total` so Ops can confirm encryption pipeline health without remote telemetry.
|
||||
|
||||
## 7. Runbook pointers
|
||||
- **Anchoring issues:** Refer to `docs/modules/findings-ledger/schema.md` §3 for root structure, `ops/devops/telemetry/package_offline_bundle.py` for diagnostics.
|
||||
- **Projection rebuilds:** `docs/modules/findings-ledger/workflow-inference.md` for chain rules; `scripts/ledger/replay.sh` (LEDGER-29-008 deliverable) for deterministic replays.
|
||||
|
||||
---
|
||||
|
||||
*Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.*
|
||||
Reference in New Issue
Block a user