Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Added `LedgerMetrics` class to record write latency and total events for ledger operations. - Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling. - Introduced `TestSurfaceSecretsScope` for managing environment variables during tests. - Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents. - Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB. - Established MongoDB indexes for efficient querying of events based on provenance and trust. - Added models and JSON parsing logic for DSSE provenance and trust information.
5.4 KiB
5.4 KiB
Findings Ledger Observability Profile (Sprint 120)
Audience: Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
Scope: Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.
1. Telemetry stack & conventions
- Export path: .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via
observability.enabled=trueinappsettings. - Namespace prefix:
ledger.*for metrics,Ledger.*for logs/traces. Labels followtenant,chain,policy,status,reason,anchor. - Time provenance: All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from
TimeProvider.
2. Metrics
| Metric | Type | Labels | Description / target |
|---|---|---|---|
ledger_write_latency_seconds |
Histogram | tenant, event_type |
End-to-end append latency (API ingress → persisted). P95 ≤ 120 ms. |
ledger_events_total |
Counter | tenant, event_type, source (policy, workflow, orchestrator) |
Incremented per committed event. Mirrors Merkle leaf count. |
ledger_ingest_backlog_events |
Gauge | tenant |
Number of events buffered in the writer queue. Alert when >5 000 for 5 min. |
ledger_projection_lag_seconds |
Gauge | tenant |
Wall-clock difference between latest ledger event and projection tail. Target <30 s. |
ledger_projection_rebuild_seconds |
Histogram | tenant |
Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
ledger_merkle_anchor_duration_seconds |
Histogram | tenant |
Time to batch + anchor events. Target <60 s per 10k events. |
ledger_merkle_anchor_failures_total |
Counter | tenant, reason (db, signing, network) |
Alerts at >0 within 15 min. |
ledger_attachments_encryption_failures_total |
Counter | tenant, stage (encrypt, sign, upload) |
Ensures secure attachment pipeline stays healthy. |
ledger_db_connections_active |
Gauge | role (writer, projector) |
Helps tune pool size. |
ledger_app_version_info |
Gauge | version, git_sha |
Static metric for fleet observability. |
Derived dashboards
- Writer health:
ledger_write_latency_seconds(P50/P95/P99), backlog gauge, event throughput. - Projection health:
ledger_projection_lag_seconds, rebuild durations, conflict counts (from logs). - Anchoring: Anchor duration histogram, failure counter, root hash timeline.
3. Logs & traces
- Log structure: Serilog JSON with fields
tenant,chainId,sequence,eventId,eventType,actorId,policyVersion,hash,merkleRoot. - Log levels:
Informationfor success summaries (sampled),Warningfor retried operations,Errorfor failed writes/anchors. - Correlation: Each API request includes
requestId+traceIdlogged with events. Projector logs capturereplayIdandrebuildReason. - Secrets: Ensure
event_bodyis never logged; log only metadata/hashes.
4. Alerts
| Alert | Condition | Response |
|---|---|---|
| LedgerWriteSLA | ledger_write_latency_seconds P95 > 0.12 s for 3 intervals |
Check DB contention, review queue backlog, scale writer. |
| LedgerBacklogGrowing | ledger_ingest_backlog_events > 5 000 for 5 min |
Inspect upstream policy runs, ensure projector keeping up. |
| ProjectionLag | ledger_projection_lag_seconds > 60 s |
Trigger rebuild, verify change streams. |
| AnchorFailure | ledger_merkle_anchor_failures_total increase > 0 |
Collect logs, rerun anchor, verify signing service. |
| AttachmentSecurityError | ledger_attachments_encryption_failures_total increase > 0 |
Audit attachments pipeline; check key material and storage endpoints. |
Alerts integrate with Notifier channel ledger.alerts. For air-gapped deployments emit to local syslog + CLI incident scripts.
5. Testing & determinism harness
- Replay harness: CLI
dotnet run --project tools/LedgerReplayHarnessexecutes deterministic replays at 5 M findings/tenant. Metrics emitted:ledger_projection_rebuild_secondswithscenariolabel. - Property tests: Seeded tests ensure
ledger_events_totaland Merkle leaf counts match after replay. - CI gating:
LEDGER-29-008requires harness output uploaded as signed JSON (harness-report.json+ DSSE) and referenced in sprint notes.
6. Offline & air-gap guidance
- Collect metrics/log snapshots via
stella ledger observability snapshot --out offline/ledger/metrics.tar.gz. Includeledger_write_latency_secondssummary, anchor root history, and projection lag samples. - Include default Grafana JSON under
offline/telemetry/dashboards/ledger/*.json. Dashboards use the metrics above; filter bytenant. - Ensure sealed-mode doc (
docs/modules/findings-ledger/schema.md§3.3) referencesledger_attachments_encryption_failures_totalso Ops can confirm encryption pipeline health without remote telemetry.
7. Runbook pointers
- Anchoring issues: Refer to
docs/modules/findings-ledger/schema.md§3 for root structure,ops/devops/telemetry/package_offline_bundle.pyfor diagnostics. - Projection rebuilds:
docs/modules/findings-ledger/workflow-inference.mdfor chain rules;scripts/ledger/replay.sh(LEDGER-29-008 deliverable) for deterministic replays.
Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.