Files
git.stella-ops.org/docs/modules/findings-ledger/observability.md
StellaOps Bot 47168fec38 feat: Add VEX compact fixture and implement offline verifier for Findings Ledger exports
- Introduced a new VEX compact fixture for testing purposes.
- Implemented `verify_export.py` script to validate Findings Ledger exports, ensuring deterministic ordering and applying redaction manifests.
- Added a lightweight stub `HarnessRunner` for unit tests to validate ledger hashing expectations.
- Documented tasks related to the Mirror Creator.
- Created models for entropy signals and implemented the `EntropyPenaltyCalculator` to compute penalties based on scanner outputs.
- Developed unit tests for `EntropyPenaltyCalculator` to ensure correct penalty calculations and handling of edge cases.
- Added tests for symbol ID normalization in the reachability scanner.
- Enhanced console status service with comprehensive unit tests for connection handling and error recovery.
- Included Cosign tool version 2.6.0 with checksums for various platforms.
2025-12-02 21:08:01 +02:00

73 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Findings Ledger Observability Profile (Sprint 120)
> **Audience:** Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
> **Scope:** Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.
## 1. Telemetry stack & conventions
- **Export path:** .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via `observability.enabled=true` in `appsettings`.
- **Namespace prefix:** `ledger.*` for metrics, `Ledger.*` for logs/traces. Labels follow `tenant`, `chain`, `policy`, `status`, `reason`, `anchor`.
- **Time provenance:** All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from `TimeProvider`.
## 2. Metrics
| Metric | Type | Labels | Description / target |
| --- | --- | --- | --- |
| `ledger_write_duration_seconds` | Histogram | `tenant`, `event_type`, `source` | End-to-end append latency (API ingress → persisted). P95 ≤120ms. |
| `ledger_events_total` | Counter | `tenant`, `event_type`, `source` (`policy`, `workflow`, `orchestrator`) | Incremented per committed event. Mirrors Merkle leaf count. |
| `ledger_ingest_backlog_events` | Gauge | `tenant` | Number of events buffered in the writer/anchor queues. Alert when >5000 for 5min. |
| `ledger_quota_remaining` | Gauge | `tenant` | Remaining ingest capacity before backpressure applies (defaults to 5000 events). |
| `ledger_backpressure_applied_total` | Counter | `tenant`, `reason`, `limit` | Incremented whenever backlog crosses quota threshold. |
| `ledger_quota_rejections_total` | Counter | `tenant`, `reason` | Incremented when requests are actively rejected due to quotas. |
| `ledger_projection_lag_seconds` | Gauge | `tenant` | Wall-clock difference between latest ledger event and projection tail. Target <30s. |
| `ledger_projection_rebuild_seconds` | Histogram | `tenant` | Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
| `ledger_projection_apply_seconds` | Histogram | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Time to apply a single ledger event to projection. Target P95 <1s. |
| `ledger_projection_events_total` | Counter | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Count of events applied to projections. |
| `ledger_merkle_anchor_duration_seconds` | Histogram | `tenant` | Time to batch + anchor events. Target <60s per 10k events. |
| `ledger_merkle_anchor_failures_total` | Counter | `tenant`, `reason` (`db`, `signing`, `network`) | Alerts at >0 within 15min. |
| `ledger_attachments_encryption_failures_total` | Counter | `tenant`, `stage` (`encrypt`, `sign`, `upload`) | Ensures secure attachment pipeline stays healthy. |
| `ledger_db_connections_active` | Gauge | `role` (`writer`, `projector`) | Helps tune pool size. |
| `ledger_app_version_info` | Gauge | `version`, `git_sha` | Static metric for fleet observability. |
### Derived dashboards
- **Writer health:** `ledger_write_latency_seconds` (P50/P95/P99), backlog gauge, event throughput.
- **Projection health:** `ledger_projection_lag_seconds`, `ledger_projection_apply_seconds`, projection throughput, conflict counts (from logs).
- **Anchoring:** Anchor duration histogram, failure counter, root hash timeline.
## 3. Logs & traces
- **Log structure:** Serilog JSON with fields `tenant`, `chainId`, `sequence`, `eventId`, `eventType`, `actorId`, `policyVersion`, `hash`, `merkleRoot`.
- **Log levels:** `Information` for success summaries (sampled), `Warning` for retried operations, `Error` for failed writes/anchors.
- **Correlation:** Each API request includes `requestId` + `traceId` logged with events. Projector logs capture `replayId` and `rebuildReason`.
- **Timeline events:** `ledger.event.appended` and `ledger.projection.updated` are emitted as structured logs carrying `tenant`, `chainId`, `sequence`, `eventId`, `policyVersion`, `traceId`, and placeholder `evidence_ref` fields for downstream timeline consumers.
- **Secrets:** Ensure `event_body` is never logged; log only metadata/hashes.
## 4. Alerts
| Alert | Condition | Response |
| --- | --- | --- |
| **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 1s for 3 intervals | Check DB contention, review queue backlog, scale writer. |
| **LedgerBacklogGrowing** | `ledger_ingest_backlog_events` > 5000 for 5min | Inspect upstream policy runs, ensure projector keeping up. |
| **LedgerBackpressure** | `ledger_backpressure_applied_total` increases while `ledger_quota_remaining` < 0 | Throttle callers, raise quota or scale anchor worker. |
| **ProjectionLag** | `ledger_projection_lag_seconds` > 30s | Trigger rebuild, verify change streams. |
| **AnchorFailure** | `ledger_merkle_anchor_failures_total` increase > 0 | Collect logs, rerun anchor, verify signing service. |
| **AttachmentSecurityError** | `ledger_attachments_encryption_failures_total` increase > 0 | Audit attachments pipeline; check key material and storage endpoints. |
Alerts integrate with Notifier channel `ledger.alerts`. For air-gapped deployments emit to local syslog + CLI incident scripts.
## 5. Testing & determinism harness
- **Replay harness:** CLI `dotnet run --project tools/LedgerReplayHarness` executes deterministic replays at 5M findings/tenant. Metrics emitted: `ledger_projection_rebuild_seconds` with `scenario` label.
- **Property tests:** Seeded tests ensure `ledger_events_total` and Merkle leaf counts match after replay.
- **CI gating:** `LEDGER-29-008` requires harness output uploaded as signed JSON (`harness-report.json` + DSSE) and referenced in sprint notes.
## 6. Offline & air-gap guidance
- Collect metrics/log snapshots via `stella ledger observability snapshot --out offline/ledger/metrics.tar.gz`. Include `ledger_write_latency_seconds` summary, anchor root history, and projection lag samples.
- Include default Grafana JSON under `offline/telemetry/dashboards/ledger/*.json`. Dashboards use the metrics above; filter by `tenant`.
- Ensure sealed-mode doc (`docs/modules/findings-ledger/schema.md` §3.3) references `ledger_attachments_encryption_failures_total` so Ops can confirm encryption pipeline health without remote telemetry.
## 7. Runbook pointers
- **Anchoring issues:** Refer to `docs/modules/findings-ledger/schema.md` §3 for root structure, `ops/devops/telemetry/package_offline_bundle.py` for diagnostics.
- **Projection rebuilds:** `docs/modules/findings-ledger/workflow-inference.md` for chain rules; `scripts/ledger/replay.sh` (LEDGER-29-008 deliverable) for deterministic replays.
---
*Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.*