Files
StellaOps Bot bc0762e97d up
2025-12-09 00:20:52 +02:00

94 lines
9.7 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Findings Ledger Observability Profile (Sprint 120)
> **Audience:** Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
> **Scope:** Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.
## 1. Telemetry stack & conventions
- **Export path:** .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via `observability.enabled=true` in `appsettings`.
- **Namespace prefix:** `ledger.*` for metrics, `Ledger.*` for logs/traces. Labels follow `tenant`, `chain`, `policy`, `status`, `reason`, `anchor`.
- **Time provenance:** All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from `TimeProvider`.
## 2. Metrics
| Metric | Type | Labels | Description / target |
| --- | --- | --- | --- |
| `ledger_write_duration_seconds` | Histogram | `tenant`, `event_type`, `source` | End-to-end append latency (API ingress → persisted). P95 ≤120ms. |
| `ledger_events_total` | Counter | `tenant`, `event_type`, `source` (`policy`, `workflow`, `orchestrator`) | Incremented per committed event. Mirrors Merkle leaf count. |
| `ledger_ingest_backlog_events` | Gauge | `tenant` | Number of events buffered in the writer/anchor queues. Alert when >5000 for 5min. |
| `ledger_quota_remaining` | Gauge | `tenant` | Remaining ingest capacity before backpressure applies (defaults to 5000 events). |
| `ledger_backpressure_applied_total` | Counter | `tenant`, `reason`, `limit` | Incremented whenever backlog crosses quota threshold. |
| `ledger_quota_rejections_total` | Counter | `tenant`, `reason` | Incremented when requests are actively rejected due to quotas. |
| `ledger_projection_lag_seconds` | Gauge | `tenant` | Wall-clock difference between latest ledger event and projection tail. Target <30s. |
| `ledger_projection_rebuild_seconds` | Histogram | `tenant` | Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
| `ledger_projection_apply_seconds` | Histogram | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Time to apply a single ledger event to projection. Target P95 <1s. |
| `ledger_projection_events_total` | Counter | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Count of events applied to projections. |
| `ledger_merkle_anchor_duration_seconds` | Histogram | `tenant` | Time to batch + anchor events. Target <60s per 10k events. |
| `ledger_merkle_anchor_failures_total` | Counter | `tenant`, `reason` (`db`, `signing`, `network`) | Alerts at >0 within 15min. |
| `ledger_attachments_encryption_failures_total` | Counter | `tenant`, `stage` (`encrypt`, `sign`, `upload`) | Ensures secure attachment pipeline stays healthy. |
| `ledger_db_connections_active` | Gauge | `role` (`writer`, `projector`) | Helps tune pool size. |
| `ledger_app_version_info` | Gauge | `version`, `git_sha` | Static metric for fleet observability. |
| `ledger_scoring_latency_seconds` | Histogram | `tenant`, `policy_version`, `result` | Latency of risk scoring operations per finding. P95 target <500 ms. |
| `ledger_scoring_operations_total` | Counter | `tenant`, `policy_version`, `result` | Total number of scoring operations by result (success, partial_success, error, etc.). |
| `ledger_scoring_provider_gaps_total` | Counter | `tenant`, `provider`, `reason` | Count of findings where scoring provider was unavailable or returned no data. |
| `ledger_severity_distribution_critical` | Gauge | `tenant`, `policy_version` | Current count of critical severity findings by tenant and policy. |
| `ledger_severity_distribution_high` | Gauge | `tenant`, `policy_version` | Current count of high severity findings by tenant and policy. |
| `ledger_severity_distribution_medium` | Gauge | `tenant`, `policy_version` | Current count of medium severity findings by tenant and policy. |
| `ledger_severity_distribution_low` | Gauge | `tenant`, `policy_version` | Current count of low severity findings by tenant and policy. |
| `ledger_severity_distribution_unknown` | Gauge | `tenant`, `policy_version` | Current count of unknown/unscored findings by tenant and policy. |
| `ledger_score_freshness_seconds` | Gauge | `tenant` | Time since last scoring operation completed by tenant. Alert when >3600 s. |
| `ledger_scored_findings_exports_total` | Counter | `tenant`, `record_count` | Count of scored findings export operations. |
| `ledger_scored_findings_export_duration_seconds` | Histogram | `tenant`, `record_count` | Duration of scored findings export operations. |
| `ledger_airgap_staleness_seconds` | Histogram | `domain` | Current staleness of air-gap imported data by domain. |
| `ledger_airgap_staleness_gauge_seconds` | Gauge | `domain` | Current staleness of air-gap data by domain (observable gauge). |
| `ledger_staleness_validation_failures_total` | Counter | `domain` | Count of staleness validation failures blocking exports. |
### Derived dashboards
- **Writer health:** `ledger_write_latency_seconds` (P50/P95/P99), backlog gauge, event throughput.
- **Projection health:** `ledger_projection_lag_seconds`, `ledger_projection_apply_seconds`, projection throughput, conflict counts (from logs).
- **Anchoring:** Anchor duration histogram, failure counter, root hash timeline.
- **Risk scoring:** `ledger_scoring_latency_seconds` (P50/P95/P99), severity distribution gauges, provider gap counter, score freshness.
- **Export operations:** `ledger_scored_findings_exports_total`, export duration histogram, record counts.
- **Air-gap health:** `ledger_airgap_staleness_gauge_seconds`, staleness validation failures, domain freshness trends.
## 3. Logs & traces
- **Log structure:** Serilog JSON with fields `tenant`, `chainId`, `sequence`, `eventId`, `eventType`, `actorId`, `policyVersion`, `hash`, `merkleRoot`.
- **Log levels:** `Information` for success summaries (sampled), `Warning` for retried operations, `Error` for failed writes/anchors.
- **Correlation:** Each API request includes `requestId` + `traceId` logged with events. Projector logs capture `replayId` and `rebuildReason`.
- **Timeline events:** `ledger.event.appended` and `ledger.projection.updated` are emitted as structured logs carrying `tenant`, `chainId`, `sequence`, `eventId`, `policyVersion`, `traceId`, and placeholder `evidence_ref` fields for downstream timeline consumers.
- **Secrets:** Ensure `event_body` is never logged; log only metadata/hashes.
- **Incident mode:** When incident mode is active, emit `ledger.incident.mode`, `ledger.incident.lag_trace`, `ledger.incident.conflict_snapshot`, and `ledger.incident.replay_trace` logs (with activation id, retention extension days, lag seconds, conflict reason). Snapshot TTLs inherit an incident retention extension and are annotated with `incident.*` metadata.
## 4. Alerts
| Alert | Condition | Response |
| --- | --- | --- |
| **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 1s for 3 intervals | Check DB contention, review queue backlog, scale writer. |
| **LedgerBacklogGrowing** | `ledger_ingest_backlog_events` > 5000 for 5min | Inspect upstream policy runs, ensure projector keeping up. |
| **LedgerBackpressure** | `ledger_backpressure_applied_total` increases while `ledger_quota_remaining` < 0 | Throttle callers, raise quota or scale anchor worker. |
| **ProjectionLag** | `ledger_projection_lag_seconds` > 30s | Trigger rebuild, verify change streams. |
| **AnchorFailure** | `ledger_merkle_anchor_failures_total` increase > 0 | Collect logs, rerun anchor, verify signing service. |
| **AttachmentSecurityError** | `ledger_attachments_encryption_failures_total` increase > 0 | Audit attachments pipeline; check key material and storage endpoints. |
| **ScoringFreshnessStale** | `ledger_score_freshness_seconds` > 3600 s for any tenant | Check scoring pipeline, verify provider connectivity, re-trigger scoring job. |
| **ScoringProviderGaps** | `ledger_scoring_provider_gaps_total` increase > 10 in 5 min | Investigate provider failures; check rate limits or connectivity. |
| **AirgapDataStale** | `ledger_airgap_staleness_gauge_seconds` > threshold for 15 min | Re-import air-gap bundle; verify export pipeline in source enclave. |
Alerts integrate with Notifier channel `ledger.alerts`. For air-gapped deployments emit to local syslog + CLI incident scripts.
## 5. Testing & determinism harness
- **Replay harness:** CLI `dotnet run --project tools/LedgerReplayHarness` executes deterministic replays at 5M findings/tenant. Metrics emitted: `ledger_projection_rebuild_seconds` with `scenario` label.
- **Property tests:** Seeded tests ensure `ledger_events_total` and Merkle leaf counts match after replay.
- **CI gating:** `LEDGER-29-008` requires harness output uploaded as signed JSON (`harness-report.json` + DSSE) and referenced in sprint notes.
## 6. Offline & air-gap guidance
- Collect metrics/log snapshots via `stella ledger observability snapshot --out offline/ledger/metrics.tar.gz`. Include `ledger_write_latency_seconds` summary, anchor root history, and projection lag samples.
- Include default Grafana JSON under `offline/telemetry/dashboards/ledger/*.json`. Dashboards use the metrics above; filter by `tenant`.
- Ensure sealed-mode doc (`docs/modules/findings-ledger/schema.md` §3.3) references `ledger_attachments_encryption_failures_total` so Ops can confirm encryption pipeline health without remote telemetry.
## 7. Runbook pointers
- **Anchoring issues:** Refer to `docs/modules/findings-ledger/schema.md` §3 for root structure, `ops/devops/telemetry/package_offline_bundle.py` for diagnostics.
- **Projection rebuilds:** `docs/modules/findings-ledger/workflow-inference.md` for chain rules; `scripts/ledger/replay.sh` (LEDGER-29-008 deliverable) for deterministic replays.
---
*Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.*