# Findings Ledger Observability Profile (Sprint 120) > **Audience:** Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild > **Scope:** Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs. ## 1. Telemetry stack & conventions - **Export path:** .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via `observability.enabled=true` in `appsettings`. - **Namespace prefix:** `ledger.*` for metrics, `Ledger.*` for logs/traces. Labels follow `tenant`, `chain`, `policy`, `status`, `reason`, `anchor`. - **Time provenance:** All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from `TimeProvider`. ## 2. Metrics | Metric | Type | Labels | Description / target | | --- | --- | --- | --- | | `ledger_write_duration_seconds` | Histogram | `tenant`, `event_type`, `source` | End-to-end append latency (API ingress → persisted). P95 ≤ 120 ms. | | `ledger_events_total` | Counter | `tenant`, `event_type`, `source` (`policy`, `workflow`, `orchestrator`) | Incremented per committed event. Mirrors Merkle leaf count. | | `ledger_ingest_backlog_events` | Gauge | `tenant` | Number of events buffered in the writer/anchor queues. Alert when >5 000 for 5 min. | | `ledger_quota_remaining` | Gauge | `tenant` | Remaining ingest capacity before backpressure applies (defaults to 5 000 events). | | `ledger_backpressure_applied_total` | Counter | `tenant`, `reason`, `limit` | Incremented whenever backlog crosses quota threshold. | | `ledger_quota_rejections_total` | Counter | `tenant`, `reason` | Incremented when requests are actively rejected due to quotas. | | `ledger_projection_lag_seconds` | Gauge | `tenant` | Wall-clock difference between latest ledger event and projection tail. Target <30 s. | | `ledger_projection_rebuild_seconds` | Histogram | `tenant` | Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. | | `ledger_projection_apply_seconds` | Histogram | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Time to apply a single ledger event to projection. Target P95 <1 s. | | `ledger_projection_events_total` | Counter | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Count of events applied to projections. | | `ledger_merkle_anchor_duration_seconds` | Histogram | `tenant` | Time to batch + anchor events. Target <60 s per 10k events. | | `ledger_merkle_anchor_failures_total` | Counter | `tenant`, `reason` (`db`, `signing`, `network`) | Alerts at >0 within 15 min. | | `ledger_attachments_encryption_failures_total` | Counter | `tenant`, `stage` (`encrypt`, `sign`, `upload`) | Ensures secure attachment pipeline stays healthy. | | `ledger_db_connections_active` | Gauge | `role` (`writer`, `projector`) | Helps tune pool size. | | `ledger_app_version_info` | Gauge | `version`, `git_sha` | Static metric for fleet observability. | | `ledger_scoring_latency_seconds` | Histogram | `tenant`, `policy_version`, `result` | Latency of risk scoring operations per finding. P95 target <500 ms. | | `ledger_scoring_operations_total` | Counter | `tenant`, `policy_version`, `result` | Total number of scoring operations by result (success, partial_success, error, etc.). | | `ledger_scoring_provider_gaps_total` | Counter | `tenant`, `provider`, `reason` | Count of findings where scoring provider was unavailable or returned no data. | | `ledger_severity_distribution_critical` | Gauge | `tenant`, `policy_version` | Current count of critical severity findings by tenant and policy. | | `ledger_severity_distribution_high` | Gauge | `tenant`, `policy_version` | Current count of high severity findings by tenant and policy. | | `ledger_severity_distribution_medium` | Gauge | `tenant`, `policy_version` | Current count of medium severity findings by tenant and policy. | | `ledger_severity_distribution_low` | Gauge | `tenant`, `policy_version` | Current count of low severity findings by tenant and policy. | | `ledger_severity_distribution_unknown` | Gauge | `tenant`, `policy_version` | Current count of unknown/unscored findings by tenant and policy. | | `ledger_score_freshness_seconds` | Gauge | `tenant` | Time since last scoring operation completed by tenant. Alert when >3600 s. | | `ledger_scored_findings_exports_total` | Counter | `tenant`, `record_count` | Count of scored findings export operations. | | `ledger_scored_findings_export_duration_seconds` | Histogram | `tenant`, `record_count` | Duration of scored findings export operations. | | `ledger_airgap_staleness_seconds` | Histogram | `domain` | Current staleness of air-gap imported data by domain. | | `ledger_airgap_staleness_gauge_seconds` | Gauge | `domain` | Current staleness of air-gap data by domain (observable gauge). | | `ledger_staleness_validation_failures_total` | Counter | `domain` | Count of staleness validation failures blocking exports. | ### Derived dashboards - **Writer health:** `ledger_write_latency_seconds` (P50/P95/P99), backlog gauge, event throughput. - **Projection health:** `ledger_projection_lag_seconds`, `ledger_projection_apply_seconds`, projection throughput, conflict counts (from logs). - **Anchoring:** Anchor duration histogram, failure counter, root hash timeline. - **Risk scoring:** `ledger_scoring_latency_seconds` (P50/P95/P99), severity distribution gauges, provider gap counter, score freshness. - **Export operations:** `ledger_scored_findings_exports_total`, export duration histogram, record counts. - **Air-gap health:** `ledger_airgap_staleness_gauge_seconds`, staleness validation failures, domain freshness trends. ## 3. Logs & traces - **Log structure:** Serilog JSON with fields `tenant`, `chainId`, `sequence`, `eventId`, `eventType`, `actorId`, `policyVersion`, `hash`, `merkleRoot`. - **Log levels:** `Information` for success summaries (sampled), `Warning` for retried operations, `Error` for failed writes/anchors. - **Correlation:** Each API request includes `requestId` + `traceId` logged with events. Projector logs capture `replayId` and `rebuildReason`. - **Timeline events:** `ledger.event.appended` and `ledger.projection.updated` are emitted as structured logs carrying `tenant`, `chainId`, `sequence`, `eventId`, `policyVersion`, `traceId`, and placeholder `evidence_ref` fields for downstream timeline consumers. - **Secrets:** Ensure `event_body` is never logged; log only metadata/hashes. - **Incident mode:** When incident mode is active, emit `ledger.incident.mode`, `ledger.incident.lag_trace`, `ledger.incident.conflict_snapshot`, and `ledger.incident.replay_trace` logs (with activation id, retention extension days, lag seconds, conflict reason). Snapshot TTLs inherit an incident retention extension and are annotated with `incident.*` metadata. ## 4. Alerts | Alert | Condition | Response | | --- | --- | --- | | **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 1 s for 3 intervals | Check DB contention, review queue backlog, scale writer. | | **LedgerBacklogGrowing** | `ledger_ingest_backlog_events` > 5 000 for 5 min | Inspect upstream policy runs, ensure projector keeping up. | | **LedgerBackpressure** | `ledger_backpressure_applied_total` increases while `ledger_quota_remaining` < 0 | Throttle callers, raise quota or scale anchor worker. | | **ProjectionLag** | `ledger_projection_lag_seconds` > 30 s | Trigger rebuild, verify change streams. | | **AnchorFailure** | `ledger_merkle_anchor_failures_total` increase > 0 | Collect logs, rerun anchor, verify signing service. | | **AttachmentSecurityError** | `ledger_attachments_encryption_failures_total` increase > 0 | Audit attachments pipeline; check key material and storage endpoints. | | **ScoringFreshnessStale** | `ledger_score_freshness_seconds` > 3600 s for any tenant | Check scoring pipeline, verify provider connectivity, re-trigger scoring job. | | **ScoringProviderGaps** | `ledger_scoring_provider_gaps_total` increase > 10 in 5 min | Investigate provider failures; check rate limits or connectivity. | | **AirgapDataStale** | `ledger_airgap_staleness_gauge_seconds` > threshold for 15 min | Re-import air-gap bundle; verify export pipeline in source enclave. | Alerts integrate with Notifier channel `ledger.alerts`. For air-gapped deployments emit to local syslog + CLI incident scripts. ## 5. Testing & determinism harness - **Replay harness:** CLI `dotnet run --project tools/LedgerReplayHarness` executes deterministic replays at 5 M findings/tenant. Metrics emitted: `ledger_projection_rebuild_seconds` with `scenario` label. - **Property tests:** Seeded tests ensure `ledger_events_total` and Merkle leaf counts match after replay. - **CI gating:** `LEDGER-29-008` requires harness output uploaded as signed JSON (`harness-report.json` + DSSE) and referenced in sprint notes. ## 6. Offline & air-gap guidance - Collect metrics/log snapshots via `stella ledger observability snapshot --out offline/ledger/metrics.tar.gz`. Include `ledger_write_latency_seconds` summary, anchor root history, and projection lag samples. - Include default Grafana JSON under `offline/telemetry/dashboards/ledger/*.json`. Dashboards use the metrics above; filter by `tenant`. - Ensure sealed-mode doc (`docs/modules/findings-ledger/schema.md` §3.3) references `ledger_attachments_encryption_failures_total` so Ops can confirm encryption pipeline health without remote telemetry. ## 7. Runbook pointers - **Anchoring issues:** Refer to `docs/modules/findings-ledger/schema.md` §3 for root structure, `ops/devops/telemetry/package_offline_bundle.py` for diagnostics. - **Projection rebuilds:** `docs/modules/findings-ledger/workflow-inference.md` for chain rules; `scripts/ledger/replay.sh` (LEDGER-29-008 deliverable) for deterministic replays. --- *Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.*