9.4 KiB
9.4 KiB
Findings Ledger Observability Profile (Sprint 120)
Audience: Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
Scope: Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.
1. Telemetry stack & conventions
- Export path: .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via
observability.enabled=trueinappsettings. - Namespace prefix:
ledger.*for metrics,Ledger.*for logs/traces. Labels followtenant,chain,policy,status,reason,anchor. - Time provenance: All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from
TimeProvider.
2. Metrics
| Metric | Type | Labels | Description / target |
|---|---|---|---|
ledger_write_duration_seconds |
Histogram | tenant, event_type, source |
End-to-end append latency (API ingress → persisted). P95 ≤ 120 ms. |
ledger_events_total |
Counter | tenant, event_type, source (policy, workflow, orchestrator) |
Incremented per committed event. Mirrors Merkle leaf count. |
ledger_ingest_backlog_events |
Gauge | tenant |
Number of events buffered in the writer/anchor queues. Alert when >5 000 for 5 min. |
ledger_quota_remaining |
Gauge | tenant |
Remaining ingest capacity before backpressure applies (defaults to 5 000 events). |
ledger_backpressure_applied_total |
Counter | tenant, reason, limit |
Incremented whenever backlog crosses quota threshold. |
ledger_quota_rejections_total |
Counter | tenant, reason |
Incremented when requests are actively rejected due to quotas. |
ledger_projection_lag_seconds |
Gauge | tenant |
Wall-clock difference between latest ledger event and projection tail. Target <30 s. |
ledger_projection_rebuild_seconds |
Histogram | tenant |
Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
ledger_projection_apply_seconds |
Histogram | tenant, event_type, policy_version, evaluation_status |
Time to apply a single ledger event to projection. Target P95 <1 s. |
ledger_projection_events_total |
Counter | tenant, event_type, policy_version, evaluation_status |
Count of events applied to projections. |
ledger_merkle_anchor_duration_seconds |
Histogram | tenant |
Time to batch + anchor events. Target <60 s per 10k events. |
ledger_merkle_anchor_failures_total |
Counter | tenant, reason (db, signing, network) |
Alerts at >0 within 15 min. |
ledger_attachments_encryption_failures_total |
Counter | tenant, stage (encrypt, sign, upload) |
Ensures secure attachment pipeline stays healthy. |
ledger_db_connections_active |
Gauge | role (writer, projector) |
Helps tune pool size. |
ledger_app_version_info |
Gauge | version, git_sha |
Static metric for fleet observability. |
ledger_scoring_latency_seconds |
Histogram | tenant, policy_version, result |
Latency of risk scoring operations per finding. P95 target <500 ms. |
ledger_scoring_operations_total |
Counter | tenant, policy_version, result |
Total number of scoring operations by result (success, partial_success, error, etc.). |
ledger_scoring_provider_gaps_total |
Counter | tenant, provider, reason |
Count of findings where scoring provider was unavailable or returned no data. |
ledger_severity_distribution_critical |
Gauge | tenant, policy_version |
Current count of critical severity findings by tenant and policy. |
ledger_severity_distribution_high |
Gauge | tenant, policy_version |
Current count of high severity findings by tenant and policy. |
ledger_severity_distribution_medium |
Gauge | tenant, policy_version |
Current count of medium severity findings by tenant and policy. |
ledger_severity_distribution_low |
Gauge | tenant, policy_version |
Current count of low severity findings by tenant and policy. |
ledger_severity_distribution_unknown |
Gauge | tenant, policy_version |
Current count of unknown/unscored findings by tenant and policy. |
ledger_score_freshness_seconds |
Gauge | tenant |
Time since last scoring operation completed by tenant. Alert when >3600 s. |
ledger_scored_findings_exports_total |
Counter | tenant, record_count |
Count of scored findings export operations. |
ledger_scored_findings_export_duration_seconds |
Histogram | tenant, record_count |
Duration of scored findings export operations. |
ledger_airgap_staleness_seconds |
Histogram | domain |
Current staleness of air-gap imported data by domain. |
ledger_airgap_staleness_gauge_seconds |
Gauge | domain |
Current staleness of air-gap data by domain (observable gauge). |
ledger_staleness_validation_failures_total |
Counter | domain |
Count of staleness validation failures blocking exports. |
Derived dashboards
- Writer health:
ledger_write_latency_seconds(P50/P95/P99), backlog gauge, event throughput. - Projection health:
ledger_projection_lag_seconds,ledger_projection_apply_seconds, projection throughput, conflict counts (from logs). - Anchoring: Anchor duration histogram, failure counter, root hash timeline.
- Risk scoring:
ledger_scoring_latency_seconds(P50/P95/P99), severity distribution gauges, provider gap counter, score freshness. - Export operations:
ledger_scored_findings_exports_total, export duration histogram, record counts. - Air-gap health:
ledger_airgap_staleness_gauge_seconds, staleness validation failures, domain freshness trends.
3. Logs & traces
- Log structure: Serilog JSON with fields
tenant,chainId,sequence,eventId,eventType,actorId,policyVersion,hash,merkleRoot. - Log levels:
Informationfor success summaries (sampled),Warningfor retried operations,Errorfor failed writes/anchors. - Correlation: Each API request includes
requestId+traceIdlogged with events. Projector logs capturereplayIdandrebuildReason. - Timeline events:
ledger.event.appendedandledger.projection.updatedare emitted as structured logs carryingtenant,chainId,sequence,eventId,policyVersion,traceId, and placeholderevidence_reffields for downstream timeline consumers. - Secrets: Ensure
event_bodyis never logged; log only metadata/hashes.
4. Alerts
| Alert | Condition | Response |
|---|---|---|
| LedgerWriteSLA | ledger_write_latency_seconds P95 > 1 s for 3 intervals |
Check DB contention, review queue backlog, scale writer. |
| LedgerBacklogGrowing | ledger_ingest_backlog_events > 5 000 for 5 min |
Inspect upstream policy runs, ensure projector keeping up. |
| LedgerBackpressure | ledger_backpressure_applied_total increases while ledger_quota_remaining < 0 |
Throttle callers, raise quota or scale anchor worker. |
| ProjectionLag | ledger_projection_lag_seconds > 30 s |
Trigger rebuild, verify change streams. |
| AnchorFailure | ledger_merkle_anchor_failures_total increase > 0 |
Collect logs, rerun anchor, verify signing service. |
| AttachmentSecurityError | ledger_attachments_encryption_failures_total increase > 0 |
Audit attachments pipeline; check key material and storage endpoints. |
| ScoringFreshnessStale | ledger_score_freshness_seconds > 3600 s for any tenant |
Check scoring pipeline, verify provider connectivity, re-trigger scoring job. |
| ScoringProviderGaps | ledger_scoring_provider_gaps_total increase > 10 in 5 min |
Investigate provider failures; check rate limits or connectivity. |
| AirgapDataStale | ledger_airgap_staleness_gauge_seconds > threshold for 15 min |
Re-import air-gap bundle; verify export pipeline in source enclave. |
Alerts integrate with Notifier channel ledger.alerts. For air-gapped deployments emit to local syslog + CLI incident scripts.
5. Testing & determinism harness
- Replay harness: CLI
dotnet run --project tools/LedgerReplayHarnessexecutes deterministic replays at 5 M findings/tenant. Metrics emitted:ledger_projection_rebuild_secondswithscenariolabel. - Property tests: Seeded tests ensure
ledger_events_totaland Merkle leaf counts match after replay. - CI gating:
LEDGER-29-008requires harness output uploaded as signed JSON (harness-report.json+ DSSE) and referenced in sprint notes.
6. Offline & air-gap guidance
- Collect metrics/log snapshots via
stella ledger observability snapshot --out offline/ledger/metrics.tar.gz. Includeledger_write_latency_secondssummary, anchor root history, and projection lag samples. - Include default Grafana JSON under
offline/telemetry/dashboards/ledger/*.json. Dashboards use the metrics above; filter bytenant. - Ensure sealed-mode doc (
docs/modules/findings-ledger/schema.md§3.3) referencesledger_attachments_encryption_failures_totalso Ops can confirm encryption pipeline health without remote telemetry.
7. Runbook pointers
- Anchoring issues: Refer to
docs/modules/findings-ledger/schema.md§3 for root structure,ops/devops/telemetry/package_offline_bundle.pyfor diagnostics. - Projection rebuilds:
docs/modules/findings-ledger/workflow-inference.mdfor chain rules;scripts/ledger/replay.sh(LEDGER-29-008 deliverable) for deterministic replays.
Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.