Files
git.stella-ops.org/docs/modules/findings-ledger/observability.md

9.4 KiB
Raw Blame History

Findings Ledger Observability Profile (Sprint 120)

Audience: Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
Scope: Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.

1. Telemetry stack & conventions

  • Export path: .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via observability.enabled=true in appsettings.
  • Namespace prefix: ledger.* for metrics, Ledger.* for logs/traces. Labels follow tenant, chain, policy, status, reason, anchor.
  • Time provenance: All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from TimeProvider.

2. Metrics

Metric Type Labels Description / target
ledger_write_duration_seconds Histogram tenant, event_type, source End-to-end append latency (API ingress → persisted). P95 ≤120ms.
ledger_events_total Counter tenant, event_type, source (policy, workflow, orchestrator) Incremented per committed event. Mirrors Merkle leaf count.
ledger_ingest_backlog_events Gauge tenant Number of events buffered in the writer/anchor queues. Alert when >5000 for 5min.
ledger_quota_remaining Gauge tenant Remaining ingest capacity before backpressure applies (defaults to 5000 events).
ledger_backpressure_applied_total Counter tenant, reason, limit Incremented whenever backlog crosses quota threshold.
ledger_quota_rejections_total Counter tenant, reason Incremented when requests are actively rejected due to quotas.
ledger_projection_lag_seconds Gauge tenant Wall-clock difference between latest ledger event and projection tail. Target <30s.
ledger_projection_rebuild_seconds Histogram tenant Duration of replay/rebuild operations triggered by LEDGER-29-008 harness.
ledger_projection_apply_seconds Histogram tenant, event_type, policy_version, evaluation_status Time to apply a single ledger event to projection. Target P95 <1s.
ledger_projection_events_total Counter tenant, event_type, policy_version, evaluation_status Count of events applied to projections.
ledger_merkle_anchor_duration_seconds Histogram tenant Time to batch + anchor events. Target <60s per 10k events.
ledger_merkle_anchor_failures_total Counter tenant, reason (db, signing, network) Alerts at >0 within 15min.
ledger_attachments_encryption_failures_total Counter tenant, stage (encrypt, sign, upload) Ensures secure attachment pipeline stays healthy.
ledger_db_connections_active Gauge role (writer, projector) Helps tune pool size.
ledger_app_version_info Gauge version, git_sha Static metric for fleet observability.
ledger_scoring_latency_seconds Histogram tenant, policy_version, result Latency of risk scoring operations per finding. P95 target <500 ms.
ledger_scoring_operations_total Counter tenant, policy_version, result Total number of scoring operations by result (success, partial_success, error, etc.).
ledger_scoring_provider_gaps_total Counter tenant, provider, reason Count of findings where scoring provider was unavailable or returned no data.
ledger_severity_distribution_critical Gauge tenant, policy_version Current count of critical severity findings by tenant and policy.
ledger_severity_distribution_high Gauge tenant, policy_version Current count of high severity findings by tenant and policy.
ledger_severity_distribution_medium Gauge tenant, policy_version Current count of medium severity findings by tenant and policy.
ledger_severity_distribution_low Gauge tenant, policy_version Current count of low severity findings by tenant and policy.
ledger_severity_distribution_unknown Gauge tenant, policy_version Current count of unknown/unscored findings by tenant and policy.
ledger_score_freshness_seconds Gauge tenant Time since last scoring operation completed by tenant. Alert when >3600 s.
ledger_scored_findings_exports_total Counter tenant, record_count Count of scored findings export operations.
ledger_scored_findings_export_duration_seconds Histogram tenant, record_count Duration of scored findings export operations.
ledger_airgap_staleness_seconds Histogram domain Current staleness of air-gap imported data by domain.
ledger_airgap_staleness_gauge_seconds Gauge domain Current staleness of air-gap data by domain (observable gauge).
ledger_staleness_validation_failures_total Counter domain Count of staleness validation failures blocking exports.

Derived dashboards

  • Writer health: ledger_write_latency_seconds (P50/P95/P99), backlog gauge, event throughput.
  • Projection health: ledger_projection_lag_seconds, ledger_projection_apply_seconds, projection throughput, conflict counts (from logs).
  • Anchoring: Anchor duration histogram, failure counter, root hash timeline.
  • Risk scoring: ledger_scoring_latency_seconds (P50/P95/P99), severity distribution gauges, provider gap counter, score freshness.
  • Export operations: ledger_scored_findings_exports_total, export duration histogram, record counts.
  • Air-gap health: ledger_airgap_staleness_gauge_seconds, staleness validation failures, domain freshness trends.

3. Logs & traces

  • Log structure: Serilog JSON with fields tenant, chainId, sequence, eventId, eventType, actorId, policyVersion, hash, merkleRoot.
  • Log levels: Information for success summaries (sampled), Warning for retried operations, Error for failed writes/anchors.
  • Correlation: Each API request includes requestId + traceId logged with events. Projector logs capture replayId and rebuildReason.
  • Timeline events: ledger.event.appended and ledger.projection.updated are emitted as structured logs carrying tenant, chainId, sequence, eventId, policyVersion, traceId, and placeholder evidence_ref fields for downstream timeline consumers.
  • Secrets: Ensure event_body is never logged; log only metadata/hashes.

4. Alerts

Alert Condition Response
LedgerWriteSLA ledger_write_latency_seconds P95 > 1s for 3 intervals Check DB contention, review queue backlog, scale writer.
LedgerBacklogGrowing ledger_ingest_backlog_events > 5000 for 5min Inspect upstream policy runs, ensure projector keeping up.
LedgerBackpressure ledger_backpressure_applied_total increases while ledger_quota_remaining < 0 Throttle callers, raise quota or scale anchor worker.
ProjectionLag ledger_projection_lag_seconds > 30s Trigger rebuild, verify change streams.
AnchorFailure ledger_merkle_anchor_failures_total increase > 0 Collect logs, rerun anchor, verify signing service.
AttachmentSecurityError ledger_attachments_encryption_failures_total increase > 0 Audit attachments pipeline; check key material and storage endpoints.
ScoringFreshnessStale ledger_score_freshness_seconds > 3600 s for any tenant Check scoring pipeline, verify provider connectivity, re-trigger scoring job.
ScoringProviderGaps ledger_scoring_provider_gaps_total increase > 10 in 5 min Investigate provider failures; check rate limits or connectivity.
AirgapDataStale ledger_airgap_staleness_gauge_seconds > threshold for 15 min Re-import air-gap bundle; verify export pipeline in source enclave.

Alerts integrate with Notifier channel ledger.alerts. For air-gapped deployments emit to local syslog + CLI incident scripts.

5. Testing & determinism harness

  • Replay harness: CLI dotnet run --project tools/LedgerReplayHarness executes deterministic replays at 5M findings/tenant. Metrics emitted: ledger_projection_rebuild_seconds with scenario label.
  • Property tests: Seeded tests ensure ledger_events_total and Merkle leaf counts match after replay.
  • CI gating: LEDGER-29-008 requires harness output uploaded as signed JSON (harness-report.json + DSSE) and referenced in sprint notes.

6. Offline & air-gap guidance

  • Collect metrics/log snapshots via stella ledger observability snapshot --out offline/ledger/metrics.tar.gz. Include ledger_write_latency_seconds summary, anchor root history, and projection lag samples.
  • Include default Grafana JSON under offline/telemetry/dashboards/ledger/*.json. Dashboards use the metrics above; filter by tenant.
  • Ensure sealed-mode doc (docs/modules/findings-ledger/schema.md §3.3) references ledger_attachments_encryption_failures_total so Ops can confirm encryption pipeline health without remote telemetry.

7. Runbook pointers

  • Anchoring issues: Refer to docs/modules/findings-ledger/schema.md §3 for root structure, ops/devops/telemetry/package_offline_bundle.py for diagnostics.
  • Projection rebuilds: docs/modules/findings-ledger/workflow-inference.md for chain rules; scripts/ledger/replay.sh (LEDGER-29-008 deliverable) for deterministic replays.

Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.