Files
git.stella-ops.org/docs/modules/findings-ledger/observability.md
master 8355e2ff75
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Add initial implementation of Vulnerability Resolver Jobs
- Created project for StellaOps.Scanner.Analyzers.Native.Tests with necessary dependencies.
- Documented roles and guidelines in AGENTS.md for Scheduler module.
- Implemented IResolverJobService interface and InMemoryResolverJobService for handling resolver jobs.
- Added ResolverBacklogNotifier and ResolverBacklogService for monitoring job metrics.
- Developed API endpoints for managing resolver jobs and retrieving metrics.
- Defined models for resolver job requests and responses.
- Integrated dependency injection for resolver job services.
- Implemented ImpactIndexSnapshot for persisting impact index data.
- Introduced SignalsScoringOptions for configurable scoring weights in reachability scoring.
- Added unit tests for ReachabilityScoringService and RuntimeFactsIngestionService.
- Created dotnet-filter.sh script to handle command-line arguments for dotnet.
- Established nuget-prime project for managing package downloads.
2025-11-18 07:52:15 +02:00

6.0 KiB
Raw Blame History

Findings Ledger Observability Profile (Sprint 120)

Audience: Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild
Scope: Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.

1. Telemetry stack & conventions

  • Export path: .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via observability.enabled=true in appsettings.
  • Namespace prefix: ledger.* for metrics, Ledger.* for logs/traces. Labels follow tenant, chain, policy, status, reason, anchor.
  • Time provenance: All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from TimeProvider.

2. Metrics

Metric Type Labels Description / target
ledger_write_latency_seconds Histogram tenant, event_type End-to-end append latency (API ingress → persisted). P95 ≤120ms.
ledger_events_total Counter tenant, event_type, source (policy, workflow, orchestrator) Incremented per committed event. Mirrors Merkle leaf count.
ledger_ingest_backlog_events Gauge tenant Number of events buffered in the writer queue. Alert when >5000 for 5min.
ledger_projection_lag_seconds Gauge tenant Wall-clock difference between latest ledger event and projection tail. Target <30s.
ledger_projection_rebuild_seconds Histogram tenant Duration of replay/rebuild operations triggered by LEDGER-29-008 harness.
ledger_projection_apply_seconds Histogram tenant, event_type, policy_version, evaluation_status Time to apply a single ledger event to projection. Target P95 <1s.
ledger_projection_events_total Counter tenant, event_type, policy_version, evaluation_status Count of events applied to projections.
ledger_merkle_anchor_duration_seconds Histogram tenant Time to batch + anchor events. Target <60s per 10k events.
ledger_merkle_anchor_failures_total Counter tenant, reason (db, signing, network) Alerts at >0 within 15min.
ledger_attachments_encryption_failures_total Counter tenant, stage (encrypt, sign, upload) Ensures secure attachment pipeline stays healthy.
ledger_db_connections_active Gauge role (writer, projector) Helps tune pool size.
ledger_app_version_info Gauge version, git_sha Static metric for fleet observability.

Derived dashboards

  • Writer health: ledger_write_latency_seconds (P50/P95/P99), backlog gauge, event throughput.
  • Projection health: ledger_projection_lag_seconds, ledger_projection_apply_seconds, projection throughput, conflict counts (from logs).
  • Anchoring: Anchor duration histogram, failure counter, root hash timeline.

3. Logs & traces

  • Log structure: Serilog JSON with fields tenant, chainId, sequence, eventId, eventType, actorId, policyVersion, hash, merkleRoot.
  • Log levels: Information for success summaries (sampled), Warning for retried operations, Error for failed writes/anchors.
  • Correlation: Each API request includes requestId + traceId logged with events. Projector logs capture replayId and rebuildReason.
  • Timeline events: ledger.event.appended and ledger.projection.updated are emitted as structured logs carrying tenant, chainId, sequence, eventId, policyVersion, traceId, and placeholder evidence_ref fields for downstream timeline consumers.
  • Secrets: Ensure event_body is never logged; log only metadata/hashes.

4. Alerts

Alert Condition Response
LedgerWriteSLA ledger_write_latency_seconds P95 > 1s for 3 intervals Check DB contention, review queue backlog, scale writer.
LedgerBacklogGrowing ledger_ingest_backlog_events > 5000 for 5min Inspect upstream policy runs, ensure projector keeping up.
ProjectionLag ledger_projection_lag_seconds > 30s Trigger rebuild, verify change streams.
AnchorFailure ledger_merkle_anchor_failures_total increase > 0 Collect logs, rerun anchor, verify signing service.
AttachmentSecurityError ledger_attachments_encryption_failures_total increase > 0 Audit attachments pipeline; check key material and storage endpoints.

Alerts integrate with Notifier channel ledger.alerts. For air-gapped deployments emit to local syslog + CLI incident scripts.

5. Testing & determinism harness

  • Replay harness: CLI dotnet run --project tools/LedgerReplayHarness executes deterministic replays at 5M findings/tenant. Metrics emitted: ledger_projection_rebuild_seconds with scenario label.
  • Property tests: Seeded tests ensure ledger_events_total and Merkle leaf counts match after replay.
  • CI gating: LEDGER-29-008 requires harness output uploaded as signed JSON (harness-report.json + DSSE) and referenced in sprint notes.

6. Offline & air-gap guidance

  • Collect metrics/log snapshots via stella ledger observability snapshot --out offline/ledger/metrics.tar.gz. Include ledger_write_latency_seconds summary, anchor root history, and projection lag samples.
  • Include default Grafana JSON under offline/telemetry/dashboards/ledger/*.json. Dashboards use the metrics above; filter by tenant.
  • Ensure sealed-mode doc (docs/modules/findings-ledger/schema.md §3.3) references ledger_attachments_encryption_failures_total so Ops can confirm encryption pipeline health without remote telemetry.

7. Runbook pointers

  • Anchoring issues: Refer to docs/modules/findings-ledger/schema.md §3 for root structure, ops/devops/telemetry/package_offline_bundle.py for diagnostics.
  • Projection rebuilds: docs/modules/findings-ledger/workflow-inference.md for chain rules; scripts/ledger/replay.sh (LEDGER-29-008 deliverable) for deterministic replays.

Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.