Files
git.stella-ops.org/docs/modules/vex-lens/runbooks/observability.md
StellaOps Bot efaf3cb789
Some checks failed
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Manifest Integrity / Validate Schema Integrity (push) Has been cancelled
Manifest Integrity / Validate Contract Documents (push) Has been cancelled
Manifest Integrity / Validate Pack Fixtures (push) Has been cancelled
Manifest Integrity / Audit SHA256SUMS Files (push) Has been cancelled
Manifest Integrity / Verify Merkle Roots (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
up
2025-12-12 09:35:37 +02:00

2.4 KiB

VEX Lens observability runbook (stub · 2025-11-29 demo)

Dashboards (offline import)

  • Grafana JSON: docs/modules/vex-lens/runbooks/dashboards/vex-lens-observability.json (import locally; no external data sources assumed).
  • Planned panels: consensus latency, conflict backlog, recompute duration, issuer trust changes, export job success rate, and DSSE verification failures.

Key metrics

  • vex_consensus_latency_seconds_bucket — latency from observation intake to consensus write.
  • vex_conflict_queue_depth — size of unresolved conflict queue.
  • vex_recompute_duration_seconds_bucket{reason} — recompute times by trigger (issuer update, policy knob, ingestion delta).
  • vex_export_duration_seconds_bucket — export job runtime.
  • vex_dsse_verification_failures_total — failed attestations during export/ingest.
  • vex_consensus_conflicts_total{reason} — conflict counts by reason (status disagreement, scope mismatch, missing provenance).

Logs & traces

  • Correlate by correlationId, artifactKey, advisoryKey, and issuer. Include trustTier, weightBefore, weightAfter, and justification fields for audits.
  • Traces disabled by default for air-gap; enable by setting Telemetry:ExportEnabled=true and pointing OTLP endpoint to on-prem collector.

Health/diagnostics

  • /health/liveness and /health/readiness (service) must return 200; readiness checks projection store (PostgreSQL or in-memory), cache, and event bus reachability.
  • /status exposes build version, commit, feature flags; verify it matches offline bundle manifest.
  • Export self-check: run stella vex export --format json --manifest out/manifest.json and validate hashes against manifest entries.

Alert hints

  • Consensus latency p99 > 1.5s over 5m.
  • Conflict queue depth > 500 for any tenant.
  • DSSE verification failures > 0 in a 10m window.
  • Export failure rate > 2% over 10m.

Offline verification steps

  1. Import Grafana JSON locally; point to Prometheus scrape labeled vex-lens.
  2. Run export CLI above and verify manifest.json hashes via jq -r '.files[].sha256'.
  3. Fetch /status and confirm commit/version match the exported manifest and offline kit bundle metadata.

Evidence locations

  • Sprint tracker: docs/implplan/SPRINT_0332_0001_0001_docs_modules_vex_lens.md.
  • Module docs: README.md, architecture.md, implementation_plan.md.
  • Dashboard stub: runbooks/dashboards/vex-lens-observability.json.