Files
git.stella-ops.org/docs/modules/vex-lens/runbooks/observability.md
StellaOps Bot efaf3cb789
Some checks failed
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Manifest Integrity / Validate Schema Integrity (push) Has been cancelled
Manifest Integrity / Validate Contract Documents (push) Has been cancelled
Manifest Integrity / Validate Pack Fixtures (push) Has been cancelled
Manifest Integrity / Audit SHA256SUMS Files (push) Has been cancelled
Manifest Integrity / Verify Merkle Roots (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
up
2025-12-12 09:35:37 +02:00

39 lines
2.4 KiB
Markdown

# VEX Lens observability runbook (stub · 2025-11-29 demo)
## Dashboards (offline import)
- Grafana JSON: `docs/modules/vex-lens/runbooks/dashboards/vex-lens-observability.json` (import locally; no external data sources assumed).
- Planned panels: consensus latency, conflict backlog, recompute duration, issuer trust changes, export job success rate, and DSSE verification failures.
## Key metrics
- `vex_consensus_latency_seconds_bucket` — latency from observation intake to consensus write.
- `vex_conflict_queue_depth` — size of unresolved conflict queue.
- `vex_recompute_duration_seconds_bucket{reason}` — recompute times by trigger (issuer update, policy knob, ingestion delta).
- `vex_export_duration_seconds_bucket` — export job runtime.
- `vex_dsse_verification_failures_total` — failed attestations during export/ingest.
- `vex_consensus_conflicts_total{reason}` — conflict counts by reason (status disagreement, scope mismatch, missing provenance).
## Logs & traces
- Correlate by `correlationId`, `artifactKey`, `advisoryKey`, and `issuer`. Include `trustTier`, `weightBefore`, `weightAfter`, and `justification` fields for audits.
- Traces disabled by default for air-gap; enable by setting `Telemetry:ExportEnabled=true` and pointing OTLP endpoint to on-prem collector.
## Health/diagnostics
- `/health/liveness` and `/health/readiness` (service) must return 200; readiness checks projection store (PostgreSQL or in-memory), cache, and event bus reachability.
- `/status` exposes build version, commit, feature flags; verify it matches offline bundle manifest.
- Export self-check: run `stella vex export --format json --manifest out/manifest.json` and validate hashes against manifest entries.
## Alert hints
- Consensus latency p99 > 1.5s over 5m.
- Conflict queue depth > 500 for any tenant.
- DSSE verification failures > 0 in a 10m window.
- Export failure rate > 2% over 10m.
## Offline verification steps
1) Import Grafana JSON locally; point to Prometheus scrape labeled `vex-lens`.
2) Run export CLI above and verify `manifest.json` hashes via `jq -r '.files[].sha256'`.
3) Fetch `/status` and confirm commit/version match the exported manifest and offline kit bundle metadata.
## Evidence locations
- Sprint tracker: `docs/implplan/SPRINT_0332_0001_0001_docs_modules_vex_lens.md`.
- Module docs: `README.md`, `architecture.md`, `implementation_plan.md`.
- Dashboard stub: `runbooks/dashboards/vex-lens-observability.json`.