# Vuln Explorer observability runbook (demo snapshot · 2025-11-29) ## Dashboards (offline-friendly) - Grafana JSON: `docs/modules/vuln-explorer/runbooks/dashboards/vuln-explorer-observability.json` (import locally; no external data sources assumed). - Ops dashboards: `ops/devops/vuln/dashboards/vuln-explorer.json` (CI/staging) adds API latency p95, projection lag, error rate, query budget enforcement. ## Key metrics - `vuln_projection_lag_seconds{tenant}` – seconds between latest ledger event and projector head. - `vuln_findings_open_total{severity,tenant}` – count of open findings by severity. - `vuln_export_duration_seconds_bucket` – histogram for export job runtime. - `vuln_projection_backlog_total` – queued events awaiting projection. - `vuln_triage_actions_total{type}` – immutable triage actions (assign, comment, risk_accept, remediation_note). - `vuln_api_request_duration_seconds_bucket{route}` – API latency for `GET /v1/findings*` and `POST /v1/reports`. - `vuln_query_hashes_total{tenant,query_hash}` – hashed query shapes (no PII) to observe cache effectiveness. - `vuln_api_payload_bytes_bucket{direction}` – request/response size histograms to spot oversized payloads. ## Logs & traces - Correlate by `correlationId` and `findingId`. Structured fields: `tenant`, `advisoryKey`, `policyVersion`, `projectId`, `route`. - Query PII guardrail: request filters are hashed (SHA-256 with deployment salt); raw filters are not logged. Strings longer than 128 chars are truncated; known PII fields (`email`, `userId`) are dropped before logging. - Trace exemplar anchors: `traceparent` headers are copied into logs; exporters stay disabled by default for air-gap. Enable by setting `Telemetry:ExportEnabled=true` and pointing to on-prem Tempo/Jaeger. ## Health/diagnostics - `/health/liveness` and `/health/readiness` (HTTP 200 expected; readiness checks Mongo + cache reachability). - `/status` returns build version, git commit, and enabled features; safe for anonymous fetch in sealed environments. - Ledger replay check: `GET /v1/findings?projectionMode=verify` emits `X-Vuln-Projection-Head` for quick consistency probes. ## Alert hints (wire to local Alertmanager or watchdog) - Projection lag > 120s for any tenant. - API p99 latency > 800ms for `GET /v1/findings` or `POST /v1/reports`. - Export failure rate > 2% over 10m window. - Accepted-risk approaching expiry within 7d (emit Notify event `vuln.accepted_risk.expiring`). ## Offline verification steps 1) Import Grafana JSON locally and point to Prometheus scrape job `vuln-explorer`. 2) Run `stella vuln export --format json --manifest out/manifest.json` and validate hashes using `jq -r '.files[].sha256'` against generated bundle. 3) Use `curl -s "$BASEURL/status" | jq '{commit,version,features}'` to confirm expected build metadata matches the exported bundle manifest. ## Evidence locations - Sprint alignment: `docs/implplan/SPRINT_0334_0001_0001_docs_modules_vuln_explorer.md`. - API contract draft: `docs/modules/vuln-explorer/api.md` and OpenAPI at `docs/modules/vuln-explorer/openapi/vuln-explorer.v1.yaml`. - Schema references: `docs/modules/vuln-explorer/architecture.md` (ledger model, VEX decision schemas).