Files
git.stella-ops.org/docs/modules/vuln-explorer/runbooks/observability.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

2.7 KiB
Raw Blame History

Vuln Explorer observability runbook (demo snapshot · 2025-11-29)

Dashboards (offline-friendly)

  • Grafana JSON: docs/modules/vuln-explorer/runbooks/dashboards/vuln-explorer-observability.json (import locally; no external data sources assumed).
  • Panels: projection lag, open findings by severity/tenant, accepted-risk ageing, API 5xx rate, export duration p95, ledger replay backlog.

Key metrics

  • vuln_projection_lag_seconds{tenant} seconds between latest ledger event and projector head.
  • vuln_findings_open_total{severity,tenant} count of open findings by severity.
  • vuln_export_duration_seconds_bucket histogram for export job runtime.
  • vuln_projection_backlog_total queued events awaiting projection.
  • vuln_triage_actions_total{type} immutable triage actions (assign, comment, risk_accept, remediation_note).
  • vuln_api_request_duration_seconds_bucket{route} API latency for GET /v1/findings* and POST /v1/reports.

Logs & traces

  • Correlate by correlationId and findingId. Structured fields: tenant, advisoryKey, policyVersion, projectId, route.
  • Trace exemplar anchors: traceparent headers are copied into logs; exporters stay disabled by default for air-gap. Enable by setting Telemetry:ExportEnabled=true and pointing to on-prem Tempo/Jaeger.

Health/diagnostics

  • /health/liveness and /health/readiness (HTTP 200 expected; readiness checks Mongo + cache reachability).
  • /status returns build version, git commit, and enabled features; safe for anonymous fetch in sealed environments.
  • Ledger replay check: GET /v1/findings?projectionMode=verify emits X-Vuln-Projection-Head for quick consistency probes.

Alert hints (wire to local Alertmanager or watchdog)

  • Projection lag > 120s for any tenant.
  • API p99 latency > 800ms for GET /v1/findings or POST /v1/reports.
  • Export failure rate > 2% over 10m window.
  • Accepted-risk approaching expiry within 7d (emit Notify event vuln.accepted_risk.expiring).

Offline verification steps

  1. Import Grafana JSON locally and point to Prometheus scrape job vuln-explorer.
  2. Run stella vuln export --format json --manifest out/manifest.json and validate hashes using jq -r '.files[].sha256' against generated bundle.
  3. Use curl -s "$BASEURL/status" | jq '{commit,version,features}' to confirm expected build metadata matches the exported bundle manifest.

Evidence locations

  • Sprint alignment: docs/implplan/SPRINT_0334_0001_0001_docs_modules_vuln_explorer.md.
  • API contract draft: docs/modules/vuln-explorer/api.md and OpenAPI at docs/modules/vuln-explorer/openapi/vuln-explorer.v1.yaml.
  • Schema references: docs/modules/vuln-explorer/architecture.md (ledger model, VEX decision schemas).