Files
git.stella-ops.org/docs/modules/export-center/operations/observability.md
StellaOps Bot 71e9a56cfd
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
feat: Add Scanner CI runner and related artifacts
- Implemented `run-scanner-ci.sh` to build and run tests for the Scanner solution with a warmed NuGet cache.
- Created `excititor-vex-traces.json` dashboard for monitoring Excititor VEX observations.
- Added Docker Compose configuration for the OTLP span sink in `docker-compose.spansink.yml`.
- Configured OpenTelemetry collector in `otel-spansink.yaml` to receive and process traces.
- Developed `run-spansink.sh` script to run the OTLP span sink for Excititor traces.
- Introduced `FileSystemRiskBundleObjectStore` for storing risk bundle artifacts in the filesystem.
- Built `RiskBundleBuilder` for creating risk bundles with associated metadata and providers.
- Established `RiskBundleJob` to execute the risk bundle creation and storage process.
- Defined models for risk bundle inputs, entries, and manifests in `RiskBundleModels.cs`.
- Implemented signing functionality for risk bundle manifests with `HmacRiskBundleManifestSigner`.
- Created unit tests for `RiskBundleBuilder`, `RiskBundleJob`, and signing functionality to ensure correctness.
- Added filesystem artifact reader tests to validate manifest parsing and artifact listing.
- Included test manifests for egress scenarios in the task runner tests.
- Developed timeline query service tests to verify tenant and event ID handling.
2025-11-30 19:12:35 +02:00

2.3 KiB

Export Center observability runbook (stub · 2025-11-29 demo)

Dashboards (offline import)

  • Grafana JSON: docs/modules/export-center/operations/dashboards/export-center-observability.json (import locally; no external data sources assumed).
  • Planned panels: export job duration p95/p99, bundle size histogram, registry push latency, provenance/attestation verification failures, queue depth, and error rate per profile.

Key metrics

  • export_job_duration_seconds_bucket{profile} — export duration by profile.
  • export_bundle_size_bytes_bucket{profile} — bundle size distribution.
  • export_registry_push_latency_seconds_bucket{profile} — registry push latency.
  • export_attestation_failures_total{reason} — DSSE/provenance verification failures.
  • export_queue_depth — pending export jobs.
  • export_manifest_publish_total{result} — manifest publish successes/failures.

Logs & traces

  • Correlate by exportId, profile, tenant; include bundleDigest, attestationStatus, registry. Traces disabled by default; enable OTLP to on-prem collector when permitted.

Health/diagnostics

  • /health/liveness and /health/readiness (export service) check storage, registry reachability, and attestation verification path.
  • /status exposes build version, commit, feature flags; verify against offline bundle manifest.
  • Verification probe: stella export bundle verify --manifest <path> once bundle available; validate hashes against manifest.

Alert hints

  • Export job duration p99 > target SLA per profile.
  • Attestation verification failures > 0 over 10m.
  • Registry push latency spikes or error rate > threshold.
  • Queue depth growth without completion.

Offline verification steps

  1. Import Grafana JSON locally; point to Prometheus scrape labeled export-center.
  2. Run stella export bundle --profile <profile> --manifest out/manifest.json and verify hashes via jq -r '.files[].sha256' against generated bundles.
  3. Fetch /status and compare commit/version to offline bundle manifest.

Evidence locations

  • Sprint tracker: docs/implplan/SPRINT_0320_0001_0001_docs_modules_export_center.md.
  • Module docs: README.md, architecture.md, implementation_plan.md.
  • Dashboard stub: operations/dashboards/export-center-observability.json.