Files
git.stella-ops.org/docs/modules/zastava/operations/observability.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

2.3 KiB

Zastava observability runbook (stub · 2025-11-29 demo)

Dashboards (offline import)

  • Grafana JSON: docs/modules/zastava/operations/dashboards/zastava-observability.json (import locally; no external data sources assumed).
  • Planned panels: admission decision rate, webhook latency p95/p99, cache freshness (Surface.FS), Surface.Env key misses, Secrets fetch failures, policy violation counts, and drift events.

Key metrics

  • zastava_admission_latency_seconds_bucket{webhook} — admission webhook latency.
  • zastava_admission_decisions_total{result} — allow/deny counts.
  • zastava_surface_env_miss_total — Surface.Env key misses.
  • zastava_surface_secrets_failures_total{reason} — secret retrieval failures.
  • zastava_surface_fs_cache_freshness_seconds — cache age vs Scanner surface metadata.
  • zastava_drift_events_total{type} — drift detections by category.

Logs & traces

  • Correlate by correlationId, tenant, cluster, and admissionId. Include policyVersion, surfaceEnvProfile, and secretsProvider fields.
  • Traces disabled by default for air-gap; enable via Telemetry:ExportEnabled=true pointing to on-prem collector.

Health/diagnostics

  • /health/liveness and /health/readiness (webhook + observer) check cache reachability, Secrets provider connectivity, and policy fetch.
  • /status exposes build version, commit, feature flags; verify against offline bundle manifest.
  • Cache probe: GET /surface/fs/cache/status returns freshness and hash for cached surfaces.

Alert hints

  • Admission latency p99 > 800ms.
  • Deny rate spike > 5% over 10m without policy change.
  • Surface.Env miss rate > 1% or Secrets failure > 0 over 10m.
  • Cache freshness > 10m behind Scanner surface metadata.

Offline verification steps

  1. Import Grafana JSON locally; point to Prometheus scrape labeled zastava.
  2. Replay a sealed admission bundle and verify /status + cache probe hashes match the manifest in the offline kit.
  3. Run webhook smoke (kubectl apply --dry-run=server -f samples/admission-request.yaml) and confirm metrics increment locally.

Evidence locations

  • Sprint tracker: docs/implplan/SPRINT_0335_0001_0001_docs_modules_zastava.md.
  • Module docs: README.md, architecture.md, implementation_plan.md.
  • Dashboard stub: operations/dashboards/zastava-observability.json.