Files
git.stella-ops.org/docs/operations/export-runbook.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

8.8 KiB

Export Center Operations Runbook

Export Center workers and API are landing across Sprints 35-37. This runbook captures the target operational procedures so DevOps can validate them as each milestone goes live. Update specific commands once EXPORT-SVC-35-006, EXPORT-SVC-36-001..004, and related CLI tasks ship.

1. Service scope

The Export Center packages StellaOps evidence and policy overlays into reproducible bundles (JSON, Trivy DB, mirror). Operations owns:

  • Worker scaling, queue management, and distribution storage.
  • Monitoring and alerts for run throughput, failures, and verification issues.
  • Runbook execution for recovery, retention, and compliance.
  • Coordination with DevOps validation (cosign + trivy module db import smoke tests).

Related documentation:

  • docs/export-center/overview.md
  • docs/export-center/architecture.md
  • docs/export-center/profiles.md
  • docs/export-center/trivy-adapter.md
  • docs/export-center/mirror-bundles.md
  • docs/export-center/api.md
  • docs/export-center/cli.md

2. Contacts & tooling

Area Owner(s) Escalation
Export Center service Exporter Service Guild #export-center-ops, on-call rotation
Distribution & CI smoke DevOps Guild CI channel, PagerDuty devops-export
KMS / encryption Authority Core #authority-core
Offline Kit dissemination Offline Kit Guild #offline-kit

Primary tooling:

  • stella export CLI (submit, watch, download, verify).
  • Export Center API (/api/export/*) for automation.
  • Grafana dashboards (Export Center / Run Health, Export Center / Distribution).
  • Alertmanager routes (Export.Center.Failures, Export.Center.Verify).

3. Monitoring & SLOs

Key metrics (exposed by workers and API):

Metric SLO / Alert Notes
exporter_run_duration_seconds p95 < 300 s (full), < 120 s (delta) Break down by profile (profile_kind).
exporter_run_failures_total Alert when > 3 failures/15 min per profile Include error_code label.
exporter_run_bytes_total Track growth trends Helps with storage planning.
exporter_distribution_push_seconds p95 < 60 s Covers OCI/object storage.
exporter_verify_failures_total Alert on any non-zero Raised when cosign/Trivy smoke tests fail.
exporter_retention_pruned_total Should increase nightly Confirms retention job success.

Dashboards must include:

  • Run throughput by profile.
  • Failure breakdown (adapter, signing, distribution).
  • Queue depth and worker concurrency (via Orchestrator metrics).
  • Storage consumption (object storage buckets, local staging).

Alerts (Alertmanager):

  • ExportCenterRunFailureSpike - exporter_run_failures_total increase rate > 3/15 min.
  • ExportCenterVerifyFailure - any entry in exporter_verify_failures_total > 0.
  • ExportCenterWorkerLag - queue backlog > threshold for 10 minutes.
  • ExportCenterRetentionStale - no pruning events in 24 hours.

4. Routine operations

4.1 Daily checklist

  • Review dashboard for run throughput and error classes.
  • Confirm CI smoke job (cosign + trivy module db import) passed.
  • Check storage usage against capacity thresholds.
  • Verify retention job executed (look for exporter_retention_pruned_total increment).
  • Scan logs for adapter.trivy.unsupported_schema_version or mirror.delta.apply_failed.

4.2 Weekly tasks

  • Rotate Download/OCI API tokens if configured with short-lived credentials.
  • Review upcoming profile changes (new tenants, profile updates).
  • Test stella export verify against a recent run for each profile.
  • Exercise failover of workers (scale to zero one replica, ensure others pick up).

4.3 Pre-release

  • Ensure bundles generated for release candidates pass cosign verification.
  • Capture sample manifests (export.json, manifest.yaml) for documentation archives.
  • Validate Offline Kit packaging includes latest full + delta mirror bundles.

5. Capacity & scaling

5.1 Worker sizing

  • Default workers handle ~2 full runs or 6 delta runs concurrently per 4 vCPU.
  • Scale out when:
    • Queue depth (exporter_jobs_ready) > 10 for 10 minutes.
    • p95 durations exceed SLO for multiple runs without failures.
  • Use Orchestrator quotas: ensure per-tenant concurrency (max_active_runs) is tuned.

5.2 Storage planning

  • Staging storage (object store or filesystem) must hold at least:
    • Latest full bundle per tenant per profile.
    • Last N deltas (default N=5).
  • Set retention policy via configuration:
ExportCenter:
  Retention:
    Mirror:
      Mode: days
      Value: 30
    Trivy:
      Mode: count
      Value: 10
  • Monitor exporter_storage_bytes_total (if available) or use bucket metrics from storage provider.

6. Failure response

Symptom Likely cause Immediate action Follow-up
ERR_EXPORT_UNSUPPORTED_SCHEMA Trivy schema mismatch Pin SchemaVersion to previous value; rerun export Coordinate with Exporter Guild to add new mapping
ERR_EXPORT_BASE_MISSING Base manifest unavailable Trigger full export (mirror:full), notify tenant Investigate storage retention settings
Run stuck in pending Worker unavailable / queue paused Check worker pods / Orchestrator status Scale workers or fix queue
Signing failure (errorCode=signing) KMS outage or permission change Verify KMS health; retry run; escalate to Authority Document incident, review key rotation schedule
Distribution failure (errorCode=distribution) OCI/object store outage Switch profile distribution to download-only (distribution: ["http"]) Restore distribution backend, resume normal config
CLI verification failure in CI New bundle did not pass cosign or Trivy import Inspect pipeline logs; download bundle; rerun verification manually Engage Exporter Guild if data quality issue
Retention job skipped Scheduler failure or misconfiguration Run retention job manually (stella export retention run) Audit scheduler configuration

Log locations: exporter service emits structured logs with runId, profile, errorCode. For Kubernetes deployments, check kubectl logs deployment/export-center-worker.

7. Recovery playbooks

7.1 Replaying a failed run

  1. Identify run (runId) and root cause via GET /api/export/runs/{id}.
  2. If configuration changed, clone profile and adjust settings.
  3. Resubmit run (stella export run submit or API) with --allow-empty if intentionally empty.
  4. Monitor SSE stream or stella export run watch.
  5. After success, prune failed run data if necessary.

7.2 Restoring from previous full bundle

  1. Locate last successful full bundle (mirror:full) and associated manifest.
  2. Download and verify signatures.
  3. Extract into mirror staging area.
  4. Apply subsequent delta bundles in order.
  5. Trigger mirror verification script (mirror verify <path>).

7.3 KMS outage response

  1. Disable new export submissions temporarily (set per-tenant quota to 0).
  2. Coordinate with Authority Core to restore KMS.
  3. Once KMS back, run stella export run submit --profile <id> --selectors ... --priority catch-up for affected tenants.

8. Verification workflow

All bundles must pass both signature and content verification.

8.1 Trivy bundle validation (CI job)

cosign verify-blob \
  --key tenants/acme/export-center.pub \
  --signature signatures/trivy-db.sig \
  trivy/db.bundle

trivy module db import trivy/db.bundle --cache-dir /tmp/trivy-cache

Automation: DEVOPS-EXPORT-36-001 ensures this runs on every pipeline.

8.2 Mirror bundle validation

cosign verify-blob \
  --key tenants/acme/export-center.pub \
  --signature signatures/export.sig \
  mirror/export.json

./offline-kit/bin/mirror verify mirror-20251029-full.tar.zst

If encryption enabled, decrypt using age or AES key before verification.

9. Change management

  • Profile changes require change record referencing tenant impact and expected bundle size.
  • Distribution configuration updates (OCI vs HTTP) must be tested in staging.
  • Schema upgrades (e.g., Trivy schema v3) need coordination with DevOps, Exporter, and Docs.
  • Update runbook and related docs when processes change (tie updates to DOCS-EXPORT-37-005).

10. References

  • docs/export-center/trivy-adapter.md
  • docs/export-center/mirror-bundles.md
  • ops/devops/TASKS.md (DEVOPS-EXPORT-36-001, DEVOPS-EXPORT-37-001)
  • docs/ingestion/aggregation-only-contract.md
  • docs/24_OFFLINE_KIT.md

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.