Files
git.stella-ops.org/docs/operations/export-runbook.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

204 lines
8.8 KiB
Markdown

# Export Center Operations Runbook
> Export Center workers and API are landing across Sprints 35-37. This runbook captures the target operational procedures so DevOps can validate them as each milestone goes live. Update specific commands once `EXPORT-SVC-35-006`, `EXPORT-SVC-36-001..004`, and related CLI tasks ship.
## 1. Service scope
The Export Center packages StellaOps evidence and policy overlays into reproducible bundles (JSON, Trivy DB, mirror). Operations owns:
- Worker scaling, queue management, and distribution storage.
- Monitoring and alerts for run throughput, failures, and verification issues.
- Runbook execution for recovery, retention, and compliance.
- Coordination with DevOps validation (cosign + `trivy module db import` smoke tests).
Related documentation:
- `docs/export-center/overview.md`
- `docs/export-center/architecture.md`
- `docs/export-center/profiles.md`
- `docs/export-center/trivy-adapter.md`
- `docs/export-center/mirror-bundles.md`
- `docs/export-center/api.md`
- `docs/export-center/cli.md`
## 2. Contacts & tooling
| Area | Owner(s) | Escalation |
|------|----------|------------|
| Export Center service | Exporter Service Guild | `#export-center-ops`, on-call rotation |
| Distribution & CI smoke | DevOps Guild | CI channel, PagerDuty `devops-export` |
| KMS / encryption | Authority Core | `#authority-core` |
| Offline Kit dissemination | Offline Kit Guild | `#offline-kit` |
Primary tooling:
- `stella export` CLI (submit, watch, download, verify).
- Export Center API (`/api/export/*`) for automation.
- Grafana dashboards (`Export Center / Run Health`, `Export Center / Distribution`).
- Alertmanager routes (`Export.Center.Failures`, `Export.Center.Verify`).
## 3. Monitoring & SLOs
Key metrics (exposed by workers and API):
| Metric | SLO / Alert | Notes |
|--------|-------------|-------|
| `exporter_run_duration_seconds` | p95 < 300 s (full), < 120 s (delta) | Break down by profile (`profile_kind`). |
| `exporter_run_failures_total` | Alert when > 3 failures/15 min per profile | Include `error_code` label. |
| `exporter_run_bytes_total` | Track growth trends | Helps with storage planning. |
| `exporter_distribution_push_seconds` | p95 < 60 s | Covers OCI/object storage. |
| `exporter_verify_failures_total` | Alert on any non-zero | Raised when cosign/Trivy smoke tests fail. |
| `exporter_retention_pruned_total` | Should increase nightly | Confirms retention job success. |
Dashboards must include:
- Run throughput by profile.
- Failure breakdown (adapter, signing, distribution).
- Queue depth and worker concurrency (via Orchestrator metrics).
- Storage consumption (object storage buckets, local staging).
Alerts (Alertmanager):
- `ExportCenterRunFailureSpike` - `exporter_run_failures_total` increase rate > 3/15 min.
- `ExportCenterVerifyFailure` - any entry in `exporter_verify_failures_total` > 0.
- `ExportCenterWorkerLag` - queue backlog > threshold for 10 minutes.
- `ExportCenterRetentionStale` - no pruning events in 24 hours.
## 4. Routine operations
### 4.1 Daily checklist
- Review dashboard for run throughput and error classes.
- Confirm CI smoke job (cosign + `trivy module db import`) passed.
- Check storage usage against capacity thresholds.
- Verify retention job executed (look for `exporter_retention_pruned_total` increment).
- Scan logs for `adapter.trivy.unsupported_schema_version` or `mirror.delta.apply_failed`.
### 4.2 Weekly tasks
- Rotate Download/OCI API tokens if configured with short-lived credentials.
- Review upcoming profile changes (new tenants, profile updates).
- Test `stella export verify` against a recent run for each profile.
- Exercise failover of workers (scale to zero one replica, ensure others pick up).
### 4.3 Pre-release
- Ensure bundles generated for release candidates pass cosign verification.
- Capture sample manifests (`export.json`, `manifest.yaml`) for documentation archives.
- Validate Offline Kit packaging includes latest full + delta mirror bundles.
## 5. Capacity & scaling
### 5.1 Worker sizing
- Default workers handle ~2 full runs or 6 delta runs concurrently per 4 vCPU.
- Scale out when:
- Queue depth (`exporter_jobs_ready`) > 10 for 10 minutes.
- p95 durations exceed SLO for multiple runs without failures.
- Use Orchestrator quotas: ensure per-tenant concurrency (`max_active_runs`) is tuned.
### 5.2 Storage planning
- Staging storage (object store or filesystem) must hold at least:
- Latest full bundle per tenant per profile.
- Last `N` deltas (default N=5).
- Set retention policy via configuration:
```yaml
ExportCenter:
Retention:
Mirror:
Mode: days
Value: 30
Trivy:
Mode: count
Value: 10
```
- Monitor `exporter_storage_bytes_total` (if available) or use bucket metrics from storage provider.
## 6. Failure response
| Symptom | Likely cause | Immediate action | Follow-up |
|---------|--------------|------------------|-----------|
| `ERR_EXPORT_UNSUPPORTED_SCHEMA` | Trivy schema mismatch | Pin `SchemaVersion` to previous value; rerun export | Coordinate with Exporter Guild to add new mapping |
| `ERR_EXPORT_BASE_MISSING` | Base manifest unavailable | Trigger full export (`mirror:full`), notify tenant | Investigate storage retention settings |
| Run stuck in `pending` | Worker unavailable / queue paused | Check worker pods / Orchestrator status | Scale workers or fix queue |
| Signing failure (`errorCode=signing`) | KMS outage or permission change | Verify KMS health; retry run; escalate to Authority | Document incident, review key rotation schedule |
| Distribution failure (`errorCode=distribution`) | OCI/object store outage | Switch profile distribution to download-only (`distribution: ["http"]`) | Restore distribution backend, resume normal config |
| CLI verification failure in CI | New bundle did not pass cosign or Trivy import | Inspect pipeline logs; download bundle; rerun verification manually | Engage Exporter Guild if data quality issue |
| Retention job skipped | Scheduler failure or misconfiguration | Run retention job manually (`stella export retention run`) | Audit scheduler configuration |
Log locations: `exporter` service emits structured logs with `runId`, `profile`, `errorCode`. For Kubernetes deployments, check `kubectl logs deployment/export-center-worker`.
## 7. Recovery playbooks
### 7.1 Replaying a failed run
1. Identify run (`runId`) and root cause via `GET /api/export/runs/{id}`.
2. If configuration changed, clone profile and adjust settings.
3. Resubmit run (`stella export run submit` or API) with `--allow-empty` if intentionally empty.
4. Monitor SSE stream or `stella export run watch`.
5. After success, prune failed run data if necessary.
### 7.2 Restoring from previous full bundle
1. Locate last successful full bundle (`mirror:full`) and associated manifest.
2. Download and verify signatures.
3. Extract into mirror staging area.
4. Apply subsequent delta bundles in order.
5. Trigger mirror verification script (`mirror verify <path>`).
### 7.3 KMS outage response
1. Disable new export submissions temporarily (set per-tenant quota to 0).
2. Coordinate with Authority Core to restore KMS.
3. Once KMS back, run `stella export run submit --profile <id> --selectors ... --priority catch-up` for affected tenants.
## 8. Verification workflow
All bundles must pass both signature and content verification.
### 8.1 Trivy bundle validation (CI job)
```bash
cosign verify-blob \
--key tenants/acme/export-center.pub \
--signature signatures/trivy-db.sig \
trivy/db.bundle
trivy module db import trivy/db.bundle --cache-dir /tmp/trivy-cache
```
Automation: `DEVOPS-EXPORT-36-001` ensures this runs on every pipeline.
### 8.2 Mirror bundle validation
```bash
cosign verify-blob \
--key tenants/acme/export-center.pub \
--signature signatures/export.sig \
mirror/export.json
./offline-kit/bin/mirror verify mirror-20251029-full.tar.zst
```
If encryption enabled, decrypt using age or AES key before verification.
## 9. Change management
- Profile changes require change record referencing tenant impact and expected bundle size.
- Distribution configuration updates (`OCI` vs `HTTP`) must be tested in staging.
- Schema upgrades (e.g., Trivy schema v3) need coordination with DevOps, Exporter, and Docs.
- Update runbook and related docs when processes change (tie updates to `DOCS-EXPORT-37-005`).
## 10. References
- `docs/export-center/trivy-adapter.md`
- `docs/export-center/mirror-bundles.md`
- `ops/devops/TASKS.md` (`DEVOPS-EXPORT-36-001`, `DEVOPS-EXPORT-37-001`)
- `docs/ingestion/aggregation-only-contract.md`
- `docs/24_OFFLINE_KIT.md`
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.