- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
		
			
				
	
	
	
		
			8.7 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Export Center Operations Runbook
Export Center workers and API are landing across Sprints 35-37. This runbook captures the target operational procedures so DevOps can validate them as each milestone goes live. Update specific commands once
EXPORT-SVC-35-006,EXPORT-SVC-36-001..004, and related CLI tasks ship.
1. Service scope
The Export Center packages StellaOps evidence and policy overlays into reproducible bundles (JSON, Trivy DB, mirror). Operations owns:
- Worker scaling, queue management, and distribution storage.
- Monitoring and alerts for run throughput, failures, and verification issues.
- Runbook execution for recovery, retention, and compliance.
- Coordination with DevOps validation (cosign + trivy module db importsmoke tests).
Related documentation:
- docs/modules/export-center/overview.md
- docs/modules/export-center/architecture.md
- docs/modules/export-center/profiles.md
- docs/modules/export-center/trivy-adapter.md
- docs/modules/export-center/mirror-bundles.md
- docs/modules/export-center/api.md
- docs/modules/export-center/cli.md
2. Contacts & tooling
| Area | Owner(s) | Escalation | 
|---|---|---|
| Export Center service | Exporter Service Guild | #export-center-ops, on-call rotation | 
| Distribution & CI smoke | DevOps Guild | CI channel, PagerDuty devops-export | 
| KMS / encryption | Authority Core | #authority-core | 
| Offline Kit dissemination | Offline Kit Guild | #offline-kit | 
Primary tooling:
- stella exportCLI (submit, watch, download, verify).
- Export Center API (/api/export/*) for automation.
- Grafana dashboards (Export Center / Run Health,Export Center / Distribution).
- Alertmanager routes (Export.Center.Failures,Export.Center.Verify).
3. Monitoring & SLOs
Key metrics (exposed by workers and API):
| Metric | SLO / Alert | Notes | 
|---|---|---|
| exporter_run_duration_seconds | p95 < 300 s (full), < 120 s (delta) | Break down by profile ( profile_kind). | 
| exporter_run_failures_total | Alert when > 3 failures/15 min per profile | Include error_codelabel. | 
| exporter_run_bytes_total | Track growth trends | Helps with storage planning. | 
| exporter_distribution_push_seconds | p95 < 60 s | Covers OCI/object storage. | 
| exporter_verify_failures_total | Alert on any non-zero | Raised when cosign/Trivy smoke tests fail. | 
| exporter_retention_pruned_total | Should increase nightly | Confirms retention job success. | 
Dashboards must include:
- Run throughput by profile.
- Failure breakdown (adapter, signing, distribution).
- Queue depth and worker concurrency (via Orchestrator metrics).
- Storage consumption (object storage buckets, local staging).
Alerts (Alertmanager):
- ExportCenterRunFailureSpike-- exporter_run_failures_totalincrease rate > 3/15 min.
- ExportCenterVerifyFailure- any entry in- exporter_verify_failures_total> 0.
- ExportCenterWorkerLag- queue backlog > threshold for 10 minutes.
- ExportCenterRetentionStale- no pruning events in 24 hours.
4. Routine operations
4.1 Daily checklist
- Review dashboard for run throughput and error classes.
- Confirm CI smoke job (cosign + trivy module db import) passed.
- Check storage usage against capacity thresholds.
- Verify retention job executed (look for exporter_retention_pruned_totalincrement).
- Scan logs for adapter.trivy.unsupported_schema_versionormirror.delta.apply_failed.
4.2 Weekly tasks
- Rotate Download/OCI API tokens if configured with short-lived credentials.
- Review upcoming profile changes (new tenants, profile updates).
- Test stella export verifyagainst a recent run for each profile.
- Exercise failover of workers (scale to zero one replica, ensure others pick up).
4.3 Pre-release
- Ensure bundles generated for release candidates pass cosign verification.
- Capture sample manifests (export.json,manifest.yaml) for documentation archives.
- Validate Offline Kit packaging includes latest full + delta mirror bundles.
5. Capacity & scaling
5.1 Worker sizing
- Default workers handle ~2 full runs or 6 delta runs concurrently per 4 vCPU.
- Scale out when:
- Queue depth (exporter_jobs_ready) > 10 for 10 minutes.
- p95 durations exceed SLO for multiple runs without failures.
 
- Queue depth (
- Use Orchestrator quotas: ensure per-tenant concurrency (max_active_runs) is tuned.
5.2 Storage planning
- Staging storage (object store or filesystem) must hold at least:
- Latest full bundle per tenant per profile.
- Last Ndeltas (default N=5).
 
- Set retention policy via configuration:
ExportCenter:
  Retention:
    Mirror:
      Mode: days
      Value: 30
    Trivy:
      Mode: count
      Value: 10
- Monitor exporter_storage_bytes_total(if available) or use bucket metrics from storage provider.
6. Failure response
| Symptom | Likely cause | Immediate action | Follow-up | 
|---|---|---|---|
| ERR_EXPORT_UNSUPPORTED_SCHEMA | Trivy schema mismatch | Pin SchemaVersionto previous value; rerun export | Coordinate with Exporter Guild to add new mapping | 
| ERR_EXPORT_BASE_MISSING | Base manifest unavailable | Trigger full export ( mirror:full), notify tenant | Investigate storage retention settings | 
| Run stuck in pending | Worker unavailable / queue paused | Check worker pods / Orchestrator status | Scale workers or fix queue | 
| Signing failure ( errorCode=signing) | KMS outage or permission change | Verify KMS health; retry run; escalate to Authority | Document incident, review key rotation schedule | 
| Distribution failure ( errorCode=distribution) | OCI/object store outage | Switch profile distribution to download-only ( distribution: ["http"]) | Restore distribution backend, resume normal config | 
| CLI verification failure in CI | New bundle did not pass cosign or Trivy import | Inspect pipeline logs; download bundle; rerun verification manually | Engage Exporter Guild if data quality issue | 
| Retention job skipped | Scheduler failure or misconfiguration | Run retention job manually ( stella export retention run) | Audit scheduler configuration | 
Log locations: exporter service emits structured logs with runId, profile, errorCode. For Kubernetes deployments, check kubectl logs deployment/export-center-worker.
7. Recovery playbooks
7.1 Replaying a failed run
- Identify run (runId) and root cause viaGET /api/export/runs/{id}.
- If configuration changed, clone profile and adjust settings.
- Resubmit run (stella export run submitor API) with--allow-emptyif intentionally empty.
- Monitor SSE stream or stella export run watch.
- After success, prune failed run data if necessary.
7.2 Restoring from previous full bundle
- Locate last successful full bundle (mirror:full) and associated manifest.
- Download and verify signatures.
- Extract into mirror staging area.
- Apply subsequent delta bundles in order.
- Trigger mirror verification script (mirror verify <path>).
7.3 KMS outage response
- Disable new export submissions temporarily (set per-tenant quota to 0).
- Coordinate with Authority Core to restore KMS.
- Once KMS back, run stella export run submit --profile <id> --selectors ... --priority catch-upfor affected tenants.
8. Verification workflow
All bundles must pass both signature and content verification.
8.1 Trivy bundle validation (CI job)
cosign verify-blob \
  --key tenants/acme/export-center.pub \
  --signature signatures/trivy-db.sig \
  trivy/db.bundle
trivy module db import trivy/db.bundle --cache-dir /tmp/trivy-cache
Automation: DEVOPS-EXPORT-36-001 ensures this runs on every pipeline.
8.2 Mirror bundle validation
cosign verify-blob \
  --key tenants/acme/export-center.pub \
  --signature signatures/export.sig \
  mirror/export.json
./offline-kit/bin/mirror verify mirror-20251029-full.tar.zst
If encryption enabled, decrypt using age or AES key before verification.
9. Change management
- Profile changes require change record referencing tenant impact and expected bundle size.
- Distribution configuration updates (OCIvsHTTP) must be tested in staging.
- Schema upgrades (e.g., Trivy schema v3) need coordination with DevOps, Exporter, and Docs.
- Update runbook and related docs when processes change (tie updates to DOCS-EXPORT-37-005).
10. References
- docs/modules/export-center/trivy-adapter.md
- docs/modules/export-center/mirror-bundles.md
- ops/devops/TASKS.md(- DEVOPS-EXPORT-36-001,- DEVOPS-EXPORT-37-001)
- docs/ingestion/aggregation-only-contract.md
- docs/24_OFFLINE_KIT.md
Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.