feat: Implement console session management with tenant and profile handling
- Add ConsoleSessionStore for managing console session state including tenants, profile, and token information. - Create OperatorContextService to manage operator context for orchestrator actions. - Implement OperatorMetadataInterceptor to enrich HTTP requests with operator context metadata. - Develop ConsoleProfileComponent to display user profile and session details, including tenant information and access tokens. - Add corresponding HTML and SCSS for ConsoleProfileComponent to enhance UI presentation. - Write unit tests for ConsoleProfileComponent to ensure correct rendering and functionality.
This commit is contained in:
		
							
								
								
									
										203
									
								
								docs/operations/export-runbook.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										203
									
								
								docs/operations/export-runbook.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,203 @@ | ||||
| # Export Center Operations Runbook | ||||
|  | ||||
| > Export Center workers and API are landing across Sprints 35-37. This runbook captures the target operational procedures so DevOps can validate them as each milestone goes live. Update specific commands once `EXPORT-SVC-35-006`, `EXPORT-SVC-36-001..004`, and related CLI tasks ship. | ||||
|  | ||||
| ## 1. Service scope | ||||
|  | ||||
| The Export Center packages StellaOps evidence and policy overlays into reproducible bundles (JSON, Trivy DB, mirror). Operations owns: | ||||
|  | ||||
| - Worker scaling, queue management, and distribution storage. | ||||
| - Monitoring and alerts for run throughput, failures, and verification issues. | ||||
| - Runbook execution for recovery, retention, and compliance. | ||||
| - Coordination with DevOps validation (cosign + `trivy module db import` smoke tests). | ||||
|  | ||||
| Related documentation: | ||||
|  | ||||
| - `docs/export-center/overview.md` | ||||
| - `docs/export-center/architecture.md` | ||||
| - `docs/export-center/profiles.md` | ||||
| - `docs/export-center/trivy-adapter.md` | ||||
| - `docs/export-center/mirror-bundles.md` | ||||
| - `docs/export-center/api.md` | ||||
| - `docs/export-center/cli.md` | ||||
|  | ||||
| ## 2. Contacts & tooling | ||||
|  | ||||
| | Area | Owner(s) | Escalation | | ||||
| |------|----------|------------| | ||||
| | Export Center service | Exporter Service Guild | `#export-center-ops`, on-call rotation | | ||||
| | Distribution & CI smoke | DevOps Guild | CI channel, PagerDuty `devops-export` | | ||||
| | KMS / encryption | Authority Core | `#authority-core` | | ||||
| | Offline Kit dissemination | Offline Kit Guild | `#offline-kit` | | ||||
|  | ||||
| Primary tooling: | ||||
|  | ||||
| - `stella export` CLI (submit, watch, download, verify). | ||||
| - Export Center API (`/api/export/*`) for automation. | ||||
| - Grafana dashboards (`Export Center / Run Health`, `Export Center / Distribution`). | ||||
| - Alertmanager routes (`Export.Center.Failures`, `Export.Center.Verify`). | ||||
|  | ||||
| ## 3. Monitoring & SLOs | ||||
|  | ||||
| Key metrics (exposed by workers and API): | ||||
|  | ||||
| | Metric | SLO / Alert | Notes | | ||||
| |--------|-------------|-------| | ||||
| | `exporter_run_duration_seconds` | p95 < 300 s (full), < 120 s (delta) | Break down by profile (`profile_kind`). | | ||||
| | `exporter_run_failures_total` | Alert when > 3 failures/15 min per profile | Include `error_code` label. | | ||||
| | `exporter_run_bytes_total` | Track growth trends | Helps with storage planning. | | ||||
| | `exporter_distribution_push_seconds` | p95 < 60 s | Covers OCI/object storage. | | ||||
| | `exporter_verify_failures_total` | Alert on any non-zero | Raised when cosign/Trivy smoke tests fail. | | ||||
| | `exporter_retention_pruned_total` | Should increase nightly | Confirms retention job success. | | ||||
|  | ||||
| Dashboards must include: | ||||
|  | ||||
| - Run throughput by profile. | ||||
| - Failure breakdown (adapter, signing, distribution). | ||||
| - Queue depth and worker concurrency (via Orchestrator metrics). | ||||
| - Storage consumption (object storage buckets, local staging). | ||||
|  | ||||
| Alerts (Alertmanager): | ||||
|  | ||||
| - `ExportCenterRunFailureSpike` - `exporter_run_failures_total` increase rate > 3/15 min. | ||||
| - `ExportCenterVerifyFailure` - any entry in `exporter_verify_failures_total` > 0. | ||||
| - `ExportCenterWorkerLag` - queue backlog > threshold for 10 minutes. | ||||
| - `ExportCenterRetentionStale` - no pruning events in 24 hours. | ||||
|  | ||||
| ## 4. Routine operations | ||||
|  | ||||
| ### 4.1 Daily checklist | ||||
|  | ||||
| - Review dashboard for run throughput and error classes. | ||||
| - Confirm CI smoke job (cosign + `trivy module db import`) passed. | ||||
| - Check storage usage against capacity thresholds. | ||||
| - Verify retention job executed (look for `exporter_retention_pruned_total` increment). | ||||
| - Scan logs for `adapter.trivy.unsupported_schema_version` or `mirror.delta.apply_failed`. | ||||
|  | ||||
| ### 4.2 Weekly tasks | ||||
|  | ||||
| - Rotate Download/OCI API tokens if configured with short-lived credentials. | ||||
| - Review upcoming profile changes (new tenants, profile updates). | ||||
| - Test `stella export verify` against a recent run for each profile. | ||||
| - Exercise failover of workers (scale to zero one replica, ensure others pick up). | ||||
|  | ||||
| ### 4.3 Pre-release | ||||
|  | ||||
| - Ensure bundles generated for release candidates pass cosign verification. | ||||
| - Capture sample manifests (`export.json`, `manifest.yaml`) for documentation archives. | ||||
| - Validate Offline Kit packaging includes latest full + delta mirror bundles. | ||||
|  | ||||
| ## 5. Capacity & scaling | ||||
|  | ||||
| ### 5.1 Worker sizing | ||||
|  | ||||
| - Default workers handle ~2 full runs or 6 delta runs concurrently per 4 vCPU. | ||||
| - Scale out when: | ||||
|   - Queue depth (`exporter_jobs_ready`) > 10 for 10 minutes. | ||||
|   - p95 durations exceed SLO for multiple runs without failures. | ||||
| - Use Orchestrator quotas: ensure per-tenant concurrency (`max_active_runs`) is tuned. | ||||
|  | ||||
| ### 5.2 Storage planning | ||||
|  | ||||
| - Staging storage (object store or filesystem) must hold at least: | ||||
|   - Latest full bundle per tenant per profile. | ||||
|   - Last `N` deltas (default N=5). | ||||
| - Set retention policy via configuration: | ||||
|  | ||||
| ```yaml | ||||
| ExportCenter: | ||||
|   Retention: | ||||
|     Mirror: | ||||
|       Mode: days | ||||
|       Value: 30 | ||||
|     Trivy: | ||||
|       Mode: count | ||||
|       Value: 10 | ||||
| ``` | ||||
|  | ||||
| - Monitor `exporter_storage_bytes_total` (if available) or use bucket metrics from storage provider. | ||||
|  | ||||
| ## 6. Failure response | ||||
|  | ||||
| | Symptom | Likely cause | Immediate action | Follow-up | | ||||
| |---------|--------------|------------------|-----------| | ||||
| | `ERR_EXPORT_UNSUPPORTED_SCHEMA` | Trivy schema mismatch | Pin `SchemaVersion` to previous value; rerun export | Coordinate with Exporter Guild to add new mapping | | ||||
| | `ERR_EXPORT_BASE_MISSING` | Base manifest unavailable | Trigger full export (`mirror:full`), notify tenant | Investigate storage retention settings | | ||||
| | Run stuck in `pending` | Worker unavailable / queue paused | Check worker pods / Orchestrator status | Scale workers or fix queue |  | ||||
| | Signing failure (`errorCode=signing`) | KMS outage or permission change | Verify KMS health; retry run; escalate to Authority | Document incident, review key rotation schedule | | ||||
| | Distribution failure (`errorCode=distribution`) | OCI/object store outage | Switch profile distribution to download-only (`distribution: ["http"]`) | Restore distribution backend, resume normal config | | ||||
| | CLI verification failure in CI | New bundle did not pass cosign or Trivy import | Inspect pipeline logs; download bundle; rerun verification manually | Engage Exporter Guild if data quality issue | | ||||
| | Retention job skipped | Scheduler failure or misconfiguration | Run retention job manually (`stella export retention run`) | Audit scheduler configuration | | ||||
|  | ||||
| Log locations: `exporter` service emits structured logs with `runId`, `profile`, `errorCode`. For Kubernetes deployments, check `kubectl logs deployment/export-center-worker`. | ||||
|  | ||||
| ## 7. Recovery playbooks | ||||
|  | ||||
| ### 7.1 Replaying a failed run | ||||
|  | ||||
| 1. Identify run (`runId`) and root cause via `GET /api/export/runs/{id}`. | ||||
| 2. If configuration changed, clone profile and adjust settings. | ||||
| 3. Resubmit run (`stella export run submit` or API) with `--allow-empty` if intentionally empty. | ||||
| 4. Monitor SSE stream or `stella export run watch`. | ||||
| 5. After success, prune failed run data if necessary. | ||||
|  | ||||
| ### 7.2 Restoring from previous full bundle | ||||
|  | ||||
| 1. Locate last successful full bundle (`mirror:full`) and associated manifest. | ||||
| 2. Download and verify signatures. | ||||
| 3. Extract into mirror staging area. | ||||
| 4. Apply subsequent delta bundles in order. | ||||
| 5. Trigger mirror verification script (`mirror verify <path>`). | ||||
|  | ||||
| ### 7.3 KMS outage response | ||||
|  | ||||
| 1. Disable new export submissions temporarily (set per-tenant quota to 0). | ||||
| 2. Coordinate with Authority Core to restore KMS. | ||||
| 3. Once KMS back, run `stella export run submit --profile <id> --selectors ... --priority catch-up` for affected tenants. | ||||
|  | ||||
| ## 8. Verification workflow | ||||
|  | ||||
| All bundles must pass both signature and content verification. | ||||
|  | ||||
| ### 8.1 Trivy bundle validation (CI job) | ||||
|  | ||||
| ```bash | ||||
| cosign verify-blob \ | ||||
|   --key tenants/acme/export-center.pub \ | ||||
|   --signature signatures/trivy-db.sig \ | ||||
|   trivy/db.bundle | ||||
|  | ||||
| trivy module db import trivy/db.bundle --cache-dir /tmp/trivy-cache | ||||
| ``` | ||||
|  | ||||
| Automation: `DEVOPS-EXPORT-36-001` ensures this runs on every pipeline. | ||||
|  | ||||
| ### 8.2 Mirror bundle validation | ||||
|  | ||||
| ```bash | ||||
| cosign verify-blob \ | ||||
|   --key tenants/acme/export-center.pub \ | ||||
|   --signature signatures/export.sig \ | ||||
|   mirror/export.json | ||||
|  | ||||
| ./offline-kit/bin/mirror verify mirror-20251029-full.tar.zst | ||||
| ``` | ||||
|  | ||||
| If encryption enabled, decrypt using age or AES key before verification. | ||||
|  | ||||
| ## 9. Change management | ||||
|  | ||||
| - Profile changes require change record referencing tenant impact and expected bundle size. | ||||
| - Distribution configuration updates (`OCI` vs `HTTP`) must be tested in staging. | ||||
| - Schema upgrades (e.g., Trivy schema v3) need coordination with DevOps, Exporter, and Docs. | ||||
| - Update runbook and related docs when processes change (tie updates to `DOCS-EXPORT-37-005`). | ||||
|  | ||||
| ## 10. References | ||||
|  | ||||
| - `docs/export-center/trivy-adapter.md` | ||||
| - `docs/export-center/mirror-bundles.md` | ||||
| - `ops/devops/TASKS.md` (`DEVOPS-EXPORT-36-001`, `DEVOPS-EXPORT-37-001`) | ||||
| - `docs/ingestion/aggregation-only-contract.md` | ||||
| - `docs/24_OFFLINE_KIT.md` | ||||
|  | ||||
| > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. | ||||
		Reference in New Issue
	
	Block a user