up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
2025-10-28 09:58:55 +02:00
parent 4d932cc1ba
commit b0e56fa608
501 changed files with 51904 additions and 6663 deletions

162
docs/task-packs/runbook.md Normal file
View File

@@ -0,0 +1,162 @@
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
# Task Pack Operations Runbook
This runbook guides SREs and on-call engineers through executing, monitoring, and troubleshooting Task Packs using the Task Runner service, Packs Registry, and StellaOps CLI. It aligns with Sprint43 deliverables (approvals workflow, notifications, chaos resilience).
---
## 1·Quick Reference
| Action | Command / UI | Notes |
|--------|--------------|-------|
| Validate pack | `stella pack validate --bundle <file>` | Run before publishing or importing. |
| Plan pack run | `stella pack plan --inputs inputs.json` | Outputs plan hash, required approvals, secret summary. |
| Execute pack | `stella pack run --pack <id>:<version>` | Streams logs; prompts for secrets/approvals if allowed. |
| Approve gate | Console notifications or `stella pack approve --run <id> --gate <gate>` | Requires `Packs.Approve`. |
| View run | Console `/console/packs/runs/:id` or `stella pack runs show <id>` | SSE stream available for live status. |
| Export evidence | `stella pack runs export --run <id>` | Produces bundle with plan, logs, artifacts, attestations. |
---
## 2·Run Lifecycle
1. **Submission**
- CLI/Orchestrator submits run with inputs, pack version, tenant context.
- Task Runner validates pack hash, scopes, sealed-mode constraints.
2. **Plan & Simulation**
- Runner caches plan graph; optional simulation diff recorded.
3. **Approvals**
- Gates emit notifications (`NOTIFY-SVC-40-001`).
- Approvers can approve/resume via CLI, Console, or API.
4. **Execution**
- Steps executed per plan (sequential/parallel).
- Logs streamed via SSE (`/task-runner/runs/{id}/logs`).
5. **Evidence & Attestation**
- On completion, DSSE attestation + evidence bundle stored.
- Exports available via Export Center.
6. **Cleanup**
- Artifacts retained per retention policy (default 30d).
- Mirror pack run manifest to Offline Kit if configured.
---
## 3·Monitoring & Telemetry
- **Metrics dashboards:** `task-runner` Grafana board.
- `pack_run_active` active runs per tenant.
- `pack_step_duration_seconds` histograms per step type.
- `pack_gate_wait_seconds` approval wait time (alert >30m).
- `pack_run_success_ratio` success vs failure rate.
- **Logs:** Search by `runId`, `packId`, `tenant`, `stepId`.
- **Traces:** Query `taskrunner.run` span in Tempo/Jaeger.
- **Notifications:** Subscribe to `pack.run.*` topics via Notifier for Slack/email/PagerDuty hooks.
Observability configuration referenced in Task Runner tasks (OBS-50-001..55-001).
---
## 4·Approvals Workflow
- Approvals may be requested via Console banner, CLI prompt, or email/Slack.
- Approver roles: `Packs.Approve` + tenant membership.
- CLI command:
```bash
stella pack approve \
--run run:tenant:timestamp \
--gate security-review \
--comment "Validated remediation scope; proceeding."
```
- Auto-expiry triggers run cancellation (configurable per gate).
- Approval events logged and included in evidence bundle.
---
## 5·Secrets Handling
- Secrets retrieved via Authority secure channel or CLI profile.
- Task Runner injects secrets into isolated environment variables or temp files (auto-shredded).
- Logs redact secrets; evidence bundles include only secret metadata (name, scope, last four characters).
- For sealed mode, secrets must originate from sealed vault (configured via `TASKRUNNER_SEALED_VAULT_URL`).
---
## 6·Failure Recovery
| Scenario | Symptom | Resolution |
|----------|---------|------------|
| **Plan hash mismatch** | Run aborted with `ERR_PACK_HASH_MISMATCH`. | Re-run `stella pack plan`; ensure pack not modified post-plan. |
| **Approval timeout** | `ERR_PACK_APPROVAL_TIMEOUT`. | Requeue run with extended TTL or escalate to approver; verify notifications delivered. |
| **Secret missing** | Run fails at injection step. | Provide secret via CLI (`--secrets`) or configure profile; check Authority scope. |
| **Network blocked (sealed)** | `ERR_PACK_NETWORK_BLOCKED`. | Update pack to avoid external calls or whitelist domain via AirGap policy. |
| **Artifact upload failure** | Evidence missing, logs show storage errors. | Retry run with `--resume` (if supported); verify object storage health. |
| **Runner chaos trigger** | Run paused with chaos event note. | Review chaos test plan; resume if acceptable or cancel run. |
`stella pack runs resume --run <id>` resumes paused runs post-remediation (approvals or transient failures).
---
## 7·Chaos & Resilience
- Chaos hooks pause runs, drop network, or delay approvals to test resilience.
- Track chaos events via `pack.chaos.injected` timeline entries.
- Post-chaos, ensure metrics return to baseline; record findings in Ops log.
---
## 8·Offline & Air-Gapped Execution
- Use `stella pack mirror pull` to import packs into sealed registry.
- CLI caches bundles under `~/.stella/packs/` for offline runs.
- Approvals require offline process:
- Generate approval request bundle (`stella pack approve --offline-request`).
- Approver signs bundle using offline CLI.
- Import approval via `stella pack approve --offline-response`.
- Evidence bundles exported to removable media; verify checksums before upload to online systems.
---
## 9·Runbooks for Common Packs
Maintain per-pack playbooks in `docs/task-packs/runbook/<pack-name>.md`. Include:
- Purpose and scope.
- Required inputs and secrets.
- Approval stakeholders.
- Pre-checks and post-checks.
- Rollback procedures.
The Docs Guild can use this root runbook as a template.
---
## 10·Escalation Matrix
| Issue | Primary | Secondary | Notes |
|-------|---------|-----------|-------|
| Pack validation errors | DevEx/CLI Guild | Task Runner Guild | Provide pack bundle + validation output. |
| Approval pipeline failure | Task Runner Guild | Authority Core | Confirm scope/role mapping. |
| Registry outage | Packs Registry Guild | DevOps Guild | Use mirror fallback if possible. |
| Evidence integrity issues | Evidence Locker Guild | Security Guild | Validate DSSE attestations, escalate if tampered. |
Escalations must include run ID, tenant, pack version, plan hash, and timestamps.
---
## 11·Compliance Checklist
- [ ] Run lifecycle documented (submission → evidence).
- [ ] Monitoring metrics, logs, traces, and notifications captured.
- [ ] Approvals workflow instructions provided (CLI + Console).
- [ ] Secret handling, sealed-mode constraints, and offline process described.
- [ ] Failure scenarios + recovery steps listed.
- [ ] Chaos/resilience guidance included.
- [ ] Escalation matrix defined.
- [ ] Imposed rule reminder included at top.
---
*Last updated: 2025-10-27 (Sprint43).*