> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. # Task Pack Operations Runbook This runbook guides SREs and on-call engineers through executing, monitoring, and troubleshooting Task Packs using the Task Runner service, Packs Registry, and StellaOps CLI. It aligns with Sprint 43 deliverables (approvals workflow, notifications, chaos resilience). --- ## 1 · Quick Reference | Action | Command / UI | Notes | |--------|--------------|-------| | Validate pack | `stella pack validate --bundle ` | Run before publishing or importing. | | Plan pack run | `stella pack plan --inputs inputs.json` | Outputs plan hash, required approvals, secret summary. | | Execute pack | `stella pack run --pack :` | Streams logs; prompts for secrets/approvals if allowed. | | Approve gate | Console notifications or `stella pack approve --run --gate ` | Requires `Packs.Approve`. | | View run | Console `/console/packs/runs/:id` or `stella pack runs show ` | SSE stream available for live status. | | Export evidence | `stella pack runs export --run ` | Produces bundle with plan, logs, artifacts, attestations. | --- ## 2 · Run Lifecycle 1. **Submission** - CLI/Orchestrator submits run with inputs, pack version, tenant context. - Task Runner validates pack hash, scopes, sealed-mode constraints. 2. **Plan & Simulation** - Runner caches plan graph; optional simulation diff recorded. 3. **Approvals** - Gates emit notifications (`NOTIFY-SVC-40-001`). - Approvers can approve/resume via CLI, Console, or API. 4. **Execution** - Steps executed per plan (sequential/parallel). - Logs streamed via SSE (`/task-runner/runs/{id}/logs`). 5. **Evidence & Attestation** - On completion, DSSE attestation + evidence bundle stored. - Exports available via Export Center. 6. **Cleanup** - Artifacts retained per retention policy (default 30 d). - Mirror pack run manifest to Offline Kit if configured. --- ## 3 · Monitoring & Telemetry - **Metrics dashboards:** `task-runner` Grafana board. - `pack_run_active` – active runs per tenant. - `pack_step_duration_seconds` – histograms per step type. - `pack_gate_wait_seconds` – approval wait time (alert > 30 m). - `pack_run_success_ratio` – success vs failure rate. - **Logs:** Search by `runId`, `packId`, `tenant`, `stepId`. - **Traces:** Query `taskrunner.run` span in Tempo/Jaeger. - **Notifications:** Subscribe to `pack.run.*` topics via Notifier for Slack/email/PagerDuty hooks. Observability configuration referenced in Task Runner tasks (OBS-50-001..55-001). --- ## 4 · Approvals Workflow - Approvals may be requested via Console banner, CLI prompt, or email/Slack. - Approver roles: `Packs.Approve` + tenant membership. - CLI command: ```bash stella pack approve \ --run run:tenant:timestamp \ --gate security-review \ --comment "Validated remediation scope; proceeding." ``` - Auto-expiry triggers run cancellation (configurable per gate). - Approval events logged and included in evidence bundle. --- ## 5 · Secrets Handling - Secrets retrieved via Authority secure channel or CLI profile. - Task Runner injects secrets into isolated environment variables or temp files (auto-shredded). - Logs redact secrets; evidence bundles include only secret metadata (name, scope, last four characters). - For sealed mode, secrets must originate from sealed vault (configured via `TASKRUNNER_SEALED_VAULT_URL`). --- ## 6 · Failure Recovery | Scenario | Symptom | Resolution | |----------|---------|------------| | **Plan hash mismatch** | Run aborted with `ERR_PACK_HASH_MISMATCH`. | Re-run `stella pack plan`; ensure pack not modified post-plan. | | **Approval timeout** | `ERR_PACK_APPROVAL_TIMEOUT`. | Requeue run with extended TTL or escalate to approver; verify notifications delivered. | | **Secret missing** | Run fails at injection step. | Provide secret via CLI (`--secrets`) or configure profile; check Authority scope. | | **Network blocked (sealed)** | `ERR_PACK_NETWORK_BLOCKED`. | Update pack to avoid external calls or whitelist domain via AirGap policy. | | **Artifact upload failure** | Evidence missing, logs show storage errors. | Retry run with `--resume` (if supported); verify object storage health. | | **Runner chaos trigger** | Run paused with chaos event note. | Review chaos test plan; resume if acceptable or cancel run. | `stella pack runs resume --run ` resumes paused runs post-remediation (approvals or transient failures). --- ## 7 · Chaos & Resilience - Chaos hooks pause runs, drop network, or delay approvals to test resilience. - Track chaos events via `pack.chaos.injected` timeline entries. - Post-chaos, ensure metrics return to baseline; record findings in Ops log. --- ## 8 · Offline & Air-Gapped Execution - Use `stella pack mirror pull` to import packs into sealed registry. - CLI caches bundles under `~/.stella/packs/` for offline runs. - Approvals require offline process: - Generate approval request bundle (`stella pack approve --offline-request`). - Approver signs bundle using offline CLI. - Import approval via `stella pack approve --offline-response`. - Evidence bundles exported to removable media; verify checksums before upload to online systems. --- ## 9 · Runbooks for Common Packs Maintain per-pack playbooks in `docs/task-packs/runbook/.md`. Include: - Purpose and scope. - Required inputs and secrets. - Approval stakeholders. - Pre-checks and post-checks. - Rollback procedures. The Docs Guild can use this root runbook as a template. --- ## 10 · Escalation Matrix | Issue | Primary | Secondary | Notes | |-------|---------|-----------|-------| | Pack validation errors | DevEx/CLI Guild | Task Runner Guild | Provide pack bundle + validation output. | | Approval pipeline failure | Task Runner Guild | Authority Core | Confirm scope/role mapping. | | Registry outage | Packs Registry Guild | DevOps Guild | Use mirror fallback if possible. | | Evidence integrity issues | Evidence Locker Guild | Security Guild | Validate DSSE attestations, escalate if tampered. | Escalations must include run ID, tenant, pack version, plan hash, and timestamps. --- ## 11 · Compliance Checklist - [ ] Run lifecycle documented (submission → evidence). - [ ] Monitoring metrics, logs, traces, and notifications captured. - [ ] Approvals workflow instructions provided (CLI + Console). - [ ] Secret handling, sealed-mode constraints, and offline process described. - [ ] Failure scenarios + recovery steps listed. - [ ] Chaos/resilience guidance included. - [ ] Escalation matrix defined. - [ ] Imposed rule reminder included at top. --- *Last updated: 2025-10-27 (Sprint 43).*