Files
git.stella-ops.org/docs/task-packs/runbook.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

6.9 KiB
Raw Blame History

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

Task Pack Operations Runbook

This runbook guides SREs and on-call engineers through executing, monitoring, and troubleshooting Task Packs using the Task Runner service, Packs Registry, and StellaOps CLI. It aligns with Sprint43 deliverables (approvals workflow, notifications, chaos resilience).


1·Quick Reference

Action Command / UI Notes
Validate pack stella pack validate --bundle <file> Run before publishing or importing.
Plan pack run stella pack plan --inputs inputs.json Outputs plan hash, required approvals, secret summary.
Execute pack stella pack run --pack <id>:<version> Streams logs; prompts for secrets/approvals if allowed.
Approve gate Console notifications or stella pack approve --run <id> --gate <gate> Requires Packs.Approve.
View run Console /console/packs/runs/:id or stella pack runs show <id> SSE stream available for live status.
Export evidence stella pack runs export --run <id> Produces bundle with plan, logs, artifacts, attestations.

2·Run Lifecycle

  1. Submission
    • CLI/Orchestrator submits run with inputs, pack version, tenant context.
    • Task Runner validates pack hash, scopes, sealed-mode constraints.
  2. Plan & Simulation
    • Runner caches plan graph; optional simulation diff recorded.
  3. Approvals
    • Gates emit notifications (NOTIFY-SVC-40-001).
    • Approvers can approve/resume via CLI, Console, or API.
  4. Execution
    • Steps executed per plan (sequential/parallel).
    • Logs streamed via SSE (/task-runner/runs/{id}/logs).
  5. Evidence & Attestation
    • On completion, DSSE attestation + evidence bundle stored.
    • Exports available via Export Center.
  6. Cleanup
    • Artifacts retained per retention policy (default 30d).
    • Mirror pack run manifest to Offline Kit if configured.

3·Monitoring & Telemetry

  • Metrics dashboards: task-runner Grafana board.
    • pack_run_active active runs per tenant.
    • pack_step_duration_seconds histograms per step type.
    • pack_gate_wait_seconds approval wait time (alert >30m).
    • pack_run_success_ratio success vs failure rate.
  • Logs: Search by runId, packId, tenant, stepId.
  • Traces: Query taskrunner.run span in Tempo/Jaeger.
  • Notifications: Subscribe to pack.run.* topics via Notifier for Slack/email/PagerDuty hooks.

Observability configuration referenced in Task Runner tasks (OBS-50-001..55-001).


4·Approvals Workflow

  • Approvals may be requested via Console banner, CLI prompt, or email/Slack.
  • Approver roles: Packs.Approve + tenant membership.
  • CLI command:
stella pack approve \
  --run run:tenant:timestamp \
  --gate security-review \
  --comment "Validated remediation scope; proceeding."
  • Auto-expiry triggers run cancellation (configurable per gate).
  • Approval events logged and included in evidence bundle.

5·Secrets Handling

  • Secrets retrieved via Authority secure channel or CLI profile.
  • Task Runner injects secrets into isolated environment variables or temp files (auto-shredded).
  • Logs redact secrets; evidence bundles include only secret metadata (name, scope, last four characters).
  • For sealed mode, secrets must originate from sealed vault (configured via TASKRUNNER_SEALED_VAULT_URL).

6·Failure Recovery

Scenario Symptom Resolution
Plan hash mismatch Run aborted with ERR_PACK_HASH_MISMATCH. Re-run stella pack plan; ensure pack not modified post-plan.
Approval timeout ERR_PACK_APPROVAL_TIMEOUT. Requeue run with extended TTL or escalate to approver; verify notifications delivered.
Secret missing Run fails at injection step. Provide secret via CLI (--secrets) or configure profile; check Authority scope.
Network blocked (sealed) ERR_PACK_NETWORK_BLOCKED. Update pack to avoid external calls or whitelist domain via AirGap policy.
Artifact upload failure Evidence missing, logs show storage errors. Retry run with --resume (if supported); verify object storage health.
Runner chaos trigger Run paused with chaos event note. Review chaos test plan; resume if acceptable or cancel run.

stella pack runs resume --run <id> resumes paused runs post-remediation (approvals or transient failures).


7·Chaos & Resilience

  • Chaos hooks pause runs, drop network, or delay approvals to test resilience.
  • Track chaos events via pack.chaos.injected timeline entries.
  • Post-chaos, ensure metrics return to baseline; record findings in Ops log.

8·Offline & Air-Gapped Execution

  • Use stella pack mirror pull to import packs into sealed registry.
  • CLI caches bundles under ~/.stella/packs/ for offline runs.
  • Approvals require offline process:
    • Generate approval request bundle (stella pack approve --offline-request).
    • Approver signs bundle using offline CLI.
    • Import approval via stella pack approve --offline-response.
  • Evidence bundles exported to removable media; verify checksums before upload to online systems.

9·Runbooks for Common Packs

Maintain per-pack playbooks in docs/task-packs/runbook/<pack-name>.md. Include:

  • Purpose and scope.
  • Required inputs and secrets.
  • Approval stakeholders.
  • Pre-checks and post-checks.
  • Rollback procedures.

The Docs Guild can use this root runbook as a template.


10·Escalation Matrix

Issue Primary Secondary Notes
Pack validation errors DevEx/CLI Guild Task Runner Guild Provide pack bundle + validation output.
Approval pipeline failure Task Runner Guild Authority Core Confirm scope/role mapping.
Registry outage Packs Registry Guild DevOps Guild Use mirror fallback if possible.
Evidence integrity issues Evidence Locker Guild Security Guild Validate DSSE attestations, escalate if tampered.

Escalations must include run ID, tenant, pack version, plan hash, and timestamps.


11·Compliance Checklist

  • Run lifecycle documented (submission → evidence).
  • Monitoring metrics, logs, traces, and notifications captured.
  • Approvals workflow instructions provided (CLI + Console).
  • Secret handling, sealed-mode constraints, and offline process described.
  • Failure scenarios + recovery steps listed.
  • Chaos/resilience guidance included.
  • Escalation matrix defined.
  • Imposed rule reminder included at top.

Last updated: 2025-10-27 (Sprint43).