stella-ops.org/git.stella-ops.org

Fork 0

Files

root 68da90a11a

Docs CI / lint-and-preview (push) Has been cancelled

Details

Restructure solution layout by module

2025-10-28 15:10:40 +02:00

6.9 KiB

Raw Blame History

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

Task Pack Operations Runbook

This runbook guides SREs and on-call engineers through executing, monitoring, and troubleshooting Task Packs using the Task Runner service, Packs Registry, and StellaOps CLI. It aligns with Sprint 43 deliverables (approvals workflow, notifications, chaos resilience).

1 · Quick Reference

Action	Command / UI	Notes
Validate pack	`stella pack validate --bundle <file>`	Run before publishing or importing.
Plan pack run	`stella pack plan --inputs inputs.json`	Outputs plan hash, required approvals, secret summary.
Execute pack	`stella pack run --pack <id>:<version>`	Streams logs; prompts for secrets/approvals if allowed.
Approve gate	Console notifications or `stella pack approve --run <id> --gate <gate>`	Requires `Packs.Approve`.
View run	Console `/console/packs/runs/:id` or `stella pack runs show <id>`	SSE stream available for live status.
Export evidence	`stella pack runs export --run <id>`	Produces bundle with plan, logs, artifacts, attestations.

2 · Run Lifecycle

Submission
- CLI/Orchestrator submits run with inputs, pack version, tenant context.
- Task Runner validates pack hash, scopes, sealed-mode constraints.
Plan & Simulation
- Runner caches plan graph; optional simulation diff recorded.
Approvals
- Gates emit notifications (NOTIFY-SVC-40-001).
- Approvers can approve/resume via CLI, Console, or API.
Execution
- Steps executed per plan (sequential/parallel).
- Logs streamed via SSE (/task-runner/runs/{id}/logs).
Evidence & Attestation
- On completion, DSSE attestation + evidence bundle stored.
- Exports available via Export Center.
Cleanup
- Artifacts retained per retention policy (default 30 d).
- Mirror pack run manifest to Offline Kit if configured.

3 · Monitoring & Telemetry

Metrics dashboards: task-runner Grafana board.
- pack_run_active – active runs per tenant.
- pack_step_duration_seconds – histograms per step type.
- pack_gate_wait_seconds – approval wait time (alert > 30 m).
- pack_run_success_ratio – success vs failure rate.
Logs: Search by runId, packId, tenant, stepId.
Traces: Query taskrunner.run span in Tempo/Jaeger.
Notifications: Subscribe to pack.run.* topics via Notifier for Slack/email/PagerDuty hooks.

Observability configuration referenced in Task Runner tasks (OBS-50-001..55-001).

4 · Approvals Workflow

Approvals may be requested via Console banner, CLI prompt, or email/Slack.
Approver roles: Packs.Approve + tenant membership.
CLI command:

stella pack approve \
  --run run:tenant:timestamp \
  --gate security-review \
  --comment "Validated remediation scope; proceeding."

Auto-expiry triggers run cancellation (configurable per gate).
Approval events logged and included in evidence bundle.

5 · Secrets Handling

Secrets retrieved via Authority secure channel or CLI profile.
Task Runner injects secrets into isolated environment variables or temp files (auto-shredded).
Logs redact secrets; evidence bundles include only secret metadata (name, scope, last four characters).
For sealed mode, secrets must originate from sealed vault (configured via TASKRUNNER_SEALED_VAULT_URL).

6 · Failure Recovery

Scenario	Symptom	Resolution
Plan hash mismatch	Run aborted with `ERR_PACK_HASH_MISMATCH`.	Re-run `stella pack plan`; ensure pack not modified post-plan.
Approval timeout	`ERR_PACK_APPROVAL_TIMEOUT`.	Requeue run with extended TTL or escalate to approver; verify notifications delivered.
Secret missing	Run fails at injection step.	Provide secret via CLI (`--secrets`) or configure profile; check Authority scope.
Network blocked (sealed)	`ERR_PACK_NETWORK_BLOCKED`.	Update pack to avoid external calls or whitelist domain via AirGap policy.
Artifact upload failure	Evidence missing, logs show storage errors.	Retry run with `--resume` (if supported); verify object storage health.
Runner chaos trigger	Run paused with chaos event note.	Review chaos test plan; resume if acceptable or cancel run.

stella pack runs resume --run <id> resumes paused runs post-remediation (approvals or transient failures).

7 · Chaos & Resilience

Chaos hooks pause runs, drop network, or delay approvals to test resilience.
Track chaos events via pack.chaos.injected timeline entries.
Post-chaos, ensure metrics return to baseline; record findings in Ops log.

8 · Offline & Air-Gapped Execution

Use stella pack mirror pull to import packs into sealed registry.
CLI caches bundles under ~/.stella/packs/ for offline runs.
Approvals require offline process:
- Generate approval request bundle (stella pack approve --offline-request).
- Approver signs bundle using offline CLI.
- Import approval via stella pack approve --offline-response.
Evidence bundles exported to removable media; verify checksums before upload to online systems.

9 · Runbooks for Common Packs

Maintain per-pack playbooks in docs/task-packs/runbook/<pack-name>.md. Include:

Purpose and scope.
Required inputs and secrets.
Approval stakeholders.
Pre-checks and post-checks.
Rollback procedures.

The Docs Guild can use this root runbook as a template.

10 · Escalation Matrix

Issue	Primary	Secondary	Notes
Pack validation errors	DevEx/CLI Guild	Task Runner Guild	Provide pack bundle + validation output.
Approval pipeline failure	Task Runner Guild	Authority Core	Confirm scope/role mapping.
Registry outage	Packs Registry Guild	DevOps Guild	Use mirror fallback if possible.
Evidence integrity issues	Evidence Locker Guild	Security Guild	Validate DSSE attestations, escalate if tampered.

Escalations must include run ID, tenant, pack version, plan hash, and timestamps.

11 · Compliance Checklist

Run lifecycle documented (submission → evidence).
Monitoring metrics, logs, traces, and notifications captured.
Approvals workflow instructions provided (CLI + Console).
Secret handling, sealed-mode constraints, and offline process described.
Failure scenarios + recovery steps listed.
Chaos/resilience guidance included.
Escalation matrix defined.
Imposed rule reminder included at top.

Last updated: 2025-10-27 (Sprint 43).

6.9 KiB Raw Blame History Unescape Escape

Task Pack Operations Runbook

1 · Quick Reference

2 · Run Lifecycle

3 · Monitoring & Telemetry

4 · Approvals Workflow

5 · Secrets Handling

6 · Failure Recovery

7 · Chaos & Resilience

8 · Offline & Air-Gapped Execution

9 · Runbooks for Common Packs

10 · Escalation Matrix

11 · Compliance Checklist