Files
git.stella-ops.org/docs/operations/devops/task-runner-simulation.md

49 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Task Runner — Simulation & Failure Policy Notes
> **Status:** Draft (2025-11-04) — execution wiring + CLI simulate command landed; docs pending final polish
The Task Runner planning layer now materialises additional runtime metadata to unblock execution and simulation flows:
- **Execution graph builder** converts `TaskPackPlan` steps (including `map` and `parallel`) into a deterministic graph with preserved enablement flags and per-step metadata (`maxParallel`, `continueOnError`, parameters, approval IDs).
- **Simulation engine** walks the execution graph and classifies steps as `pending`, `skipped`, `requires-approval`, or `requires-policy`, producing a deterministic preview for CLI/UI consumers while surfacing declared outputs.
- **Failure policy** pack-level `spec.failure.retries` is normalised into a `TaskPackPlanFailurePolicy` (default: `maxAttempts = 1`, `backoffSeconds = 0`). The new step state machine uses this policy to schedule retries and to determine when a run must abort.
- **Simulation API + Worker** `POST /v1/task-runner/simulations` returns the deterministic preview; `GET /v1/task-runner/runs/{id}` exposes persisted retry windows now written by the worker as it honours `maxParallel`, `continueOnError`, and retry windows during execution.
## Current behaviour
- Map steps expand into child iterations (`stepId[index]::templateId`) with per-item parameters preserved for runtime reference.
- Parallel blocks honour `maxParallel` (defaults to unlimited) and the worker executes children accordingly, short-circuiting when `continueOnError` is false.
- Simulation output mirrors approvals/policy gates, allowing the WebService/CLI to show which actions must occur before execution resumes.
- File-backed state store persists `PackRunState` snapshots (`nextAttemptAt`, attempts, reasons) so orchestration clients and CLI can resume runs deterministically even in air-gapped environments.
- Step state machine transitions:
- `pending → running → succeeded`
- `running → failed` (abort) once attempts ≥ `maxAttempts`
- `running → pending` with scheduled `nextAttemptAt` when retries remain
- `pending → skipped` for disabled steps (e.g., `when` expressions).
## CLI usage
Run the simulation without mutating state:
```bash
stella task-runner simulate \
--manifest ./packs/sample-pack.yaml \
--inputs ./inputs.json \
--format table
```
Use `--format json` (or `--output path.json`) to emit the raw payload produced by `POST /api/task-runner/simulations`.
## Follow-up gaps
- Fold the CLI command into the official reference/quickstart guides and capture exit-code conventions.
References:
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Core/Execution/PackRunExecutionGraphBuilder.cs`
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Core/Execution/Simulation/PackRunSimulationEngine.cs`
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Core/Execution/PackRunStepStateMachine.cs`
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Infrastructure/Execution/FilePackRunStateStore.cs`
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Worker/Services/PackRunWorkerService.cs`
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.WebService/Program.cs`