git.stella-ops.org/docs/workflow/engine/03-canonical-execution-model.md

# 03. Canonical Execution Model

## 1. Why The Engine Executes Canonical Definitions

The workflow corpus is now fully declarative and canonicalizable.

That changes the best runtime strategy:

- authored C# remains the source of truth
- canonical definition becomes the runtime execution contract
- the engine interprets canonical definitions directly

This gives the platform:

- deterministic runtime behavior
- shared semantics between export/import and execution
- less runtime coupling to workflow-specific CLR delegates
- a clean separation between authoring and execution

## 2. Definition Lifecycle

### 2.1 Authoring

Workflows are authored in C# through the declarative DSL.

### 2.2 Normalization

At service startup, each workflow registration is normalized into:

1. workflow registration metadata
2. canonical workflow definition
3. required module set
4. function usage metadata

### 2.3 Validation

The runtime should validate canonical definitions before accepting them for execution.

Recommended startup modes:

- `Strict`
  Startup fails if a definition is invalid.
- `Warn`
  Startup succeeds, but invalid definitions are marked unavailable.

### 2.4 Runtime Cache

The engine should cache canonical runtime definitions in memory by:

- workflow name
- workflow version

This cache is immutable after startup in v1.

## 3. Canonical Runtime Definition Shape

The runtime definition should be treated as a compiled, execution-ready representation of the canonical contracts, not a raw JSON document.

The runtime model should contain:

- definition identity
- display metadata
- required modules
- step graph
- task declarations
- expression trees
- transport declarations
- subworkflow declarations
- continue-with declarations

## 4. Execution Context Model

The interpreter should run every step against a single canonical execution context.

Recommended execution context fields:

- `WorkflowName`
- `WorkflowVersion`
- `WorkflowInstanceId`
- `BusinessReference`
- `State`
- `StartPayload`
- `CompletionPayload`
- `CurrentTask`
- `CurrentSignal`
- `FunctionRuntime`
- `TransportDispatcher`
- `RuntimeMetadata`

`RuntimeMetadata` should hold:

- node id
- current signal id
- snapshot version
- waiting token
- execution started at

## 5. Core Runtime State Model

The runtime must distinguish between:

- business state
- engine state

### 5.1 Business State

Business state is what the workflow author reasons about.

Examples:

- `srPolicyId`
- `policySubstatus`
- customer lookup state
- payload shaping outputs
- subworkflow results

### 5.2 Engine State

Engine state is what the runtime needs to resume correctly.

Examples:

- current workflow status
- current wait type
- current wait token
- active task identity
- resume pointer
- subworkflow frame stack
- outstanding timer descriptors
- last processed signal id

Business state must remain visible in runtime inspection.
Engine state must remain safe and deterministic for resume.

## 6. Run-To-Wait Execution Model

The engine uses a run-to-wait interpreter.

This means:

1. load snapshot
2. execute sequentially
3. stop when a durable wait boundary is reached
4. persist resulting snapshot
5. release instance

Wait boundaries are:

- human task activation
- scheduled timer
- external signal wait
- child workflow wait
- terminal completion

This model is essential for:

- multi-instance safety
- restart recovery
- no sticky ownership
- no in-memory correctness assumptions

## 7. Step Semantics

### 7.1 State Assignment

State assignment is immediate and local to the current execution transaction.

The engine:

- evaluates the assignment expression
- writes to the business state dictionary
- keeps changes in-memory until the next durable checkpoint

### 7.2 Business Reference Assignment

Business reference assignment updates the canonical business reference attached to:

- the runtime snapshot
- new tasks
- instance projection updates

Business reference changes must be applied transactionally with other execution results.

### 7.3 Human Task Activation

A human task activation step is a terminal wait boundary.

The interpreter does not continue past it in the same execution.

The result of task activation is:

- one active task projection
- updated instance status
- updated runtime snapshot
- optional runtime metadata for the active task

### 7.4 Transport Call

Transport calls are synchronous from the perspective of a single execution slice.

The engine:

- evaluates payload expressions
- dispatches through the correct transport adapter
- captures result payload
- stores result under the result key when present
- chooses the success, failure, or timeout branch

No engine-specific callback registration should be required for normal synchronous transport calls.

### 7.5 Conditional Branch

Conditions evaluate against the current execution context.

Only one branch is executed.

The branch path must be reproducible in the resume pointer model.

### 7.6 Repeat

Repeat executes logically as:

- evaluate collection or repeat source
- for each iteration:
  - bind iteration context
  - execute nested sequence

If an iteration hits a wait boundary, the engine snapshot must preserve:

- repeat step id
- iteration index
- remaining resume location inside the iteration body

### 7.7 Subworkflow Invocation

Subworkflow invocation is a wait boundary unless the child completes inline before producing a wait.

Parent snapshot must record:

- child workflow identity
- child workflow version
- parent business reference
- parent resume pointer
- target result key
- parent workflow state needed for resume

### 7.8 Continue-With

Continue-with creates a new workflow start request as an engine side effect.

It is not a resume boundary for the current instance unless explicitly modeled that way by the workflow.

## 8. Resume Model

### 8.1 Resume Pointer

The engine must persist a deterministic resume pointer.

It should identify:

- entry point kind
- task name if resuming from task completion
- branch path
- next step index
- repeat iteration where applicable

The existing declarative resume model is the right conceptual baseline, but the engine should persist it inside the canonical runtime snapshot rather than inside a CLR-only execution flow.

### 8.2 Waiting Token

Every durable wait must have a waiting token.

The waiting token is how the engine prevents stale resumes.

When a signal arrives:

- if the waiting token does not match the snapshot
- the signal is stale and must be ignored safely

This is the primary guard for:

- canceled timers
- duplicate wake-ups
- late child completions
- redelivered signals

### 8.3 Version

Every successful execution commit must increment snapshot version.

Signals may carry the expected version that created the wait.

This allows the engine to detect stale work before any mutation.

## 9. Human Task Model

The task model remains projection-first.

The runtime does not wait on an in-memory task object.

Instead:

- task activation writes a task projection row
- runtime snapshot enters `WaitingForTaskCompletion`
- task completion API provides the wake-up event

Task completion is therefore an external signal into the engine.

## 10. Error Model

The interpreter should classify errors into:

- definition errors
- expression evaluation errors
- transport errors
- timeout errors
- authorization errors
- engine consistency errors

Definition errors are startup or validation failures.
Execution errors are runtime failures that may:

- route into a failure branch
- schedule a retry
- fail the workflow
- move the instance to a recoverable error state

## 11. Retry Model

Retries should be modeled explicitly as scheduled signals.

The engine should not sleep inside a worker.

A retry should:

1. persist the failure context
2. generate a new waiting token
3. enqueue a delayed resume signal
4. commit

## 12. Completion Model

A workflow completes when the interpreter reaches terminal completion with no outstanding waits.

Completion result must:

- mark instance projection completed
- mark runtime state completed
- clear stale timeout metadata
- apply retention timing

## 13. Determinism Requirements

The runtime must assume:

- expressions are deterministic given the execution context
- transport calls are side effects and must be treated explicitly
- no hidden CLR delegate behavior remains in workflow definitions

The runtime should not rely on:

- non-deterministic local time calls inside step execution
- in-memory mutable workflow objects
- ambient state outside the canonical execution context

## 14. Resulting Implementation Shape

The engine kernel should be implemented as:

- definition normalizer
- canonical interpreter
- transport dispatcher
- execution coordinator
- resume serializer/deserializer

This produces a runtime that is small, explicit, and aligned with the already-completed full-declaration effort.