Files
git.stella-ops.org/docs/workflow/engine/04-persistence-signaling-and-scheduling.md
master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00

8.6 KiB

04. Persistence, Signaling, And Scheduling

1. Persistence Strategy

Oracle is the single durable backend for v1.

It stores:

  • workflow instance projections
  • workflow task projections
  • workflow task events
  • workflow runtime snapshots
  • hosted job locks
  • AQ queues for immediate and delayed signals

This keeps correctness inside one transactional platform.

2. Existing Tables To Preserve

The current workflow schema already has the right base tables:

  • WF_INSTANCES
  • WF_TASKS
  • WF_TASK_EVENTS
  • WF_RUNTIME_STATES
  • WF_HOST_LOCKS

The current workflow database model is the mapping baseline for these tables.

2.1 WF_INSTANCES

Purpose:

  • product-facing workflow instance summary
  • instance business reference
  • instance status
  • product-facing state snapshot

2.2 WF_TASKS

Purpose:

  • active and historical human task projections
  • task routing
  • assignment
  • task payload
  • effective roles

2.3 WF_TASK_EVENTS

Purpose:

  • append-only task event history
  • created, assigned, released, completed, reassigned events

2.4 WF_RUNTIME_STATES

Purpose:

  • engine-owned durable runtime snapshot

This table becomes the main source of truth for engine resume.

3. Proposed Runtime State Extensions

WF_RUNTIME_STATES should be extended to support canonical engine execution directly.

Recommended new columns:

  • STATE_VERSION Numeric optimistic concurrency version.
  • SNAPSHOT_SCHEMA_VERSION Snapshot format version for engine evolution.
  • WAITING_KIND Current wait type.
  • WAITING_TOKEN Stale-signal guard token.
  • WAITING_UNTIL_UTC Next due time when waiting on time-based resume.
  • ACTIVE_TASK_ID Current active task projection id when applicable.
  • RESUME_POINTER_JSON Serialized canonical resume pointer.
  • LAST_SIGNAL_ID Last successfully processed signal id.
  • LAST_ERROR_CODE Last engine error code.
  • LAST_ERROR_JSON Structured last error details.
  • LAST_EXECUTED_BY Node id that last committed execution.
  • LAST_EXECUTED_ON_UTC Last successful engine commit timestamp.

The existing fields remain useful:

  • workflow identity
  • business reference
  • runtime provider
  • runtime instance id
  • runtime status
  • state json
  • lifecycle timestamps

4. Snapshot Structure

STATE_JSON should hold a provider snapshot object for SerdicaEngine.

Recommended shape:

{
  "engineSchemaVersion": 1,
  "workflowState": {},
  "businessReference": {
    "key": "1200345",
    "parts": {}
  },
  "status": "Open",
  "waiting": {
    "kind": "TaskCompletion",
    "token": "wait-123",
    "untilUtc": null
  },
  "resume": {
    "entryPointKind": "TaskOnComplete",
    "taskName": "ApproveApplication",
    "branchPath": [],
    "nextStepIndex": 3
  },
  "subWorkflowFrames": [],
  "continuationBuffer": []
}

5. Oracle AQ Strategy

5.1 Why AQ

AQ is the default signaling backend because it gives:

  • durable storage
  • blocking dequeue
  • delayed delivery
  • database-managed recovery
  • transactional semantics close to the runtime state store

5.2 Queue Topology

Use explicit queues for clarity and operations.

Recommended objects:

  • WF_SIGNAL_QTAB
  • WF_SIGNAL_Q
  • WF_SCHEDULE_QTAB
  • WF_SCHEDULE_Q
  • WF_DLQ_QTAB
  • WF_DLQ_Q

Rationale:

  • immediate signals and delayed signals are operationally different
  • dead-letter isolation matters for supportability
  • queue separation makes metrics and troubleshooting simpler

5.3 Payload Format

Use a compact JSON envelope serialized to UTF-8 bytes in a RAW payload.

Reasons:

  • simple from .NET
  • explicit schema ownership in application code
  • small message size
  • backend abstraction remains possible later

Do not put full workflow snapshots into AQ messages.

6. Signal Envelope

Recommended envelope:

public sealed record WorkflowSignalEnvelope
{
    public required string SignalId { get; init; }
    public required string WorkflowInstanceId { get; init; }
    public required string RuntimeProvider { get; init; }
    public required string SignalType { get; init; }
    public required long ExpectedVersion { get; init; }
    public string? WaitingToken { get; init; }
    public DateTime OccurredAtUtc { get; init; }
    public DateTime? DueAtUtc { get; init; }
    public Dictionary<string, JsonElement> Payload { get; init; } = [];
}

Signal types:

  • TaskCompleted
  • TimerDue
  • RetryDue
  • ExternalSignal
  • SubWorkflowCompleted
  • InternalContinue

7. Transaction Model

7.1 Start Transaction

Start must durably commit:

  • instance projection
  • runtime snapshot
  • task rows if any
  • task events if any
  • scheduled or immediate AQ messages if any

7.2 Completion Transaction

Task completion must durably commit:

  • task completion event
  • updated instance projection
  • new task rows if any
  • updated runtime snapshot
  • any resulting AQ signals or schedules

7.3 Signal Resume Transaction

Signal resume must durably commit:

  • AQ dequeue
  • updated runtime snapshot
  • resulting projection changes
  • any next AQ signals

The intended operational model is:

  • dequeue with transactional semantics
  • update state and projections
  • commit once

If commit fails, the signal must become visible again.

8. Blocking Receive Model

No polling loop should be used for work discovery.

Each node should run a signal pump that:

  • opens one or more blocking AQ dequeue consumers
  • waits on AQ rather than sleeping and scanning
  • dispatches envelopes to bounded execution workers

Suggested parameters:

  • dequeue wait seconds
  • max concurrent handlers
  • max poison retries
  • dead-letter policy

9. Scheduling Model

9.1 Scheduling Requirement

The engine must support timers without a periodic sweep job.

9.2 Scheduling Approach

When a workflow enters a timed wait:

  1. runtime snapshot is updated with:
    • waiting kind
    • waiting token
    • due time
  2. a delayed AQ message is enqueued
  3. transaction commits

When the delayed message becomes available:

  1. a signal consumer dequeues it
  2. current snapshot is loaded
  3. waiting token is checked
  4. if token matches, resume
  5. if token does not match, ignore as stale

9.3 Logical Cancel Instead Of Physical Delete

The scheduler should treat cancel and reschedule logically.

Do not make correctness depend on deleting a queued timer message.

Instead:

  • generate a new waiting token when schedule changes
  • old delayed message becomes stale automatically

This is simpler and more reliable in distributed execution.

10. Multi-Node Concurrency Model

The engine must assume multiple nodes can receive signals for the same workflow instance.

Correctness model:

  • signal delivery is at-least-once
  • snapshot update uses version control
  • waiting token guards stale work
  • duplicate resumes are safe

Recommended write model:

  • read snapshot version
  • execute
  • update WF_RUNTIME_STATES where STATE_VERSION = expected
  • if update count is zero, abandon and retry or ignore as stale

This avoids permanent instance ownership.

11. Restart And Recovery Semantics

11.1 One Node Down

Other nodes continue consuming AQ and processing instances.

11.2 All Nodes Down, Database Up

Signals remain durable in AQ.

When any node comes back:

  • AQ consumers reconnect
  • pending immediate and delayed signals are processed
  • workflow resumes continue

11.3 Database Down

No execution can continue while Oracle is unavailable.

Once Oracle returns:

  • AQ queues recover with the database
  • runtime snapshots recover with the database
  • resumed node consumers continue from durable state

11.4 All Nodes And Database Down

After Oracle returns and at least one application node starts:

  • AQ messages are still present
  • runtime state is still present
  • due delayed messages can be consumed
  • execution resumes from durable state

This is one of the main reasons Oracle AQ is preferred over a separate volatile wake-up layer.

12. Redis Position In V1

Redis is optional and not part of the correctness path.

It may be used later for:

  • local cache
  • non-authoritative wake hints
  • metrics fanout

It should not be required for:

  • durable signal delivery
  • timer delivery
  • restart recovery

13. Dead-Letter Strategy

Messages should move to DLQ when:

  • deserialization fails
  • definition is missing
  • snapshot is irreparably inconsistent
  • retry count exceeds threshold

DLQ entry should preserve:

  • original envelope
  • failure reason
  • last node id
  • failure timestamp

14. Retention

Retention remains a service responsibility.

It should continue to clean:

  • stale instances
  • stale tasks
  • completed data past purge window
  • runtime states past purge window

AQ retention policy should be aligned with application retention and supportability needs, but queue cleanup must not delete active work.