Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.6 KiB
04. Persistence, Signaling, And Scheduling
1. Persistence Strategy
Oracle is the single durable backend for v1.
It stores:
- workflow instance projections
- workflow task projections
- workflow task events
- workflow runtime snapshots
- hosted job locks
- AQ queues for immediate and delayed signals
This keeps correctness inside one transactional platform.
2. Existing Tables To Preserve
The current workflow schema already has the right base tables:
WF_INSTANCESWF_TASKSWF_TASK_EVENTSWF_RUNTIME_STATESWF_HOST_LOCKS
The current workflow database model is the mapping baseline for these tables.
2.1 WF_INSTANCES
Purpose:
- product-facing workflow instance summary
- instance business reference
- instance status
- product-facing state snapshot
2.2 WF_TASKS
Purpose:
- active and historical human task projections
- task routing
- assignment
- task payload
- effective roles
2.3 WF_TASK_EVENTS
Purpose:
- append-only task event history
- created, assigned, released, completed, reassigned events
2.4 WF_RUNTIME_STATES
Purpose:
- engine-owned durable runtime snapshot
This table becomes the main source of truth for engine resume.
3. Proposed Runtime State Extensions
WF_RUNTIME_STATES should be extended to support canonical engine execution directly.
Recommended new columns:
STATE_VERSIONNumeric optimistic concurrency version.SNAPSHOT_SCHEMA_VERSIONSnapshot format version for engine evolution.WAITING_KINDCurrent wait type.WAITING_TOKENStale-signal guard token.WAITING_UNTIL_UTCNext due time when waiting on time-based resume.ACTIVE_TASK_IDCurrent active task projection id when applicable.RESUME_POINTER_JSONSerialized canonical resume pointer.LAST_SIGNAL_IDLast successfully processed signal id.LAST_ERROR_CODELast engine error code.LAST_ERROR_JSONStructured last error details.LAST_EXECUTED_BYNode id that last committed execution.LAST_EXECUTED_ON_UTCLast successful engine commit timestamp.
The existing fields remain useful:
- workflow identity
- business reference
- runtime provider
- runtime instance id
- runtime status
- state json
- lifecycle timestamps
4. Snapshot Structure
STATE_JSON should hold a provider snapshot object for SerdicaEngine.
Recommended shape:
{
"engineSchemaVersion": 1,
"workflowState": {},
"businessReference": {
"key": "1200345",
"parts": {}
},
"status": "Open",
"waiting": {
"kind": "TaskCompletion",
"token": "wait-123",
"untilUtc": null
},
"resume": {
"entryPointKind": "TaskOnComplete",
"taskName": "ApproveApplication",
"branchPath": [],
"nextStepIndex": 3
},
"subWorkflowFrames": [],
"continuationBuffer": []
}
5. Oracle AQ Strategy
5.1 Why AQ
AQ is the default signaling backend because it gives:
- durable storage
- blocking dequeue
- delayed delivery
- database-managed recovery
- transactional semantics close to the runtime state store
5.2 Queue Topology
Use explicit queues for clarity and operations.
Recommended objects:
WF_SIGNAL_QTABWF_SIGNAL_QWF_SCHEDULE_QTABWF_SCHEDULE_QWF_DLQ_QTABWF_DLQ_Q
Rationale:
- immediate signals and delayed signals are operationally different
- dead-letter isolation matters for supportability
- queue separation makes metrics and troubleshooting simpler
5.3 Payload Format
Use a compact JSON envelope serialized to UTF-8 bytes in a RAW payload.
Reasons:
- simple from .NET
- explicit schema ownership in application code
- small message size
- backend abstraction remains possible later
Do not put full workflow snapshots into AQ messages.
6. Signal Envelope
Recommended envelope:
public sealed record WorkflowSignalEnvelope
{
public required string SignalId { get; init; }
public required string WorkflowInstanceId { get; init; }
public required string RuntimeProvider { get; init; }
public required string SignalType { get; init; }
public required long ExpectedVersion { get; init; }
public string? WaitingToken { get; init; }
public DateTime OccurredAtUtc { get; init; }
public DateTime? DueAtUtc { get; init; }
public Dictionary<string, JsonElement> Payload { get; init; } = [];
}
Signal types:
TaskCompletedTimerDueRetryDueExternalSignalSubWorkflowCompletedInternalContinue
7. Transaction Model
7.1 Start Transaction
Start must durably commit:
- instance projection
- runtime snapshot
- task rows if any
- task events if any
- scheduled or immediate AQ messages if any
7.2 Completion Transaction
Task completion must durably commit:
- task completion event
- updated instance projection
- new task rows if any
- updated runtime snapshot
- any resulting AQ signals or schedules
7.3 Signal Resume Transaction
Signal resume must durably commit:
- AQ dequeue
- updated runtime snapshot
- resulting projection changes
- any next AQ signals
The intended operational model is:
- dequeue with transactional semantics
- update state and projections
- commit once
If commit fails, the signal must become visible again.
8. Blocking Receive Model
No polling loop should be used for work discovery.
Each node should run a signal pump that:
- opens one or more blocking AQ dequeue consumers
- waits on AQ rather than sleeping and scanning
- dispatches envelopes to bounded execution workers
Suggested parameters:
- dequeue wait seconds
- max concurrent handlers
- max poison retries
- dead-letter policy
9. Scheduling Model
9.1 Scheduling Requirement
The engine must support timers without a periodic sweep job.
9.2 Scheduling Approach
When a workflow enters a timed wait:
- runtime snapshot is updated with:
- waiting kind
- waiting token
- due time
- a delayed AQ message is enqueued
- transaction commits
When the delayed message becomes available:
- a signal consumer dequeues it
- current snapshot is loaded
- waiting token is checked
- if token matches, resume
- if token does not match, ignore as stale
9.3 Logical Cancel Instead Of Physical Delete
The scheduler should treat cancel and reschedule logically.
Do not make correctness depend on deleting a queued timer message.
Instead:
- generate a new waiting token when schedule changes
- old delayed message becomes stale automatically
This is simpler and more reliable in distributed execution.
10. Multi-Node Concurrency Model
The engine must assume multiple nodes can receive signals for the same workflow instance.
Correctness model:
- signal delivery is at-least-once
- snapshot update uses version control
- waiting token guards stale work
- duplicate resumes are safe
Recommended write model:
- read snapshot version
- execute
- update
WF_RUNTIME_STATESwhereSTATE_VERSION = expected - if update count is zero, abandon and retry or ignore as stale
This avoids permanent instance ownership.
11. Restart And Recovery Semantics
11.1 One Node Down
Other nodes continue consuming AQ and processing instances.
11.2 All Nodes Down, Database Up
Signals remain durable in AQ.
When any node comes back:
- AQ consumers reconnect
- pending immediate and delayed signals are processed
- workflow resumes continue
11.3 Database Down
No execution can continue while Oracle is unavailable.
Once Oracle returns:
- AQ queues recover with the database
- runtime snapshots recover with the database
- resumed node consumers continue from durable state
11.4 All Nodes And Database Down
After Oracle returns and at least one application node starts:
- AQ messages are still present
- runtime state is still present
- due delayed messages can be consumed
- execution resumes from durable state
This is one of the main reasons Oracle AQ is preferred over a separate volatile wake-up layer.
12. Redis Position In V1
Redis is optional and not part of the correctness path.
It may be used later for:
- local cache
- non-authoritative wake hints
- metrics fanout
It should not be required for:
- durable signal delivery
- timer delivery
- restart recovery
13. Dead-Letter Strategy
Messages should move to DLQ when:
- deserialization fails
- definition is missing
- snapshot is irreparably inconsistent
- retry count exceeds threshold
DLQ entry should preserve:
- original envelope
- failure reason
- last node id
- failure timestamp
14. Retention
Retention remains a service responsibility.
It should continue to clean:
- stale instances
- stale tasks
- completed data past purge window
- runtime states past purge window
AQ retention policy should be aligned with application retention and supportability needs, but queue cleanup must not delete active work.