Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
404 lines
8.6 KiB
Markdown
404 lines
8.6 KiB
Markdown
# 04. Persistence, Signaling, And Scheduling
|
|
|
|
## 1. Persistence Strategy
|
|
|
|
Oracle is the single durable backend for v1.
|
|
|
|
It stores:
|
|
|
|
- workflow instance projections
|
|
- workflow task projections
|
|
- workflow task events
|
|
- workflow runtime snapshots
|
|
- hosted job locks
|
|
- AQ queues for immediate and delayed signals
|
|
|
|
This keeps correctness inside one transactional platform.
|
|
|
|
## 2. Existing Tables To Preserve
|
|
|
|
The current workflow schema already has the right base tables:
|
|
|
|
- `WF_INSTANCES`
|
|
- `WF_TASKS`
|
|
- `WF_TASK_EVENTS`
|
|
- `WF_RUNTIME_STATES`
|
|
- `WF_HOST_LOCKS`
|
|
|
|
The current workflow database model is the mapping baseline for these tables.
|
|
|
|
### 2.1 WF_INSTANCES
|
|
|
|
Purpose:
|
|
|
|
- product-facing workflow instance summary
|
|
- instance business reference
|
|
- instance status
|
|
- product-facing state snapshot
|
|
|
|
### 2.2 WF_TASKS
|
|
|
|
Purpose:
|
|
|
|
- active and historical human task projections
|
|
- task routing
|
|
- assignment
|
|
- task payload
|
|
- effective roles
|
|
|
|
### 2.3 WF_TASK_EVENTS
|
|
|
|
Purpose:
|
|
|
|
- append-only task event history
|
|
- created, assigned, released, completed, reassigned events
|
|
|
|
### 2.4 WF_RUNTIME_STATES
|
|
|
|
Purpose:
|
|
|
|
- engine-owned durable runtime snapshot
|
|
|
|
This table becomes the main source of truth for engine resume.
|
|
|
|
## 3. Proposed Runtime State Extensions
|
|
|
|
`WF_RUNTIME_STATES` should be extended to support canonical engine execution directly.
|
|
|
|
Recommended new columns:
|
|
|
|
- `STATE_VERSION`
|
|
Numeric optimistic concurrency version.
|
|
- `SNAPSHOT_SCHEMA_VERSION`
|
|
Snapshot format version for engine evolution.
|
|
- `WAITING_KIND`
|
|
Current wait type.
|
|
- `WAITING_TOKEN`
|
|
Stale-signal guard token.
|
|
- `WAITING_UNTIL_UTC`
|
|
Next due time when waiting on time-based resume.
|
|
- `ACTIVE_TASK_ID`
|
|
Current active task projection id when applicable.
|
|
- `RESUME_POINTER_JSON`
|
|
Serialized canonical resume pointer.
|
|
- `LAST_SIGNAL_ID`
|
|
Last successfully processed signal id.
|
|
- `LAST_ERROR_CODE`
|
|
Last engine error code.
|
|
- `LAST_ERROR_JSON`
|
|
Structured last error details.
|
|
- `LAST_EXECUTED_BY`
|
|
Node id that last committed execution.
|
|
- `LAST_EXECUTED_ON_UTC`
|
|
Last successful engine commit timestamp.
|
|
|
|
The existing fields remain useful:
|
|
|
|
- workflow identity
|
|
- business reference
|
|
- runtime provider
|
|
- runtime instance id
|
|
- runtime status
|
|
- state json
|
|
- lifecycle timestamps
|
|
|
|
## 4. Snapshot Structure
|
|
|
|
`STATE_JSON` should hold a provider snapshot object for `SerdicaEngine`.
|
|
|
|
Recommended shape:
|
|
|
|
```json
|
|
{
|
|
"engineSchemaVersion": 1,
|
|
"workflowState": {},
|
|
"businessReference": {
|
|
"key": "1200345",
|
|
"parts": {}
|
|
},
|
|
"status": "Open",
|
|
"waiting": {
|
|
"kind": "TaskCompletion",
|
|
"token": "wait-123",
|
|
"untilUtc": null
|
|
},
|
|
"resume": {
|
|
"entryPointKind": "TaskOnComplete",
|
|
"taskName": "ApproveApplication",
|
|
"branchPath": [],
|
|
"nextStepIndex": 3
|
|
},
|
|
"subWorkflowFrames": [],
|
|
"continuationBuffer": []
|
|
}
|
|
```
|
|
|
|
## 5. Oracle AQ Strategy
|
|
|
|
### 5.1 Why AQ
|
|
|
|
AQ is the default signaling backend because it gives:
|
|
|
|
- durable storage
|
|
- blocking dequeue
|
|
- delayed delivery
|
|
- database-managed recovery
|
|
- transactional semantics close to the runtime state store
|
|
|
|
### 5.2 Queue Topology
|
|
|
|
Use explicit queues for clarity and operations.
|
|
|
|
Recommended objects:
|
|
|
|
- `WF_SIGNAL_QTAB`
|
|
- `WF_SIGNAL_Q`
|
|
- `WF_SCHEDULE_QTAB`
|
|
- `WF_SCHEDULE_Q`
|
|
- `WF_DLQ_QTAB`
|
|
- `WF_DLQ_Q`
|
|
|
|
Rationale:
|
|
|
|
- immediate signals and delayed signals are operationally different
|
|
- dead-letter isolation matters for supportability
|
|
- queue separation makes metrics and troubleshooting simpler
|
|
|
|
### 5.3 Payload Format
|
|
|
|
Use a compact JSON envelope serialized to UTF-8 bytes in a `RAW` payload.
|
|
|
|
Reasons:
|
|
|
|
- simple from .NET
|
|
- explicit schema ownership in application code
|
|
- small message size
|
|
- backend abstraction remains possible later
|
|
|
|
Do not put full workflow snapshots into AQ messages.
|
|
|
|
## 6. Signal Envelope
|
|
|
|
Recommended envelope:
|
|
|
|
```csharp
|
|
public sealed record WorkflowSignalEnvelope
|
|
{
|
|
public required string SignalId { get; init; }
|
|
public required string WorkflowInstanceId { get; init; }
|
|
public required string RuntimeProvider { get; init; }
|
|
public required string SignalType { get; init; }
|
|
public required long ExpectedVersion { get; init; }
|
|
public string? WaitingToken { get; init; }
|
|
public DateTime OccurredAtUtc { get; init; }
|
|
public DateTime? DueAtUtc { get; init; }
|
|
public Dictionary<string, JsonElement> Payload { get; init; } = [];
|
|
}
|
|
```
|
|
|
|
Signal types:
|
|
|
|
- `TaskCompleted`
|
|
- `TimerDue`
|
|
- `RetryDue`
|
|
- `ExternalSignal`
|
|
- `SubWorkflowCompleted`
|
|
- `InternalContinue`
|
|
|
|
## 7. Transaction Model
|
|
|
|
### 7.1 Start Transaction
|
|
|
|
Start must durably commit:
|
|
|
|
- instance projection
|
|
- runtime snapshot
|
|
- task rows if any
|
|
- task events if any
|
|
- scheduled or immediate AQ messages if any
|
|
|
|
### 7.2 Completion Transaction
|
|
|
|
Task completion must durably commit:
|
|
|
|
- task completion event
|
|
- updated instance projection
|
|
- new task rows if any
|
|
- updated runtime snapshot
|
|
- any resulting AQ signals or schedules
|
|
|
|
### 7.3 Signal Resume Transaction
|
|
|
|
Signal resume must durably commit:
|
|
|
|
- AQ dequeue
|
|
- updated runtime snapshot
|
|
- resulting projection changes
|
|
- any next AQ signals
|
|
|
|
The intended operational model is:
|
|
|
|
- dequeue with transactional semantics
|
|
- update state and projections
|
|
- commit once
|
|
|
|
If commit fails, the signal must become visible again.
|
|
|
|
## 8. Blocking Receive Model
|
|
|
|
No polling loop should be used for work discovery.
|
|
|
|
Each node should run a signal pump that:
|
|
|
|
- opens one or more blocking AQ dequeue consumers
|
|
- waits on AQ rather than sleeping and scanning
|
|
- dispatches envelopes to bounded execution workers
|
|
|
|
Suggested parameters:
|
|
|
|
- dequeue wait seconds
|
|
- max concurrent handlers
|
|
- max poison retries
|
|
- dead-letter policy
|
|
|
|
## 9. Scheduling Model
|
|
|
|
### 9.1 Scheduling Requirement
|
|
|
|
The engine must support timers without a periodic sweep job.
|
|
|
|
### 9.2 Scheduling Approach
|
|
|
|
When a workflow enters a timed wait:
|
|
|
|
1. runtime snapshot is updated with:
|
|
- waiting kind
|
|
- waiting token
|
|
- due time
|
|
2. a delayed AQ message is enqueued
|
|
3. transaction commits
|
|
|
|
When the delayed message becomes available:
|
|
|
|
1. a signal consumer dequeues it
|
|
2. current snapshot is loaded
|
|
3. waiting token is checked
|
|
4. if token matches, resume
|
|
5. if token does not match, ignore as stale
|
|
|
|
### 9.3 Logical Cancel Instead Of Physical Delete
|
|
|
|
The scheduler should treat cancel and reschedule logically.
|
|
|
|
Do not make correctness depend on deleting a queued timer message.
|
|
|
|
Instead:
|
|
|
|
- generate a new waiting token when schedule changes
|
|
- old delayed message becomes stale automatically
|
|
|
|
This is simpler and more reliable in distributed execution.
|
|
|
|
## 10. Multi-Node Concurrency Model
|
|
|
|
The engine must assume multiple nodes can receive signals for the same workflow instance.
|
|
|
|
Correctness model:
|
|
|
|
- signal delivery is at-least-once
|
|
- snapshot update uses version control
|
|
- waiting token guards stale work
|
|
- duplicate resumes are safe
|
|
|
|
Recommended write model:
|
|
|
|
- read snapshot version
|
|
- execute
|
|
- update `WF_RUNTIME_STATES` where `STATE_VERSION = expected`
|
|
- if update count is zero, abandon and retry or ignore as stale
|
|
|
|
This avoids permanent instance ownership.
|
|
|
|
## 11. Restart And Recovery Semantics
|
|
|
|
### 11.1 One Node Down
|
|
|
|
Other nodes continue consuming AQ and processing instances.
|
|
|
|
### 11.2 All Nodes Down, Database Up
|
|
|
|
Signals remain durable in AQ.
|
|
|
|
When any node comes back:
|
|
|
|
- AQ consumers reconnect
|
|
- pending immediate and delayed signals are processed
|
|
- workflow resumes continue
|
|
|
|
### 11.3 Database Down
|
|
|
|
No execution can continue while Oracle is unavailable.
|
|
|
|
Once Oracle returns:
|
|
|
|
- AQ queues recover with the database
|
|
- runtime snapshots recover with the database
|
|
- resumed node consumers continue from durable state
|
|
|
|
### 11.4 All Nodes And Database Down
|
|
|
|
After Oracle returns and at least one application node starts:
|
|
|
|
- AQ messages are still present
|
|
- runtime state is still present
|
|
- due delayed messages can be consumed
|
|
- execution resumes from durable state
|
|
|
|
This is one of the main reasons Oracle AQ is preferred over a separate volatile wake-up layer.
|
|
|
|
## 12. Redis Position In V1
|
|
|
|
Redis is optional and not part of the correctness path.
|
|
|
|
It may be used later for:
|
|
|
|
- local cache
|
|
- non-authoritative wake hints
|
|
- metrics fanout
|
|
|
|
It should not be required for:
|
|
|
|
- durable signal delivery
|
|
- timer delivery
|
|
- restart recovery
|
|
|
|
## 13. Dead-Letter Strategy
|
|
|
|
Messages should move to DLQ when:
|
|
|
|
- deserialization fails
|
|
- definition is missing
|
|
- snapshot is irreparably inconsistent
|
|
- retry count exceeds threshold
|
|
|
|
DLQ entry should preserve:
|
|
|
|
- original envelope
|
|
- failure reason
|
|
- last node id
|
|
- failure timestamp
|
|
|
|
## 14. Retention
|
|
|
|
Retention remains a service responsibility.
|
|
|
|
It should continue to clean:
|
|
|
|
- stale instances
|
|
- stale tasks
|
|
- completed data past purge window
|
|
- runtime states past purge window
|
|
|
|
AQ retention policy should be aligned with application retention and supportability needs, but queue cleanup must not delete active work.
|
|
|