Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
403
docs/workflow/engine/04-persistence-signaling-and-scheduling.md
Normal file
403
docs/workflow/engine/04-persistence-signaling-and-scheduling.md
Normal file
@@ -0,0 +1,403 @@
|
||||
# 04. Persistence, Signaling, And Scheduling
|
||||
|
||||
## 1. Persistence Strategy
|
||||
|
||||
Oracle is the single durable backend for v1.
|
||||
|
||||
It stores:
|
||||
|
||||
- workflow instance projections
|
||||
- workflow task projections
|
||||
- workflow task events
|
||||
- workflow runtime snapshots
|
||||
- hosted job locks
|
||||
- AQ queues for immediate and delayed signals
|
||||
|
||||
This keeps correctness inside one transactional platform.
|
||||
|
||||
## 2. Existing Tables To Preserve
|
||||
|
||||
The current workflow schema already has the right base tables:
|
||||
|
||||
- `WF_INSTANCES`
|
||||
- `WF_TASKS`
|
||||
- `WF_TASK_EVENTS`
|
||||
- `WF_RUNTIME_STATES`
|
||||
- `WF_HOST_LOCKS`
|
||||
|
||||
The current workflow database model is the mapping baseline for these tables.
|
||||
|
||||
### 2.1 WF_INSTANCES
|
||||
|
||||
Purpose:
|
||||
|
||||
- product-facing workflow instance summary
|
||||
- instance business reference
|
||||
- instance status
|
||||
- product-facing state snapshot
|
||||
|
||||
### 2.2 WF_TASKS
|
||||
|
||||
Purpose:
|
||||
|
||||
- active and historical human task projections
|
||||
- task routing
|
||||
- assignment
|
||||
- task payload
|
||||
- effective roles
|
||||
|
||||
### 2.3 WF_TASK_EVENTS
|
||||
|
||||
Purpose:
|
||||
|
||||
- append-only task event history
|
||||
- created, assigned, released, completed, reassigned events
|
||||
|
||||
### 2.4 WF_RUNTIME_STATES
|
||||
|
||||
Purpose:
|
||||
|
||||
- engine-owned durable runtime snapshot
|
||||
|
||||
This table becomes the main source of truth for engine resume.
|
||||
|
||||
## 3. Proposed Runtime State Extensions
|
||||
|
||||
`WF_RUNTIME_STATES` should be extended to support canonical engine execution directly.
|
||||
|
||||
Recommended new columns:
|
||||
|
||||
- `STATE_VERSION`
|
||||
Numeric optimistic concurrency version.
|
||||
- `SNAPSHOT_SCHEMA_VERSION`
|
||||
Snapshot format version for engine evolution.
|
||||
- `WAITING_KIND`
|
||||
Current wait type.
|
||||
- `WAITING_TOKEN`
|
||||
Stale-signal guard token.
|
||||
- `WAITING_UNTIL_UTC`
|
||||
Next due time when waiting on time-based resume.
|
||||
- `ACTIVE_TASK_ID`
|
||||
Current active task projection id when applicable.
|
||||
- `RESUME_POINTER_JSON`
|
||||
Serialized canonical resume pointer.
|
||||
- `LAST_SIGNAL_ID`
|
||||
Last successfully processed signal id.
|
||||
- `LAST_ERROR_CODE`
|
||||
Last engine error code.
|
||||
- `LAST_ERROR_JSON`
|
||||
Structured last error details.
|
||||
- `LAST_EXECUTED_BY`
|
||||
Node id that last committed execution.
|
||||
- `LAST_EXECUTED_ON_UTC`
|
||||
Last successful engine commit timestamp.
|
||||
|
||||
The existing fields remain useful:
|
||||
|
||||
- workflow identity
|
||||
- business reference
|
||||
- runtime provider
|
||||
- runtime instance id
|
||||
- runtime status
|
||||
- state json
|
||||
- lifecycle timestamps
|
||||
|
||||
## 4. Snapshot Structure
|
||||
|
||||
`STATE_JSON` should hold a provider snapshot object for `SerdicaEngine`.
|
||||
|
||||
Recommended shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"engineSchemaVersion": 1,
|
||||
"workflowState": {},
|
||||
"businessReference": {
|
||||
"key": "1200345",
|
||||
"parts": {}
|
||||
},
|
||||
"status": "Open",
|
||||
"waiting": {
|
||||
"kind": "TaskCompletion",
|
||||
"token": "wait-123",
|
||||
"untilUtc": null
|
||||
},
|
||||
"resume": {
|
||||
"entryPointKind": "TaskOnComplete",
|
||||
"taskName": "ApproveApplication",
|
||||
"branchPath": [],
|
||||
"nextStepIndex": 3
|
||||
},
|
||||
"subWorkflowFrames": [],
|
||||
"continuationBuffer": []
|
||||
}
|
||||
```
|
||||
|
||||
## 5. Oracle AQ Strategy
|
||||
|
||||
### 5.1 Why AQ
|
||||
|
||||
AQ is the default signaling backend because it gives:
|
||||
|
||||
- durable storage
|
||||
- blocking dequeue
|
||||
- delayed delivery
|
||||
- database-managed recovery
|
||||
- transactional semantics close to the runtime state store
|
||||
|
||||
### 5.2 Queue Topology
|
||||
|
||||
Use explicit queues for clarity and operations.
|
||||
|
||||
Recommended objects:
|
||||
|
||||
- `WF_SIGNAL_QTAB`
|
||||
- `WF_SIGNAL_Q`
|
||||
- `WF_SCHEDULE_QTAB`
|
||||
- `WF_SCHEDULE_Q`
|
||||
- `WF_DLQ_QTAB`
|
||||
- `WF_DLQ_Q`
|
||||
|
||||
Rationale:
|
||||
|
||||
- immediate signals and delayed signals are operationally different
|
||||
- dead-letter isolation matters for supportability
|
||||
- queue separation makes metrics and troubleshooting simpler
|
||||
|
||||
### 5.3 Payload Format
|
||||
|
||||
Use a compact JSON envelope serialized to UTF-8 bytes in a `RAW` payload.
|
||||
|
||||
Reasons:
|
||||
|
||||
- simple from .NET
|
||||
- explicit schema ownership in application code
|
||||
- small message size
|
||||
- backend abstraction remains possible later
|
||||
|
||||
Do not put full workflow snapshots into AQ messages.
|
||||
|
||||
## 6. Signal Envelope
|
||||
|
||||
Recommended envelope:
|
||||
|
||||
```csharp
|
||||
public sealed record WorkflowSignalEnvelope
|
||||
{
|
||||
public required string SignalId { get; init; }
|
||||
public required string WorkflowInstanceId { get; init; }
|
||||
public required string RuntimeProvider { get; init; }
|
||||
public required string SignalType { get; init; }
|
||||
public required long ExpectedVersion { get; init; }
|
||||
public string? WaitingToken { get; init; }
|
||||
public DateTime OccurredAtUtc { get; init; }
|
||||
public DateTime? DueAtUtc { get; init; }
|
||||
public Dictionary<string, JsonElement> Payload { get; init; } = [];
|
||||
}
|
||||
```
|
||||
|
||||
Signal types:
|
||||
|
||||
- `TaskCompleted`
|
||||
- `TimerDue`
|
||||
- `RetryDue`
|
||||
- `ExternalSignal`
|
||||
- `SubWorkflowCompleted`
|
||||
- `InternalContinue`
|
||||
|
||||
## 7. Transaction Model
|
||||
|
||||
### 7.1 Start Transaction
|
||||
|
||||
Start must durably commit:
|
||||
|
||||
- instance projection
|
||||
- runtime snapshot
|
||||
- task rows if any
|
||||
- task events if any
|
||||
- scheduled or immediate AQ messages if any
|
||||
|
||||
### 7.2 Completion Transaction
|
||||
|
||||
Task completion must durably commit:
|
||||
|
||||
- task completion event
|
||||
- updated instance projection
|
||||
- new task rows if any
|
||||
- updated runtime snapshot
|
||||
- any resulting AQ signals or schedules
|
||||
|
||||
### 7.3 Signal Resume Transaction
|
||||
|
||||
Signal resume must durably commit:
|
||||
|
||||
- AQ dequeue
|
||||
- updated runtime snapshot
|
||||
- resulting projection changes
|
||||
- any next AQ signals
|
||||
|
||||
The intended operational model is:
|
||||
|
||||
- dequeue with transactional semantics
|
||||
- update state and projections
|
||||
- commit once
|
||||
|
||||
If commit fails, the signal must become visible again.
|
||||
|
||||
## 8. Blocking Receive Model
|
||||
|
||||
No polling loop should be used for work discovery.
|
||||
|
||||
Each node should run a signal pump that:
|
||||
|
||||
- opens one or more blocking AQ dequeue consumers
|
||||
- waits on AQ rather than sleeping and scanning
|
||||
- dispatches envelopes to bounded execution workers
|
||||
|
||||
Suggested parameters:
|
||||
|
||||
- dequeue wait seconds
|
||||
- max concurrent handlers
|
||||
- max poison retries
|
||||
- dead-letter policy
|
||||
|
||||
## 9. Scheduling Model
|
||||
|
||||
### 9.1 Scheduling Requirement
|
||||
|
||||
The engine must support timers without a periodic sweep job.
|
||||
|
||||
### 9.2 Scheduling Approach
|
||||
|
||||
When a workflow enters a timed wait:
|
||||
|
||||
1. runtime snapshot is updated with:
|
||||
- waiting kind
|
||||
- waiting token
|
||||
- due time
|
||||
2. a delayed AQ message is enqueued
|
||||
3. transaction commits
|
||||
|
||||
When the delayed message becomes available:
|
||||
|
||||
1. a signal consumer dequeues it
|
||||
2. current snapshot is loaded
|
||||
3. waiting token is checked
|
||||
4. if token matches, resume
|
||||
5. if token does not match, ignore as stale
|
||||
|
||||
### 9.3 Logical Cancel Instead Of Physical Delete
|
||||
|
||||
The scheduler should treat cancel and reschedule logically.
|
||||
|
||||
Do not make correctness depend on deleting a queued timer message.
|
||||
|
||||
Instead:
|
||||
|
||||
- generate a new waiting token when schedule changes
|
||||
- old delayed message becomes stale automatically
|
||||
|
||||
This is simpler and more reliable in distributed execution.
|
||||
|
||||
## 10. Multi-Node Concurrency Model
|
||||
|
||||
The engine must assume multiple nodes can receive signals for the same workflow instance.
|
||||
|
||||
Correctness model:
|
||||
|
||||
- signal delivery is at-least-once
|
||||
- snapshot update uses version control
|
||||
- waiting token guards stale work
|
||||
- duplicate resumes are safe
|
||||
|
||||
Recommended write model:
|
||||
|
||||
- read snapshot version
|
||||
- execute
|
||||
- update `WF_RUNTIME_STATES` where `STATE_VERSION = expected`
|
||||
- if update count is zero, abandon and retry or ignore as stale
|
||||
|
||||
This avoids permanent instance ownership.
|
||||
|
||||
## 11. Restart And Recovery Semantics
|
||||
|
||||
### 11.1 One Node Down
|
||||
|
||||
Other nodes continue consuming AQ and processing instances.
|
||||
|
||||
### 11.2 All Nodes Down, Database Up
|
||||
|
||||
Signals remain durable in AQ.
|
||||
|
||||
When any node comes back:
|
||||
|
||||
- AQ consumers reconnect
|
||||
- pending immediate and delayed signals are processed
|
||||
- workflow resumes continue
|
||||
|
||||
### 11.3 Database Down
|
||||
|
||||
No execution can continue while Oracle is unavailable.
|
||||
|
||||
Once Oracle returns:
|
||||
|
||||
- AQ queues recover with the database
|
||||
- runtime snapshots recover with the database
|
||||
- resumed node consumers continue from durable state
|
||||
|
||||
### 11.4 All Nodes And Database Down
|
||||
|
||||
After Oracle returns and at least one application node starts:
|
||||
|
||||
- AQ messages are still present
|
||||
- runtime state is still present
|
||||
- due delayed messages can be consumed
|
||||
- execution resumes from durable state
|
||||
|
||||
This is one of the main reasons Oracle AQ is preferred over a separate volatile wake-up layer.
|
||||
|
||||
## 12. Redis Position In V1
|
||||
|
||||
Redis is optional and not part of the correctness path.
|
||||
|
||||
It may be used later for:
|
||||
|
||||
- local cache
|
||||
- non-authoritative wake hints
|
||||
- metrics fanout
|
||||
|
||||
It should not be required for:
|
||||
|
||||
- durable signal delivery
|
||||
- timer delivery
|
||||
- restart recovery
|
||||
|
||||
## 13. Dead-Letter Strategy
|
||||
|
||||
Messages should move to DLQ when:
|
||||
|
||||
- deserialization fails
|
||||
- definition is missing
|
||||
- snapshot is irreparably inconsistent
|
||||
- retry count exceeds threshold
|
||||
|
||||
DLQ entry should preserve:
|
||||
|
||||
- original envelope
|
||||
- failure reason
|
||||
- last node id
|
||||
- failure timestamp
|
||||
|
||||
## 14. Retention
|
||||
|
||||
Retention remains a service responsibility.
|
||||
|
||||
It should continue to clean:
|
||||
|
||||
- stale instances
|
||||
- stale tasks
|
||||
- completed data past purge window
|
||||
- runtime states past purge window
|
||||
|
||||
AQ retention policy should be aligned with application retention and supportability needs, but queue cleanup must not delete active work.
|
||||
|
||||
Reference in New Issue
Block a user