Files

master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects

Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 19:14:44 +02:00

8.6 KiB

Raw Blame History

04. Persistence, Signaling, And Scheduling

1. Persistence Strategy

Oracle is the single durable backend for v1.

It stores:

workflow instance projections
workflow task projections
workflow task events
workflow runtime snapshots
hosted job locks
AQ queues for immediate and delayed signals

This keeps correctness inside one transactional platform.

2. Existing Tables To Preserve

The current workflow schema already has the right base tables:

WF_INSTANCES
WF_TASKS
WF_TASK_EVENTS
WF_RUNTIME_STATES
WF_HOST_LOCKS

The current workflow database model is the mapping baseline for these tables.

2.1 WF_INSTANCES

Purpose:

product-facing workflow instance summary
instance business reference
instance status
product-facing state snapshot

2.2 WF_TASKS

Purpose:

active and historical human task projections
task routing
assignment
task payload
effective roles

2.3 WF_TASK_EVENTS

Purpose:

append-only task event history
created, assigned, released, completed, reassigned events

2.4 WF_RUNTIME_STATES

Purpose:

engine-owned durable runtime snapshot

This table becomes the main source of truth for engine resume.

3. Proposed Runtime State Extensions

WF_RUNTIME_STATES should be extended to support canonical engine execution directly.

4. Snapshot Structure

STATE_JSON should hold a provider snapshot object for SerdicaEngine.

Recommended shape:

{
  "engineSchemaVersion": 1,
  "workflowState": {},
  "businessReference": {
    "key": "1200345",
    "parts": {}
  },
  "status": "Open",
  "waiting": {
    "kind": "TaskCompletion",
    "token": "wait-123",
    "untilUtc": null
  },
  "resume": {
    "entryPointKind": "TaskOnComplete",
    "taskName": "ApproveApplication",
    "branchPath": [],
    "nextStepIndex": 3
  },
  "subWorkflowFrames": [],
  "continuationBuffer": []
}

5. Oracle AQ Strategy

5.1 Why AQ

AQ is the default signaling backend because it gives:

durable storage
blocking dequeue
delayed delivery
database-managed recovery
transactional semantics close to the runtime state store

5.2 Queue Topology

Use explicit queues for clarity and operations.

Recommended objects:

WF_SIGNAL_QTAB
WF_SIGNAL_Q
WF_SCHEDULE_QTAB
WF_SCHEDULE_Q
WF_DLQ_QTAB
WF_DLQ_Q

Rationale:

immediate signals and delayed signals are operationally different
dead-letter isolation matters for supportability
queue separation makes metrics and troubleshooting simpler

5.3 Payload Format

Use a compact JSON envelope serialized to UTF-8 bytes in a RAW payload.

Reasons:

simple from .NET
explicit schema ownership in application code
small message size
backend abstraction remains possible later

Do not put full workflow snapshots into AQ messages.

6. Signal Envelope

Recommended envelope:

public sealed record WorkflowSignalEnvelope
{
    public required string SignalId { get; init; }
    public required string WorkflowInstanceId { get; init; }
    public required string RuntimeProvider { get; init; }
    public required string SignalType { get; init; }
    public required long ExpectedVersion { get; init; }
    public string? WaitingToken { get; init; }
    public DateTime OccurredAtUtc { get; init; }
    public DateTime? DueAtUtc { get; init; }
    public Dictionary<string, JsonElement> Payload { get; init; } = [];
}

Signal types:

TaskCompleted
TimerDue
RetryDue
ExternalSignal
SubWorkflowCompleted
InternalContinue

7. Transaction Model

7.1 Start Transaction

Start must durably commit:

instance projection
runtime snapshot
task rows if any
task events if any
scheduled or immediate AQ messages if any

7.2 Completion Transaction

Task completion must durably commit:

task completion event
updated instance projection
new task rows if any
updated runtime snapshot
any resulting AQ signals or schedules

7.3 Signal Resume Transaction

Signal resume must durably commit:

AQ dequeue
updated runtime snapshot
resulting projection changes
any next AQ signals

The intended operational model is:

dequeue with transactional semantics
update state and projections
commit once

If commit fails, the signal must become visible again.

8. Blocking Receive Model

No polling loop should be used for work discovery.

Each node should run a signal pump that:

opens one or more blocking AQ dequeue consumers
waits on AQ rather than sleeping and scanning
dispatches envelopes to bounded execution workers

Suggested parameters:

dequeue wait seconds
max concurrent handlers
max poison retries
dead-letter policy

9. Scheduling Model

9.1 Scheduling Requirement

The engine must support timers without a periodic sweep job.

9.2 Scheduling Approach

When a workflow enters a timed wait:

runtime snapshot is updated with:
- waiting kind
- waiting token
- due time
a delayed AQ message is enqueued
transaction commits

When the delayed message becomes available:

a signal consumer dequeues it
current snapshot is loaded
waiting token is checked
if token matches, resume
if token does not match, ignore as stale

9.3 Logical Cancel Instead Of Physical Delete

The scheduler should treat cancel and reschedule logically.

Do not make correctness depend on deleting a queued timer message.

Instead:

generate a new waiting token when schedule changes
old delayed message becomes stale automatically

This is simpler and more reliable in distributed execution.

10. Multi-Node Concurrency Model

The engine must assume multiple nodes can receive signals for the same workflow instance.

Correctness model:

signal delivery is at-least-once
snapshot update uses version control
waiting token guards stale work
duplicate resumes are safe

Recommended write model:

read snapshot version
execute
update WF_RUNTIME_STATES where STATE_VERSION = expected
if update count is zero, abandon and retry or ignore as stale

This avoids permanent instance ownership.

11. Restart And Recovery Semantics

11.1 One Node Down

Other nodes continue consuming AQ and processing instances.

11.2 All Nodes Down, Database Up

Signals remain durable in AQ.

When any node comes back:

AQ consumers reconnect
pending immediate and delayed signals are processed
workflow resumes continue

11.3 Database Down

No execution can continue while Oracle is unavailable.

Once Oracle returns:

AQ queues recover with the database
runtime snapshots recover with the database
resumed node consumers continue from durable state

11.4 All Nodes And Database Down

After Oracle returns and at least one application node starts:

AQ messages are still present
runtime state is still present
due delayed messages can be consumed
execution resumes from durable state

This is one of the main reasons Oracle AQ is preferred over a separate volatile wake-up layer.

12. Redis Position In V1

Redis is optional and not part of the correctness path.

It may be used later for:

local cache
non-authoritative wake hints
metrics fanout

It should not be required for:

durable signal delivery
timer delivery
restart recovery

13. Dead-Letter Strategy

Messages should move to DLQ when:

deserialization fails
definition is missing
snapshot is irreparably inconsistent
retry count exceeds threshold

DLQ entry should preserve:

original envelope
failure reason
last node id
failure timestamp

14. Retention

Retention remains a service responsibility.

It should continue to clean:

stale instances
stale tasks
completed data past purge window
runtime states past purge window

AQ retention policy should be aligned with application retention and supportability needs, but queue cleanup must not delete active work.

8.6 KiB Raw Blame History

04. Persistence, Signaling, And Scheduling

1. Persistence Strategy

2. Existing Tables To Preserve

2.1 WF_INSTANCES

2.2 WF_TASKS

2.3 WF_TASK_EVENTS

2.4 WF_RUNTIME_STATES

3. Proposed Runtime State Extensions

4. Snapshot Structure

5. Oracle AQ Strategy

5.1 Why AQ

5.2 Queue Topology

5.3 Payload Format

6. Signal Envelope

7. Transaction Model

7.1 Start Transaction

7.2 Completion Transaction

7.3 Signal Resume Transaction

8. Blocking Receive Model

9. Scheduling Model

9.1 Scheduling Requirement

9.2 Scheduling Approach

9.3 Logical Cancel Instead Of Physical Delete

10. Multi-Node Concurrency Model

11. Restart And Recovery Semantics

11.1 One Node Down

11.2 All Nodes Down, Database Up

11.3 Database Down

11.4 All Nodes And Database Down

12. Redis Position In V1

13. Dead-Letter Strategy

14. Retention

8.6 KiB

Raw Blame History