Files

master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects

Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 19:14:44 +02:00

24 KiB

Raw Blame History

09. Backend Portability Plan

Purpose

This document defines how SerdicaEngine should evolve from an Oracle-first runtime into a backend-switchable engine that can also run on PostgreSQL and MongoDB without changing workflow declarations, canonical definitions, or runtime semantics.

The goal is not to support every backend in the same way internally.

The goal is to preserve one stable engine contract:

the same declarative workflow classes
the same canonical runtime definitions
the same public workflow/task APIs
the same runtime behavior around tasks, waits, timers, external signals, subworkflows, retries, and retention

Backend switching must only change infrastructure adapters and host configuration.

Current Baseline

Today the strongest backend shape is Oracle:

runtime state persists in an Oracle-backed runtime-state adapter
projections persist in an Oracle-backed projection adapter
immediate signaling and delayed scheduling run through Oracle AQ adapters
the engine host composes those adapters through backend registration

Oracle is the reference implementation because it already gives:

one durable database
durable queueing
delayed delivery
blocking dequeue without polling
transactional coupling between state mutation and queue enqueue

That reference point matters because PostgreSQL and MongoDB must match the engine contract even if they reach it through different infrastructure mechanisms.

Non-Negotiable Product Rules

Backend portability must not break these rules:

Authored workflow classes remain pure declaration classes.
Canonical runtime definitions remain backend-agnostic.
Engine execution remains run-to-wait.
Multi-instance deployment remains supported.
Steady-state signal and timer discovery must not rely on polling loops.
Signal delivery remains at-least-once.
Resume remains idempotent through version and waiting-token checks.
Public API contracts and projections remain stable.
Operational features remain available:
- signal raise
- dead-letter inspection
- dead-letter replay
- runtime inspection
- retention
- diagram inspection

Architecture Principle

Do not make the engine "database-agnostic" by hiding everything behind one giant repository.

That approach will collapse important guarantees.

Instead, separate the backend into explicit capabilities:

runtime state persistence
projection persistence
signal transport
schedule transport
mutation transaction boundary
wake-up notification strategy
lease or concurrency strategy
dead-letter and replay strategy
retention and purge strategy

Each backend implementation must satisfy the full capability matrix.

Implemented Signal Driver Split

The engine now separates durable signal ownership from wake-up delivery.

The shared seam is defined by engine signal-driver abstractions plus signal and schedule bridge contracts.

That split exists to preserve transactional correctness while still allowing faster wake strategies later.

The separation is:

IWorkflowSignalStore: durable immediate signal persistence
IWorkflowSignalDriver: wake-up and claim path for available signals
IWorkflowSignalScheduler: durable delayed-signal persistence
IWorkflowWakeOutbox: deferred wake publication when the driver is not transaction-coupled to the durable store

The public engine surface still uses:

IWorkflowSignalBus
IWorkflowScheduleBus

Those are now bridge contracts.

They do not define backend mechanics directly.

Current Backend Matrix

Backend profile	Durable signal store	Wake driver	Schedule store	Dispatch mode
Oracle	Oracle AQ signal adapter	Oracle AQ blocking dequeue	Oracle AQ schedule adapter	`NativeTransactional`
PostgreSQL	PostgreSQL durable signal store	PostgreSQL native wake or claim adapter	PostgreSQL durable schedule store	`NativeTransactional`
MongoDB	MongoDB durable signal store	MongoDB change-stream wake or claim adapter	MongoDB durable schedule store	`NativeTransactional`

Implemented Optional Redis Wake Driver

The Redis driver is implemented as a separate wake-driver plugin.

Its shape is intentionally narrow:

Oracle, PostgreSQL, and MongoDB remain the durable signal stores.
Oracle, PostgreSQL, and MongoDB persist durable signals transactionally.
Redis receives wake hints directly after commit through the mutation scope post-commit hook.
workers wake through Redis and then claim from the durable backend store.

Oracle is now supported in this combination, but it is not the preferred Oracle profile. Oracle native AQ wake remains the default because it is slightly faster in the current measurements and keeps the cleanest native timer and dequeue path.

Redis on Oracle exists for topology consistency, not because Oracle needs Redis for correctness or because it is the current fastest Oracle path.

Redis Driver Rules

Redis must remain a wake driver plugin, not the authoritative durable signal queue for mixed backends.

The intended shape is:

Oracle or PostgreSQL or MongoDB remains the durable IWorkflowSignalStore
Redis becomes an IWorkflowSignalDriver
Redis is published directly after the durable store transaction commits
backend-native wake drivers are not active when Redis is selected

That preserves the required correctness model:

persist runtime state, projections, and durable signal inside the backend mutation boundary
commit the mutation boundary
publish the Redis wake hint from the registered post-commit action
wake workers and claim from the durable backend store

IWorkflowWakeOutbox remains in the abstraction set for future non-Redis wake drivers that may still need deferred publication, but it is not the active Redis hot path.

Redis may improve signal-to-resume latency, especially for PostgreSQL and MongoDB where the durable store and the wake path are already split cleanly.

Redis must not become the correctness layer unless the whole durable signal model also moves there, which is not the design target of this engine.

Required Capability Matrix

Every engine backend profile must define concrete answers for the following:

Capability	Oracle	PostgreSQL	MongoDB
Runtime state durability	Native	Required	Required
Projection durability	Native	Required	Required
Optimistic concurrency	Row/version	Row/version	Document version
Immediate signal durability	AQ	Queue table or queue extension	Signal collection
Delayed scheduling	AQ delayed delivery	Durable due-message table	Durable due-message collection
Blocking wake-up	AQ dequeue	`LISTEN/NOTIFY`, Redis wake driver, or dedicated queue worker	Change streams or Redis wake driver
Atomic state + signal publish	Native DB transaction	Outbox transaction	Transactional outbox or equivalent
Dead-letter support	AQ + table	Queue/DLQ table	DLQ collection
Multi-node safety	DB + AQ	DB + wake hints	DB + change stream / wake hints
Restart recovery	Native	Required	Required

The backend is not complete until every row has a real implementation.

Engine Backend Layers

The switchable backend model should be built around these interfaces.

1. Runtime State Store

Responsible for:

loading runtime snapshot by workflow instance id
inserting new snapshot
updating snapshot with expected version
querying runtime status for operational needs
storing engine-specific snapshot JSON

Target interface shape:

public interface IWorkflowRuntimeStateStore
{
    Task<WorkflowRuntimeStateRecord?> GetAsync(string workflowInstanceId, CancellationToken ct = default);
    Task InsertAsync(WorkflowRuntimeStateRecord record, CancellationToken ct = default);
    Task UpdateAsync(
        WorkflowRuntimeStateRecord record,
        long expectedVersion,
        CancellationToken ct = default);
}

Notes:

Oracle and PostgreSQL should use explicit version columns.
MongoDB should use a document version field and conditional update filter.
This store must not also own signal publishing logic.

2. Projection Store

Responsible for:

workflow instance summaries
task summaries
task event history
business reference lookup
support read APIs

The projection model is product-facing and must remain stable.

That means:

the shape of projection records must not depend on the backend
only the persistence adapter may change

Target direction:

split the current projection application service into a backend-neutral application service plus backend adapters
keep one projection contract
allow Oracle and PostgreSQL to stay relational
allow MongoDB to project into document collections if needed

3. Signal Bus

Responsible for durable immediate signals:

internal continue
external signal
task completion continuation
subworkflow completion
replay from dead-letter

The current contract already exists in the engine runtime abstractions.

Required guarantees:

at-least-once delivery
ack only after successful processing
delivery count visibility
explicit abandon
explicit dead-letter move
replay support

4. Schedule Bus

Responsible for durable delayed delivery:

timer due
retry due
delayed continuation

Required guarantees:

message is not lost across process restart
message becomes visible at or after due time
stale due messages are safely ignored through waiting tokens
schedule and immediate signal semantics use the same envelope model

5. Mutation Transaction Boundary

This is the most important portability seam.

The engine mutates three things together:

runtime state
projections
signals or schedules

Oracle can do that in one database transaction because state, projections, and AQ live inside the same durable boundary.

PostgreSQL and MongoDB may require an outbox-based boundary instead.

This must be explicit:

public interface IWorkflowMutationCoordinator
{
    Task ExecuteAsync(
        Func<IWorkflowMutationContext, CancellationToken, Task> action,
        CancellationToken ct = default);
}

Where the mutation context exposes:

runtime state adapter
projection adapter
signal outbox writer
schedule outbox writer

Do not let the runtime service hand-roll transaction logic per backend.

6. Wake-Up Notifier

The engine must not scan due rows in a steady loop.

That means every backend needs a wake-up channel:

Oracle: AQ blocking dequeue
PostgreSQL: LISTEN/NOTIFY as wake hint for durable queue tables
MongoDB: change streams as wake hint for durable signal collections

The wake-up channel is not the durable source of truth except in Oracle AQ.

It is only the wake mechanism.

That distinction is mandatory for PostgreSQL and MongoDB.

Backend Profiles

Oracle Profile

Role

Oracle remains the reference backend profile and the operational default.

Storage Model

runtime state table
relational projection tables
AQ signal queue
AQ schedule queue or delayed signal queue
DLQ table and AQ-assisted replay

Commit Model

one transaction for runtime state, projections, and AQ enqueue

Wake Model

AQ blocking dequeue

Advantages

strongest correctness story
simplest atomic mutation model
no extra wake layer required

Risks

Oracle-specific infrastructure coupling
AQ operational expertise required
portability work must not assume AQ-only features in engine logic

Oracle should be treated as the semantic gold standard that other backends must match.

PostgreSQL Profile

Goal

Provide a backend profile that preserves engine semantics using PostgreSQL as the durable system of record.

Recommended Shape

runtime state in PostgreSQL tables
projections in PostgreSQL tables
durable signal queue table
durable schedule queue table
DLQ table
LISTEN/NOTIFY for wake-up hints only

Why Not `LISTEN/NOTIFY` Alone

LISTEN/NOTIFY is not sufficient as the durable signal layer because notifications are ephemeral.

The durable truth must stay in tables.

The recommended model is:

insert durable signal row in the same transaction as state/projection mutation
emit NOTIFY before commit or immediately after durable insert
workers wake up and claim rows from the signal queue table
if notification is missed, the next notification or startup recovery still finds the rows

Queue Claim Strategy

Recommended queue-claim pattern:

FOR UPDATE SKIP LOCKED
ordered by available time, priority, and creation time
delivery count increment on claim
explicit ack by state transition or delete
explicit dead-letter move after delivery limit

Schedule Strategy

Recommended schedule table:

signal_id
available_at_utc
workflow_instance_id
runtime_provider
signal_type
serialized payload
delivery count
dead-letter metadata

Recommended wake-up path:

durable insert into schedule table
NOTIFY workflow_signal
workers wake and attempt claim of rows with available_at_utc <= now()

This is still not "polling" if workers block on LISTEN and only do bounded claim attempts on wake-up, startup, and recovery events.

Atomicity Model

PostgreSQL cannot rely on an external broker if we want the same atomicity guarantees.

The cleanest profile is:

database state
database projections
database signal queue
database schedule queue
NOTIFY as non-durable wake hint

That keeps the entire correctness boundary in PostgreSQL.

Operational Notes

Need explicit handling for:

orphan claimed rows after node crash
reclaim timeout
dead-letter browsing and replay
table bloat and retention
index strategy for due rows

Suggested Components

PostgresWorkflowRuntimeStateStore
PostgresWorkflowProjectionStore
PostgresWorkflowSignalQueue
PostgresWorkflowScheduleQueue
PostgresWorkflowWakeListener
PostgresWorkflowMutationCoordinator

MongoDB Profile

Goal

Provide a backend profile that preserves engine semantics using MongoDB as the durable system of record.

Recommended Shape

runtime state in a workflow_runtime_states collection
projections in dedicated collections
durable workflow_signals collection
durable workflow_schedules collection
dead-letter collection
change streams for wake-up hints

Why Change Streams Are Not Enough

Change streams are a wake mechanism, not the durable queue itself.

The durable truth must remain in collections so the engine can recover after:

service restart
watcher restart
temporary connectivity loss

Document Model

Signal document fields should include:

_id
workflowInstanceId
runtimeProvider
signalType
waitingToken
expectedVersion
dueAtUtc
status
deliveryCount
claimedBy
claimedAtUtc
deadLetterReason
payload

Claim Strategy

Recommended model:

atomically claim one available document with findOneAndUpdate
filter by:
- status = Ready
- dueAtUtc <= now
- not already claimed
set:
- status = Claimed
- claimedBy
- claimedAtUtc
- increment deliveryCount

Ack means:

delete the signal or mark it completed

Abandon means:

move back to Ready

Dead-letter means:

move to DLQ collection or set status = DeadLetter

Schedule Strategy

Two reasonable models exist.

Model A: Separate Schedule Collection

keep delayed signals in workflow_schedules
promote due documents into workflow_signals
wake workers through change streams

This is simpler conceptually but adds one extra movement step.

Model B: Unified Signal Collection

store all signals in one collection
use dueAtUtc and status
workers claim only due documents

This is the better v1 choice because it keeps one signal envelope pipeline.

Atomicity Model

MongoDB can support multi-document transactions in replica-set mode.

That means the preferred model is:

runtime state
projections
signal collection writes
schedule writes

all inside one MongoDB transaction.

If that operational assumption is unacceptable, then MongoDB is not a correctness-grade replacement for the Oracle profile and should not be offered as a production engine backend.

Wake Model

Use change streams to avoid steady-state polling:

watch inserts and state transitions for ready or due signals
on startup, run bounded recovery sweep for unclaimed ready signals
on worker restart, resume from durable signal documents, not from missed change stream events

Operational Notes

Need explicit handling for:

resume token persistence for observers
claimed-document recovery after node failure
shard-key implications if sharding is introduced later
transactional prerequisites in local and CI test environments

Suggested Components

MongoWorkflowRuntimeStateStore
MongoWorkflowProjectionStore
MongoWorkflowSignalStore
MongoWorkflowWakeStreamListener
MongoWorkflowMutationCoordinator

Backend Selection Model

The engine should not expose dozens of independent switches in appsettings.

Use one backend profile section plus internal composition.

Recommended shape:

{
  "WorkflowEngine": {
    "BackendProfile": "Oracle"
  }
}

And then backend-specific option sections:

{
  "WorkflowBackend:Oracle": {
    "ConnectionString": "...",
    "QueueOwner": "SRD_WFKLW",
    "SignalQueueName": "WF_SIGNAL_Q",
    "DeadLetterQueueName": "WF_SIGNAL_DLQ"
  },
  "WorkflowBackend:PostgreSql": {
    "ConnectionString": "...",
    "SignalTable": "workflow_signals",
    "ScheduleTable": "workflow_schedules",
    "DeadLetterTable": "workflow_signal_dead_letters",
    "NotificationChannel": "workflow_signal"
  },
  "WorkflowBackend:MongoDb": {
    "ConnectionString": "...",
    "DatabaseName": "serdica_workflow",
    "SignalCollection": "workflow_signals",
    "RuntimeStateCollection": "workflow_runtime_states",
    "ProjectionPrefix": "workflow"
  }
}

The DI layer should map BackendProfile to one complete backend package, not a mix-and-match set of partial adapters.

That avoids unsupported combinations like:

Oracle state + Mongo signals
PostgreSQL state + Redis schedule

unless they are designed explicitly as a later profile.

Implementation Refactor Needed

To make the backend switch clean, the current Oracle-first host should be refactored in this order.

Phase 1: Split Projection Persistence

Refactor the current projection application service into:

projection application service
backend-neutral projection contract
Oracle implementation

Then add backend implementations later without changing the application service.

Phase 2: Introduce Dedicated Backend Plugin Registration

Add:

public interface IWorkflowBackendRegistrationMarker
{
    string BackendName { get; }
}

Then create dedicated backend plugins for:

Oracle
PostgreSQL
MongoDB

The host should remain backend-neutral and validate that the selected backend plugin has registered itself. Each backend plugin should own registration of:

runtime state store
projection store
mutation coordinator
signal bus
schedule bus
dead-letter store
backend-specific options and wake-up strategy

Phase 3: Move Transaction Logic Into Backend Coordinator

Refactor the current workflow mutation transaction scope so the runtime service no longer knows whether the backend uses:

direct database transaction
database transaction plus outbox
document transaction

The runtime service should only ask for one mutation boundary.

Phase 4: Normalize Dead-Letter Model

Standardize a backend-neutral dead-letter record so the operational endpoints do not care which backend stores it.

That includes:

signal id
workflow instance id
signal type
first failure time
last failure time
delivery count
last error
payload snapshot

Phase 5: Introduce Backend Conformance Tests

Every backend must pass the same contract suite:

state insert/update/version conflict
task activation and completion
timer due resume
external signal resume
subworkflow completion resume
duplicate delivery safety
restart recovery
dead-letter move and replay
retention and purge

Oracle should remain the first backend to pass the full suite.

PostgreSQL and MongoDB are not ready until they pass the same suite.

Backend-Specific Risks

PostgreSQL Risks

row-level queue claim logic can create hot indexes under high throughput
LISTEN/NOTIFY payloads are not durable
reclaim and retry logic must be designed carefully to avoid stuck claimed rows
due-row access patterns must be tuned with indexes and partitioning if volume grows

MongoDB Risks

production-grade correctness depends on replica-set transactions
change streams add operational requirements and resume-token handling
projection queries may become more complex if the read model is heavily relational today
collection growth and retention strategy must be explicit early

Oracle Risks

Oracle remains the strongest correctness model but the least portable implementation
engine logic must not drift toward AQ-only assumptions that other backends cannot model

Recommended Rollout Order

Do not build PostgreSQL and MongoDB in parallel first.

Use this order:

stabilize Oracle as the contract baseline
refactor the host into a true backend-plugin model
implement PostgreSQL profile
pass the full backend conformance suite on PostgreSQL
implement MongoDB profile only if there is a real product need for MongoDB as the system of record

PostgreSQL should come before MongoDB because:

its runtime-state and projection model are closer to the current Oracle design
its transaction semantics fit the engine more naturally
the read-side model is already relational

Validation Order After Functional Backend Completion

Functional backend completion is not the same as backend readiness.

After a backend can start, resume, signal, schedule, and retain workflows, the next required order is:

backend-neutral hostile-condition coverage
curated Bulstrad parity coverage
backend-neutral performance tiers
backend-specific baseline publication
final three-backend comparison

This means:

PostgreSQL is not done when its basic stores and buses compile; it must also match the Oracle hostile-condition and Bulstrad suites
MongoDB is not done when replica-set transactions and signal delivery work; it must also match the same parity and performance suites
the final adoption decision should be based on the shared comparison pack, not on isolated backend microbenchmarks

Proposed Sprint

Sprint 14: Backend Portability And Store Profiles

Goal

Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.

Scope

introduce backend profile abstraction and dedicated backend plugin registration
split projection persistence from the current Oracle-first application service
formalize mutation coordinator abstraction
add backend-neutral dead-letter contract
define and implement backend conformance suite
implement PostgreSQL profile
design MongoDB profile in executable detail, with implementation only after explicit product approval

Deliverables

IWorkflowBackendRegistrationMarker
backend-neutral projection contract
backend-neutral mutation coordinator contract
backend conformance test suite
dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
architecture-ready MongoDB backend plugin design package

Exit Criteria

host selects one backend profile by configuration
host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
Oracle and PostgreSQL pass the same conformance suite
MongoDB path is specified well enough that implementation is a bounded engineering task
workflow declarations and canonical definitions remain unchanged across backend profiles

Final Rule

Backend switching is an infrastructure concern, not a workflow concern.

If a future backend requires changing workflow declarations, canonical definitions, or engine semantics, that backend does not fit the architecture and should not be adopted without a new ADR.

24 KiB Raw Blame History

09. Backend Portability Plan

Purpose

Current Baseline

Non-Negotiable Product Rules

Architecture Principle

Implemented Signal Driver Split

Current Backend Matrix

Implemented Optional Redis Wake Driver

Redis Driver Rules

Required Capability Matrix

Engine Backend Layers

1. Runtime State Store

2. Projection Store

3. Signal Bus

4. Schedule Bus

5. Mutation Transaction Boundary

6. Wake-Up Notifier

Backend Profiles

Oracle Profile

Role

Storage Model

Commit Model

Wake Model

Advantages

Risks

PostgreSQL Profile

Goal

Recommended Shape

Why Not LISTEN/NOTIFY Alone

Queue Claim Strategy

Schedule Strategy

Atomicity Model

Operational Notes

Suggested Components

MongoDB Profile

Goal

Recommended Shape

Why Change Streams Are Not Enough

Document Model

Claim Strategy

Schedule Strategy

Model A: Separate Schedule Collection

Model B: Unified Signal Collection

Atomicity Model

Wake Model

Operational Notes

Suggested Components

Backend Selection Model

Implementation Refactor Needed

Phase 1: Split Projection Persistence

Phase 2: Introduce Dedicated Backend Plugin Registration

Phase 3: Move Transaction Logic Into Backend Coordinator

Phase 4: Normalize Dead-Letter Model

Phase 5: Introduce Backend Conformance Tests

Backend-Specific Risks

PostgreSQL Risks

MongoDB Risks

Oracle Risks

Recommended Rollout Order

Validation Order After Functional Backend Completion

Proposed Sprint

Sprint 14: Backend Portability And Store Profiles

Goal

Scope

Deliverables

Exit Criteria

Final Rule

24 KiB

Raw Blame History

Why Not `LISTEN/NOTIFY` Alone