Files
git.stella-ops.org/docs/workflow/engine/09-backend-portability-plan.md
master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00

24 KiB

09. Backend Portability Plan

Purpose

This document defines how SerdicaEngine should evolve from an Oracle-first runtime into a backend-switchable engine that can also run on PostgreSQL and MongoDB without changing workflow declarations, canonical definitions, or runtime semantics.

The goal is not to support every backend in the same way internally.

The goal is to preserve one stable engine contract:

  • the same declarative workflow classes
  • the same canonical runtime definitions
  • the same public workflow/task APIs
  • the same runtime behavior around tasks, waits, timers, external signals, subworkflows, retries, and retention

Backend switching must only change infrastructure adapters and host configuration.

Current Baseline

Today the strongest backend shape is Oracle:

  • runtime state persists in an Oracle-backed runtime-state adapter
  • projections persist in an Oracle-backed projection adapter
  • immediate signaling and delayed scheduling run through Oracle AQ adapters
  • the engine host composes those adapters through backend registration

Oracle is the reference implementation because it already gives:

  • one durable database
  • durable queueing
  • delayed delivery
  • blocking dequeue without polling
  • transactional coupling between state mutation and queue enqueue

That reference point matters because PostgreSQL and MongoDB must match the engine contract even if they reach it through different infrastructure mechanisms.

Non-Negotiable Product Rules

Backend portability must not break these rules:

  1. Authored workflow classes remain pure declaration classes.
  2. Canonical runtime definitions remain backend-agnostic.
  3. Engine execution remains run-to-wait.
  4. Multi-instance deployment remains supported.
  5. Steady-state signal and timer discovery must not rely on polling loops.
  6. Signal delivery remains at-least-once.
  7. Resume remains idempotent through version and waiting-token checks.
  8. Public API contracts and projections remain stable.
  9. Operational features remain available:
    • signal raise
    • dead-letter inspection
    • dead-letter replay
    • runtime inspection
    • retention
    • diagram inspection

Architecture Principle

Do not make the engine "database-agnostic" by hiding everything behind one giant repository.

That approach will collapse important guarantees.

Instead, separate the backend into explicit capabilities:

  1. runtime state persistence
  2. projection persistence
  3. signal transport
  4. schedule transport
  5. mutation transaction boundary
  6. wake-up notification strategy
  7. lease or concurrency strategy
  8. dead-letter and replay strategy
  9. retention and purge strategy

Each backend implementation must satisfy the full capability matrix.

Implemented Signal Driver Split

The engine now separates durable signal ownership from wake-up delivery.

The shared seam is defined by engine signal-driver abstractions plus signal and schedule bridge contracts.

That split exists to preserve transactional correctness while still allowing faster wake strategies later.

The separation is:

  • IWorkflowSignalStore: durable immediate signal persistence
  • IWorkflowSignalDriver: wake-up and claim path for available signals
  • IWorkflowSignalScheduler: durable delayed-signal persistence
  • IWorkflowWakeOutbox: deferred wake publication when the driver is not transaction-coupled to the durable store

The public engine surface still uses:

  • IWorkflowSignalBus
  • IWorkflowScheduleBus

Those are now bridge contracts.

They do not define backend mechanics directly.

Current Backend Matrix

Backend profile Durable signal store Wake driver Schedule store Dispatch mode
Oracle Oracle AQ signal adapter Oracle AQ blocking dequeue Oracle AQ schedule adapter NativeTransactional
PostgreSQL PostgreSQL durable signal store PostgreSQL native wake or claim adapter PostgreSQL durable schedule store NativeTransactional
MongoDB MongoDB durable signal store MongoDB change-stream wake or claim adapter MongoDB durable schedule store NativeTransactional

Implemented Optional Redis Wake Driver

The Redis driver is implemented as a separate wake-driver plugin.

Its shape is intentionally narrow:

  • Oracle, PostgreSQL, and MongoDB remain the durable signal stores.
  • Oracle, PostgreSQL, and MongoDB persist durable signals transactionally.
  • Redis receives wake hints directly after commit through the mutation scope post-commit hook.
  • workers wake through Redis and then claim from the durable backend store.

Oracle is now supported in this combination, but it is not the preferred Oracle profile. Oracle native AQ wake remains the default because it is slightly faster in the current measurements and keeps the cleanest native timer and dequeue path.

Redis on Oracle exists for topology consistency, not because Oracle needs Redis for correctness or because it is the current fastest Oracle path.

Redis Driver Rules

Redis must remain a wake driver plugin, not the authoritative durable signal queue for mixed backends.

The intended shape is:

  • Oracle or PostgreSQL or MongoDB remains the durable IWorkflowSignalStore
  • Redis becomes an IWorkflowSignalDriver
  • Redis is published directly after the durable store transaction commits
  • backend-native wake drivers are not active when Redis is selected

That preserves the required correctness model:

  1. persist runtime state, projections, and durable signal inside the backend mutation boundary
  2. commit the mutation boundary
  3. publish the Redis wake hint from the registered post-commit action
  4. wake workers and claim from the durable backend store

IWorkflowWakeOutbox remains in the abstraction set for future non-Redis wake drivers that may still need deferred publication, but it is not the active Redis hot path.

Redis may improve signal-to-resume latency, especially for PostgreSQL and MongoDB where the durable store and the wake path are already split cleanly.

Redis must not become the correctness layer unless the whole durable signal model also moves there, which is not the design target of this engine.

Required Capability Matrix

Every engine backend profile must define concrete answers for the following:

Capability Oracle PostgreSQL MongoDB
Runtime state durability Native Required Required
Projection durability Native Required Required
Optimistic concurrency Row/version Row/version Document version
Immediate signal durability AQ Queue table or queue extension Signal collection
Delayed scheduling AQ delayed delivery Durable due-message table Durable due-message collection
Blocking wake-up AQ dequeue LISTEN/NOTIFY, Redis wake driver, or dedicated queue worker Change streams or Redis wake driver
Atomic state + signal publish Native DB transaction Outbox transaction Transactional outbox or equivalent
Dead-letter support AQ + table Queue/DLQ table DLQ collection
Multi-node safety DB + AQ DB + wake hints DB + change stream / wake hints
Restart recovery Native Required Required

The backend is not complete until every row has a real implementation.

Engine Backend Layers

The switchable backend model should be built around these interfaces.

1. Runtime State Store

Responsible for:

  • loading runtime snapshot by workflow instance id
  • inserting new snapshot
  • updating snapshot with expected version
  • querying runtime status for operational needs
  • storing engine-specific snapshot JSON

Target interface shape:

public interface IWorkflowRuntimeStateStore
{
    Task<WorkflowRuntimeStateRecord?> GetAsync(string workflowInstanceId, CancellationToken ct = default);
    Task InsertAsync(WorkflowRuntimeStateRecord record, CancellationToken ct = default);
    Task UpdateAsync(
        WorkflowRuntimeStateRecord record,
        long expectedVersion,
        CancellationToken ct = default);
}

Notes:

  • Oracle and PostgreSQL should use explicit version columns.
  • MongoDB should use a document version field and conditional update filter.
  • This store must not also own signal publishing logic.

2. Projection Store

Responsible for:

  • workflow instance summaries
  • task summaries
  • task event history
  • business reference lookup
  • support read APIs

The projection model is product-facing and must remain stable.

That means:

  • the shape of projection records must not depend on the backend
  • only the persistence adapter may change

Target direction:

  • split the current projection application service into a backend-neutral application service plus backend adapters
  • keep one projection contract
  • allow Oracle and PostgreSQL to stay relational
  • allow MongoDB to project into document collections if needed

3. Signal Bus

Responsible for durable immediate signals:

  • internal continue
  • external signal
  • task completion continuation
  • subworkflow completion
  • replay from dead-letter

The current contract already exists in the engine runtime abstractions.

Required guarantees:

  • at-least-once delivery
  • ack only after successful processing
  • delivery count visibility
  • explicit abandon
  • explicit dead-letter move
  • replay support

4. Schedule Bus

Responsible for durable delayed delivery:

  • timer due
  • retry due
  • delayed continuation

Required guarantees:

  • message is not lost across process restart
  • message becomes visible at or after due time
  • stale due messages are safely ignored through waiting tokens
  • schedule and immediate signal semantics use the same envelope model

5. Mutation Transaction Boundary

This is the most important portability seam.

The engine mutates three things together:

  • runtime state
  • projections
  • signals or schedules

Oracle can do that in one database transaction because state, projections, and AQ live inside the same durable boundary.

PostgreSQL and MongoDB may require an outbox-based boundary instead.

This must be explicit:

public interface IWorkflowMutationCoordinator
{
    Task ExecuteAsync(
        Func<IWorkflowMutationContext, CancellationToken, Task> action,
        CancellationToken ct = default);
}

Where the mutation context exposes:

  • runtime state adapter
  • projection adapter
  • signal outbox writer
  • schedule outbox writer

Do not let the runtime service hand-roll transaction logic per backend.

6. Wake-Up Notifier

The engine must not scan due rows in a steady loop.

That means every backend needs a wake-up channel:

  • Oracle: AQ blocking dequeue
  • PostgreSQL: LISTEN/NOTIFY as wake hint for durable queue tables
  • MongoDB: change streams as wake hint for durable signal collections

The wake-up channel is not the durable source of truth except in Oracle AQ.

It is only the wake mechanism.

That distinction is mandatory for PostgreSQL and MongoDB.

Backend Profiles

Oracle Profile

Role

Oracle remains the reference backend profile and the operational default.

Storage Model

  • runtime state table
  • relational projection tables
  • AQ signal queue
  • AQ schedule queue or delayed signal queue
  • DLQ table and AQ-assisted replay

Commit Model

  • one transaction for runtime state, projections, and AQ enqueue

Wake Model

  • AQ blocking dequeue

Advantages

  • strongest correctness story
  • simplest atomic mutation model
  • no extra wake layer required

Risks

  • Oracle-specific infrastructure coupling
  • AQ operational expertise required
  • portability work must not assume AQ-only features in engine logic

Oracle should be treated as the semantic gold standard that other backends must match.

PostgreSQL Profile

Goal

Provide a backend profile that preserves engine semantics using PostgreSQL as the durable system of record.

  • runtime state in PostgreSQL tables
  • projections in PostgreSQL tables
  • durable signal queue table
  • durable schedule queue table
  • DLQ table
  • LISTEN/NOTIFY for wake-up hints only

Why Not LISTEN/NOTIFY Alone

LISTEN/NOTIFY is not sufficient as the durable signal layer because notifications are ephemeral.

The durable truth must stay in tables.

The recommended model is:

  1. insert durable signal row in the same transaction as state/projection mutation
  2. emit NOTIFY before commit or immediately after durable insert
  3. workers wake up and claim rows from the signal queue table
  4. if notification is missed, the next notification or startup recovery still finds the rows

Queue Claim Strategy

Recommended queue-claim pattern:

  • FOR UPDATE SKIP LOCKED
  • ordered by available time, priority, and creation time
  • delivery count increment on claim
  • explicit ack by state transition or delete
  • explicit dead-letter move after delivery limit

Schedule Strategy

Recommended schedule table:

  • signal_id
  • available_at_utc
  • workflow_instance_id
  • runtime_provider
  • signal_type
  • serialized payload
  • delivery count
  • dead-letter metadata

Recommended wake-up path:

  • durable insert into schedule table
  • NOTIFY workflow_signal
  • workers wake and attempt claim of rows with available_at_utc <= now()

This is still not "polling" if workers block on LISTEN and only do bounded claim attempts on wake-up, startup, and recovery events.

Atomicity Model

PostgreSQL cannot rely on an external broker if we want the same atomicity guarantees.

The cleanest profile is:

  • database state
  • database projections
  • database signal queue
  • database schedule queue
  • NOTIFY as non-durable wake hint

That keeps the entire correctness boundary in PostgreSQL.

Operational Notes

Need explicit handling for:

  • orphan claimed rows after node crash
  • reclaim timeout
  • dead-letter browsing and replay
  • table bloat and retention
  • index strategy for due rows

Suggested Components

  • PostgresWorkflowRuntimeStateStore
  • PostgresWorkflowProjectionStore
  • PostgresWorkflowSignalQueue
  • PostgresWorkflowScheduleQueue
  • PostgresWorkflowWakeListener
  • PostgresWorkflowMutationCoordinator

MongoDB Profile

Goal

Provide a backend profile that preserves engine semantics using MongoDB as the durable system of record.

  • runtime state in a workflow_runtime_states collection
  • projections in dedicated collections
  • durable workflow_signals collection
  • durable workflow_schedules collection
  • dead-letter collection
  • change streams for wake-up hints

Why Change Streams Are Not Enough

Change streams are a wake mechanism, not the durable queue itself.

The durable truth must remain in collections so the engine can recover after:

  • service restart
  • watcher restart
  • temporary connectivity loss

Document Model

Signal document fields should include:

  • _id
  • workflowInstanceId
  • runtimeProvider
  • signalType
  • waitingToken
  • expectedVersion
  • dueAtUtc
  • status
  • deliveryCount
  • claimedBy
  • claimedAtUtc
  • deadLetterReason
  • payload

Claim Strategy

Recommended model:

  • atomically claim one available document with findOneAndUpdate
  • filter by:
    • status = Ready
    • dueAtUtc <= now
    • not already claimed
  • set:
    • status = Claimed
    • claimedBy
    • claimedAtUtc
    • increment deliveryCount

Ack means:

  • delete the signal or mark it completed

Abandon means:

  • move back to Ready

Dead-letter means:

  • move to DLQ collection or set status = DeadLetter

Schedule Strategy

Two reasonable models exist.

Model A: Separate Schedule Collection

  • keep delayed signals in workflow_schedules
  • promote due documents into workflow_signals
  • wake workers through change streams

This is simpler conceptually but adds one extra movement step.

Model B: Unified Signal Collection

  • store all signals in one collection
  • use dueAtUtc and status
  • workers claim only due documents

This is the better v1 choice because it keeps one signal envelope pipeline.

Atomicity Model

MongoDB can support multi-document transactions in replica-set mode.

That means the preferred model is:

  • runtime state
  • projections
  • signal collection writes
  • schedule writes

all inside one MongoDB transaction.

If that operational assumption is unacceptable, then MongoDB is not a correctness-grade replacement for the Oracle profile and should not be offered as a production engine backend.

Wake Model

Use change streams to avoid steady-state polling:

  • watch inserts and state transitions for ready or due signals
  • on startup, run bounded recovery sweep for unclaimed ready signals
  • on worker restart, resume from durable signal documents, not from missed change stream events

Operational Notes

Need explicit handling for:

  • resume token persistence for observers
  • claimed-document recovery after node failure
  • shard-key implications if sharding is introduced later
  • transactional prerequisites in local and CI test environments

Suggested Components

  • MongoWorkflowRuntimeStateStore
  • MongoWorkflowProjectionStore
  • MongoWorkflowSignalStore
  • MongoWorkflowWakeStreamListener
  • MongoWorkflowMutationCoordinator

Backend Selection Model

The engine should not expose dozens of independent switches in appsettings.

Use one backend profile section plus internal composition.

Recommended shape:

{
  "WorkflowEngine": {
    "BackendProfile": "Oracle"
  }
}

And then backend-specific option sections:

{
  "WorkflowBackend:Oracle": {
    "ConnectionString": "...",
    "QueueOwner": "SRD_WFKLW",
    "SignalQueueName": "WF_SIGNAL_Q",
    "DeadLetterQueueName": "WF_SIGNAL_DLQ"
  },
  "WorkflowBackend:PostgreSql": {
    "ConnectionString": "...",
    "SignalTable": "workflow_signals",
    "ScheduleTable": "workflow_schedules",
    "DeadLetterTable": "workflow_signal_dead_letters",
    "NotificationChannel": "workflow_signal"
  },
  "WorkflowBackend:MongoDb": {
    "ConnectionString": "...",
    "DatabaseName": "serdica_workflow",
    "SignalCollection": "workflow_signals",
    "RuntimeStateCollection": "workflow_runtime_states",
    "ProjectionPrefix": "workflow"
  }
}

The DI layer should map BackendProfile to one complete backend package, not a mix-and-match set of partial adapters.

That avoids unsupported combinations like:

  • Oracle state + Mongo signals
  • PostgreSQL state + Redis schedule

unless they are designed explicitly as a later profile.

Implementation Refactor Needed

To make the backend switch clean, the current Oracle-first host should be refactored in this order.

Phase 1: Split Projection Persistence

Refactor the current projection application service into:

  • projection application service
  • backend-neutral projection contract
  • Oracle implementation

Then add backend implementations later without changing the application service.

Phase 2: Introduce Dedicated Backend Plugin Registration

Add:

public interface IWorkflowBackendRegistrationMarker
{
    string BackendName { get; }
}

Then create dedicated backend plugins for:

  • Oracle
  • PostgreSQL
  • MongoDB

The host should remain backend-neutral and validate that the selected backend plugin has registered itself. Each backend plugin should own registration of:

  • runtime state store
  • projection store
  • mutation coordinator
  • signal bus
  • schedule bus
  • dead-letter store
  • backend-specific options and wake-up strategy

Phase 3: Move Transaction Logic Into Backend Coordinator

Refactor the current workflow mutation transaction scope so the runtime service no longer knows whether the backend uses:

  • direct database transaction
  • database transaction plus outbox
  • document transaction

The runtime service should only ask for one mutation boundary.

Phase 4: Normalize Dead-Letter Model

Standardize a backend-neutral dead-letter record so the operational endpoints do not care which backend stores it.

That includes:

  • signal id
  • workflow instance id
  • signal type
  • first failure time
  • last failure time
  • delivery count
  • last error
  • payload snapshot

Phase 5: Introduce Backend Conformance Tests

Every backend must pass the same contract suite:

  • state insert/update/version conflict
  • task activation and completion
  • timer due resume
  • external signal resume
  • subworkflow completion resume
  • duplicate delivery safety
  • restart recovery
  • dead-letter move and replay
  • retention and purge

Oracle should remain the first backend to pass the full suite.

PostgreSQL and MongoDB are not ready until they pass the same suite.

Backend-Specific Risks

PostgreSQL Risks

  • row-level queue claim logic can create hot indexes under high throughput
  • LISTEN/NOTIFY payloads are not durable
  • reclaim and retry logic must be designed carefully to avoid stuck claimed rows
  • due-row access patterns must be tuned with indexes and partitioning if volume grows

MongoDB Risks

  • production-grade correctness depends on replica-set transactions
  • change streams add operational requirements and resume-token handling
  • projection queries may become more complex if the read model is heavily relational today
  • collection growth and retention strategy must be explicit early

Oracle Risks

  • Oracle remains the strongest correctness model but the least portable implementation
  • engine logic must not drift toward AQ-only assumptions that other backends cannot model

Do not build PostgreSQL and MongoDB in parallel first.

Use this order:

  1. stabilize Oracle as the contract baseline
  2. refactor the host into a true backend-plugin model
  3. implement PostgreSQL profile
  4. pass the full backend conformance suite on PostgreSQL
  5. implement MongoDB profile only if there is a real product need for MongoDB as the system of record

PostgreSQL should come before MongoDB because:

  • its runtime-state and projection model are closer to the current Oracle design
  • its transaction semantics fit the engine more naturally
  • the read-side model is already relational

Validation Order After Functional Backend Completion

Functional backend completion is not the same as backend readiness.

After a backend can start, resume, signal, schedule, and retain workflows, the next required order is:

  1. backend-neutral hostile-condition coverage
  2. curated Bulstrad parity coverage
  3. backend-neutral performance tiers
  4. backend-specific baseline publication
  5. final three-backend comparison

This means:

  • PostgreSQL is not done when its basic stores and buses compile; it must also match the Oracle hostile-condition and Bulstrad suites
  • MongoDB is not done when replica-set transactions and signal delivery work; it must also match the same parity and performance suites
  • the final adoption decision should be based on the shared comparison pack, not on isolated backend microbenchmarks

Proposed Sprint

Sprint 14: Backend Portability And Store Profiles

Goal

Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.

Scope

  • introduce backend profile abstraction and dedicated backend plugin registration
  • split projection persistence from the current Oracle-first application service
  • formalize mutation coordinator abstraction
  • add backend-neutral dead-letter contract
  • define and implement backend conformance suite
  • implement PostgreSQL profile
  • design MongoDB profile in executable detail, with implementation only after explicit product approval

Deliverables

  • IWorkflowBackendRegistrationMarker
  • backend-neutral projection contract
  • backend-neutral mutation coordinator contract
  • backend conformance test suite
  • dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
  • architecture-ready MongoDB backend plugin design package

Exit Criteria

  • host selects one backend profile by configuration
  • host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
  • Oracle and PostgreSQL pass the same conformance suite
  • MongoDB path is specified well enough that implementation is a bounded engineering task
  • workflow declarations and canonical definitions remain unchanged across backend profiles

Final Rule

Backend switching is an infrastructure concern, not a workflow concern.

If a future backend requires changing workflow declarations, canonical definitions, or engine semantics, that backend does not fit the architecture and should not be adopted without a new ADR.