Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
24 KiB
09. Backend Portability Plan
Purpose
This document defines how SerdicaEngine should evolve from an Oracle-first runtime into a backend-switchable engine that can also run on PostgreSQL and MongoDB without changing workflow declarations, canonical definitions, or runtime semantics.
The goal is not to support every backend in the same way internally.
The goal is to preserve one stable engine contract:
- the same declarative workflow classes
- the same canonical runtime definitions
- the same public workflow/task APIs
- the same runtime behavior around tasks, waits, timers, external signals, subworkflows, retries, and retention
Backend switching must only change infrastructure adapters and host configuration.
Current Baseline
Today the strongest backend shape is Oracle:
- runtime state persists in an Oracle-backed runtime-state adapter
- projections persist in an Oracle-backed projection adapter
- immediate signaling and delayed scheduling run through Oracle AQ adapters
- the engine host composes those adapters through backend registration
Oracle is the reference implementation because it already gives:
- one durable database
- durable queueing
- delayed delivery
- blocking dequeue without polling
- transactional coupling between state mutation and queue enqueue
That reference point matters because PostgreSQL and MongoDB must match the engine contract even if they reach it through different infrastructure mechanisms.
Non-Negotiable Product Rules
Backend portability must not break these rules:
- Authored workflow classes remain pure declaration classes.
- Canonical runtime definitions remain backend-agnostic.
- Engine execution remains run-to-wait.
- Multi-instance deployment remains supported.
- Steady-state signal and timer discovery must not rely on polling loops.
- Signal delivery remains at-least-once.
- Resume remains idempotent through version and waiting-token checks.
- Public API contracts and projections remain stable.
- Operational features remain available:
- signal raise
- dead-letter inspection
- dead-letter replay
- runtime inspection
- retention
- diagram inspection
Architecture Principle
Do not make the engine "database-agnostic" by hiding everything behind one giant repository.
That approach will collapse important guarantees.
Instead, separate the backend into explicit capabilities:
- runtime state persistence
- projection persistence
- signal transport
- schedule transport
- mutation transaction boundary
- wake-up notification strategy
- lease or concurrency strategy
- dead-letter and replay strategy
- retention and purge strategy
Each backend implementation must satisfy the full capability matrix.
Implemented Signal Driver Split
The engine now separates durable signal ownership from wake-up delivery.
The shared seam is defined by engine signal-driver abstractions plus signal and schedule bridge contracts.
That split exists to preserve transactional correctness while still allowing faster wake strategies later.
The separation is:
IWorkflowSignalStore: durable immediate signal persistenceIWorkflowSignalDriver: wake-up and claim path for available signalsIWorkflowSignalScheduler: durable delayed-signal persistenceIWorkflowWakeOutbox: deferred wake publication when the driver is not transaction-coupled to the durable store
The public engine surface still uses:
IWorkflowSignalBusIWorkflowScheduleBus
Those are now bridge contracts.
They do not define backend mechanics directly.
Current Backend Matrix
| Backend profile | Durable signal store | Wake driver | Schedule store | Dispatch mode |
|---|---|---|---|---|
| Oracle | Oracle AQ signal adapter | Oracle AQ blocking dequeue | Oracle AQ schedule adapter | NativeTransactional |
| PostgreSQL | PostgreSQL durable signal store | PostgreSQL native wake or claim adapter | PostgreSQL durable schedule store | NativeTransactional |
| MongoDB | MongoDB durable signal store | MongoDB change-stream wake or claim adapter | MongoDB durable schedule store | NativeTransactional |
Implemented Optional Redis Wake Driver
The Redis driver is implemented as a separate wake-driver plugin.
Its shape is intentionally narrow:
- Oracle, PostgreSQL, and MongoDB remain the durable signal stores.
- Oracle, PostgreSQL, and MongoDB persist durable signals transactionally.
- Redis receives wake hints directly after commit through the mutation scope post-commit hook.
- workers wake through Redis and then claim from the durable backend store.
Oracle is now supported in this combination, but it is not the preferred Oracle profile. Oracle native AQ wake remains the default because it is slightly faster in the current measurements and keeps the cleanest native timer and dequeue path.
Redis on Oracle exists for topology consistency, not because Oracle needs Redis for correctness or because it is the current fastest Oracle path.
Redis Driver Rules
Redis must remain a wake driver plugin, not the authoritative durable signal queue for mixed backends.
The intended shape is:
- Oracle or PostgreSQL or MongoDB remains the durable
IWorkflowSignalStore - Redis becomes an
IWorkflowSignalDriver - Redis is published directly after the durable store transaction commits
- backend-native wake drivers are not active when Redis is selected
That preserves the required correctness model:
- persist runtime state, projections, and durable signal inside the backend mutation boundary
- commit the mutation boundary
- publish the Redis wake hint from the registered post-commit action
- wake workers and claim from the durable backend store
IWorkflowWakeOutbox remains in the abstraction set for future non-Redis wake drivers that may still need deferred publication, but it is not the active Redis hot path.
Redis may improve signal-to-resume latency, especially for PostgreSQL and MongoDB where the durable store and the wake path are already split cleanly.
Redis must not become the correctness layer unless the whole durable signal model also moves there, which is not the design target of this engine.
Required Capability Matrix
Every engine backend profile must define concrete answers for the following:
| Capability | Oracle | PostgreSQL | MongoDB |
|---|---|---|---|
| Runtime state durability | Native | Required | Required |
| Projection durability | Native | Required | Required |
| Optimistic concurrency | Row/version | Row/version | Document version |
| Immediate signal durability | AQ | Queue table or queue extension | Signal collection |
| Delayed scheduling | AQ delayed delivery | Durable due-message table | Durable due-message collection |
| Blocking wake-up | AQ dequeue | LISTEN/NOTIFY, Redis wake driver, or dedicated queue worker |
Change streams or Redis wake driver |
| Atomic state + signal publish | Native DB transaction | Outbox transaction | Transactional outbox or equivalent |
| Dead-letter support | AQ + table | Queue/DLQ table | DLQ collection |
| Multi-node safety | DB + AQ | DB + wake hints | DB + change stream / wake hints |
| Restart recovery | Native | Required | Required |
The backend is not complete until every row has a real implementation.
Engine Backend Layers
The switchable backend model should be built around these interfaces.
1. Runtime State Store
Responsible for:
- loading runtime snapshot by workflow instance id
- inserting new snapshot
- updating snapshot with expected version
- querying runtime status for operational needs
- storing engine-specific snapshot JSON
Target interface shape:
public interface IWorkflowRuntimeStateStore
{
Task<WorkflowRuntimeStateRecord?> GetAsync(string workflowInstanceId, CancellationToken ct = default);
Task InsertAsync(WorkflowRuntimeStateRecord record, CancellationToken ct = default);
Task UpdateAsync(
WorkflowRuntimeStateRecord record,
long expectedVersion,
CancellationToken ct = default);
}
Notes:
- Oracle and PostgreSQL should use explicit version columns.
- MongoDB should use a document version field and conditional update filter.
- This store must not also own signal publishing logic.
2. Projection Store
Responsible for:
- workflow instance summaries
- task summaries
- task event history
- business reference lookup
- support read APIs
The projection model is product-facing and must remain stable.
That means:
- the shape of projection records must not depend on the backend
- only the persistence adapter may change
Target direction:
- split the current projection application service into a backend-neutral application service plus backend adapters
- keep one projection contract
- allow Oracle and PostgreSQL to stay relational
- allow MongoDB to project into document collections if needed
3. Signal Bus
Responsible for durable immediate signals:
- internal continue
- external signal
- task completion continuation
- subworkflow completion
- replay from dead-letter
The current contract already exists in the engine runtime abstractions.
Required guarantees:
- at-least-once delivery
- ack only after successful processing
- delivery count visibility
- explicit abandon
- explicit dead-letter move
- replay support
4. Schedule Bus
Responsible for durable delayed delivery:
- timer due
- retry due
- delayed continuation
Required guarantees:
- message is not lost across process restart
- message becomes visible at or after due time
- stale due messages are safely ignored through waiting tokens
- schedule and immediate signal semantics use the same envelope model
5. Mutation Transaction Boundary
This is the most important portability seam.
The engine mutates three things together:
- runtime state
- projections
- signals or schedules
Oracle can do that in one database transaction because state, projections, and AQ live inside the same durable boundary.
PostgreSQL and MongoDB may require an outbox-based boundary instead.
This must be explicit:
public interface IWorkflowMutationCoordinator
{
Task ExecuteAsync(
Func<IWorkflowMutationContext, CancellationToken, Task> action,
CancellationToken ct = default);
}
Where the mutation context exposes:
- runtime state adapter
- projection adapter
- signal outbox writer
- schedule outbox writer
Do not let the runtime service hand-roll transaction logic per backend.
6. Wake-Up Notifier
The engine must not scan due rows in a steady loop.
That means every backend needs a wake-up channel:
- Oracle: AQ blocking dequeue
- PostgreSQL:
LISTEN/NOTIFYas wake hint for durable queue tables - MongoDB: change streams as wake hint for durable signal collections
The wake-up channel is not the durable source of truth except in Oracle AQ.
It is only the wake mechanism.
That distinction is mandatory for PostgreSQL and MongoDB.
Backend Profiles
Oracle Profile
Role
Oracle remains the reference backend profile and the operational default.
Storage Model
- runtime state table
- relational projection tables
- AQ signal queue
- AQ schedule queue or delayed signal queue
- DLQ table and AQ-assisted replay
Commit Model
- one transaction for runtime state, projections, and AQ enqueue
Wake Model
- AQ blocking dequeue
Advantages
- strongest correctness story
- simplest atomic mutation model
- no extra wake layer required
Risks
- Oracle-specific infrastructure coupling
- AQ operational expertise required
- portability work must not assume AQ-only features in engine logic
Oracle should be treated as the semantic gold standard that other backends must match.
PostgreSQL Profile
Goal
Provide a backend profile that preserves engine semantics using PostgreSQL as the durable system of record.
Recommended Shape
- runtime state in PostgreSQL tables
- projections in PostgreSQL tables
- durable signal queue table
- durable schedule queue table
- DLQ table
LISTEN/NOTIFYfor wake-up hints only
Why Not LISTEN/NOTIFY Alone
LISTEN/NOTIFY is not sufficient as the durable signal layer because notifications are ephemeral.
The durable truth must stay in tables.
The recommended model is:
- insert durable signal row in the same transaction as state/projection mutation
- emit
NOTIFYbefore commit or immediately after durable insert - workers wake up and claim rows from the signal queue table
- if notification is missed, the next notification or startup recovery still finds the rows
Queue Claim Strategy
Recommended queue-claim pattern:
FOR UPDATE SKIP LOCKED- ordered by available time, priority, and creation time
- delivery count increment on claim
- explicit ack by state transition or delete
- explicit dead-letter move after delivery limit
Schedule Strategy
Recommended schedule table:
signal_idavailable_at_utcworkflow_instance_idruntime_providersignal_type- serialized payload
- delivery count
- dead-letter metadata
Recommended wake-up path:
- durable insert into schedule table
NOTIFY workflow_signal- workers wake and attempt claim of rows with
available_at_utc <= now()
This is still not "polling" if workers block on LISTEN and only do bounded claim attempts on wake-up, startup, and recovery events.
Atomicity Model
PostgreSQL cannot rely on an external broker if we want the same atomicity guarantees.
The cleanest profile is:
- database state
- database projections
- database signal queue
- database schedule queue
NOTIFYas non-durable wake hint
That keeps the entire correctness boundary in PostgreSQL.
Operational Notes
Need explicit handling for:
- orphan claimed rows after node crash
- reclaim timeout
- dead-letter browsing and replay
- table bloat and retention
- index strategy for due rows
Suggested Components
PostgresWorkflowRuntimeStateStorePostgresWorkflowProjectionStorePostgresWorkflowSignalQueuePostgresWorkflowScheduleQueuePostgresWorkflowWakeListenerPostgresWorkflowMutationCoordinator
MongoDB Profile
Goal
Provide a backend profile that preserves engine semantics using MongoDB as the durable system of record.
Recommended Shape
- runtime state in a
workflow_runtime_statescollection - projections in dedicated collections
- durable
workflow_signalscollection - durable
workflow_schedulescollection - dead-letter collection
- change streams for wake-up hints
Why Change Streams Are Not Enough
Change streams are a wake mechanism, not the durable queue itself.
The durable truth must remain in collections so the engine can recover after:
- service restart
- watcher restart
- temporary connectivity loss
Document Model
Signal document fields should include:
_idworkflowInstanceIdruntimeProvidersignalTypewaitingTokenexpectedVersiondueAtUtcstatusdeliveryCountclaimedByclaimedAtUtcdeadLetterReasonpayload
Claim Strategy
Recommended model:
- atomically claim one available document with
findOneAndUpdate - filter by:
status = ReadydueAtUtc <= now- not already claimed
- set:
status = ClaimedclaimedByclaimedAtUtc- increment
deliveryCount
Ack means:
- delete the signal or mark it completed
Abandon means:
- move back to
Ready
Dead-letter means:
- move to DLQ collection or set
status = DeadLetter
Schedule Strategy
Two reasonable models exist.
Model A: Separate Schedule Collection
- keep delayed signals in
workflow_schedules - promote due documents into
workflow_signals - wake workers through change streams
This is simpler conceptually but adds one extra movement step.
Model B: Unified Signal Collection
- store all signals in one collection
- use
dueAtUtcandstatus - workers claim only due documents
This is the better v1 choice because it keeps one signal envelope pipeline.
Atomicity Model
MongoDB can support multi-document transactions in replica-set mode.
That means the preferred model is:
- runtime state
- projections
- signal collection writes
- schedule writes
all inside one MongoDB transaction.
If that operational assumption is unacceptable, then MongoDB is not a correctness-grade replacement for the Oracle profile and should not be offered as a production engine backend.
Wake Model
Use change streams to avoid steady-state polling:
- watch inserts and state transitions for ready or due signals
- on startup, run bounded recovery sweep for unclaimed ready signals
- on worker restart, resume from durable signal documents, not from missed change stream events
Operational Notes
Need explicit handling for:
- resume token persistence for observers
- claimed-document recovery after node failure
- shard-key implications if sharding is introduced later
- transactional prerequisites in local and CI test environments
Suggested Components
MongoWorkflowRuntimeStateStoreMongoWorkflowProjectionStoreMongoWorkflowSignalStoreMongoWorkflowWakeStreamListenerMongoWorkflowMutationCoordinator
Backend Selection Model
The engine should not expose dozens of independent switches in appsettings.
Use one backend profile section plus internal composition.
Recommended shape:
{
"WorkflowEngine": {
"BackendProfile": "Oracle"
}
}
And then backend-specific option sections:
{
"WorkflowBackend:Oracle": {
"ConnectionString": "...",
"QueueOwner": "SRD_WFKLW",
"SignalQueueName": "WF_SIGNAL_Q",
"DeadLetterQueueName": "WF_SIGNAL_DLQ"
},
"WorkflowBackend:PostgreSql": {
"ConnectionString": "...",
"SignalTable": "workflow_signals",
"ScheduleTable": "workflow_schedules",
"DeadLetterTable": "workflow_signal_dead_letters",
"NotificationChannel": "workflow_signal"
},
"WorkflowBackend:MongoDb": {
"ConnectionString": "...",
"DatabaseName": "serdica_workflow",
"SignalCollection": "workflow_signals",
"RuntimeStateCollection": "workflow_runtime_states",
"ProjectionPrefix": "workflow"
}
}
The DI layer should map BackendProfile to one complete backend package, not a mix-and-match set of partial adapters.
That avoids unsupported combinations like:
- Oracle state + Mongo signals
- PostgreSQL state + Redis schedule
unless they are designed explicitly as a later profile.
Implementation Refactor Needed
To make the backend switch clean, the current Oracle-first host should be refactored in this order.
Phase 1: Split Projection Persistence
Refactor the current projection application service into:
- projection application service
- backend-neutral projection contract
- Oracle implementation
Then add backend implementations later without changing the application service.
Phase 2: Introduce Dedicated Backend Plugin Registration
Add:
public interface IWorkflowBackendRegistrationMarker
{
string BackendName { get; }
}
Then create dedicated backend plugins for:
- Oracle
- PostgreSQL
- MongoDB
The host should remain backend-neutral and validate that the selected backend plugin has registered itself. Each backend plugin should own registration of:
- runtime state store
- projection store
- mutation coordinator
- signal bus
- schedule bus
- dead-letter store
- backend-specific options and wake-up strategy
Phase 3: Move Transaction Logic Into Backend Coordinator
Refactor the current workflow mutation transaction scope so the runtime service no longer knows whether the backend uses:
- direct database transaction
- database transaction plus outbox
- document transaction
The runtime service should only ask for one mutation boundary.
Phase 4: Normalize Dead-Letter Model
Standardize a backend-neutral dead-letter record so the operational endpoints do not care which backend stores it.
That includes:
- signal id
- workflow instance id
- signal type
- first failure time
- last failure time
- delivery count
- last error
- payload snapshot
Phase 5: Introduce Backend Conformance Tests
Every backend must pass the same contract suite:
- state insert/update/version conflict
- task activation and completion
- timer due resume
- external signal resume
- subworkflow completion resume
- duplicate delivery safety
- restart recovery
- dead-letter move and replay
- retention and purge
Oracle should remain the first backend to pass the full suite.
PostgreSQL and MongoDB are not ready until they pass the same suite.
Backend-Specific Risks
PostgreSQL Risks
- row-level queue claim logic can create hot indexes under high throughput
LISTEN/NOTIFYpayloads are not durable- reclaim and retry logic must be designed carefully to avoid stuck claimed rows
- due-row access patterns must be tuned with indexes and partitioning if volume grows
MongoDB Risks
- production-grade correctness depends on replica-set transactions
- change streams add operational requirements and resume-token handling
- projection queries may become more complex if the read model is heavily relational today
- collection growth and retention strategy must be explicit early
Oracle Risks
- Oracle remains the strongest correctness model but the least portable implementation
- engine logic must not drift toward AQ-only assumptions that other backends cannot model
Recommended Rollout Order
Do not build PostgreSQL and MongoDB in parallel first.
Use this order:
- stabilize Oracle as the contract baseline
- refactor the host into a true backend-plugin model
- implement PostgreSQL profile
- pass the full backend conformance suite on PostgreSQL
- implement MongoDB profile only if there is a real product need for MongoDB as the system of record
PostgreSQL should come before MongoDB because:
- its runtime-state and projection model are closer to the current Oracle design
- its transaction semantics fit the engine more naturally
- the read-side model is already relational
Validation Order After Functional Backend Completion
Functional backend completion is not the same as backend readiness.
After a backend can start, resume, signal, schedule, and retain workflows, the next required order is:
- backend-neutral hostile-condition coverage
- curated Bulstrad parity coverage
- backend-neutral performance tiers
- backend-specific baseline publication
- final three-backend comparison
This means:
- PostgreSQL is not done when its basic stores and buses compile; it must also match the Oracle hostile-condition and Bulstrad suites
- MongoDB is not done when replica-set transactions and signal delivery work; it must also match the same parity and performance suites
- the final adoption decision should be based on the shared comparison pack, not on isolated backend microbenchmarks
Proposed Sprint
Sprint 14: Backend Portability And Store Profiles
Goal
Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.
Scope
- introduce backend profile abstraction and dedicated backend plugin registration
- split projection persistence from the current Oracle-first application service
- formalize mutation coordinator abstraction
- add backend-neutral dead-letter contract
- define and implement backend conformance suite
- implement PostgreSQL profile
- design MongoDB profile in executable detail, with implementation only after explicit product approval
Deliverables
IWorkflowBackendRegistrationMarker- backend-neutral projection contract
- backend-neutral mutation coordinator contract
- backend conformance test suite
- dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
- architecture-ready MongoDB backend plugin design package
Exit Criteria
- host selects one backend profile by configuration
- host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
- Oracle and PostgreSQL pass the same conformance suite
- MongoDB path is specified well enough that implementation is a bounded engineering task
- workflow declarations and canonical definitions remain unchanged across backend profiles
Final Rule
Backend switching is an infrastructure concern, not a workflow concern.
If a future backend requires changing workflow declarations, canonical definitions, or engine semantics, that backend does not fit the architecture and should not be adopted without a new ADR.