Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
806
docs/workflow/engine/09-backend-portability-plan.md
Normal file
806
docs/workflow/engine/09-backend-portability-plan.md
Normal file
@@ -0,0 +1,806 @@
|
||||
# 09. Backend Portability Plan
|
||||
|
||||
## Purpose
|
||||
|
||||
This document defines how `SerdicaEngine` should evolve from an Oracle-first runtime into a backend-switchable engine that can also run on PostgreSQL and MongoDB without changing workflow declarations, canonical definitions, or runtime semantics.
|
||||
|
||||
The goal is not to support every backend in the same way internally.
|
||||
|
||||
The goal is to preserve one stable engine contract:
|
||||
|
||||
- the same declarative workflow classes
|
||||
- the same canonical runtime definitions
|
||||
- the same public workflow/task APIs
|
||||
- the same runtime behavior around tasks, waits, timers, external signals, subworkflows, retries, and retention
|
||||
|
||||
Backend switching must only change infrastructure adapters and host configuration.
|
||||
|
||||
## Current Baseline
|
||||
|
||||
Today the strongest backend shape is Oracle:
|
||||
|
||||
- runtime state persists in an Oracle-backed runtime-state adapter
|
||||
- projections persist in an Oracle-backed projection adapter
|
||||
- immediate signaling and delayed scheduling run through Oracle AQ adapters
|
||||
- the engine host composes those adapters through backend registration
|
||||
|
||||
Oracle is the reference implementation because it already gives:
|
||||
|
||||
- one durable database
|
||||
- durable queueing
|
||||
- delayed delivery
|
||||
- blocking dequeue without polling
|
||||
- transactional coupling between state mutation and queue enqueue
|
||||
|
||||
That reference point matters because PostgreSQL and MongoDB must match the engine contract even if they reach it through different infrastructure mechanisms.
|
||||
|
||||
## Non-Negotiable Product Rules
|
||||
|
||||
Backend portability must not break these rules:
|
||||
|
||||
1. Authored workflow classes remain pure declaration classes.
|
||||
2. Canonical runtime definitions remain backend-agnostic.
|
||||
3. Engine execution remains run-to-wait.
|
||||
4. Multi-instance deployment remains supported.
|
||||
5. Steady-state signal and timer discovery must not rely on polling loops.
|
||||
6. Signal delivery remains at-least-once.
|
||||
7. Resume remains idempotent through version and waiting-token checks.
|
||||
8. Public API contracts and projections remain stable.
|
||||
9. Operational features remain available:
|
||||
- signal raise
|
||||
- dead-letter inspection
|
||||
- dead-letter replay
|
||||
- runtime inspection
|
||||
- retention
|
||||
- diagram inspection
|
||||
|
||||
## Architecture Principle
|
||||
|
||||
Do not make the engine "database-agnostic" by hiding everything behind one giant repository.
|
||||
|
||||
That approach will collapse important guarantees.
|
||||
|
||||
Instead, separate the backend into explicit capabilities:
|
||||
|
||||
1. runtime state persistence
|
||||
2. projection persistence
|
||||
3. signal transport
|
||||
4. schedule transport
|
||||
5. mutation transaction boundary
|
||||
6. wake-up notification strategy
|
||||
7. lease or concurrency strategy
|
||||
8. dead-letter and replay strategy
|
||||
9. retention and purge strategy
|
||||
|
||||
Each backend implementation must satisfy the full capability matrix.
|
||||
|
||||
## Implemented Signal Driver Split
|
||||
|
||||
The engine now separates durable signal ownership from wake-up delivery.
|
||||
|
||||
The shared seam is defined by engine signal-driver abstractions plus signal and schedule bridge contracts.
|
||||
|
||||
That split exists to preserve transactional correctness while still allowing faster wake strategies later.
|
||||
|
||||
The separation is:
|
||||
|
||||
- `IWorkflowSignalStore`: durable immediate signal persistence
|
||||
- `IWorkflowSignalDriver`: wake-up and claim path for available signals
|
||||
- `IWorkflowSignalScheduler`: durable delayed-signal persistence
|
||||
- `IWorkflowWakeOutbox`: deferred wake publication when the driver is not transaction-coupled to the durable store
|
||||
|
||||
The public engine surface still uses:
|
||||
|
||||
- `IWorkflowSignalBus`
|
||||
- `IWorkflowScheduleBus`
|
||||
|
||||
Those are now bridge contracts.
|
||||
|
||||
They do not define backend mechanics directly.
|
||||
|
||||
### Current Backend Matrix
|
||||
|
||||
| Backend profile | Durable signal store | Wake driver | Schedule store | Dispatch mode |
|
||||
|-----------|--------|------------|---------|-------------|
|
||||
| Oracle | Oracle AQ signal adapter | Oracle AQ blocking dequeue | Oracle AQ schedule adapter | `NativeTransactional` |
|
||||
| PostgreSQL | PostgreSQL durable signal store | PostgreSQL native wake or claim adapter | PostgreSQL durable schedule store | `NativeTransactional` |
|
||||
| MongoDB | MongoDB durable signal store | MongoDB change-stream wake or claim adapter | MongoDB durable schedule store | `NativeTransactional` |
|
||||
|
||||
### Implemented Optional Redis Wake Driver
|
||||
|
||||
The Redis driver is implemented as a separate wake-driver plugin.
|
||||
|
||||
Its shape is intentionally narrow:
|
||||
|
||||
- Oracle, PostgreSQL, and MongoDB remain the durable signal stores.
|
||||
- Oracle, PostgreSQL, and MongoDB persist durable signals transactionally.
|
||||
- Redis receives wake hints directly after commit through the mutation scope post-commit hook.
|
||||
- workers wake through Redis and then claim from the durable backend store.
|
||||
|
||||
Oracle is now supported in this combination, but it is not the preferred Oracle profile.
|
||||
Oracle native AQ wake remains the default because it is slightly faster in the current measurements and keeps the cleanest native timer and dequeue path.
|
||||
|
||||
Redis on Oracle exists for topology consistency, not because Oracle needs Redis for correctness or because it is the current fastest Oracle path.
|
||||
|
||||
### Redis Driver Rules
|
||||
|
||||
Redis must remain a wake driver plugin, not the authoritative durable signal queue for mixed backends.
|
||||
|
||||
The intended shape is:
|
||||
|
||||
- Oracle or PostgreSQL or MongoDB remains the durable `IWorkflowSignalStore`
|
||||
- Redis becomes an `IWorkflowSignalDriver`
|
||||
- Redis is published directly after the durable store transaction commits
|
||||
- backend-native wake drivers are not active when Redis is selected
|
||||
|
||||
That preserves the required correctness model:
|
||||
|
||||
1. persist runtime state, projections, and durable signal inside the backend mutation boundary
|
||||
2. commit the mutation boundary
|
||||
3. publish the Redis wake hint from the registered post-commit action
|
||||
4. wake workers and claim from the durable backend store
|
||||
|
||||
`IWorkflowWakeOutbox` remains in the abstraction set for future non-Redis wake drivers that may still need deferred publication, but it is not the active Redis hot path.
|
||||
|
||||
Redis may improve signal-to-resume latency, especially for PostgreSQL and MongoDB where the durable store and the wake path are already split cleanly.
|
||||
|
||||
Redis must not become the correctness layer unless the whole durable signal model also moves there, which is not the design target of this engine.
|
||||
|
||||
## Required Capability Matrix
|
||||
|
||||
Every engine backend profile must define concrete answers for the following:
|
||||
|
||||
| Capability | Oracle | PostgreSQL | MongoDB |
|
||||
|-----------|--------|------------|---------|
|
||||
| Runtime state durability | Native | Required | Required |
|
||||
| Projection durability | Native | Required | Required |
|
||||
| Optimistic concurrency | Row/version | Row/version | Document version |
|
||||
| Immediate signal durability | AQ | Queue table or queue extension | Signal collection |
|
||||
| Delayed scheduling | AQ delayed delivery | Durable due-message table | Durable due-message collection |
|
||||
| Blocking wake-up | AQ dequeue | `LISTEN/NOTIFY`, Redis wake driver, or dedicated queue worker | Change streams or Redis wake driver |
|
||||
| Atomic state + signal publish | Native DB transaction | Outbox transaction | Transactional outbox or equivalent |
|
||||
| Dead-letter support | AQ + table | Queue/DLQ table | DLQ collection |
|
||||
| Multi-node safety | DB + AQ | DB + wake hints | DB + change stream / wake hints |
|
||||
| Restart recovery | Native | Required | Required |
|
||||
|
||||
The backend is not complete until every row has a real implementation.
|
||||
|
||||
## Engine Backend Layers
|
||||
|
||||
The switchable backend model should be built around these interfaces.
|
||||
|
||||
### 1. Runtime State Store
|
||||
|
||||
Responsible for:
|
||||
|
||||
- loading runtime snapshot by workflow instance id
|
||||
- inserting new snapshot
|
||||
- updating snapshot with expected version
|
||||
- querying runtime status for operational needs
|
||||
- storing engine-specific snapshot JSON
|
||||
|
||||
Target interface shape:
|
||||
|
||||
```csharp
|
||||
public interface IWorkflowRuntimeStateStore
|
||||
{
|
||||
Task<WorkflowRuntimeStateRecord?> GetAsync(string workflowInstanceId, CancellationToken ct = default);
|
||||
Task InsertAsync(WorkflowRuntimeStateRecord record, CancellationToken ct = default);
|
||||
Task UpdateAsync(
|
||||
WorkflowRuntimeStateRecord record,
|
||||
long expectedVersion,
|
||||
CancellationToken ct = default);
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Oracle and PostgreSQL should use explicit version columns.
|
||||
- MongoDB should use a document version field and conditional update filter.
|
||||
- This store must not also own signal publishing logic.
|
||||
|
||||
### 2. Projection Store
|
||||
|
||||
Responsible for:
|
||||
|
||||
- workflow instance summaries
|
||||
- task summaries
|
||||
- task event history
|
||||
- business reference lookup
|
||||
- support read APIs
|
||||
|
||||
The projection model is product-facing and must remain stable.
|
||||
|
||||
That means:
|
||||
|
||||
- the shape of projection records must not depend on the backend
|
||||
- only the persistence adapter may change
|
||||
|
||||
Target direction:
|
||||
|
||||
- split the current projection application service into a backend-neutral application service plus backend adapters
|
||||
- keep one projection contract
|
||||
- allow Oracle and PostgreSQL to stay relational
|
||||
- allow MongoDB to project into document collections if needed
|
||||
|
||||
### 3. Signal Bus
|
||||
|
||||
Responsible for durable immediate signals:
|
||||
|
||||
- internal continue
|
||||
- external signal
|
||||
- task completion continuation
|
||||
- subworkflow completion
|
||||
- replay from dead-letter
|
||||
|
||||
The current contract already exists in the engine runtime abstractions.
|
||||
|
||||
Required guarantees:
|
||||
|
||||
- at-least-once delivery
|
||||
- ack only after successful processing
|
||||
- delivery count visibility
|
||||
- explicit abandon
|
||||
- explicit dead-letter move
|
||||
- replay support
|
||||
|
||||
### 4. Schedule Bus
|
||||
|
||||
Responsible for durable delayed delivery:
|
||||
|
||||
- timer due
|
||||
- retry due
|
||||
- delayed continuation
|
||||
|
||||
Required guarantees:
|
||||
|
||||
- message is not lost across process restart
|
||||
- message becomes visible at or after due time
|
||||
- stale due messages are safely ignored through waiting tokens
|
||||
- schedule and immediate signal semantics use the same envelope model
|
||||
|
||||
### 5. Mutation Transaction Boundary
|
||||
|
||||
This is the most important portability seam.
|
||||
|
||||
The engine mutates three things together:
|
||||
|
||||
- runtime state
|
||||
- projections
|
||||
- signals or schedules
|
||||
|
||||
Oracle can do that in one database transaction because state, projections, and AQ live inside the same durable boundary.
|
||||
|
||||
PostgreSQL and MongoDB may require an outbox-based boundary instead.
|
||||
|
||||
This must be explicit:
|
||||
|
||||
```csharp
|
||||
public interface IWorkflowMutationCoordinator
|
||||
{
|
||||
Task ExecuteAsync(
|
||||
Func<IWorkflowMutationContext, CancellationToken, Task> action,
|
||||
CancellationToken ct = default);
|
||||
}
|
||||
```
|
||||
|
||||
Where the mutation context exposes:
|
||||
|
||||
- runtime state adapter
|
||||
- projection adapter
|
||||
- signal outbox writer
|
||||
- schedule outbox writer
|
||||
|
||||
Do not let the runtime service hand-roll transaction logic per backend.
|
||||
|
||||
### 6. Wake-Up Notifier
|
||||
|
||||
The engine must not scan due rows in a steady loop.
|
||||
|
||||
That means every backend needs a wake-up channel:
|
||||
|
||||
- Oracle: AQ blocking dequeue
|
||||
- PostgreSQL: `LISTEN/NOTIFY` as wake hint for durable queue tables
|
||||
- MongoDB: change streams as wake hint for durable signal collections
|
||||
|
||||
The wake-up channel is not the durable source of truth except in Oracle AQ.
|
||||
|
||||
It is only the wake mechanism.
|
||||
|
||||
That distinction is mandatory for PostgreSQL and MongoDB.
|
||||
|
||||
## Backend Profiles
|
||||
|
||||
## Oracle Profile
|
||||
|
||||
### Role
|
||||
|
||||
Oracle remains the reference backend profile and the operational default.
|
||||
|
||||
### Storage Model
|
||||
|
||||
- runtime state table
|
||||
- relational projection tables
|
||||
- AQ signal queue
|
||||
- AQ schedule queue or delayed signal queue
|
||||
- DLQ table and AQ-assisted replay
|
||||
|
||||
### Commit Model
|
||||
|
||||
- one transaction for runtime state, projections, and AQ enqueue
|
||||
|
||||
### Wake Model
|
||||
|
||||
- AQ blocking dequeue
|
||||
|
||||
### Advantages
|
||||
|
||||
- strongest correctness story
|
||||
- simplest atomic mutation model
|
||||
- no extra wake layer required
|
||||
|
||||
### Risks
|
||||
|
||||
- Oracle-specific infrastructure coupling
|
||||
- AQ operational expertise required
|
||||
- portability work must not assume AQ-only features in engine logic
|
||||
|
||||
Oracle should be treated as the semantic gold standard that other backends must match.
|
||||
|
||||
## PostgreSQL Profile
|
||||
|
||||
### Goal
|
||||
|
||||
Provide a backend profile that preserves engine semantics using PostgreSQL as the durable system of record.
|
||||
|
||||
### Recommended Shape
|
||||
|
||||
- runtime state in PostgreSQL tables
|
||||
- projections in PostgreSQL tables
|
||||
- durable signal queue table
|
||||
- durable schedule queue table
|
||||
- DLQ table
|
||||
- `LISTEN/NOTIFY` for wake-up hints only
|
||||
|
||||
### Why Not `LISTEN/NOTIFY` Alone
|
||||
|
||||
`LISTEN/NOTIFY` is not sufficient as the durable signal layer because notifications are ephemeral.
|
||||
|
||||
The durable truth must stay in tables.
|
||||
|
||||
The recommended model is:
|
||||
|
||||
1. insert durable signal row in the same transaction as state/projection mutation
|
||||
2. emit `NOTIFY` before commit or immediately after durable insert
|
||||
3. workers wake up and claim rows from the signal queue table
|
||||
4. if notification is missed, the next notification or startup recovery still finds the rows
|
||||
|
||||
### Queue Claim Strategy
|
||||
|
||||
Recommended queue-claim pattern:
|
||||
|
||||
- `FOR UPDATE SKIP LOCKED`
|
||||
- ordered by available time, priority, and creation time
|
||||
- delivery count increment on claim
|
||||
- explicit ack by state transition or delete
|
||||
- explicit dead-letter move after delivery limit
|
||||
|
||||
### Schedule Strategy
|
||||
|
||||
Recommended schedule table:
|
||||
|
||||
- `signal_id`
|
||||
- `available_at_utc`
|
||||
- `workflow_instance_id`
|
||||
- `runtime_provider`
|
||||
- `signal_type`
|
||||
- serialized payload
|
||||
- delivery count
|
||||
- dead-letter metadata
|
||||
|
||||
Recommended wake-up path:
|
||||
|
||||
- durable insert into schedule table
|
||||
- `NOTIFY workflow_signal`
|
||||
- workers wake and attempt claim of rows with `available_at_utc <= now()`
|
||||
|
||||
This is still not "polling" if workers block on `LISTEN` and only do bounded claim attempts on wake-up, startup, and recovery events.
|
||||
|
||||
### Atomicity Model
|
||||
|
||||
PostgreSQL cannot rely on an external broker if we want the same atomicity guarantees.
|
||||
|
||||
The cleanest profile is:
|
||||
|
||||
- database state
|
||||
- database projections
|
||||
- database signal queue
|
||||
- database schedule queue
|
||||
- `NOTIFY` as non-durable wake hint
|
||||
|
||||
That keeps the entire correctness boundary in PostgreSQL.
|
||||
|
||||
### Operational Notes
|
||||
|
||||
Need explicit handling for:
|
||||
|
||||
- orphan claimed rows after node crash
|
||||
- reclaim timeout
|
||||
- dead-letter browsing and replay
|
||||
- table bloat and retention
|
||||
- index strategy for due rows
|
||||
|
||||
### Suggested Components
|
||||
|
||||
- `PostgresWorkflowRuntimeStateStore`
|
||||
- `PostgresWorkflowProjectionStore`
|
||||
- `PostgresWorkflowSignalQueue`
|
||||
- `PostgresWorkflowScheduleQueue`
|
||||
- `PostgresWorkflowWakeListener`
|
||||
- `PostgresWorkflowMutationCoordinator`
|
||||
|
||||
## MongoDB Profile
|
||||
|
||||
### Goal
|
||||
|
||||
Provide a backend profile that preserves engine semantics using MongoDB as the durable system of record.
|
||||
|
||||
### Recommended Shape
|
||||
|
||||
- runtime state in a `workflow_runtime_states` collection
|
||||
- projections in dedicated collections
|
||||
- durable `workflow_signals` collection
|
||||
- durable `workflow_schedules` collection
|
||||
- dead-letter collection
|
||||
- change streams for wake-up hints
|
||||
|
||||
### Why Change Streams Are Not Enough
|
||||
|
||||
Change streams are a wake mechanism, not the durable queue itself.
|
||||
|
||||
The durable truth must remain in collections so the engine can recover after:
|
||||
|
||||
- service restart
|
||||
- watcher restart
|
||||
- temporary connectivity loss
|
||||
|
||||
### Document Model
|
||||
|
||||
Signal document fields should include:
|
||||
|
||||
- `_id`
|
||||
- `workflowInstanceId`
|
||||
- `runtimeProvider`
|
||||
- `signalType`
|
||||
- `waitingToken`
|
||||
- `expectedVersion`
|
||||
- `dueAtUtc`
|
||||
- `status`
|
||||
- `deliveryCount`
|
||||
- `claimedBy`
|
||||
- `claimedAtUtc`
|
||||
- `deadLetterReason`
|
||||
- `payload`
|
||||
|
||||
### Claim Strategy
|
||||
|
||||
Recommended model:
|
||||
|
||||
- atomically claim one available document with `findOneAndUpdate`
|
||||
- filter by:
|
||||
- `status = Ready`
|
||||
- `dueAtUtc <= now`
|
||||
- not already claimed
|
||||
- set:
|
||||
- `status = Claimed`
|
||||
- `claimedBy`
|
||||
- `claimedAtUtc`
|
||||
- increment `deliveryCount`
|
||||
|
||||
Ack means:
|
||||
|
||||
- delete the signal or mark it completed
|
||||
|
||||
Abandon means:
|
||||
|
||||
- move back to `Ready`
|
||||
|
||||
Dead-letter means:
|
||||
|
||||
- move to DLQ collection or set `status = DeadLetter`
|
||||
|
||||
### Schedule Strategy
|
||||
|
||||
Two reasonable models exist.
|
||||
|
||||
#### Model A: Separate Schedule Collection
|
||||
|
||||
- keep delayed signals in `workflow_schedules`
|
||||
- promote due documents into `workflow_signals`
|
||||
- wake workers through change streams
|
||||
|
||||
This is simpler conceptually but adds one extra movement step.
|
||||
|
||||
#### Model B: Unified Signal Collection
|
||||
|
||||
- store all signals in one collection
|
||||
- use `dueAtUtc` and `status`
|
||||
- workers claim only due documents
|
||||
|
||||
This is the better v1 choice because it keeps one signal envelope pipeline.
|
||||
|
||||
### Atomicity Model
|
||||
|
||||
MongoDB can support multi-document transactions in replica-set mode.
|
||||
|
||||
That means the preferred model is:
|
||||
|
||||
- runtime state
|
||||
- projections
|
||||
- signal collection writes
|
||||
- schedule writes
|
||||
|
||||
all inside one MongoDB transaction.
|
||||
|
||||
If that operational assumption is unacceptable, then MongoDB is not a correctness-grade replacement for the Oracle profile and should not be offered as a production engine backend.
|
||||
|
||||
### Wake Model
|
||||
|
||||
Use change streams to avoid steady-state polling:
|
||||
|
||||
- watch inserts and state transitions for ready or due signals
|
||||
- on startup, run bounded recovery sweep for unclaimed ready signals
|
||||
- on worker restart, resume from durable signal documents, not from missed change stream events
|
||||
|
||||
### Operational Notes
|
||||
|
||||
Need explicit handling for:
|
||||
|
||||
- resume token persistence for observers
|
||||
- claimed-document recovery after node failure
|
||||
- shard-key implications if sharding is introduced later
|
||||
- transactional prerequisites in local and CI test environments
|
||||
|
||||
### Suggested Components
|
||||
|
||||
- `MongoWorkflowRuntimeStateStore`
|
||||
- `MongoWorkflowProjectionStore`
|
||||
- `MongoWorkflowSignalStore`
|
||||
- `MongoWorkflowWakeStreamListener`
|
||||
- `MongoWorkflowMutationCoordinator`
|
||||
|
||||
## Backend Selection Model
|
||||
|
||||
The engine should not expose dozens of independent switches in appsettings.
|
||||
|
||||
Use one backend profile section plus internal composition.
|
||||
|
||||
Recommended shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"WorkflowEngine": {
|
||||
"BackendProfile": "Oracle"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
And then backend-specific option sections:
|
||||
|
||||
```json
|
||||
{
|
||||
"WorkflowBackend:Oracle": {
|
||||
"ConnectionString": "...",
|
||||
"QueueOwner": "SRD_WFKLW",
|
||||
"SignalQueueName": "WF_SIGNAL_Q",
|
||||
"DeadLetterQueueName": "WF_SIGNAL_DLQ"
|
||||
},
|
||||
"WorkflowBackend:PostgreSql": {
|
||||
"ConnectionString": "...",
|
||||
"SignalTable": "workflow_signals",
|
||||
"ScheduleTable": "workflow_schedules",
|
||||
"DeadLetterTable": "workflow_signal_dead_letters",
|
||||
"NotificationChannel": "workflow_signal"
|
||||
},
|
||||
"WorkflowBackend:MongoDb": {
|
||||
"ConnectionString": "...",
|
||||
"DatabaseName": "serdica_workflow",
|
||||
"SignalCollection": "workflow_signals",
|
||||
"RuntimeStateCollection": "workflow_runtime_states",
|
||||
"ProjectionPrefix": "workflow"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The DI layer should map `BackendProfile` to one complete backend package, not a mix-and-match set of partial adapters.
|
||||
|
||||
That avoids unsupported combinations like:
|
||||
|
||||
- Oracle state + Mongo signals
|
||||
- PostgreSQL state + Redis schedule
|
||||
|
||||
unless they are designed explicitly as a later profile.
|
||||
|
||||
## Implementation Refactor Needed
|
||||
|
||||
To make the backend switch clean, the current Oracle-first host should be refactored in this order.
|
||||
|
||||
### Phase 1: Split Projection Persistence
|
||||
|
||||
Refactor the current projection application service into:
|
||||
|
||||
- projection application service
|
||||
- backend-neutral projection contract
|
||||
- Oracle implementation
|
||||
|
||||
Then add backend implementations later without changing the application service.
|
||||
|
||||
### Phase 2: Introduce Dedicated Backend Plugin Registration
|
||||
|
||||
Add:
|
||||
|
||||
```csharp
|
||||
public interface IWorkflowBackendRegistrationMarker
|
||||
{
|
||||
string BackendName { get; }
|
||||
}
|
||||
```
|
||||
|
||||
Then create dedicated backend plugins for:
|
||||
|
||||
- Oracle
|
||||
- PostgreSQL
|
||||
- MongoDB
|
||||
|
||||
The host should remain backend-neutral and validate that the selected backend plugin has registered itself.
|
||||
Each backend plugin should own registration of:
|
||||
|
||||
- runtime state store
|
||||
- projection store
|
||||
- mutation coordinator
|
||||
- signal bus
|
||||
- schedule bus
|
||||
- dead-letter store
|
||||
- backend-specific options and wake-up strategy
|
||||
|
||||
### Phase 3: Move Transaction Logic Into Backend Coordinator
|
||||
|
||||
Refactor the current workflow mutation transaction scope so the runtime service no longer knows whether the backend uses:
|
||||
|
||||
- direct database transaction
|
||||
- database transaction plus outbox
|
||||
- document transaction
|
||||
|
||||
The runtime service should only ask for one mutation boundary.
|
||||
|
||||
### Phase 4: Normalize Dead-Letter Model
|
||||
|
||||
Standardize a backend-neutral dead-letter record so the operational endpoints do not care which backend stores it.
|
||||
|
||||
That includes:
|
||||
|
||||
- signal id
|
||||
- workflow instance id
|
||||
- signal type
|
||||
- first failure time
|
||||
- last failure time
|
||||
- delivery count
|
||||
- last error
|
||||
- payload snapshot
|
||||
|
||||
### Phase 5: Introduce Backend Conformance Tests
|
||||
|
||||
Every backend must pass the same contract suite:
|
||||
|
||||
- state insert/update/version conflict
|
||||
- task activation and completion
|
||||
- timer due resume
|
||||
- external signal resume
|
||||
- subworkflow completion resume
|
||||
- duplicate delivery safety
|
||||
- restart recovery
|
||||
- dead-letter move and replay
|
||||
- retention and purge
|
||||
|
||||
Oracle should remain the first backend to pass the full suite.
|
||||
|
||||
PostgreSQL and MongoDB are not ready until they pass the same suite.
|
||||
|
||||
## Backend-Specific Risks
|
||||
|
||||
## PostgreSQL Risks
|
||||
|
||||
- row-level queue claim logic can create hot indexes under high throughput
|
||||
- `LISTEN/NOTIFY` payloads are not durable
|
||||
- reclaim and retry logic must be designed carefully to avoid stuck claimed rows
|
||||
- due-row access patterns must be tuned with indexes and partitioning if volume grows
|
||||
|
||||
## MongoDB Risks
|
||||
|
||||
- production-grade correctness depends on replica-set transactions
|
||||
- change streams add operational requirements and resume-token handling
|
||||
- projection queries may become more complex if the read model is heavily relational today
|
||||
- collection growth and retention strategy must be explicit early
|
||||
|
||||
## Oracle Risks
|
||||
|
||||
- Oracle remains the strongest correctness model but the least portable implementation
|
||||
- engine logic must not drift toward AQ-only assumptions that other backends cannot model
|
||||
|
||||
## Recommended Rollout Order
|
||||
|
||||
Do not build PostgreSQL and MongoDB in parallel first.
|
||||
|
||||
Use this order:
|
||||
|
||||
1. stabilize Oracle as the contract baseline
|
||||
2. refactor the host into a true backend-plugin model
|
||||
3. implement PostgreSQL profile
|
||||
4. pass the full backend conformance suite on PostgreSQL
|
||||
5. implement MongoDB profile only if there is a real product need for MongoDB as the system of record
|
||||
|
||||
PostgreSQL should come before MongoDB because:
|
||||
|
||||
- its runtime-state and projection model are closer to the current Oracle design
|
||||
- its transaction semantics fit the engine more naturally
|
||||
- the read-side model is already relational
|
||||
|
||||
## Validation Order After Functional Backend Completion
|
||||
|
||||
Functional backend completion is not the same as backend readiness.
|
||||
|
||||
After a backend can start, resume, signal, schedule, and retain workflows, the next required order is:
|
||||
|
||||
1. backend-neutral hostile-condition coverage
|
||||
2. curated Bulstrad parity coverage
|
||||
3. backend-neutral performance tiers
|
||||
4. backend-specific baseline publication
|
||||
5. final three-backend comparison
|
||||
|
||||
This means:
|
||||
|
||||
- PostgreSQL is not done when its basic stores and buses compile; it must also match the Oracle hostile-condition and Bulstrad suites
|
||||
- MongoDB is not done when replica-set transactions and signal delivery work; it must also match the same parity and performance suites
|
||||
- the final adoption decision should be based on the shared comparison pack, not on isolated backend microbenchmarks
|
||||
|
||||
## Proposed Sprint
|
||||
|
||||
## Sprint 14: Backend Portability And Store Profiles
|
||||
|
||||
### Goal
|
||||
|
||||
Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.
|
||||
|
||||
### Scope
|
||||
|
||||
- introduce backend profile abstraction and dedicated backend plugin registration
|
||||
- split projection persistence from the current Oracle-first application service
|
||||
- formalize mutation coordinator abstraction
|
||||
- add backend-neutral dead-letter contract
|
||||
- define and implement backend conformance suite
|
||||
- implement PostgreSQL profile
|
||||
- design MongoDB profile in executable detail, with implementation only after explicit product approval
|
||||
|
||||
### Deliverables
|
||||
|
||||
- `IWorkflowBackendRegistrationMarker`
|
||||
- backend-neutral projection contract
|
||||
- backend-neutral mutation coordinator contract
|
||||
- backend conformance test suite
|
||||
- dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
|
||||
- architecture-ready MongoDB backend plugin design package
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- host selects one backend profile by configuration
|
||||
- host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
|
||||
- Oracle and PostgreSQL pass the same conformance suite
|
||||
- MongoDB path is specified well enough that implementation is a bounded engineering task
|
||||
- workflow declarations and canonical definitions remain unchanged across backend profiles
|
||||
|
||||
## Final Rule
|
||||
|
||||
Backend switching is an infrastructure concern, not a workflow concern.
|
||||
|
||||
If a future backend requires changing workflow declarations, canonical definitions, or engine semantics, that backend does not fit the architecture and should not be adopted without a new ADR.
|
||||
|
||||
Reference in New Issue
Block a user