# 09. Backend Portability Plan

## Purpose

This document defines how `SerdicaEngine` should evolve from an Oracle-first runtime into a backend-switchable engine that can also run on PostgreSQL and MongoDB without changing workflow declarations, canonical definitions, or runtime semantics.

The goal is not to support every backend in the same way internally.

The goal is to preserve one stable engine contract:

- the same declarative workflow classes
- the same canonical runtime definitions
- the same public workflow/task APIs
- the same runtime behavior around tasks, waits, timers, external signals, subworkflows, retries, and retention

Backend switching must only change infrastructure adapters and host configuration.

## Current Baseline

Today the strongest backend shape is Oracle:

- runtime state persists in an Oracle-backed runtime-state adapter
- projections persist in an Oracle-backed projection adapter
- immediate signaling and delayed scheduling run through Oracle AQ adapters
- the engine host composes those adapters through backend registration

Oracle is the reference implementation because it already gives:

- one durable database
- durable queueing
- delayed delivery
- blocking dequeue without polling
- transactional coupling between state mutation and queue enqueue

That reference point matters because PostgreSQL and MongoDB must match the engine contract even if they reach it through different infrastructure mechanisms.

## Non-Negotiable Product Rules

Backend portability must not break these rules:

1. Authored workflow classes remain pure declaration classes.
2. Canonical runtime definitions remain backend-agnostic.
3. Engine execution remains run-to-wait.
4. Multi-instance deployment remains supported.
5. Steady-state signal and timer discovery must not rely on polling loops.
6. Signal delivery remains at-least-once.
7. Resume remains idempotent through version and waiting-token checks.
8. Public API contracts and projections remain stable.
9. Operational features remain available:
   - signal raise
   - dead-letter inspection
   - dead-letter replay
   - runtime inspection
   - retention
   - diagram inspection

## Architecture Principle

Do not make the engine "database-agnostic" by hiding everything behind one giant repository.

That approach will collapse important guarantees.

Instead, separate the backend into explicit capabilities:

1. runtime state persistence
2. projection persistence
3. signal transport
4. schedule transport
5. mutation transaction boundary
6. wake-up notification strategy
7. lease or concurrency strategy
8. dead-letter and replay strategy
9. retention and purge strategy

Each backend implementation must satisfy the full capability matrix.

## Implemented Signal Driver Split

The engine now separates durable signal ownership from wake-up delivery.

The shared seam is defined by engine signal-driver abstractions plus signal and schedule bridge contracts.

That split exists to preserve transactional correctness while still allowing faster wake strategies later.

The separation is:

- `IWorkflowSignalStore`: durable immediate signal persistence
- `IWorkflowSignalDriver`: wake-up and claim path for available signals
- `IWorkflowSignalScheduler`: durable delayed-signal persistence
- `IWorkflowWakeOutbox`: deferred wake publication when the driver is not transaction-coupled to the durable store

The public engine surface still uses:

- `IWorkflowSignalBus`
- `IWorkflowScheduleBus`

Those are now bridge contracts.

They do not define backend mechanics directly.

### Current Backend Matrix

| Backend profile | Durable signal store | Wake driver | Schedule store | Dispatch mode |
|-----------|--------|------------|---------|-------------|
| Oracle | Oracle AQ signal adapter | Oracle AQ blocking dequeue | Oracle AQ schedule adapter | `NativeTransactional` |
| PostgreSQL | PostgreSQL durable signal store | PostgreSQL native wake or claim adapter | PostgreSQL durable schedule store | `NativeTransactional` |
| MongoDB | MongoDB durable signal store | MongoDB change-stream wake or claim adapter | MongoDB durable schedule store | `NativeTransactional` |

### Implemented Optional Redis Wake Driver

The Redis driver is implemented as a separate wake-driver plugin.

Its shape is intentionally narrow:

- Oracle, PostgreSQL, and MongoDB remain the durable signal stores.
- Oracle, PostgreSQL, and MongoDB persist durable signals transactionally.
- Redis receives wake hints directly after commit through the mutation scope post-commit hook.
- workers wake through Redis and then claim from the durable backend store.

Oracle is now supported in this combination, but it is not the preferred Oracle profile.
Oracle native AQ wake remains the default because it is slightly faster in the current measurements and keeps the cleanest native timer and dequeue path.

Redis on Oracle exists for topology consistency, not because Oracle needs Redis for correctness or because it is the current fastest Oracle path.

### Redis Driver Rules

Redis must remain a wake driver plugin, not the authoritative durable signal queue for mixed backends.

The intended shape is:

- Oracle or PostgreSQL or MongoDB remains the durable `IWorkflowSignalStore`
- Redis becomes an `IWorkflowSignalDriver`
- Redis is published directly after the durable store transaction commits
- backend-native wake drivers are not active when Redis is selected

That preserves the required correctness model:

1. persist runtime state, projections, and durable signal inside the backend mutation boundary
2. commit the mutation boundary
3. publish the Redis wake hint from the registered post-commit action
4. wake workers and claim from the durable backend store

`IWorkflowWakeOutbox` remains in the abstraction set for future non-Redis wake drivers that may still need deferred publication, but it is not the active Redis hot path.

Redis may improve signal-to-resume latency, especially for PostgreSQL and MongoDB where the durable store and the wake path are already split cleanly.

Redis must not become the correctness layer unless the whole durable signal model also moves there, which is not the design target of this engine.

## Required Capability Matrix

Every engine backend profile must define concrete answers for the following:

| Capability | Oracle | PostgreSQL | MongoDB |
|-----------|--------|------------|---------|
| Runtime state durability | Native | Required | Required |
| Projection durability | Native | Required | Required |
| Optimistic concurrency | Row/version | Row/version | Document version |
| Immediate signal durability | AQ | Queue table or queue extension | Signal collection |
| Delayed scheduling | AQ delayed delivery | Durable due-message table | Durable due-message collection |
| Blocking wake-up | AQ dequeue | `LISTEN/NOTIFY`, Redis wake driver, or dedicated queue worker | Change streams or Redis wake driver |
| Atomic state + signal publish | Native DB transaction | Outbox transaction | Transactional outbox or equivalent |
| Dead-letter support | AQ + table | Queue/DLQ table | DLQ collection |
| Multi-node safety | DB + AQ | DB + wake hints | DB + change stream / wake hints |
| Restart recovery | Native | Required | Required |

The backend is not complete until every row has a real implementation.

## Engine Backend Layers

The switchable backend model should be built around these interfaces.

### 1. Runtime State Store

Responsible for:

- loading runtime snapshot by workflow instance id
- inserting new snapshot
- updating snapshot with expected version
- querying runtime status for operational needs
- storing engine-specific snapshot JSON

Target interface shape:

```csharp
public interface IWorkflowRuntimeStateStore
{
    Task<WorkflowRuntimeStateRecord?> GetAsync(string workflowInstanceId, CancellationToken ct = default);
    Task InsertAsync(WorkflowRuntimeStateRecord record, CancellationToken ct = default);
    Task UpdateAsync(
        WorkflowRuntimeStateRecord record,
        long expectedVersion,
        CancellationToken ct = default);
}
```

Notes:

- Oracle and PostgreSQL should use explicit version columns.
- MongoDB should use a document version field and conditional update filter.
- This store must not also own signal publishing logic.

### 2. Projection Store

Responsible for:

- workflow instance summaries
- task summaries
- task event history
- business reference lookup
- support read APIs

The projection model is product-facing and must remain stable.

That means:

- the shape of projection records must not depend on the backend
- only the persistence adapter may change

Target direction:

- split the current projection application service into a backend-neutral application service plus backend adapters
- keep one projection contract
- allow Oracle and PostgreSQL to stay relational
- allow MongoDB to project into document collections if needed

### 3. Signal Bus

Responsible for durable immediate signals:

- internal continue
- external signal
- task completion continuation
- subworkflow completion
- replay from dead-letter

The current contract already exists in the engine runtime abstractions.

Required guarantees:

- at-least-once delivery
- ack only after successful processing
- delivery count visibility
- explicit abandon
- explicit dead-letter move
- replay support

### 4. Schedule Bus

Responsible for durable delayed delivery:

- timer due
- retry due
- delayed continuation

Required guarantees:

- message is not lost across process restart
- message becomes visible at or after due time
- stale due messages are safely ignored through waiting tokens
- schedule and immediate signal semantics use the same envelope model

### 5. Mutation Transaction Boundary

This is the most important portability seam.

The engine mutates three things together:

- runtime state
- projections
- signals or schedules

Oracle can do that in one database transaction because state, projections, and AQ live inside the same durable boundary.

PostgreSQL and MongoDB may require an outbox-based boundary instead.

This must be explicit:

```csharp
public interface IWorkflowMutationCoordinator
{
    Task ExecuteAsync(
        Func<IWorkflowMutationContext, CancellationToken, Task> action,
        CancellationToken ct = default);
}
```

Where the mutation context exposes:

- runtime state adapter
- projection adapter
- signal outbox writer
- schedule outbox writer

Do not let the runtime service hand-roll transaction logic per backend.

### 6. Wake-Up Notifier

The engine must not scan due rows in a steady loop.

That means every backend needs a wake-up channel:

- Oracle: AQ blocking dequeue
- PostgreSQL: `LISTEN/NOTIFY` as wake hint for durable queue tables
- MongoDB: change streams as wake hint for durable signal collections

The wake-up channel is not the durable source of truth except in Oracle AQ.

It is only the wake mechanism.

That distinction is mandatory for PostgreSQL and MongoDB.

## Backend Profiles

## Oracle Profile

### Role

Oracle remains the reference backend profile and the operational default.

### Storage Model

- runtime state table
- relational projection tables
- AQ signal queue
- AQ schedule queue or delayed signal queue
- DLQ table and AQ-assisted replay

### Commit Model

- one transaction for runtime state, projections, and AQ enqueue

### Wake Model

- AQ blocking dequeue

### Advantages

- strongest correctness story
- simplest atomic mutation model
- no extra wake layer required

### Risks

- Oracle-specific infrastructure coupling
- AQ operational expertise required
- portability work must not assume AQ-only features in engine logic

Oracle should be treated as the semantic gold standard that other backends must match.

## PostgreSQL Profile

### Goal

Provide a backend profile that preserves engine semantics using PostgreSQL as the durable system of record.

### Recommended Shape

- runtime state in PostgreSQL tables
- projections in PostgreSQL tables
- durable signal queue table
- durable schedule queue table
- DLQ table
- `LISTEN/NOTIFY` for wake-up hints only

### Why Not `LISTEN/NOTIFY` Alone

`LISTEN/NOTIFY` is not sufficient as the durable signal layer because notifications are ephemeral.

The durable truth must stay in tables.

The recommended model is:

1. insert durable signal row in the same transaction as state/projection mutation
2. emit `NOTIFY` before commit or immediately after durable insert
3. workers wake up and claim rows from the signal queue table
4. if notification is missed, the next notification or startup recovery still finds the rows

### Queue Claim Strategy

Recommended queue-claim pattern:

- `FOR UPDATE SKIP LOCKED`
- ordered by available time, priority, and creation time
- delivery count increment on claim
- explicit ack by state transition or delete
- explicit dead-letter move after delivery limit

### Schedule Strategy

Recommended schedule table:

- `signal_id`
- `available_at_utc`
- `workflow_instance_id`
- `runtime_provider`
- `signal_type`
- serialized payload
- delivery count
- dead-letter metadata

Recommended wake-up path:

- durable insert into schedule table
- `NOTIFY workflow_signal`
- workers wake and attempt claim of rows with `available_at_utc <= now()`

This is still not "polling" if workers block on `LISTEN` and only do bounded claim attempts on wake-up, startup, and recovery events.

### Atomicity Model

PostgreSQL cannot rely on an external broker if we want the same atomicity guarantees.

The cleanest profile is:

- database state
- database projections
- database signal queue
- database schedule queue
- `NOTIFY` as non-durable wake hint

That keeps the entire correctness boundary in PostgreSQL.

### Operational Notes

Need explicit handling for:

- orphan claimed rows after node crash
- reclaim timeout
- dead-letter browsing and replay
- table bloat and retention
- index strategy for due rows

### Suggested Components

- `PostgresWorkflowRuntimeStateStore`
- `PostgresWorkflowProjectionStore`
- `PostgresWorkflowSignalQueue`
- `PostgresWorkflowScheduleQueue`
- `PostgresWorkflowWakeListener`
- `PostgresWorkflowMutationCoordinator`

## MongoDB Profile

### Goal

Provide a backend profile that preserves engine semantics using MongoDB as the durable system of record.

### Recommended Shape

- runtime state in a `workflow_runtime_states` collection
- projections in dedicated collections
- durable `workflow_signals` collection
- durable `workflow_schedules` collection
- dead-letter collection
- change streams for wake-up hints

### Why Change Streams Are Not Enough

Change streams are a wake mechanism, not the durable queue itself.

The durable truth must remain in collections so the engine can recover after:

- service restart
- watcher restart
- temporary connectivity loss

### Document Model

Signal document fields should include:

- `_id`
- `workflowInstanceId`
- `runtimeProvider`
- `signalType`
- `waitingToken`
- `expectedVersion`
- `dueAtUtc`
- `status`
- `deliveryCount`
- `claimedBy`
- `claimedAtUtc`
- `deadLetterReason`
- `payload`

### Claim Strategy

Recommended model:

- atomically claim one available document with `findOneAndUpdate`
- filter by:
  - `status = Ready`
  - `dueAtUtc <= now`
  - not already claimed
- set:
  - `status = Claimed`
  - `claimedBy`
  - `claimedAtUtc`
  - increment `deliveryCount`

Ack means:

- delete the signal or mark it completed

Abandon means:

- move back to `Ready`

Dead-letter means:

- move to DLQ collection or set `status = DeadLetter`

### Schedule Strategy

Two reasonable models exist.

#### Model A: Separate Schedule Collection

- keep delayed signals in `workflow_schedules`
- promote due documents into `workflow_signals`
- wake workers through change streams

This is simpler conceptually but adds one extra movement step.

#### Model B: Unified Signal Collection

- store all signals in one collection
- use `dueAtUtc` and `status`
- workers claim only due documents

This is the better v1 choice because it keeps one signal envelope pipeline.

### Atomicity Model

MongoDB can support multi-document transactions in replica-set mode.

That means the preferred model is:

- runtime state
- projections
- signal collection writes
- schedule writes

all inside one MongoDB transaction.

If that operational assumption is unacceptable, then MongoDB is not a correctness-grade replacement for the Oracle profile and should not be offered as a production engine backend.

### Wake Model

Use change streams to avoid steady-state polling:

- watch inserts and state transitions for ready or due signals
- on startup, run bounded recovery sweep for unclaimed ready signals
- on worker restart, resume from durable signal documents, not from missed change stream events

### Operational Notes

Need explicit handling for:

- resume token persistence for observers
- claimed-document recovery after node failure
- shard-key implications if sharding is introduced later
- transactional prerequisites in local and CI test environments

### Suggested Components

- `MongoWorkflowRuntimeStateStore`
- `MongoWorkflowProjectionStore`
- `MongoWorkflowSignalStore`
- `MongoWorkflowWakeStreamListener`
- `MongoWorkflowMutationCoordinator`

## Backend Selection Model

The engine should not expose dozens of independent switches in appsettings.

Use one backend profile section plus internal composition.

Recommended shape:

```json
{
  "WorkflowEngine": {
    "BackendProfile": "Oracle"
  }
}
```

And then backend-specific option sections:

```json
{
  "WorkflowBackend:Oracle": {
    "ConnectionString": "...",
    "QueueOwner": "SRD_WFKLW",
    "SignalQueueName": "WF_SIGNAL_Q",
    "DeadLetterQueueName": "WF_SIGNAL_DLQ"
  },
  "WorkflowBackend:PostgreSql": {
    "ConnectionString": "...",
    "SignalTable": "workflow_signals",
    "ScheduleTable": "workflow_schedules",
    "DeadLetterTable": "workflow_signal_dead_letters",
    "NotificationChannel": "workflow_signal"
  },
  "WorkflowBackend:MongoDb": {
    "ConnectionString": "...",
    "DatabaseName": "serdica_workflow",
    "SignalCollection": "workflow_signals",
    "RuntimeStateCollection": "workflow_runtime_states",
    "ProjectionPrefix": "workflow"
  }
}
```

The DI layer should map `BackendProfile` to one complete backend package, not a mix-and-match set of partial adapters.

That avoids unsupported combinations like:

- Oracle state + Mongo signals
- PostgreSQL state + Redis schedule

unless they are designed explicitly as a later profile.

## Implementation Refactor Needed

To make the backend switch clean, the current Oracle-first host should be refactored in this order.

### Phase 1: Split Projection Persistence

Refactor the current projection application service into:

- projection application service
- backend-neutral projection contract
- Oracle implementation

Then add backend implementations later without changing the application service.

### Phase 2: Introduce Dedicated Backend Plugin Registration

Add:

```csharp
public interface IWorkflowBackendRegistrationMarker
{
    string BackendName { get; }
}
```

Then create dedicated backend plugins for:

- Oracle
- PostgreSQL
- MongoDB

The host should remain backend-neutral and validate that the selected backend plugin has registered itself.
Each backend plugin should own registration of:

- runtime state store
- projection store
- mutation coordinator
- signal bus
- schedule bus
- dead-letter store
- backend-specific options and wake-up strategy

### Phase 3: Move Transaction Logic Into Backend Coordinator

Refactor the current workflow mutation transaction scope so the runtime service no longer knows whether the backend uses:

- direct database transaction
- database transaction plus outbox
- document transaction

The runtime service should only ask for one mutation boundary.

### Phase 4: Normalize Dead-Letter Model

Standardize a backend-neutral dead-letter record so the operational endpoints do not care which backend stores it.

That includes:

- signal id
- workflow instance id
- signal type
- first failure time
- last failure time
- delivery count
- last error
- payload snapshot

### Phase 5: Introduce Backend Conformance Tests

Every backend must pass the same contract suite:

- state insert/update/version conflict
- task activation and completion
- timer due resume
- external signal resume
- subworkflow completion resume
- duplicate delivery safety
- restart recovery
- dead-letter move and replay
- retention and purge

Oracle should remain the first backend to pass the full suite.

PostgreSQL and MongoDB are not ready until they pass the same suite.

## Backend-Specific Risks

## PostgreSQL Risks

- row-level queue claim logic can create hot indexes under high throughput
- `LISTEN/NOTIFY` payloads are not durable
- reclaim and retry logic must be designed carefully to avoid stuck claimed rows
- due-row access patterns must be tuned with indexes and partitioning if volume grows

## MongoDB Risks

- production-grade correctness depends on replica-set transactions
- change streams add operational requirements and resume-token handling
- projection queries may become more complex if the read model is heavily relational today
- collection growth and retention strategy must be explicit early

## Oracle Risks

- Oracle remains the strongest correctness model but the least portable implementation
- engine logic must not drift toward AQ-only assumptions that other backends cannot model

## Recommended Rollout Order

Do not build PostgreSQL and MongoDB in parallel first.

Use this order:

1. stabilize Oracle as the contract baseline
2. refactor the host into a true backend-plugin model
3. implement PostgreSQL profile
4. pass the full backend conformance suite on PostgreSQL
5. implement MongoDB profile only if there is a real product need for MongoDB as the system of record

PostgreSQL should come before MongoDB because:

- its runtime-state and projection model are closer to the current Oracle design
- its transaction semantics fit the engine more naturally
- the read-side model is already relational

## Validation Order After Functional Backend Completion

Functional backend completion is not the same as backend readiness.

After a backend can start, resume, signal, schedule, and retain workflows, the next required order is:

1. backend-neutral hostile-condition coverage
2. curated Bulstrad parity coverage
3. backend-neutral performance tiers
4. backend-specific baseline publication
5. final three-backend comparison

This means:

- PostgreSQL is not done when its basic stores and buses compile; it must also match the Oracle hostile-condition and Bulstrad suites
- MongoDB is not done when replica-set transactions and signal delivery work; it must also match the same parity and performance suites
- the final adoption decision should be based on the shared comparison pack, not on isolated backend microbenchmarks

## Proposed Sprint

## Sprint 14: Backend Portability And Store Profiles

### Goal

Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.

### Scope

- introduce backend profile abstraction and dedicated backend plugin registration
- split projection persistence from the current Oracle-first application service
- formalize mutation coordinator abstraction
- add backend-neutral dead-letter contract
- define and implement backend conformance suite
- implement PostgreSQL profile
- design MongoDB profile in executable detail, with implementation only after explicit product approval

### Deliverables

- `IWorkflowBackendRegistrationMarker`
- backend-neutral projection contract
- backend-neutral mutation coordinator contract
- backend conformance test suite
- dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
- architecture-ready MongoDB backend plugin design package

### Exit Criteria

- host selects one backend profile by configuration
- host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
- Oracle and PostgreSQL pass the same conformance suite
- MongoDB path is specified well enough that implementation is a bounded engineering task
- workflow declarations and canonical definitions remain unchanged across backend profiles

## Final Rule

Backend switching is an infrastructure concern, not a workflow concern.

If a future backend requires changing workflow declarations, canonical definitions, or engine semantics, that backend does not fit the architecture and should not be adopted without a new ADR.