git.stella-ops.org/docs/workflow/engine/07-sprint-plan.md

# 07. Sprint Plan

## Planning Assumptions

- sprint length: 2 weeks
- one team owning runtime, persistence, and service integration
- Oracle AQ available
- no concurrent-engine migration scope
- acceptance means code, tests, and updated docs

## Sprint 1: Foundations And Contracts

### Goal

Create the engine skeleton and the stable interfaces.

### Scope

- add runtime provider abstraction
- add signal bus abstraction
- add schedule bus abstraction
- add runtime snapshot abstraction
- add engine option classes
- add `docs/engine/` package

### Deliverables

- interface set compiled into shared abstractions
- configuration classes
- initial DI composition path
- unit tests for options and registration

### Exit Criteria

- service builds with engine abstractions present
- no Elsa runtime assumptions are introduced into new code
- docs and interface names are stable enough for later sprints

## Sprint 2: Canonical Runtime Definition Store

### Goal

Make canonical execution definitions available at runtime without Elsa.

### Scope

- compile authored workflows to canonical runtime definitions at startup
- validate definitions during startup
- cache runtime definitions
- expose startup failure mode for invalid definitions

### Deliverables

- `WorkflowRuntimeDefinitionStore`
- definition normalization pipeline
- startup validator
- tests covering:
  - valid definition load
  - invalid definition rejection
  - version resolution

### Exit Criteria

- all registered workflows load into runtime definition cache
- the runtime can resolve definition by name/version

## Sprint 3: Snapshot Store And Versioned Runtime State

### Goal

Turn `WF_RUNTIME_STATES` into a first-class engine snapshot store.

### Scope

- extend runtime state schema
- implement snapshot mapper
- implement optimistic concurrency versioning
- wire snapshot reads and writes

### Deliverables

- database migration scripts
- `OracleWorkflowRuntimeSnapshotStore`
- snapshot serialization contracts
- tests for:
  - initial insert
  - update with expected version
  - stale version conflict

### Exit Criteria

- runtime snapshots can be loaded and committed with version control
- stale updates are rejected safely

## Sprint 4: AQ Signal And Schedule Backbone

### Goal

Introduce Oracle AQ as the durable event backbone.

### Scope

- create AQ setup scripts
- implement signal bus
- implement schedule bus
- implement signal envelope serialization
- implement hosted signal consumer skeleton

### Deliverables

- AQ DDL scripts
- `OracleAqWorkflowSignalBus`
- `OracleAqWorkflowScheduleBus`
- integration tests with enqueue/dequeue
- delayed message smoke tests

### Exit Criteria

- engine can publish and receive immediate signals without polling
- engine can publish and receive delayed signals

## Sprint 5: Start Flow And Human Task Activation

### Goal

Run workflows from start until first durable wait.

### Scope

- implement execution coordinator
- implement canonical interpreter subset:
  - state assignment
  - business reference assignment
  - task activation
  - terminal completion
- integrate with `WorkflowRuntimeService`
- keep existing projection model

### Deliverables

- `SerdicaEngineRuntimeProvider.StartAsync`
- execution slice result model
- task activation write path
- tests for:
  - start to task
  - start to completion
  - business reference propagation

### Exit Criteria

- selected declarative workflows can start and create correct tasks without Elsa

## Sprint 6: Task Completion And Transport Calls

### Goal

Advance workflows after task completion and support transport-backed orchestration.

### Scope

- implement task completion execution path
- implement canonical interpreter support for:
  - transport calls
  - branches
  - success/failure paths
- integrate completion flow with runtime snapshot commit

### Deliverables

- `SerdicaEngineRuntimeProvider.CompleteAsync`
- transport dispatcher
- tests for:
  - completion to next task
  - failure branch
  - timeout branch where applicable

### Exit Criteria

- representative workflows can complete first task and reach correct next state

## Sprint 7: Subworkflows, Continue-With, And Repeat

### Goal

Support the higher-order orchestration patterns used heavily in the corpus.

### Scope

- implement subworkflow frame persistence
- implement parent resume
- implement continue-with production
- implement repeat resume semantics

### Deliverables

- subworkflow coordinator
- resume pointer serializer
- tests for:
  - child completion resumes parent
  - nested frame handling
  - repeat interrupted by wait
  - continue-with request emission

### Exit Criteria

- representative subworkflow-heavy families execute correctly

## Sprint 8: Timers, Retries, And Delayed Resume

### Goal

Finish the non-polling scheduling path.

### Scope

- implement timer waits
- implement retry scheduling
- implement stale timer ignore logic via waiting tokens
- integrate delayed AQ delivery into execution coordinator

### Deliverables

- timer wait model
- delayed resume handler
- tests for:
  - timer due resume
  - retry due resume
  - canceled timer ignored
  - restart-safe delayed processing

### Exit Criteria

- the engine supports time-based orchestration without polling loops

## Sprint 9: Operational Parity

### Goal

Reach product-surface and operations parity with the existing workflow service.

### Scope

- diagram parity validation
- runtime state inspection parity
- retention integration
- structured metrics and logging
- DLQ handling and diagnostics

### Deliverables

- runtime metadata mapping updates
- operational dashboards or documented metric set
- DLQ support
- tests for supportability paths

### Exit Criteria

- operations can inspect and support engine-driven instances through the existing product surface

## Sprint 10: Corpus Parity And Hardening

### Goal

Prove the engine against the real declarative workflow corpus.

### Scope

- execute representative high-fanout families end-to-end
- resolve remaining interpreter gaps
- multi-node duplicate delivery testing
- restart and recovery testing
- performance and soak tests

### Deliverables

- parity report against selected workflow families
- load test results
- recovery test results
- production readiness checklist

### Exit Criteria

- selected production-grade workflows run without Elsa
- restart recovery is proven
- no polling is used for steady-state signal or timer discovery

## Sprint 11: Bulstrad E2E Parity And Oracle Reliability

### Goal

Turn the engine from a validated runtime into a production-grade execution platform by proving it against real Bulstrad workflows and hostile Oracle operating conditions.

### Scope

- build a curated Bulstrad Oracle-AQ E2E suite
- replace synthetic runtime-state backing in Oracle integration tests with the real Oracle runtime-state store
- add Oracle transaction-coupling tests for state, projections, and AQ publish
- add Oracle restart, redelivery, and DLQ replay tests
- add multi-worker and duplicate-delivery race tests
- add deterministic fault-injection around commit boundaries

### Deliverables

- `BulstradOracleAqE2ETests`
- curated representative workflows with scripted downstream responders
- Oracle transport reliability suite covering:
  - immediate and delayed delivery
  - rollback and redelivery
  - dead-letter browse and replay
  - restart-safe delayed processing
- concurrency suite covering:
  - duplicate signal delivery
  - same-instance multi-worker races
  - retry-after-conflict behavior
- documented timing expectations for cold-start and steady-state Oracle AQ

### Implemented Coverage

The current Oracle-backed integration harness now includes:

- Bulstrad policy-change families:
  - `OpenForChangePolicy`
  - `ReviewPolicyOpenForChange`
  - `AssistantAddAnnex`
  - `AnnexCancellation`
  - `AssistantPolicyReinstate`
  - `AssistantPolicyCancellation`
  - `AssistantPrintInsisDocuments`
- shared policy families:
  - `InsisIntegrationNew`
  - `QuotationConfirm`
  - `QuoteOrAplCancel`
- Oracle transport and recovery matrix:
  - immediate and delayed AQ delivery
  - delayed backlog drain within a bounded latency envelope
  - dequeue rollback redelivery
  - ambient Oracle transaction commit and rollback for immediate messages
  - ambient Oracle transaction commit and rollback for delayed messages
  - dead-letter browse, replay, and backlog replay
  - dead-letter backlog survival across Oracle restart
  - timer backlog recovery across provider restart and Oracle restart
  - external-signal backlog recovery, worker abandon/recovery, and duplicate-delivery races
  - schedule/publish failure rollback inside workflow mutation transactions

### Exit Criteria

- representative Bulstrad workflows execute correctly on `SerdicaEngine` with real Oracle AQ
- AQ-backed restart and delayed-delivery behavior is proven under realistic timing variance
- duplicate delivery and commit-boundary failures are shown to be safe
- the team has a stable PR suite and a broader nightly suite for Oracle-backed engine validation

## Sprint 12: Load, Performance, And Capacity Characterization

### Goal

Turn the correctness-focused Oracle validation suite into a real load and performance program with stable smoke gates, nightly trend runs, soak coverage, and first capacity numbers.

### Scope

- build a dedicated performance harness on top of the Oracle AQ integration foundation
- separate PR smoke, nightly characterization, weekly soak, and explicit capacity tiers
- add synthetic engine workloads for stable measurement
- add representative Bulstrad workload runners for business realism
- persist performance artifacts and summary reports
- define baseline and regression strategy per environment

### Deliverables

- categorized performance scenarios:
  - `WorkflowPerfLatency`
  - `WorkflowPerfThroughput`
  - `WorkflowPerfSmoke`
  - `WorkflowPerfNightly`
  - `WorkflowPerfSoak`
  - `WorkflowPerfCapacity`
- result artifact writer under `TestResults/workflow-performance/`
- scenario matrix covering:
  - AQ immediate bursts
  - AQ delayed bursts
  - mixed signal backlogs
  - synthetic start/task/signal/timer/subworkflow flows
  - representative Bulstrad families
  - restart and replay under load
- first baseline report for local Docker and CI Oracle
- first capacity note for one-node and multi-node assumptions

### Exit Criteria

- PR smoke load checks are cheap and stable enough to run continuously
- nightly runs capture latency, throughput, and correctness artifacts
- soak runs prove no backlog drift or correctness decay over extended execution
- representative Bulstrad workflows have measured latency envelopes, not just functional pass/fail
- the team has an initial sizing recommendation for worker concurrency and queue backlog expectations

### Implemented Foundation

The current Sprint 12 implementation now includes:

- performance categories and artifact generation under `TestResults/workflow-performance/`
- Oracle AQ smoke scenarios for:
  - immediate burst drain
  - delayed burst drain
  - synthetic external-signal backlog resume
  - short Bulstrad business burst using `QuoteOrAplCancel`
- persisted comparison against the previous artifact for the same scenario and tier
- Oracle AQ nightly scenarios for:
  - larger immediate burst drain
  - larger delayed burst drain
  - larger synthetic external-signal backlog resume
  - Bulstrad `QuotationConfirm -> PdfGenerator` burst
- Oracle AQ soak scenario for:
  - sustained synthetic signal round-trip waves without correctness drift
- Oracle AQ latency baseline for:
  - one-at-a-time synthetic signal round-trip with phase-level latency summaries
- Oracle AQ throughput baseline for:
  - parallel synthetic signal round-trip with `16` workload concurrency and `8` signal workers
- Oracle AQ capacity ladder for:
  - synthetic signal round-trip at concurrency `1`, `4`, `8`, and `16`
- thread-safe scripted transport recording for concurrent smoke scenarios
- first full Oracle baseline run with documented metrics in:
  - [10-oracle-performance-baseline-2026-03-17.md](10-oracle-performance-baseline-2026-03-17.md)
  - [10-oracle-performance-baseline-2026-03-17.json](10-oracle-performance-baseline-2026-03-17.json)

### Reference

The detailed workload model, KPI set, harness design, and baseline strategy are defined in [08-load-and-performance-plan.md](08-load-and-performance-plan.md).

## Sprint 13: Engine-Native Rendering And Authoring Projection

### Goal

Restore definition rendering and authoring projection without reintroducing Elsa types or runtime dependencies into the workflow declarations or the engine host.

### Scope

- design and implement a native definition-to-diagram projection for declarative and canonical workflows
- support deterministic node and edge generation from runtime definitions
- preserve task, branch, repeat, fork, timer, signal, and subworkflow visibility in the rendered output
- define a stable rendering contract for the operational API and future authoring tools
- keep rendering as a separate projection layer, not as part of runtime execution

### Deliverables

- native rendering model and renderer for `WorkflowRuntimeDefinition`
- canonical-to-diagram projection rules for:
  - linear sequences
  - decisions and conditional branches
  - repeats
  - forks and joins
  - timers and external-signal waits
  - continuations and subworkflows
- updated operational metadata and diagram endpoints backed only by engine assets
- test suite covering rendering determinism and parity for representative Bulstrad workflows

### Exit Criteria

- workflow definitions render without any Elsa packages, builders, or activity models
- rendered diagrams remain stable for the same declarative definition across rebuilds
- operational diagram inspection uses the native renderer only
- the rendering layer is ready to support a later authoring surface without changing workflow declarations

## Sprint 14: Backend Portability And Store Profiles

### Goal

Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.

### Scope

- introduce backend profile abstraction and dedicated backend plugin registration
- split projection persistence from the current Oracle-first application service
- formalize mutation coordinator abstraction
- add backend-neutral dead-letter contract
- add backend conformance suite
- implement PostgreSQL profile
- design MongoDB profile in executable detail, with implementation only after explicit product approval

### Deliverables

- `IWorkflowBackendRegistrationMarker`
- backend-neutral projection contract
- backend-neutral mutation coordinator contract
- backend conformance suite
- dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
- executable MongoDB backend plugin design package

### Exit Criteria

- host selects one backend profile by configuration
- host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
- Oracle and PostgreSQL pass the same conformance suite
- MongoDB path is specified well enough that implementation is a bounded engineering task
- workflow declarations and canonical definitions remain unchanged across backend profiles

## Sprint 15: Backend-Neutral Parity And Performance Harness

### Goal

Remove the remaining Oracle-only assumptions from the validation stack so PostgreSQL and MongoDB can be measured with the same correctness, Bulstrad, and performance scenarios.

### Scope

- extract backend-neutral performance artifacts, categories, and scenario drivers
- extract backend-neutral runtime workload helpers from the Oracle-only harness
- define one hostile-condition matrix shared by Oracle, PostgreSQL, and MongoDB
- define one curated Bulstrad parity pack shared by all backends
- define one normalized performance artifact format and baseline comparison model

### Deliverables

- shared `IntegrationTests/Performance/Common/` package
- shared normalized performance metrics model
- shared Bulstrad workload catalog for:
  - `OpenForChangePolicy`
  - `ReviewPolicyOpenForChange`
  - `AssistantPrintInsisDocuments`
  - `AssistantAddAnnex`
  - `AnnexCancellation`
  - `AssistantPolicyCancellation`
  - `AssistantPolicyReinstate`
  - `InsisIntegrationNew`
  - `QuotationConfirm`
  - `QuoteOrAplCancel`
- backend-neutral hostile-condition checklist for:
  - duplicate delivery
  - same-instance resume race
  - abandon and reclaim
  - rollback on publish/schedule failure
  - restart with pending due messages
  - DLQ replay
  - backlog drain

### Exit Criteria

- Oracle, PostgreSQL, and MongoDB use the same performance artifact shape
- Oracle no longer owns the reporting model for later backend baselines
- PostgreSQL and MongoDB can plug into the same workload definitions without changing workflow semantics

## Sprint 16: PostgreSQL Hardening, Bulstrad Parity, And Baseline

### Goal

Bring PostgreSQL to Oracle-level confidence for correctness, hostile conditions, representative product behavior, and measured performance.

### Scope

- close the PostgreSQL hostile-condition gap to the Oracle matrix
- add PostgreSQL-backed Bulstrad E2E parity
- implement PostgreSQL latency, throughput, smoke, nightly, soak, and capacity suites
- publish PostgreSQL baseline artifacts and narrative summary

### Deliverables

- PostgreSQL hostile-condition integration suite
- PostgreSQL Bulstrad parity suite
- PostgreSQL performance suites for:
  - latency
  - throughput
  - smoke
  - nightly
  - soak
  - capacity
- baseline documents:
  - `11-postgres-performance-baseline-<date>.md`
  - `11-postgres-performance-baseline-<date>.json`

### Exit Criteria

- PostgreSQL passes the same hostile-condition matrix as Oracle
- representative Bulstrad workflows run correctly on PostgreSQL
- PostgreSQL has a durable, documented performance baseline comparable to Oracle

## Sprint 17: MongoDB Hardening, Bulstrad Parity, And Baseline

### Goal

Bring MongoDB to the same product and operational confidence level as the relational backends without changing workflow behavior.

### Scope

- close the MongoDB hostile-condition gap to the Oracle matrix
- add MongoDB-backed Bulstrad E2E parity
- implement MongoDB latency, throughput, smoke, nightly, soak, and capacity suites
- publish MongoDB baseline artifacts and narrative summary

### Deliverables

- MongoDB hostile-condition integration suite
- MongoDB Bulstrad parity suite
- MongoDB performance suites for:
  - latency
  - throughput
  - smoke
  - nightly
  - soak
  - capacity
- baseline documents:
  - `12-mongo-performance-baseline-<date>.md`
  - `12-mongo-performance-baseline-<date>.json`

### Exit Criteria

- MongoDB passes the same hostile-condition matrix as Oracle
- representative Bulstrad workflows run correctly on MongoDB
- MongoDB has a durable, documented performance baseline comparable to Oracle and PostgreSQL

## Sprint 18: Final Three-Backend Characterization And Decision Pack

### Goal

Produce the final side-by-side comparison for Oracle, PostgreSQL, and MongoDB using the same workloads, the same correctness rules, and the same performance artifact format.

### Scope

- rerun the shared Bulstrad parity pack on all three backends
- rerun the shared hostile-condition matrix on all three backends
- rerun the shared performance tiers and compare normalized metrics
- capture backend-specific metrics appendices without letting them replace normalized workflow metrics
- publish the final recommendation pack

### Deliverables

- final comparison documents:
  - `13-backend-comparison-<date>.md`
  - `13-backend-comparison-<date>.json`
- normalized comparison across:
  - serial latency
  - steady-state throughput
  - capacity ladder
  - backlog drain
  - duplicate-delivery safety
  - restart recovery
- backend-specific appendices for:
  - Oracle wait and AQ observations
  - PostgreSQL lock, WAL, and queue-table observations
  - MongoDB transaction, lock, and change-stream observations

### Exit Criteria

- all three backends are compared through the same workload lens
- the team has one documented backend recommendation pack
- future backend decisions can reuse the same comparison harness instead of inventing new ad hoc measurements

### Current Status

- baseline comparison pack published in:
  - [13-backend-comparison-2026-03-17.md](13-backend-comparison-2026-03-17.md)
  - [13-backend-comparison-2026-03-17.json](13-backend-comparison-2026-03-17.json)
- normalized performance comparison is complete for Oracle, PostgreSQL, and MongoDB
- reliability and Bulstrad hardening depth remains Oracle-first, so the current comparison is a baseline decision pack, not the final production closeout
- the signal path is now split into durable store and wake driver seams
- PostgreSQL and MongoDB now persist transactional wake-outbox records behind that seam
- the optional Redis wake-driver plugin is implemented for PostgreSQL and MongoDB
- Oracle intentionally remains on native AQ and does not support the Redis wake-driver combination

## Cross-Sprint Work Items

These should be maintained continuously, not left to the end:

- architecture doc updates
- test harness improvements
- canonical execution parity assertions
- operational telemetry quality
- snapshot schema versioning discipline
- Oracle timing-envelope observations for CI and local Docker environments

## Final Milestone Definition

The project is complete when:

- the workflow service can run on the engine as the active runtime
- task and instance APIs remain stable
- Oracle AQ handles both immediate signaling and delayed scheduling
- the service resumes correctly after restart without polling
- the engine runs representative real workflows with production-grade observability