Files
git.stella-ops.org/docs/workflow/engine/07-sprint-plan.md
master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00

677 lines
21 KiB
Markdown

# 07. Sprint Plan
## Planning Assumptions
- sprint length: 2 weeks
- one team owning runtime, persistence, and service integration
- Oracle AQ available
- no concurrent-engine migration scope
- acceptance means code, tests, and updated docs
## Sprint 1: Foundations And Contracts
### Goal
Create the engine skeleton and the stable interfaces.
### Scope
- add runtime provider abstraction
- add signal bus abstraction
- add schedule bus abstraction
- add runtime snapshot abstraction
- add engine option classes
- add `docs/engine/` package
### Deliverables
- interface set compiled into shared abstractions
- configuration classes
- initial DI composition path
- unit tests for options and registration
### Exit Criteria
- service builds with engine abstractions present
- no Elsa runtime assumptions are introduced into new code
- docs and interface names are stable enough for later sprints
## Sprint 2: Canonical Runtime Definition Store
### Goal
Make canonical execution definitions available at runtime without Elsa.
### Scope
- compile authored workflows to canonical runtime definitions at startup
- validate definitions during startup
- cache runtime definitions
- expose startup failure mode for invalid definitions
### Deliverables
- `WorkflowRuntimeDefinitionStore`
- definition normalization pipeline
- startup validator
- tests covering:
- valid definition load
- invalid definition rejection
- version resolution
### Exit Criteria
- all registered workflows load into runtime definition cache
- the runtime can resolve definition by name/version
## Sprint 3: Snapshot Store And Versioned Runtime State
### Goal
Turn `WF_RUNTIME_STATES` into a first-class engine snapshot store.
### Scope
- extend runtime state schema
- implement snapshot mapper
- implement optimistic concurrency versioning
- wire snapshot reads and writes
### Deliverables
- database migration scripts
- `OracleWorkflowRuntimeSnapshotStore`
- snapshot serialization contracts
- tests for:
- initial insert
- update with expected version
- stale version conflict
### Exit Criteria
- runtime snapshots can be loaded and committed with version control
- stale updates are rejected safely
## Sprint 4: AQ Signal And Schedule Backbone
### Goal
Introduce Oracle AQ as the durable event backbone.
### Scope
- create AQ setup scripts
- implement signal bus
- implement schedule bus
- implement signal envelope serialization
- implement hosted signal consumer skeleton
### Deliverables
- AQ DDL scripts
- `OracleAqWorkflowSignalBus`
- `OracleAqWorkflowScheduleBus`
- integration tests with enqueue/dequeue
- delayed message smoke tests
### Exit Criteria
- engine can publish and receive immediate signals without polling
- engine can publish and receive delayed signals
## Sprint 5: Start Flow And Human Task Activation
### Goal
Run workflows from start until first durable wait.
### Scope
- implement execution coordinator
- implement canonical interpreter subset:
- state assignment
- business reference assignment
- task activation
- terminal completion
- integrate with `WorkflowRuntimeService`
- keep existing projection model
### Deliverables
- `SerdicaEngineRuntimeProvider.StartAsync`
- execution slice result model
- task activation write path
- tests for:
- start to task
- start to completion
- business reference propagation
### Exit Criteria
- selected declarative workflows can start and create correct tasks without Elsa
## Sprint 6: Task Completion And Transport Calls
### Goal
Advance workflows after task completion and support transport-backed orchestration.
### Scope
- implement task completion execution path
- implement canonical interpreter support for:
- transport calls
- branches
- success/failure paths
- integrate completion flow with runtime snapshot commit
### Deliverables
- `SerdicaEngineRuntimeProvider.CompleteAsync`
- transport dispatcher
- tests for:
- completion to next task
- failure branch
- timeout branch where applicable
### Exit Criteria
- representative workflows can complete first task and reach correct next state
## Sprint 7: Subworkflows, Continue-With, And Repeat
### Goal
Support the higher-order orchestration patterns used heavily in the corpus.
### Scope
- implement subworkflow frame persistence
- implement parent resume
- implement continue-with production
- implement repeat resume semantics
### Deliverables
- subworkflow coordinator
- resume pointer serializer
- tests for:
- child completion resumes parent
- nested frame handling
- repeat interrupted by wait
- continue-with request emission
### Exit Criteria
- representative subworkflow-heavy families execute correctly
## Sprint 8: Timers, Retries, And Delayed Resume
### Goal
Finish the non-polling scheduling path.
### Scope
- implement timer waits
- implement retry scheduling
- implement stale timer ignore logic via waiting tokens
- integrate delayed AQ delivery into execution coordinator
### Deliverables
- timer wait model
- delayed resume handler
- tests for:
- timer due resume
- retry due resume
- canceled timer ignored
- restart-safe delayed processing
### Exit Criteria
- the engine supports time-based orchestration without polling loops
## Sprint 9: Operational Parity
### Goal
Reach product-surface and operations parity with the existing workflow service.
### Scope
- diagram parity validation
- runtime state inspection parity
- retention integration
- structured metrics and logging
- DLQ handling and diagnostics
### Deliverables
- runtime metadata mapping updates
- operational dashboards or documented metric set
- DLQ support
- tests for supportability paths
### Exit Criteria
- operations can inspect and support engine-driven instances through the existing product surface
## Sprint 10: Corpus Parity And Hardening
### Goal
Prove the engine against the real declarative workflow corpus.
### Scope
- execute representative high-fanout families end-to-end
- resolve remaining interpreter gaps
- multi-node duplicate delivery testing
- restart and recovery testing
- performance and soak tests
### Deliverables
- parity report against selected workflow families
- load test results
- recovery test results
- production readiness checklist
### Exit Criteria
- selected production-grade workflows run without Elsa
- restart recovery is proven
- no polling is used for steady-state signal or timer discovery
## Sprint 11: Bulstrad E2E Parity And Oracle Reliability
### Goal
Turn the engine from a validated runtime into a production-grade execution platform by proving it against real Bulstrad workflows and hostile Oracle operating conditions.
### Scope
- build a curated Bulstrad Oracle-AQ E2E suite
- replace synthetic runtime-state backing in Oracle integration tests with the real Oracle runtime-state store
- add Oracle transaction-coupling tests for state, projections, and AQ publish
- add Oracle restart, redelivery, and DLQ replay tests
- add multi-worker and duplicate-delivery race tests
- add deterministic fault-injection around commit boundaries
### Deliverables
- `BulstradOracleAqE2ETests`
- curated representative workflows with scripted downstream responders
- Oracle transport reliability suite covering:
- immediate and delayed delivery
- rollback and redelivery
- dead-letter browse and replay
- restart-safe delayed processing
- concurrency suite covering:
- duplicate signal delivery
- same-instance multi-worker races
- retry-after-conflict behavior
- documented timing expectations for cold-start and steady-state Oracle AQ
### Implemented Coverage
The current Oracle-backed integration harness now includes:
- Bulstrad policy-change families:
- `OpenForChangePolicy`
- `ReviewPolicyOpenForChange`
- `AssistantAddAnnex`
- `AnnexCancellation`
- `AssistantPolicyReinstate`
- `AssistantPolicyCancellation`
- `AssistantPrintInsisDocuments`
- shared policy families:
- `InsisIntegrationNew`
- `QuotationConfirm`
- `QuoteOrAplCancel`
- Oracle transport and recovery matrix:
- immediate and delayed AQ delivery
- delayed backlog drain within a bounded latency envelope
- dequeue rollback redelivery
- ambient Oracle transaction commit and rollback for immediate messages
- ambient Oracle transaction commit and rollback for delayed messages
- dead-letter browse, replay, and backlog replay
- dead-letter backlog survival across Oracle restart
- timer backlog recovery across provider restart and Oracle restart
- external-signal backlog recovery, worker abandon/recovery, and duplicate-delivery races
- schedule/publish failure rollback inside workflow mutation transactions
### Exit Criteria
- representative Bulstrad workflows execute correctly on `SerdicaEngine` with real Oracle AQ
- AQ-backed restart and delayed-delivery behavior is proven under realistic timing variance
- duplicate delivery and commit-boundary failures are shown to be safe
- the team has a stable PR suite and a broader nightly suite for Oracle-backed engine validation
## Sprint 12: Load, Performance, And Capacity Characterization
### Goal
Turn the correctness-focused Oracle validation suite into a real load and performance program with stable smoke gates, nightly trend runs, soak coverage, and first capacity numbers.
### Scope
- build a dedicated performance harness on top of the Oracle AQ integration foundation
- separate PR smoke, nightly characterization, weekly soak, and explicit capacity tiers
- add synthetic engine workloads for stable measurement
- add representative Bulstrad workload runners for business realism
- persist performance artifacts and summary reports
- define baseline and regression strategy per environment
### Deliverables
- categorized performance scenarios:
- `WorkflowPerfLatency`
- `WorkflowPerfThroughput`
- `WorkflowPerfSmoke`
- `WorkflowPerfNightly`
- `WorkflowPerfSoak`
- `WorkflowPerfCapacity`
- result artifact writer under `TestResults/workflow-performance/`
- scenario matrix covering:
- AQ immediate bursts
- AQ delayed bursts
- mixed signal backlogs
- synthetic start/task/signal/timer/subworkflow flows
- representative Bulstrad families
- restart and replay under load
- first baseline report for local Docker and CI Oracle
- first capacity note for one-node and multi-node assumptions
### Exit Criteria
- PR smoke load checks are cheap and stable enough to run continuously
- nightly runs capture latency, throughput, and correctness artifacts
- soak runs prove no backlog drift or correctness decay over extended execution
- representative Bulstrad workflows have measured latency envelopes, not just functional pass/fail
- the team has an initial sizing recommendation for worker concurrency and queue backlog expectations
### Implemented Foundation
The current Sprint 12 implementation now includes:
- performance categories and artifact generation under `TestResults/workflow-performance/`
- Oracle AQ smoke scenarios for:
- immediate burst drain
- delayed burst drain
- synthetic external-signal backlog resume
- short Bulstrad business burst using `QuoteOrAplCancel`
- persisted comparison against the previous artifact for the same scenario and tier
- Oracle AQ nightly scenarios for:
- larger immediate burst drain
- larger delayed burst drain
- larger synthetic external-signal backlog resume
- Bulstrad `QuotationConfirm -> PdfGenerator` burst
- Oracle AQ soak scenario for:
- sustained synthetic signal round-trip waves without correctness drift
- Oracle AQ latency baseline for:
- one-at-a-time synthetic signal round-trip with phase-level latency summaries
- Oracle AQ throughput baseline for:
- parallel synthetic signal round-trip with `16` workload concurrency and `8` signal workers
- Oracle AQ capacity ladder for:
- synthetic signal round-trip at concurrency `1`, `4`, `8`, and `16`
- thread-safe scripted transport recording for concurrent smoke scenarios
- first full Oracle baseline run with documented metrics in:
- [10-oracle-performance-baseline-2026-03-17.md](10-oracle-performance-baseline-2026-03-17.md)
- [10-oracle-performance-baseline-2026-03-17.json](10-oracle-performance-baseline-2026-03-17.json)
### Reference
The detailed workload model, KPI set, harness design, and baseline strategy are defined in [08-load-and-performance-plan.md](08-load-and-performance-plan.md).
## Sprint 13: Engine-Native Rendering And Authoring Projection
### Goal
Restore definition rendering and authoring projection without reintroducing Elsa types or runtime dependencies into the workflow declarations or the engine host.
### Scope
- design and implement a native definition-to-diagram projection for declarative and canonical workflows
- support deterministic node and edge generation from runtime definitions
- preserve task, branch, repeat, fork, timer, signal, and subworkflow visibility in the rendered output
- define a stable rendering contract for the operational API and future authoring tools
- keep rendering as a separate projection layer, not as part of runtime execution
### Deliverables
- native rendering model and renderer for `WorkflowRuntimeDefinition`
- canonical-to-diagram projection rules for:
- linear sequences
- decisions and conditional branches
- repeats
- forks and joins
- timers and external-signal waits
- continuations and subworkflows
- updated operational metadata and diagram endpoints backed only by engine assets
- test suite covering rendering determinism and parity for representative Bulstrad workflows
### Exit Criteria
- workflow definitions render without any Elsa packages, builders, or activity models
- rendered diagrams remain stable for the same declarative definition across rebuilds
- operational diagram inspection uses the native renderer only
- the rendering layer is ready to support a later authoring surface without changing workflow declarations
## Sprint 14: Backend Portability And Store Profiles
### Goal
Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.
### Scope
- introduce backend profile abstraction and dedicated backend plugin registration
- split projection persistence from the current Oracle-first application service
- formalize mutation coordinator abstraction
- add backend-neutral dead-letter contract
- add backend conformance suite
- implement PostgreSQL profile
- design MongoDB profile in executable detail, with implementation only after explicit product approval
### Deliverables
- `IWorkflowBackendRegistrationMarker`
- backend-neutral projection contract
- backend-neutral mutation coordinator contract
- backend conformance suite
- dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
- executable MongoDB backend plugin design package
### Exit Criteria
- host selects one backend profile by configuration
- host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
- Oracle and PostgreSQL pass the same conformance suite
- MongoDB path is specified well enough that implementation is a bounded engineering task
- workflow declarations and canonical definitions remain unchanged across backend profiles
## Sprint 15: Backend-Neutral Parity And Performance Harness
### Goal
Remove the remaining Oracle-only assumptions from the validation stack so PostgreSQL and MongoDB can be measured with the same correctness, Bulstrad, and performance scenarios.
### Scope
- extract backend-neutral performance artifacts, categories, and scenario drivers
- extract backend-neutral runtime workload helpers from the Oracle-only harness
- define one hostile-condition matrix shared by Oracle, PostgreSQL, and MongoDB
- define one curated Bulstrad parity pack shared by all backends
- define one normalized performance artifact format and baseline comparison model
### Deliverables
- shared `IntegrationTests/Performance/Common/` package
- shared normalized performance metrics model
- shared Bulstrad workload catalog for:
- `OpenForChangePolicy`
- `ReviewPolicyOpenForChange`
- `AssistantPrintInsisDocuments`
- `AssistantAddAnnex`
- `AnnexCancellation`
- `AssistantPolicyCancellation`
- `AssistantPolicyReinstate`
- `InsisIntegrationNew`
- `QuotationConfirm`
- `QuoteOrAplCancel`
- backend-neutral hostile-condition checklist for:
- duplicate delivery
- same-instance resume race
- abandon and reclaim
- rollback on publish/schedule failure
- restart with pending due messages
- DLQ replay
- backlog drain
### Exit Criteria
- Oracle, PostgreSQL, and MongoDB use the same performance artifact shape
- Oracle no longer owns the reporting model for later backend baselines
- PostgreSQL and MongoDB can plug into the same workload definitions without changing workflow semantics
## Sprint 16: PostgreSQL Hardening, Bulstrad Parity, And Baseline
### Goal
Bring PostgreSQL to Oracle-level confidence for correctness, hostile conditions, representative product behavior, and measured performance.
### Scope
- close the PostgreSQL hostile-condition gap to the Oracle matrix
- add PostgreSQL-backed Bulstrad E2E parity
- implement PostgreSQL latency, throughput, smoke, nightly, soak, and capacity suites
- publish PostgreSQL baseline artifacts and narrative summary
### Deliverables
- PostgreSQL hostile-condition integration suite
- PostgreSQL Bulstrad parity suite
- PostgreSQL performance suites for:
- latency
- throughput
- smoke
- nightly
- soak
- capacity
- baseline documents:
- `11-postgres-performance-baseline-<date>.md`
- `11-postgres-performance-baseline-<date>.json`
### Exit Criteria
- PostgreSQL passes the same hostile-condition matrix as Oracle
- representative Bulstrad workflows run correctly on PostgreSQL
- PostgreSQL has a durable, documented performance baseline comparable to Oracle
## Sprint 17: MongoDB Hardening, Bulstrad Parity, And Baseline
### Goal
Bring MongoDB to the same product and operational confidence level as the relational backends without changing workflow behavior.
### Scope
- close the MongoDB hostile-condition gap to the Oracle matrix
- add MongoDB-backed Bulstrad E2E parity
- implement MongoDB latency, throughput, smoke, nightly, soak, and capacity suites
- publish MongoDB baseline artifacts and narrative summary
### Deliverables
- MongoDB hostile-condition integration suite
- MongoDB Bulstrad parity suite
- MongoDB performance suites for:
- latency
- throughput
- smoke
- nightly
- soak
- capacity
- baseline documents:
- `12-mongo-performance-baseline-<date>.md`
- `12-mongo-performance-baseline-<date>.json`
### Exit Criteria
- MongoDB passes the same hostile-condition matrix as Oracle
- representative Bulstrad workflows run correctly on MongoDB
- MongoDB has a durable, documented performance baseline comparable to Oracle and PostgreSQL
## Sprint 18: Final Three-Backend Characterization And Decision Pack
### Goal
Produce the final side-by-side comparison for Oracle, PostgreSQL, and MongoDB using the same workloads, the same correctness rules, and the same performance artifact format.
### Scope
- rerun the shared Bulstrad parity pack on all three backends
- rerun the shared hostile-condition matrix on all three backends
- rerun the shared performance tiers and compare normalized metrics
- capture backend-specific metrics appendices without letting them replace normalized workflow metrics
- publish the final recommendation pack
### Deliverables
- final comparison documents:
- `13-backend-comparison-<date>.md`
- `13-backend-comparison-<date>.json`
- normalized comparison across:
- serial latency
- steady-state throughput
- capacity ladder
- backlog drain
- duplicate-delivery safety
- restart recovery
- backend-specific appendices for:
- Oracle wait and AQ observations
- PostgreSQL lock, WAL, and queue-table observations
- MongoDB transaction, lock, and change-stream observations
### Exit Criteria
- all three backends are compared through the same workload lens
- the team has one documented backend recommendation pack
- future backend decisions can reuse the same comparison harness instead of inventing new ad hoc measurements
### Current Status
- baseline comparison pack published in:
- [13-backend-comparison-2026-03-17.md](13-backend-comparison-2026-03-17.md)
- [13-backend-comparison-2026-03-17.json](13-backend-comparison-2026-03-17.json)
- normalized performance comparison is complete for Oracle, PostgreSQL, and MongoDB
- reliability and Bulstrad hardening depth remains Oracle-first, so the current comparison is a baseline decision pack, not the final production closeout
- the signal path is now split into durable store and wake driver seams
- PostgreSQL and MongoDB now persist transactional wake-outbox records behind that seam
- the optional Redis wake-driver plugin is implemented for PostgreSQL and MongoDB
- Oracle intentionally remains on native AQ and does not support the Redis wake-driver combination
## Cross-Sprint Work Items
These should be maintained continuously, not left to the end:
- architecture doc updates
- test harness improvements
- canonical execution parity assertions
- operational telemetry quality
- snapshot schema versioning discipline
- Oracle timing-envelope observations for CI and local Docker environments
## Final Milestone Definition
The project is complete when:
- the workflow service can run on the engine as the active runtime
- task and instance APIs remain stable
- Oracle AQ handles both immediate signaling and delayed scheduling
- the service resumes correctly after restart without polling
- the engine runs representative real workflows with production-grade observability