Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects

Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00
parent e56f9a114a
commit f5b5f24d95
422 changed files with 85428 additions and 0 deletions
--- a/docs/workflow/engine/07-sprint-plan.md
+++ b/docs/workflow/engine/07-sprint-plan.md
@@ -0,0 +1,676 @@
+# 07. Sprint Plan
+
+## Planning Assumptions
+
+- sprint length: 2 weeks
+- one team owning runtime, persistence, and service integration
+- Oracle AQ available
+- no concurrent-engine migration scope
+- acceptance means code, tests, and updated docs
+
+## Sprint 1: Foundations And Contracts
+
+### Goal
+
+Create the engine skeleton and the stable interfaces.
+
+### Scope
+
+- add runtime provider abstraction
+- add signal bus abstraction
+- add schedule bus abstraction
+- add runtime snapshot abstraction
+- add engine option classes
+- add `docs/engine/` package
+
+### Deliverables
+
+- interface set compiled into shared abstractions
+- configuration classes
+- initial DI composition path
+- unit tests for options and registration
+
+### Exit Criteria
+
+- service builds with engine abstractions present
+- no Elsa runtime assumptions are introduced into new code
+- docs and interface names are stable enough for later sprints
+
+## Sprint 2: Canonical Runtime Definition Store
+
+### Goal
+
+Make canonical execution definitions available at runtime without Elsa.
+
+### Scope
+
+- compile authored workflows to canonical runtime definitions at startup
+- validate definitions during startup
+- cache runtime definitions
+- expose startup failure mode for invalid definitions
+
+### Deliverables
+
+- `WorkflowRuntimeDefinitionStore`
+- definition normalization pipeline
+- startup validator
+- tests covering:
+  - valid definition load
+  - invalid definition rejection
+  - version resolution
+
+### Exit Criteria
+
+- all registered workflows load into runtime definition cache
+- the runtime can resolve definition by name/version
+
+## Sprint 3: Snapshot Store And Versioned Runtime State
+
+### Goal
+
+Turn `WF_RUNTIME_STATES` into a first-class engine snapshot store.
+
+### Scope
+
+- extend runtime state schema
+- implement snapshot mapper
+- implement optimistic concurrency versioning
+- wire snapshot reads and writes
+
+### Deliverables
+
+- database migration scripts
+- `OracleWorkflowRuntimeSnapshotStore`
+- snapshot serialization contracts
+- tests for:
+  - initial insert
+  - update with expected version
+  - stale version conflict
+
+### Exit Criteria
+
+- runtime snapshots can be loaded and committed with version control
+- stale updates are rejected safely
+
+## Sprint 4: AQ Signal And Schedule Backbone
+
+### Goal
+
+Introduce Oracle AQ as the durable event backbone.
+
+### Scope
+
+- create AQ setup scripts
+- implement signal bus
+- implement schedule bus
+- implement signal envelope serialization
+- implement hosted signal consumer skeleton
+
+### Deliverables
+
+- AQ DDL scripts
+- `OracleAqWorkflowSignalBus`
+- `OracleAqWorkflowScheduleBus`
+- integration tests with enqueue/dequeue
+- delayed message smoke tests
+
+### Exit Criteria
+
+- engine can publish and receive immediate signals without polling
+- engine can publish and receive delayed signals
+
+## Sprint 5: Start Flow And Human Task Activation
+
+### Goal
+
+Run workflows from start until first durable wait.
+
+### Scope
+
+- implement execution coordinator
+- implement canonical interpreter subset:
+  - state assignment
+  - business reference assignment
+  - task activation
+  - terminal completion
+- integrate with `WorkflowRuntimeService`
+- keep existing projection model
+
+### Deliverables
+
+- `SerdicaEngineRuntimeProvider.StartAsync`
+- execution slice result model
+- task activation write path
+- tests for:
+  - start to task
+  - start to completion
+  - business reference propagation
+
+### Exit Criteria
+
+- selected declarative workflows can start and create correct tasks without Elsa
+
+## Sprint 6: Task Completion And Transport Calls
+
+### Goal
+
+Advance workflows after task completion and support transport-backed orchestration.
+
+### Scope
+
+- implement task completion execution path
+- implement canonical interpreter support for:
+  - transport calls
+  - branches
+  - success/failure paths
+- integrate completion flow with runtime snapshot commit
+
+### Deliverables
+
+- `SerdicaEngineRuntimeProvider.CompleteAsync`
+- transport dispatcher
+- tests for:
+  - completion to next task
+  - failure branch
+  - timeout branch where applicable
+
+### Exit Criteria
+
+- representative workflows can complete first task and reach correct next state
+
+## Sprint 7: Subworkflows, Continue-With, And Repeat
+
+### Goal
+
+Support the higher-order orchestration patterns used heavily in the corpus.
+
+### Scope
+
+- implement subworkflow frame persistence
+- implement parent resume
+- implement continue-with production
+- implement repeat resume semantics
+
+### Deliverables
+
+- subworkflow coordinator
+- resume pointer serializer
+- tests for:
+  - child completion resumes parent
+  - nested frame handling
+  - repeat interrupted by wait
+  - continue-with request emission
+
+### Exit Criteria
+
+- representative subworkflow-heavy families execute correctly
+
+## Sprint 8: Timers, Retries, And Delayed Resume
+
+### Goal
+
+Finish the non-polling scheduling path.
+
+### Scope
+
+- implement timer waits
+- implement retry scheduling
+- implement stale timer ignore logic via waiting tokens
+- integrate delayed AQ delivery into execution coordinator
+
+### Deliverables
+
+- timer wait model
+- delayed resume handler
+- tests for:
+  - timer due resume
+  - retry due resume
+  - canceled timer ignored
+  - restart-safe delayed processing
+
+### Exit Criteria
+
+- the engine supports time-based orchestration without polling loops
+
+## Sprint 9: Operational Parity
+
+### Goal
+
+Reach product-surface and operations parity with the existing workflow service.
+
+### Scope
+
+- diagram parity validation
+- runtime state inspection parity
+- retention integration
+- structured metrics and logging
+- DLQ handling and diagnostics
+
+### Deliverables
+
+- runtime metadata mapping updates
+- operational dashboards or documented metric set
+- DLQ support
+- tests for supportability paths
+
+### Exit Criteria
+
+- operations can inspect and support engine-driven instances through the existing product surface
+
+## Sprint 10: Corpus Parity And Hardening
+
+### Goal
+
+Prove the engine against the real declarative workflow corpus.
+
+### Scope
+
+- execute representative high-fanout families end-to-end
+- resolve remaining interpreter gaps
+- multi-node duplicate delivery testing
+- restart and recovery testing
+- performance and soak tests
+
+### Deliverables
+
+- parity report against selected workflow families
+- load test results
+- recovery test results
+- production readiness checklist
+
+### Exit Criteria
+
+- selected production-grade workflows run without Elsa
+- restart recovery is proven
+- no polling is used for steady-state signal or timer discovery
+
+## Sprint 11: Bulstrad E2E Parity And Oracle Reliability
+
+### Goal
+
+Turn the engine from a validated runtime into a production-grade execution platform by proving it against real Bulstrad workflows and hostile Oracle operating conditions.
+
+### Scope
+
+- build a curated Bulstrad Oracle-AQ E2E suite
+- replace synthetic runtime-state backing in Oracle integration tests with the real Oracle runtime-state store
+- add Oracle transaction-coupling tests for state, projections, and AQ publish
+- add Oracle restart, redelivery, and DLQ replay tests
+- add multi-worker and duplicate-delivery race tests
+- add deterministic fault-injection around commit boundaries
+
+### Deliverables
+
+- `BulstradOracleAqE2ETests`
+- curated representative workflows with scripted downstream responders
+- Oracle transport reliability suite covering:
+  - immediate and delayed delivery
+  - rollback and redelivery
+  - dead-letter browse and replay
+  - restart-safe delayed processing
+- concurrency suite covering:
+  - duplicate signal delivery
+  - same-instance multi-worker races
+  - retry-after-conflict behavior
+- documented timing expectations for cold-start and steady-state Oracle AQ
+
+### Implemented Coverage
+
+The current Oracle-backed integration harness now includes:
+
+- Bulstrad policy-change families:
+  - `OpenForChangePolicy`
+  - `ReviewPolicyOpenForChange`
+  - `AssistantAddAnnex`
+  - `AnnexCancellation`
+  - `AssistantPolicyReinstate`
+  - `AssistantPolicyCancellation`
+  - `AssistantPrintInsisDocuments`
+- shared policy families:
+  - `InsisIntegrationNew`
+  - `QuotationConfirm`
+  - `QuoteOrAplCancel`
+- Oracle transport and recovery matrix:
+  - immediate and delayed AQ delivery
+  - delayed backlog drain within a bounded latency envelope
+  - dequeue rollback redelivery
+  - ambient Oracle transaction commit and rollback for immediate messages
+  - ambient Oracle transaction commit and rollback for delayed messages
+  - dead-letter browse, replay, and backlog replay
+  - dead-letter backlog survival across Oracle restart
+  - timer backlog recovery across provider restart and Oracle restart
+  - external-signal backlog recovery, worker abandon/recovery, and duplicate-delivery races
+  - schedule/publish failure rollback inside workflow mutation transactions
+
+### Exit Criteria
+
+- representative Bulstrad workflows execute correctly on `SerdicaEngine` with real Oracle AQ
+- AQ-backed restart and delayed-delivery behavior is proven under realistic timing variance
+- duplicate delivery and commit-boundary failures are shown to be safe
+- the team has a stable PR suite and a broader nightly suite for Oracle-backed engine validation
+
+## Sprint 12: Load, Performance, And Capacity Characterization
+
+### Goal
+
+Turn the correctness-focused Oracle validation suite into a real load and performance program with stable smoke gates, nightly trend runs, soak coverage, and first capacity numbers.
+
+### Scope
+
+- build a dedicated performance harness on top of the Oracle AQ integration foundation
+- separate PR smoke, nightly characterization, weekly soak, and explicit capacity tiers
+- add synthetic engine workloads for stable measurement
+- add representative Bulstrad workload runners for business realism
+- persist performance artifacts and summary reports
+- define baseline and regression strategy per environment
+
+### Deliverables
+
+- categorized performance scenarios:
+  - `WorkflowPerfLatency`
+  - `WorkflowPerfThroughput`
+  - `WorkflowPerfSmoke`
+  - `WorkflowPerfNightly`
+  - `WorkflowPerfSoak`
+  - `WorkflowPerfCapacity`
+- result artifact writer under `TestResults/workflow-performance/`
+- scenario matrix covering:
+  - AQ immediate bursts
+  - AQ delayed bursts
+  - mixed signal backlogs
+  - synthetic start/task/signal/timer/subworkflow flows
+  - representative Bulstrad families
+  - restart and replay under load
+- first baseline report for local Docker and CI Oracle
+- first capacity note for one-node and multi-node assumptions
+
+### Exit Criteria
+
+- PR smoke load checks are cheap and stable enough to run continuously
+- nightly runs capture latency, throughput, and correctness artifacts
+- soak runs prove no backlog drift or correctness decay over extended execution
+- representative Bulstrad workflows have measured latency envelopes, not just functional pass/fail
+- the team has an initial sizing recommendation for worker concurrency and queue backlog expectations
+
+### Implemented Foundation
+
+The current Sprint 12 implementation now includes:
+
+- performance categories and artifact generation under `TestResults/workflow-performance/`
+- Oracle AQ smoke scenarios for:
+  - immediate burst drain
+  - delayed burst drain
+  - synthetic external-signal backlog resume
+  - short Bulstrad business burst using `QuoteOrAplCancel`
+- persisted comparison against the previous artifact for the same scenario and tier
+- Oracle AQ nightly scenarios for:
+  - larger immediate burst drain
+  - larger delayed burst drain
+  - larger synthetic external-signal backlog resume
+  - Bulstrad `QuotationConfirm -> PdfGenerator` burst
+- Oracle AQ soak scenario for:
+  - sustained synthetic signal round-trip waves without correctness drift
+- Oracle AQ latency baseline for:
+  - one-at-a-time synthetic signal round-trip with phase-level latency summaries
+- Oracle AQ throughput baseline for:
+  - parallel synthetic signal round-trip with `16` workload concurrency and `8` signal workers
+- Oracle AQ capacity ladder for:
+  - synthetic signal round-trip at concurrency `1`, `4`, `8`, and `16`
+- thread-safe scripted transport recording for concurrent smoke scenarios
+- first full Oracle baseline run with documented metrics in:
+  - [10-oracle-performance-baseline-2026-03-17.md](10-oracle-performance-baseline-2026-03-17.md)
+  - [10-oracle-performance-baseline-2026-03-17.json](10-oracle-performance-baseline-2026-03-17.json)
+
+### Reference
+
+The detailed workload model, KPI set, harness design, and baseline strategy are defined in [08-load-and-performance-plan.md](08-load-and-performance-plan.md).
+
+## Sprint 13: Engine-Native Rendering And Authoring Projection
+
+### Goal
+
+Restore definition rendering and authoring projection without reintroducing Elsa types or runtime dependencies into the workflow declarations or the engine host.
+
+### Scope
+
+- design and implement a native definition-to-diagram projection for declarative and canonical workflows
+- support deterministic node and edge generation from runtime definitions
+- preserve task, branch, repeat, fork, timer, signal, and subworkflow visibility in the rendered output
+- define a stable rendering contract for the operational API and future authoring tools
+- keep rendering as a separate projection layer, not as part of runtime execution
+
+### Deliverables
+
+- native rendering model and renderer for `WorkflowRuntimeDefinition`
+- canonical-to-diagram projection rules for:
+  - linear sequences
+  - decisions and conditional branches
+  - repeats
+  - forks and joins
+  - timers and external-signal waits
+  - continuations and subworkflows
+- updated operational metadata and diagram endpoints backed only by engine assets
+- test suite covering rendering determinism and parity for representative Bulstrad workflows
+
+### Exit Criteria
+
+- workflow definitions render without any Elsa packages, builders, or activity models
+- rendered diagrams remain stable for the same declarative definition across rebuilds
+- operational diagram inspection uses the native renderer only
+- the rendering layer is ready to support a later authoring surface without changing workflow declarations
+
+## Sprint 14: Backend Portability And Store Profiles
+
+### Goal
+
+Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.
+
+### Scope
+
+- introduce backend profile abstraction and dedicated backend plugin registration
+- split projection persistence from the current Oracle-first application service
+- formalize mutation coordinator abstraction
+- add backend-neutral dead-letter contract
+- add backend conformance suite
+- implement PostgreSQL profile
+- design MongoDB profile in executable detail, with implementation only after explicit product approval
+
+### Deliverables
+
+- `IWorkflowBackendRegistrationMarker`
+- backend-neutral projection contract
+- backend-neutral mutation coordinator contract
+- backend conformance suite
+- dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
+- executable MongoDB backend plugin design package
+
+### Exit Criteria
+
+- host selects one backend profile by configuration
+- host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
+- Oracle and PostgreSQL pass the same conformance suite
+- MongoDB path is specified well enough that implementation is a bounded engineering task
+- workflow declarations and canonical definitions remain unchanged across backend profiles
+
+## Sprint 15: Backend-Neutral Parity And Performance Harness
+
+### Goal
+
+Remove the remaining Oracle-only assumptions from the validation stack so PostgreSQL and MongoDB can be measured with the same correctness, Bulstrad, and performance scenarios.
+
+### Scope
+
+- extract backend-neutral performance artifacts, categories, and scenario drivers
+- extract backend-neutral runtime workload helpers from the Oracle-only harness
+- define one hostile-condition matrix shared by Oracle, PostgreSQL, and MongoDB
+- define one curated Bulstrad parity pack shared by all backends
+- define one normalized performance artifact format and baseline comparison model
+
+### Deliverables
+
+- shared `IntegrationTests/Performance/Common/` package
+- shared normalized performance metrics model
+- shared Bulstrad workload catalog for:
+  - `OpenForChangePolicy`
+  - `ReviewPolicyOpenForChange`
+  - `AssistantPrintInsisDocuments`
+  - `AssistantAddAnnex`
+  - `AnnexCancellation`
+  - `AssistantPolicyCancellation`
+  - `AssistantPolicyReinstate`
+  - `InsisIntegrationNew`
+  - `QuotationConfirm`
+  - `QuoteOrAplCancel`
+- backend-neutral hostile-condition checklist for:
+  - duplicate delivery
+  - same-instance resume race
+  - abandon and reclaim
+  - rollback on publish/schedule failure
+  - restart with pending due messages
+  - DLQ replay
+  - backlog drain
+
+### Exit Criteria
+
+- Oracle, PostgreSQL, and MongoDB use the same performance artifact shape
+- Oracle no longer owns the reporting model for later backend baselines
+- PostgreSQL and MongoDB can plug into the same workload definitions without changing workflow semantics
+
+## Sprint 16: PostgreSQL Hardening, Bulstrad Parity, And Baseline
+
+### Goal
+
+Bring PostgreSQL to Oracle-level confidence for correctness, hostile conditions, representative product behavior, and measured performance.
+
+### Scope
+
+- close the PostgreSQL hostile-condition gap to the Oracle matrix
+- add PostgreSQL-backed Bulstrad E2E parity
+- implement PostgreSQL latency, throughput, smoke, nightly, soak, and capacity suites
+- publish PostgreSQL baseline artifacts and narrative summary
+
+### Deliverables
+
+- PostgreSQL hostile-condition integration suite
+- PostgreSQL Bulstrad parity suite
+- PostgreSQL performance suites for:
+  - latency
+  - throughput
+  - smoke
+  - nightly
+  - soak
+  - capacity
+- baseline documents:
+  - `11-postgres-performance-baseline-<date>.md`
+  - `11-postgres-performance-baseline-<date>.json`
+
+### Exit Criteria
+
+- PostgreSQL passes the same hostile-condition matrix as Oracle
+- representative Bulstrad workflows run correctly on PostgreSQL
+- PostgreSQL has a durable, documented performance baseline comparable to Oracle
+
+## Sprint 17: MongoDB Hardening, Bulstrad Parity, And Baseline
+
+### Goal
+
+Bring MongoDB to the same product and operational confidence level as the relational backends without changing workflow behavior.
+
+### Scope
+
+- close the MongoDB hostile-condition gap to the Oracle matrix
+- add MongoDB-backed Bulstrad E2E parity
+- implement MongoDB latency, throughput, smoke, nightly, soak, and capacity suites
+- publish MongoDB baseline artifacts and narrative summary
+
+### Deliverables
+
+- MongoDB hostile-condition integration suite
+- MongoDB Bulstrad parity suite
+- MongoDB performance suites for:
+  - latency
+  - throughput
+  - smoke
+  - nightly
+  - soak
+  - capacity
+- baseline documents:
+  - `12-mongo-performance-baseline-<date>.md`
+  - `12-mongo-performance-baseline-<date>.json`
+
+### Exit Criteria
+
+- MongoDB passes the same hostile-condition matrix as Oracle
+- representative Bulstrad workflows run correctly on MongoDB
+- MongoDB has a durable, documented performance baseline comparable to Oracle and PostgreSQL
+
+## Sprint 18: Final Three-Backend Characterization And Decision Pack
+
+### Goal
+
+Produce the final side-by-side comparison for Oracle, PostgreSQL, and MongoDB using the same workloads, the same correctness rules, and the same performance artifact format.
+
+### Scope
+
+- rerun the shared Bulstrad parity pack on all three backends
+- rerun the shared hostile-condition matrix on all three backends
+- rerun the shared performance tiers and compare normalized metrics
+- capture backend-specific metrics appendices without letting them replace normalized workflow metrics
+- publish the final recommendation pack
+
+### Deliverables
+
+- final comparison documents:
+  - `13-backend-comparison-<date>.md`
+  - `13-backend-comparison-<date>.json`
+- normalized comparison across:
+  - serial latency
+  - steady-state throughput
+  - capacity ladder
+  - backlog drain
+  - duplicate-delivery safety
+  - restart recovery
+- backend-specific appendices for:
+  - Oracle wait and AQ observations
+  - PostgreSQL lock, WAL, and queue-table observations
+  - MongoDB transaction, lock, and change-stream observations
+
+### Exit Criteria
+
+- all three backends are compared through the same workload lens
+- the team has one documented backend recommendation pack
+- future backend decisions can reuse the same comparison harness instead of inventing new ad hoc measurements
+
+### Current Status
+
+- baseline comparison pack published in:
+  - [13-backend-comparison-2026-03-17.md](13-backend-comparison-2026-03-17.md)
+  - [13-backend-comparison-2026-03-17.json](13-backend-comparison-2026-03-17.json)
+- normalized performance comparison is complete for Oracle, PostgreSQL, and MongoDB
+- reliability and Bulstrad hardening depth remains Oracle-first, so the current comparison is a baseline decision pack, not the final production closeout
+- the signal path is now split into durable store and wake driver seams
+- PostgreSQL and MongoDB now persist transactional wake-outbox records behind that seam
+- the optional Redis wake-driver plugin is implemented for PostgreSQL and MongoDB
+- Oracle intentionally remains on native AQ and does not support the Redis wake-driver combination
+
+## Cross-Sprint Work Items
+
+These should be maintained continuously, not left to the end:
+
+- architecture doc updates
+- test harness improvements
+- canonical execution parity assertions
+- operational telemetry quality
+- snapshot schema versioning discipline
+- Oracle timing-envelope observations for CI and local Docker environments
+
+## Final Milestone Definition
+
+The project is complete when:
+
+- the workflow service can run on the engine as the active runtime
+- task and instance APIs remain stable
+- Oracle AQ handles both immediate signaling and delayed scheduling
+- the service resumes correctly after restart without polling
+- the engine runs representative real workflows with production-grade observability
+