Files
git.stella-ops.org/docs/workflow/engine/07-sprint-plan.md
master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00

21 KiB

07. Sprint Plan

Planning Assumptions

  • sprint length: 2 weeks
  • one team owning runtime, persistence, and service integration
  • Oracle AQ available
  • no concurrent-engine migration scope
  • acceptance means code, tests, and updated docs

Sprint 1: Foundations And Contracts

Goal

Create the engine skeleton and the stable interfaces.

Scope

  • add runtime provider abstraction
  • add signal bus abstraction
  • add schedule bus abstraction
  • add runtime snapshot abstraction
  • add engine option classes
  • add docs/engine/ package

Deliverables

  • interface set compiled into shared abstractions
  • configuration classes
  • initial DI composition path
  • unit tests for options and registration

Exit Criteria

  • service builds with engine abstractions present
  • no Elsa runtime assumptions are introduced into new code
  • docs and interface names are stable enough for later sprints

Sprint 2: Canonical Runtime Definition Store

Goal

Make canonical execution definitions available at runtime without Elsa.

Scope

  • compile authored workflows to canonical runtime definitions at startup
  • validate definitions during startup
  • cache runtime definitions
  • expose startup failure mode for invalid definitions

Deliverables

  • WorkflowRuntimeDefinitionStore
  • definition normalization pipeline
  • startup validator
  • tests covering:
    • valid definition load
    • invalid definition rejection
    • version resolution

Exit Criteria

  • all registered workflows load into runtime definition cache
  • the runtime can resolve definition by name/version

Sprint 3: Snapshot Store And Versioned Runtime State

Goal

Turn WF_RUNTIME_STATES into a first-class engine snapshot store.

Scope

  • extend runtime state schema
  • implement snapshot mapper
  • implement optimistic concurrency versioning
  • wire snapshot reads and writes

Deliverables

  • database migration scripts
  • OracleWorkflowRuntimeSnapshotStore
  • snapshot serialization contracts
  • tests for:
    • initial insert
    • update with expected version
    • stale version conflict

Exit Criteria

  • runtime snapshots can be loaded and committed with version control
  • stale updates are rejected safely

Sprint 4: AQ Signal And Schedule Backbone

Goal

Introduce Oracle AQ as the durable event backbone.

Scope

  • create AQ setup scripts
  • implement signal bus
  • implement schedule bus
  • implement signal envelope serialization
  • implement hosted signal consumer skeleton

Deliverables

  • AQ DDL scripts
  • OracleAqWorkflowSignalBus
  • OracleAqWorkflowScheduleBus
  • integration tests with enqueue/dequeue
  • delayed message smoke tests

Exit Criteria

  • engine can publish and receive immediate signals without polling
  • engine can publish and receive delayed signals

Sprint 5: Start Flow And Human Task Activation

Goal

Run workflows from start until first durable wait.

Scope

  • implement execution coordinator
  • implement canonical interpreter subset:
    • state assignment
    • business reference assignment
    • task activation
    • terminal completion
  • integrate with WorkflowRuntimeService
  • keep existing projection model

Deliverables

  • SerdicaEngineRuntimeProvider.StartAsync
  • execution slice result model
  • task activation write path
  • tests for:
    • start to task
    • start to completion
    • business reference propagation

Exit Criteria

  • selected declarative workflows can start and create correct tasks without Elsa

Sprint 6: Task Completion And Transport Calls

Goal

Advance workflows after task completion and support transport-backed orchestration.

Scope

  • implement task completion execution path
  • implement canonical interpreter support for:
    • transport calls
    • branches
    • success/failure paths
  • integrate completion flow with runtime snapshot commit

Deliverables

  • SerdicaEngineRuntimeProvider.CompleteAsync
  • transport dispatcher
  • tests for:
    • completion to next task
    • failure branch
    • timeout branch where applicable

Exit Criteria

  • representative workflows can complete first task and reach correct next state

Sprint 7: Subworkflows, Continue-With, And Repeat

Goal

Support the higher-order orchestration patterns used heavily in the corpus.

Scope

  • implement subworkflow frame persistence
  • implement parent resume
  • implement continue-with production
  • implement repeat resume semantics

Deliverables

  • subworkflow coordinator
  • resume pointer serializer
  • tests for:
    • child completion resumes parent
    • nested frame handling
    • repeat interrupted by wait
    • continue-with request emission

Exit Criteria

  • representative subworkflow-heavy families execute correctly

Sprint 8: Timers, Retries, And Delayed Resume

Goal

Finish the non-polling scheduling path.

Scope

  • implement timer waits
  • implement retry scheduling
  • implement stale timer ignore logic via waiting tokens
  • integrate delayed AQ delivery into execution coordinator

Deliverables

  • timer wait model
  • delayed resume handler
  • tests for:
    • timer due resume
    • retry due resume
    • canceled timer ignored
    • restart-safe delayed processing

Exit Criteria

  • the engine supports time-based orchestration without polling loops

Sprint 9: Operational Parity

Goal

Reach product-surface and operations parity with the existing workflow service.

Scope

  • diagram parity validation
  • runtime state inspection parity
  • retention integration
  • structured metrics and logging
  • DLQ handling and diagnostics

Deliverables

  • runtime metadata mapping updates
  • operational dashboards or documented metric set
  • DLQ support
  • tests for supportability paths

Exit Criteria

  • operations can inspect and support engine-driven instances through the existing product surface

Sprint 10: Corpus Parity And Hardening

Goal

Prove the engine against the real declarative workflow corpus.

Scope

  • execute representative high-fanout families end-to-end
  • resolve remaining interpreter gaps
  • multi-node duplicate delivery testing
  • restart and recovery testing
  • performance and soak tests

Deliverables

  • parity report against selected workflow families
  • load test results
  • recovery test results
  • production readiness checklist

Exit Criteria

  • selected production-grade workflows run without Elsa
  • restart recovery is proven
  • no polling is used for steady-state signal or timer discovery

Sprint 11: Bulstrad E2E Parity And Oracle Reliability

Goal

Turn the engine from a validated runtime into a production-grade execution platform by proving it against real Bulstrad workflows and hostile Oracle operating conditions.

Scope

  • build a curated Bulstrad Oracle-AQ E2E suite
  • replace synthetic runtime-state backing in Oracle integration tests with the real Oracle runtime-state store
  • add Oracle transaction-coupling tests for state, projections, and AQ publish
  • add Oracle restart, redelivery, and DLQ replay tests
  • add multi-worker and duplicate-delivery race tests
  • add deterministic fault-injection around commit boundaries

Deliverables

  • BulstradOracleAqE2ETests
  • curated representative workflows with scripted downstream responders
  • Oracle transport reliability suite covering:
    • immediate and delayed delivery
    • rollback and redelivery
    • dead-letter browse and replay
    • restart-safe delayed processing
  • concurrency suite covering:
    • duplicate signal delivery
    • same-instance multi-worker races
    • retry-after-conflict behavior
  • documented timing expectations for cold-start and steady-state Oracle AQ

Implemented Coverage

The current Oracle-backed integration harness now includes:

  • Bulstrad policy-change families:
    • OpenForChangePolicy
    • ReviewPolicyOpenForChange
    • AssistantAddAnnex
    • AnnexCancellation
    • AssistantPolicyReinstate
    • AssistantPolicyCancellation
    • AssistantPrintInsisDocuments
  • shared policy families:
    • InsisIntegrationNew
    • QuotationConfirm
    • QuoteOrAplCancel
  • Oracle transport and recovery matrix:
    • immediate and delayed AQ delivery
    • delayed backlog drain within a bounded latency envelope
    • dequeue rollback redelivery
    • ambient Oracle transaction commit and rollback for immediate messages
    • ambient Oracle transaction commit and rollback for delayed messages
    • dead-letter browse, replay, and backlog replay
    • dead-letter backlog survival across Oracle restart
    • timer backlog recovery across provider restart and Oracle restart
    • external-signal backlog recovery, worker abandon/recovery, and duplicate-delivery races
    • schedule/publish failure rollback inside workflow mutation transactions

Exit Criteria

  • representative Bulstrad workflows execute correctly on SerdicaEngine with real Oracle AQ
  • AQ-backed restart and delayed-delivery behavior is proven under realistic timing variance
  • duplicate delivery and commit-boundary failures are shown to be safe
  • the team has a stable PR suite and a broader nightly suite for Oracle-backed engine validation

Sprint 12: Load, Performance, And Capacity Characterization

Goal

Turn the correctness-focused Oracle validation suite into a real load and performance program with stable smoke gates, nightly trend runs, soak coverage, and first capacity numbers.

Scope

  • build a dedicated performance harness on top of the Oracle AQ integration foundation
  • separate PR smoke, nightly characterization, weekly soak, and explicit capacity tiers
  • add synthetic engine workloads for stable measurement
  • add representative Bulstrad workload runners for business realism
  • persist performance artifacts and summary reports
  • define baseline and regression strategy per environment

Deliverables

  • categorized performance scenarios:
    • WorkflowPerfLatency
    • WorkflowPerfThroughput
    • WorkflowPerfSmoke
    • WorkflowPerfNightly
    • WorkflowPerfSoak
    • WorkflowPerfCapacity
  • result artifact writer under TestResults/workflow-performance/
  • scenario matrix covering:
    • AQ immediate bursts
    • AQ delayed bursts
    • mixed signal backlogs
    • synthetic start/task/signal/timer/subworkflow flows
    • representative Bulstrad families
    • restart and replay under load
  • first baseline report for local Docker and CI Oracle
  • first capacity note for one-node and multi-node assumptions

Exit Criteria

  • PR smoke load checks are cheap and stable enough to run continuously
  • nightly runs capture latency, throughput, and correctness artifacts
  • soak runs prove no backlog drift or correctness decay over extended execution
  • representative Bulstrad workflows have measured latency envelopes, not just functional pass/fail
  • the team has an initial sizing recommendation for worker concurrency and queue backlog expectations

Implemented Foundation

The current Sprint 12 implementation now includes:

  • performance categories and artifact generation under TestResults/workflow-performance/
  • Oracle AQ smoke scenarios for:
    • immediate burst drain
    • delayed burst drain
    • synthetic external-signal backlog resume
    • short Bulstrad business burst using QuoteOrAplCancel
  • persisted comparison against the previous artifact for the same scenario and tier
  • Oracle AQ nightly scenarios for:
    • larger immediate burst drain
    • larger delayed burst drain
    • larger synthetic external-signal backlog resume
    • Bulstrad QuotationConfirm -> PdfGenerator burst
  • Oracle AQ soak scenario for:
    • sustained synthetic signal round-trip waves without correctness drift
  • Oracle AQ latency baseline for:
    • one-at-a-time synthetic signal round-trip with phase-level latency summaries
  • Oracle AQ throughput baseline for:
    • parallel synthetic signal round-trip with 16 workload concurrency and 8 signal workers
  • Oracle AQ capacity ladder for:
    • synthetic signal round-trip at concurrency 1, 4, 8, and 16
  • thread-safe scripted transport recording for concurrent smoke scenarios
  • first full Oracle baseline run with documented metrics in:

Reference

The detailed workload model, KPI set, harness design, and baseline strategy are defined in 08-load-and-performance-plan.md.

Sprint 13: Engine-Native Rendering And Authoring Projection

Goal

Restore definition rendering and authoring projection without reintroducing Elsa types or runtime dependencies into the workflow declarations or the engine host.

Scope

  • design and implement a native definition-to-diagram projection for declarative and canonical workflows
  • support deterministic node and edge generation from runtime definitions
  • preserve task, branch, repeat, fork, timer, signal, and subworkflow visibility in the rendered output
  • define a stable rendering contract for the operational API and future authoring tools
  • keep rendering as a separate projection layer, not as part of runtime execution

Deliverables

  • native rendering model and renderer for WorkflowRuntimeDefinition
  • canonical-to-diagram projection rules for:
    • linear sequences
    • decisions and conditional branches
    • repeats
    • forks and joins
    • timers and external-signal waits
    • continuations and subworkflows
  • updated operational metadata and diagram endpoints backed only by engine assets
  • test suite covering rendering determinism and parity for representative Bulstrad workflows

Exit Criteria

  • workflow definitions render without any Elsa packages, builders, or activity models
  • rendered diagrams remain stable for the same declarative definition across rebuilds
  • operational diagram inspection uses the native renderer only
  • the rendering layer is ready to support a later authoring surface without changing workflow declarations

Sprint 14: Backend Portability And Store Profiles

Goal

Turn the Oracle-first engine into a backend-switchable engine with one selected backend profile per deployment.

Scope

  • introduce backend profile abstraction and dedicated backend plugin registration
  • split projection persistence from the current Oracle-first application service
  • formalize mutation coordinator abstraction
  • add backend-neutral dead-letter contract
  • add backend conformance suite
  • implement PostgreSQL profile
  • design MongoDB profile in executable detail, with implementation only after explicit product approval

Deliverables

  • IWorkflowBackendRegistrationMarker
  • backend-neutral projection contract
  • backend-neutral mutation coordinator contract
  • backend conformance suite
  • dedicated Oracle, PostgreSQL, and MongoDB backend plugin projects
  • executable MongoDB backend plugin design package

Exit Criteria

  • host selects one backend profile by configuration
  • host stays backend-neutral and does not resolve Oracle/PostgreSQL directly
  • Oracle and PostgreSQL pass the same conformance suite
  • MongoDB path is specified well enough that implementation is a bounded engineering task
  • workflow declarations and canonical definitions remain unchanged across backend profiles

Sprint 15: Backend-Neutral Parity And Performance Harness

Goal

Remove the remaining Oracle-only assumptions from the validation stack so PostgreSQL and MongoDB can be measured with the same correctness, Bulstrad, and performance scenarios.

Scope

  • extract backend-neutral performance artifacts, categories, and scenario drivers
  • extract backend-neutral runtime workload helpers from the Oracle-only harness
  • define one hostile-condition matrix shared by Oracle, PostgreSQL, and MongoDB
  • define one curated Bulstrad parity pack shared by all backends
  • define one normalized performance artifact format and baseline comparison model

Deliverables

  • shared IntegrationTests/Performance/Common/ package
  • shared normalized performance metrics model
  • shared Bulstrad workload catalog for:
    • OpenForChangePolicy
    • ReviewPolicyOpenForChange
    • AssistantPrintInsisDocuments
    • AssistantAddAnnex
    • AnnexCancellation
    • AssistantPolicyCancellation
    • AssistantPolicyReinstate
    • InsisIntegrationNew
    • QuotationConfirm
    • QuoteOrAplCancel
  • backend-neutral hostile-condition checklist for:
    • duplicate delivery
    • same-instance resume race
    • abandon and reclaim
    • rollback on publish/schedule failure
    • restart with pending due messages
    • DLQ replay
    • backlog drain

Exit Criteria

  • Oracle, PostgreSQL, and MongoDB use the same performance artifact shape
  • Oracle no longer owns the reporting model for later backend baselines
  • PostgreSQL and MongoDB can plug into the same workload definitions without changing workflow semantics

Sprint 16: PostgreSQL Hardening, Bulstrad Parity, And Baseline

Goal

Bring PostgreSQL to Oracle-level confidence for correctness, hostile conditions, representative product behavior, and measured performance.

Scope

  • close the PostgreSQL hostile-condition gap to the Oracle matrix
  • add PostgreSQL-backed Bulstrad E2E parity
  • implement PostgreSQL latency, throughput, smoke, nightly, soak, and capacity suites
  • publish PostgreSQL baseline artifacts and narrative summary

Deliverables

  • PostgreSQL hostile-condition integration suite
  • PostgreSQL Bulstrad parity suite
  • PostgreSQL performance suites for:
    • latency
    • throughput
    • smoke
    • nightly
    • soak
    • capacity
  • baseline documents:
    • 11-postgres-performance-baseline-<date>.md
    • 11-postgres-performance-baseline-<date>.json

Exit Criteria

  • PostgreSQL passes the same hostile-condition matrix as Oracle
  • representative Bulstrad workflows run correctly on PostgreSQL
  • PostgreSQL has a durable, documented performance baseline comparable to Oracle

Sprint 17: MongoDB Hardening, Bulstrad Parity, And Baseline

Goal

Bring MongoDB to the same product and operational confidence level as the relational backends without changing workflow behavior.

Scope

  • close the MongoDB hostile-condition gap to the Oracle matrix
  • add MongoDB-backed Bulstrad E2E parity
  • implement MongoDB latency, throughput, smoke, nightly, soak, and capacity suites
  • publish MongoDB baseline artifacts and narrative summary

Deliverables

  • MongoDB hostile-condition integration suite
  • MongoDB Bulstrad parity suite
  • MongoDB performance suites for:
    • latency
    • throughput
    • smoke
    • nightly
    • soak
    • capacity
  • baseline documents:
    • 12-mongo-performance-baseline-<date>.md
    • 12-mongo-performance-baseline-<date>.json

Exit Criteria

  • MongoDB passes the same hostile-condition matrix as Oracle
  • representative Bulstrad workflows run correctly on MongoDB
  • MongoDB has a durable, documented performance baseline comparable to Oracle and PostgreSQL

Sprint 18: Final Three-Backend Characterization And Decision Pack

Goal

Produce the final side-by-side comparison for Oracle, PostgreSQL, and MongoDB using the same workloads, the same correctness rules, and the same performance artifact format.

Scope

  • rerun the shared Bulstrad parity pack on all three backends
  • rerun the shared hostile-condition matrix on all three backends
  • rerun the shared performance tiers and compare normalized metrics
  • capture backend-specific metrics appendices without letting them replace normalized workflow metrics
  • publish the final recommendation pack

Deliverables

  • final comparison documents:
    • 13-backend-comparison-<date>.md
    • 13-backend-comparison-<date>.json
  • normalized comparison across:
    • serial latency
    • steady-state throughput
    • capacity ladder
    • backlog drain
    • duplicate-delivery safety
    • restart recovery
  • backend-specific appendices for:
    • Oracle wait and AQ observations
    • PostgreSQL lock, WAL, and queue-table observations
    • MongoDB transaction, lock, and change-stream observations

Exit Criteria

  • all three backends are compared through the same workload lens
  • the team has one documented backend recommendation pack
  • future backend decisions can reuse the same comparison harness instead of inventing new ad hoc measurements

Current Status

  • baseline comparison pack published in:
  • normalized performance comparison is complete for Oracle, PostgreSQL, and MongoDB
  • reliability and Bulstrad hardening depth remains Oracle-first, so the current comparison is a baseline decision pack, not the final production closeout
  • the signal path is now split into durable store and wake driver seams
  • PostgreSQL and MongoDB now persist transactional wake-outbox records behind that seam
  • the optional Redis wake-driver plugin is implemented for PostgreSQL and MongoDB
  • Oracle intentionally remains on native AQ and does not support the Redis wake-driver combination

Cross-Sprint Work Items

These should be maintained continuously, not left to the end:

  • architecture doc updates
  • test harness improvements
  • canonical execution parity assertions
  • operational telemetry quality
  • snapshot schema versioning discipline
  • Oracle timing-envelope observations for CI and local Docker environments

Final Milestone Definition

The project is complete when:

  • the workflow service can run on the engine as the active runtime
  • task and instance APIs remain stable
  • Oracle AQ handles both immediate signaling and delayed scheduling
  • the service resumes correctly after restart without polling
  • the engine runs representative real workflows with production-grade observability