Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.7 KiB
01. Requirements And Principles
1. Product Goal
Build a Serdica-owned workflow engine that can run the current Bulstrad workflow corpus without Elsa while preserving the existing service-level workflow product:
- workflow start
- task inbox and task lifecycle
- business-reference based lookup
- runtime state inspection
- workflow diagrams
- canonical schema and canonical validation exposure
- workflow retention and hosted jobs
The engine must execute the same business behavior currently expressed in the declarative workflow DSL and canonical workflow definition model.
2. Functional Requirements
2.1 Workflow Definition Handling
The engine must:
- discover workflow registrations from authored C# workflow classes
- resolve the latest or exact workflow version through the existing registration catalog
- compile authored declarative workflows into canonical runtime definitions
- keep canonical validation as a first-class platform capability
- reject invalid or unsupported definitions during startup or validation
2.2 Workflow Start
The engine must:
- bind the untyped start payload to the workflow start request type
- resolve or derive business reference data
- initialize canonical workflow state
- execute the initial sequence until a wait boundary or completion
- create workflow projections and runtime state in one durable flow
- support workflow continuations created during start
2.3 Human Tasks
The engine must:
- activate human tasks with:
- task type
- route
- workflow roles
- task roles
- runtime roles
- payload
- business reference
- preserve the current task assignment model:
- assign to self
- assign to user
- assign to runtime roles
- release
- expose completed and active task history through the existing projection model
2.4 Task Completion
The engine must:
- load the current workflow state and task context
- authorize completion through the existing service layer
- apply completion payload
- continue execution from the task completion entry point
- produce next tasks, next waits, next continuations, or completion
- update runtime state and read projections durably
2.5 Runtime Semantics
The engine must support the semantic surface already present in declarative workflows:
- state assignment
- business reference assignment
- human task activation
- microservice calls
- legacy rabbit calls
- GraphQL calls
- HTTP calls
- conditional branches
- decision branches
- repeat loops
- subworkflow invocation
- continue-with orchestration
- timeout branches
- failure branches
- function-backed expressions
2.6 Subworkflows
The engine must:
- start child workflows
- persist parent resume frames
- carry child output back into parent state
- support nested resume across multiple levels
- preserve current declarative subworkflow semantics
2.7 Scheduling
The engine must support:
- timeouts
- retry wake-ups
- delayed continuation
- explicit wait-until behavior
This must happen without a steady-state polling loop.
2.8 Inspection And Operations
The service must continue to expose:
- workflow definitions
- workflow instances
- workflow tasks
- workflow task events
- workflow diagrams
- runtime state snapshots
- canonical schema
- canonical validation
3. Non-Functional Requirements
3.1 Multi-Instance Deployment
The service must support multiple application nodes against one shared Oracle database.
Implications:
- no single-node assumptions
- no in-memory-only correctness logic
- no sticky workflow ownership
- duplicate signal delivery must be safe
3.2 Durability
The system of record must be durable across:
- process restart
- node restart
- full cluster restart
- database restart
Workflow progress, pending waits, active tasks, and due timers must not be lost.
3.3 No Polling
Signal-driven wake-up is mandatory.
The engine must not rely on a periodic database scan loop to discover work. Blocking or event-driven delivery is required for:
- task completion wake-up
- delayed resume wake-up
- subworkflow completion wake-up
- external signal wake-up
3.4 One Database
Oracle is the shared durable state backend for:
- workflow projections
- workflow runtime snapshots
- host coordination
- signal and schedule durability through Oracle AQ
Redis may exist in the wider platform, but it is not required for engine correctness.
3.5 Observability
The engine must produce enough telemetry to answer:
- what instance is waiting
- why it is waiting
- which signal resumed it
- which node executed it
- which definition version it used
- why it failed
- whether a message was retried, dead-lettered, or ignored as stale
3.6 Compatibility
The engine must preserve the existing public workflow service contracts unless a future product change explicitly changes them.
The following service-contract groups are especially important:
- workflow start contracts
- workflow definition contracts
- workflow task contracts
- workflow instance contracts
- workflow operational contracts
4. Explicit V1 Assumptions
These assumptions simplify the engine architecture and are intentional.
4.1 Single Active Runtime Provider Per Deployment
The service runs one engine provider at a time.
This means:
- no mixed-provider instance routing
- no live migration between engines
- no simultaneous old-runtime and engine execution inside one deployment
The design still keeps abstractions around the runtime, signaling bus, and scheduler so that future replacement remains possible.
4.2 Canonical Runtime, Not Elsa Activity Runtime
The target engine executes canonical workflow definitions directly.
Authored C# remains the source of truth, but runtime semantics are driven by canonical definitions compiled from that source.
4.3 Oracle AQ Is The Default Event Backbone
Oracle AQ is treated as part of the durable engine platform because it satisfies:
- one-database architecture
- blocking dequeue
- durable delivery
- delayed delivery
- transactional behavior
5. Design Principles
5.1 Keep The Product Surface Stable
The workflow service remains the product boundary. The engine is an internal subsystem.
5.2 Separate Read Model From Runtime Model
Task and instance projections are optimized for product reads.
Runtime snapshots are optimized for deterministic resume.
They are related, but they are not the same data structure.
5.3 Run To Wait
The engine should never keep a workflow instance “hot†in memory for correctness.
Execution should run until:
- a task is activated
- a timer is scheduled
- an external signal wait is registered
- the workflow completes
Then the snapshot is persisted and released.
5.4 Make Delivery At-Least-Once And Resume Idempotent
Distributed delivery is never exactly-once in practice.
The engine must treat duplicate signals, duplicate wake-ups, and late timer arrivals as normal conditions.
5.5 Keep Signals Small
Signals should identify work, not carry the full workflow state.
The database snapshot remains authoritative.
5.6 Keep Abstractions At The Backend Boundary
Abstract:
- runtime provider
- signal bus
- schedule bus
- snapshot store
Do not abstract away the workflow semantics themselves.
5.7 Prefer Transactional Consistency Over Cleverness
If a feature can be made transactional in Oracle, prefer that over eventually-consistent coordination tricks.
6. Success Criteria
The engine architecture is successful when:
- the service can start and complete workflows without Elsa
- task projections remain correct
- delayed resumes happen without polling
- a stopped cluster resumes safely after restart
- a multi-node deployment does not corrupt workflow state
- canonical definitions remain the execution contract
- operations can inspect and support the system with existing product-level APIs