Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
292 lines
7.7 KiB
Markdown
292 lines
7.7 KiB
Markdown
# 01. Requirements And Principles
|
||
|
||
## 1. Product Goal
|
||
|
||
Build a Serdica-owned workflow engine that can run the current Bulstrad workflow corpus without Elsa while preserving the existing service-level workflow product:
|
||
|
||
- workflow start
|
||
- task inbox and task lifecycle
|
||
- business-reference based lookup
|
||
- runtime state inspection
|
||
- workflow diagrams
|
||
- canonical schema and canonical validation exposure
|
||
- workflow retention and hosted jobs
|
||
|
||
The engine must execute the same business behavior currently expressed in the declarative workflow DSL and canonical workflow definition model.
|
||
|
||
## 2. Functional Requirements
|
||
|
||
### 2.1 Workflow Definition Handling
|
||
|
||
The engine must:
|
||
|
||
- discover workflow registrations from authored C# workflow classes
|
||
- resolve the latest or exact workflow version through the existing registration catalog
|
||
- compile authored declarative workflows into canonical runtime definitions
|
||
- keep canonical validation as a first-class platform capability
|
||
- reject invalid or unsupported definitions during startup or validation
|
||
|
||
### 2.2 Workflow Start
|
||
|
||
The engine must:
|
||
|
||
- bind the untyped start payload to the workflow start request type
|
||
- resolve or derive business reference data
|
||
- initialize canonical workflow state
|
||
- execute the initial sequence until a wait boundary or completion
|
||
- create workflow projections and runtime state in one durable flow
|
||
- support workflow continuations created during start
|
||
|
||
### 2.3 Human Tasks
|
||
|
||
The engine must:
|
||
|
||
- activate human tasks with:
|
||
- task type
|
||
- route
|
||
- workflow roles
|
||
- task roles
|
||
- runtime roles
|
||
- payload
|
||
- business reference
|
||
- preserve the current task assignment model:
|
||
- assign to self
|
||
- assign to user
|
||
- assign to runtime roles
|
||
- release
|
||
- expose completed and active task history through the existing projection model
|
||
|
||
### 2.4 Task Completion
|
||
|
||
The engine must:
|
||
|
||
- load the current workflow state and task context
|
||
- authorize completion through the existing service layer
|
||
- apply completion payload
|
||
- continue execution from the task completion entry point
|
||
- produce next tasks, next waits, next continuations, or completion
|
||
- update runtime state and read projections durably
|
||
|
||
### 2.5 Runtime Semantics
|
||
|
||
The engine must support the semantic surface already present in declarative workflows:
|
||
|
||
- state assignment
|
||
- business reference assignment
|
||
- human task activation
|
||
- microservice calls
|
||
- legacy rabbit calls
|
||
- GraphQL calls
|
||
- HTTP calls
|
||
- conditional branches
|
||
- decision branches
|
||
- repeat loops
|
||
- subworkflow invocation
|
||
- continue-with orchestration
|
||
- timeout branches
|
||
- failure branches
|
||
- function-backed expressions
|
||
|
||
### 2.6 Subworkflows
|
||
|
||
The engine must:
|
||
|
||
- start child workflows
|
||
- persist parent resume frames
|
||
- carry child output back into parent state
|
||
- support nested resume across multiple levels
|
||
- preserve current declarative subworkflow semantics
|
||
|
||
### 2.7 Scheduling
|
||
|
||
The engine must support:
|
||
|
||
- timeouts
|
||
- retry wake-ups
|
||
- delayed continuation
|
||
- explicit wait-until behavior
|
||
|
||
This must happen without a steady-state polling loop.
|
||
|
||
### 2.8 Inspection And Operations
|
||
|
||
The service must continue to expose:
|
||
|
||
- workflow definitions
|
||
- workflow instances
|
||
- workflow tasks
|
||
- workflow task events
|
||
- workflow diagrams
|
||
- runtime state snapshots
|
||
- canonical schema
|
||
- canonical validation
|
||
|
||
## 3. Non-Functional Requirements
|
||
|
||
### 3.1 Multi-Instance Deployment
|
||
|
||
The service must support multiple application nodes against one shared Oracle database.
|
||
|
||
Implications:
|
||
|
||
- no single-node assumptions
|
||
- no in-memory-only correctness logic
|
||
- no sticky workflow ownership
|
||
- duplicate signal delivery must be safe
|
||
|
||
### 3.2 Durability
|
||
|
||
The system of record must be durable across:
|
||
|
||
- process restart
|
||
- node restart
|
||
- full cluster restart
|
||
- database restart
|
||
|
||
Workflow progress, pending waits, active tasks, and due timers must not be lost.
|
||
|
||
### 3.3 No Polling
|
||
|
||
Signal-driven wake-up is mandatory.
|
||
|
||
The engine must not rely on a periodic database scan loop to discover work. Blocking or event-driven delivery is required for:
|
||
|
||
- task completion wake-up
|
||
- delayed resume wake-up
|
||
- subworkflow completion wake-up
|
||
- external signal wake-up
|
||
|
||
### 3.4 One Database
|
||
|
||
Oracle is the shared durable state backend for:
|
||
|
||
- workflow projections
|
||
- workflow runtime snapshots
|
||
- host coordination
|
||
- signal and schedule durability through Oracle AQ
|
||
|
||
Redis may exist in the wider platform, but it is not required for engine correctness.
|
||
|
||
### 3.5 Observability
|
||
|
||
The engine must produce enough telemetry to answer:
|
||
|
||
- what instance is waiting
|
||
- why it is waiting
|
||
- which signal resumed it
|
||
- which node executed it
|
||
- which definition version it used
|
||
- why it failed
|
||
- whether a message was retried, dead-lettered, or ignored as stale
|
||
|
||
### 3.6 Compatibility
|
||
|
||
The engine must preserve the existing public workflow service contracts unless a future product change explicitly changes them.
|
||
|
||
The following service-contract groups are especially important:
|
||
|
||
- workflow start contracts
|
||
- workflow definition contracts
|
||
- workflow task contracts
|
||
- workflow instance contracts
|
||
- workflow operational contracts
|
||
|
||
## 4. Explicit V1 Assumptions
|
||
|
||
These assumptions simplify the engine architecture and are intentional.
|
||
|
||
### 4.1 Single Active Runtime Provider Per Deployment
|
||
|
||
The service runs one engine provider at a time.
|
||
|
||
This means:
|
||
|
||
- no mixed-provider instance routing
|
||
- no live migration between engines
|
||
- no simultaneous old-runtime and engine execution inside one deployment
|
||
|
||
The design still keeps abstractions around the runtime, signaling bus, and scheduler so that future replacement remains possible.
|
||
|
||
### 4.2 Canonical Runtime, Not Elsa Activity Runtime
|
||
|
||
The target engine executes canonical workflow definitions directly.
|
||
|
||
Authored C# remains the source of truth, but runtime semantics are driven by canonical definitions compiled from that source.
|
||
|
||
### 4.3 Oracle AQ Is The Default Event Backbone
|
||
|
||
Oracle AQ is treated as part of the durable engine platform because it satisfies:
|
||
|
||
- one-database architecture
|
||
- blocking dequeue
|
||
- durable delivery
|
||
- delayed delivery
|
||
- transactional behavior
|
||
|
||
## 5. Design Principles
|
||
|
||
### 5.1 Keep The Product Surface Stable
|
||
|
||
The workflow service remains the product boundary. The engine is an internal subsystem.
|
||
|
||
### 5.2 Separate Read Model From Runtime Model
|
||
|
||
Task and instance projections are optimized for product reads.
|
||
|
||
Runtime snapshots are optimized for deterministic resume.
|
||
|
||
They are related, but they are not the same data structure.
|
||
|
||
### 5.3 Run To Wait
|
||
|
||
The engine should never keep a workflow instance “hot†in memory for correctness.
|
||
|
||
Execution should run until:
|
||
|
||
- a task is activated
|
||
- a timer is scheduled
|
||
- an external signal wait is registered
|
||
- the workflow completes
|
||
|
||
Then the snapshot is persisted and released.
|
||
|
||
### 5.4 Make Delivery At-Least-Once And Resume Idempotent
|
||
|
||
Distributed delivery is never exactly-once in practice.
|
||
|
||
The engine must treat duplicate signals, duplicate wake-ups, and late timer arrivals as normal conditions.
|
||
|
||
### 5.5 Keep Signals Small
|
||
|
||
Signals should identify work, not carry the full workflow state.
|
||
|
||
The database snapshot remains authoritative.
|
||
|
||
### 5.6 Keep Abstractions At The Backend Boundary
|
||
|
||
Abstract:
|
||
|
||
- runtime provider
|
||
- signal bus
|
||
- schedule bus
|
||
- snapshot store
|
||
|
||
Do not abstract away the workflow semantics themselves.
|
||
|
||
### 5.7 Prefer Transactional Consistency Over Cleverness
|
||
|
||
If a feature can be made transactional in Oracle, prefer that over eventually-consistent coordination tricks.
|
||
|
||
## 6. Success Criteria
|
||
|
||
The engine architecture is successful when:
|
||
|
||
- the service can start and complete workflows without Elsa
|
||
- task projections remain correct
|
||
- delayed resumes happen without polling
|
||
- a stopped cluster resumes safely after restart
|
||
- a multi-node deployment does not corrupt workflow state
|
||
- canonical definitions remain the execution contract
|
||
- operations can inspect and support the system with existing product-level APIs
|
||
|