Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 KiB
05. Service Surface, Hosting, And Operations
1. Public Service Surface
The engine replacement must preserve the current workflow product APIs.
That means the following capability groups remain stable:
- workflow definition inspection
- workflow start
- workflow tasks list/get/assign/release/complete
- workflow instances list/get
- workflow diagrams
- workflow retention run
- canonical schema inspection
- canonical import validation
The existing service-contract groups remain the baseline:
- workflow definition contracts
- workflow start contracts
- workflow task contracts
- workflow instance contracts
- workflow operational contracts
2. Service Metadata
The service should continue to advertise:
- definition inspection support
- instance inspection support
- canonical schema inspection support
- canonical validation support
The diagram provider value should change from old-runtime semantics to an engine-compatible diagram provider, but the public contract can remain unchanged.
3. Workflow Diagram Strategy
The current diagram service builds a simplified linear diagram from definition metadata and overlays instance/task status.
The current simplified workflow diagram service is the baseline. V1 engine design keeps this approach.
Why:
- it is already product-compatible
- it does not depend on Elsa runtime internals
- it uses task and instance projections, which remain in place
The engine should not block on building a richer graph renderer.
4. Authorization And Assignment
Authorization remains in the service layer, not the engine kernel.
This should remain true in v1:
- engine activates tasks
- projection store writes tasks
- service decides who may assign/release/complete them
The engine should never embed user-specific authorization policy.
5. Hosting Model
5.1 Host Shape
The service process should host:
- API endpoints
- canonical definition cache
- runtime provider
- AQ signal consumer hosted service
- retention hosted service
5.2 Background Services
Recommended hosted services:
WorkflowEngineSignalHostedServiceWorkflowEngineScheduleHostedServiceThis may be unnecessary if delayed AQ messages are consumed by the same signal service.WorkflowRetentionHostedService
5.3 Concurrency Configuration
The host must expose configuration for:
- signal consumer count
- max concurrent execution handlers
- dequeue wait duration
- per-execution timeout
6. Configuration Model
6.1 Runtime Configuration
Recommended runtime options:
{
"WorkflowRuntime": {
"Provider": "SerdicaEngine",
"FailStartupOnInvalidDefinition": true
}
}
In v1 this is a single-provider choice, not a mixed routing system.
6.2 Engine Execution Configuration
Recommended engine options:
{
"WorkflowEngine": {
"NodeId": "workflow-node-1",
"MaxConcurrentExecutions": 16,
"ExecutionTimeoutSeconds": 300,
"DefinitionCacheMode": "Startup"
}
}
6.3 AQ Configuration
Recommended AQ options:
{
"WorkflowAq": {
"Schema": "SRD_WFKLW",
"SignalQueueName": "WF_SIGNAL_Q",
"ScheduleQueueName": "WF_SCHEDULE_Q",
"DeadLetterQueueName": "WF_DLQ_Q",
"DequeueWaitSeconds": 30,
"MaxDeliveryAttempts": 10,
"SignalConsumers": 4
}
}
6.4 Retention Configuration
Reuse the existing retention options and align engine snapshot retention with projection retention.
7. Operational Diagnostics
The engine must make the following available in logs and metrics:
- workflow instance id
- workflow name
- workflow version
- business reference key
- signal id
- signal type
- waiting token
- state version
- node id
- execution duration
- dequeue latency
- retry count
- dead-letter count
- transport name and step id on failure
8. Metrics
Recommended metrics:
- workflows started
- workflows completed
- workflows failed
- tasks activated
- task completions processed
- AQ signals dequeued
- AQ signal failures
- AQ DLQ count
- timer signals fired
- stale signals ignored
- execution conflict retries
- average execution slice duration
- active waiting instances by waiting kind
9. Logging
Logging should distinguish between:
- product logs
- engine execution logs
- signal bus logs
- scheduler logs
- transport logs
The engine should log structured fields, not only free text.
Minimum structured fields:
workflowInstanceIdworkflowNameworkflowVersionbusinessReferenceKeysignalIdsignalTypenodeIdstateVersionwaitingToken
10. Failure Handling Policy
10.1 Recoverable Failures
Examples:
- transient transport failure
- transient AQ dequeue failure
- optimistic concurrency conflict
Handling:
- retry execution
- reschedule if policy exists
- keep workflow consistent
10.2 Non-Recoverable Failures
Examples:
- invalid snapshot format
- missing definition for existing instance
- unresolvable signal payload
Handling:
- move signal to DLQ
- mark instance runtime state as failed or blocked
- expose failure through inspection
11. Security Boundaries
11.1 Service API Boundary
User-facing authorization stays where it currently belongs:
- endpoint layer
- task authorization service
11.2 Engine Boundary
The engine should trust only:
- validated workflow definitions
- validated task completion requests from the service layer
- authenticated transport adapters
11.3 AQ Boundary
AQ queues should be scoped to the workflow schema and not shared casually with unrelated services.
12. Testing Strategy
12.1 Unit Tests
Test:
- canonical interpreter step behavior
- resume pointer serialization
- waiting token behavior
- optimistic concurrency conflict handling
- AQ envelope serialization
12.2 Component Tests
Test:
- start flow to task activation
- task completion to next task
- timer registration to delayed resume
- subworkflow completion to parent resume
- transport failure to retry or failure branch
12.3 Integration Tests
Test with real Oracle and AQ:
- signal enqueue/dequeue
- delayed message handling
- restart recovery
- multi-node duplicate delivery safety
12.4 Oracle And AQ Reliability Tests
The engine should have a dedicated Oracle-focused integration suite, not just generic workflow integration coverage.
The Oracle suite should be split into four layers.
12.4.1 Oracle Transport Reality Tests
These tests prove the raw AQ behavior that the engine depends on:
- immediate enqueue followed by blocking dequeue
- delayed enqueue followed by eventual dequeue
- enqueue with transaction commit succeeds
- enqueue with transaction rollback disappears
- dequeue with
OnCommitplus rollback causes redelivery - dequeue with
OnCommitplus commit removes message - dead-letter enqueue and replay path
- browse path against dead-letter queue
- queue creation and teardown in ephemeral schemas or ephemeral queue names
These tests should stay small and synthetic so transport failures are easy to isolate.
12.4.2 Engine Persistence And Delivery Coupling Tests
These tests prove that Oracle state and Oracle AQ stay consistent together:
- runtime state update plus AQ enqueue committed atomically
- runtime state update rolled back means no visible signal
- projection update plus AQ enqueue committed atomically
- duplicate AQ delivery with the same waiting token is harmless
- stale expected version plus valid waiting token is ignored safely
- stale timer message after reschedule becomes a no-op
These are the most important correctness tests for the run-to-wait architecture.
12.4.3 Restart And Recovery Tests
These tests should simulate realistic restart conditions:
- app restart with immediate signal already in queue
- app restart with delayed signal not yet due
- app restart after delayed signal becomes due
- app restart after dequeue but before commit
- Oracle container restart while waiting instances exist
- Oracle restart while delayed messages are still pending
- service restart with dead-letter backlog present
These tests should prove that no polling is needed to recover normal execution.
12.4.4 Oracle Load And Timing Tests
These tests should focus on timing variance and backlog behavior:
- cold-container delayed message latency envelope
- many delayed messages becoming due in the same second
- burst of immediate signals after service startup
- mixed immediate and delayed signals on one queue
- long-running dequeue loops with empty polls between real messages
- bounded backlog drain time for representative queue depth
The goal is not only correctness, but knowing what timing variance is normal on local and CI Oracle containers.
The detailed workload model, KPI set, harness structure, and test-tier split should live in 08-load-and-performance-plan.md.
12.5 Bulstrad Product-Parity Tests
Synthetic engine tests are necessary but not sufficient.
The main parity suite should use real Bulstrad declarative workflows with scripted downstream transport responses. The purpose is to prove that the Serdica engine executes product workflows, not just toy workflows.
Recommended first-wave Bulstrad coverage:
- transport-heavy completion flows such as
AssistantPrintInsisDocuments - approval/review chains such as
ReviewPolicyOpenForChange - parent-child workflow chains such as
OpenForChangePolicy - cancellation flows such as
AnnexCancellation - policy end-state flows such as
AssistantPolicyCancellation - reinstate or reopen flows such as
AssistantPolicyReinstate - shared-policy integration flows such as
InsisIntegrationNew - shared-policy confirmation and conversion flows such as
QuotationConfirm - failure-tolerant cleanup flows such as
QuoteOrAplCancel
Each Bulstrad test should assert:
- task sequence
- task payload shape
- transport invocation order
- final workflow state
- runtime version progression
- absence of leaked subworkflow frames or stale wait metadata
Current Oracle-backed parity coverage already includes these families and uses restarted providers plus real Oracle workflow tables, not synthetic in-memory state.
12.6 Chaos And Fault-Injection Tests
The engine should also have a deterministic chaos suite.
Recommended failure points:
- before snapshot save
- after snapshot save but before projection save
- after projection save but before AQ enqueue
- after AQ enqueue but before commit
- after dequeue but before signal processing completes
- after signal processing but before lease commit
Recommended assertions:
- no duplicate open tasks
- no lost committed signal
- no unbounded retry loop
- no invalid version rollback
- no stuck instance without an explainable wait reason
12.7 Parity Tests
The most important tests compare outcomes against the current declarative workflow expectations:
- same task sequence
- same state changes
- same business reference results
- same transport payload shaping
13. Supportability
Operations staff should be able to answer:
- what is this instance waiting for
- when was it last executed
- what signal is due next
- why was a signal ignored
- why did a signal go to DLQ
- which step failed
This is why runtime state inspection and structured failure metadata are mandatory.