Files

master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects

Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 19:14:44 +02:00

11 KiB

Raw Blame History

05. Service Surface, Hosting, And Operations

1. Public Service Surface

The engine replacement must preserve the current workflow product APIs.

That means the following capability groups remain stable:

workflow definition inspection
workflow start
workflow tasks list/get/assign/release/complete
workflow instances list/get
workflow diagrams
workflow retention run
canonical schema inspection
canonical import validation

The existing service-contract groups remain the baseline:

workflow definition contracts
workflow start contracts
workflow task contracts
workflow instance contracts
workflow operational contracts

2. Service Metadata

The service should continue to advertise:

definition inspection support
instance inspection support
canonical schema inspection support
canonical validation support

The diagram provider value should change from old-runtime semantics to an engine-compatible diagram provider, but the public contract can remain unchanged.

3. Workflow Diagram Strategy

The current diagram service builds a simplified linear diagram from definition metadata and overlays instance/task status.

The current simplified workflow diagram service is the baseline. V1 engine design keeps this approach.

Why:

it is already product-compatible
it does not depend on Elsa runtime internals
it uses task and instance projections, which remain in place

The engine should not block on building a richer graph renderer.

4. Authorization And Assignment

Authorization remains in the service layer, not the engine kernel.

This should remain true in v1:

engine activates tasks
projection store writes tasks
service decides who may assign/release/complete them

The engine should never embed user-specific authorization policy.

5. Hosting Model

5.1 Host Shape

The service process should host:

API endpoints
canonical definition cache
runtime provider
AQ signal consumer hosted service
retention hosted service

5.2 Background Services

Recommended hosted services:

WorkflowEngineSignalHostedService
WorkflowEngineScheduleHostedService This may be unnecessary if delayed AQ messages are consumed by the same signal service.
WorkflowRetentionHostedService

5.3 Concurrency Configuration

The host must expose configuration for:

signal consumer count
max concurrent execution handlers
dequeue wait duration
per-execution timeout

6. Configuration Model

6.1 Runtime Configuration

Recommended runtime options:

{
  "WorkflowRuntime": {
    "Provider": "SerdicaEngine",
    "FailStartupOnInvalidDefinition": true
  }
}

In v1 this is a single-provider choice, not a mixed routing system.

6.2 Engine Execution Configuration

Recommended engine options:

{
  "WorkflowEngine": {
    "NodeId": "workflow-node-1",
    "MaxConcurrentExecutions": 16,
    "ExecutionTimeoutSeconds": 300,
    "DefinitionCacheMode": "Startup"
  }
}

6.3 AQ Configuration

Recommended AQ options:

{
  "WorkflowAq": {
    "Schema": "SRD_WFKLW",
    "SignalQueueName": "WF_SIGNAL_Q",
    "ScheduleQueueName": "WF_SCHEDULE_Q",
    "DeadLetterQueueName": "WF_DLQ_Q",
    "DequeueWaitSeconds": 30,
    "MaxDeliveryAttempts": 10,
    "SignalConsumers": 4
  }
}

6.4 Retention Configuration

Reuse the existing retention options and align engine snapshot retention with projection retention.

7. Operational Diagnostics

The engine must make the following available in logs and metrics:

workflow instance id
workflow name
workflow version
business reference key
signal id
signal type
waiting token
state version
node id
execution duration
dequeue latency
retry count
dead-letter count
transport name and step id on failure

8. Metrics

Recommended metrics:

workflows started
workflows completed
workflows failed
tasks activated
task completions processed
AQ signals dequeued
AQ signal failures
AQ DLQ count
timer signals fired
stale signals ignored
execution conflict retries
average execution slice duration
active waiting instances by waiting kind

9. Logging

Logging should distinguish between:

product logs
engine execution logs
signal bus logs
scheduler logs
transport logs

The engine should log structured fields, not only free text.

Minimum structured fields:

workflowInstanceId
workflowName
workflowVersion
businessReferenceKey
signalId
signalType
nodeId
stateVersion
waitingToken

10. Failure Handling Policy

10.1 Recoverable Failures

Examples:

transient transport failure
transient AQ dequeue failure
optimistic concurrency conflict

Handling:

retry execution
reschedule if policy exists
keep workflow consistent

10.2 Non-Recoverable Failures

Examples:

invalid snapshot format
missing definition for existing instance
unresolvable signal payload

Handling:

move signal to DLQ
mark instance runtime state as failed or blocked
expose failure through inspection

11. Security Boundaries

11.1 Service API Boundary

User-facing authorization stays where it currently belongs:

endpoint layer
task authorization service

11.2 Engine Boundary

The engine should trust only:

validated workflow definitions
validated task completion requests from the service layer
authenticated transport adapters

11.3 AQ Boundary

AQ queues should be scoped to the workflow schema and not shared casually with unrelated services.

12. Testing Strategy

12.1 Unit Tests

Test:

canonical interpreter step behavior
resume pointer serialization
waiting token behavior
optimistic concurrency conflict handling
AQ envelope serialization

12.2 Component Tests

Test:

start flow to task activation
task completion to next task
timer registration to delayed resume
subworkflow completion to parent resume
transport failure to retry or failure branch

12.3 Integration Tests

Test with real Oracle and AQ:

signal enqueue/dequeue
delayed message handling
restart recovery
multi-node duplicate delivery safety

12.4 Oracle And AQ Reliability Tests

The engine should have a dedicated Oracle-focused integration suite, not just generic workflow integration coverage.

The Oracle suite should be split into four layers.

12.4.1 Oracle Transport Reality Tests

These tests prove the raw AQ behavior that the engine depends on:

immediate enqueue followed by blocking dequeue
delayed enqueue followed by eventual dequeue
enqueue with transaction commit succeeds
enqueue with transaction rollback disappears
dequeue with OnCommit plus rollback causes redelivery
dequeue with OnCommit plus commit removes message
dead-letter enqueue and replay path
browse path against dead-letter queue
queue creation and teardown in ephemeral schemas or ephemeral queue names

These tests should stay small and synthetic so transport failures are easy to isolate.

12.4.2 Engine Persistence And Delivery Coupling Tests

These tests prove that Oracle state and Oracle AQ stay consistent together:

runtime state update plus AQ enqueue committed atomically
runtime state update rolled back means no visible signal
projection update plus AQ enqueue committed atomically
duplicate AQ delivery with the same waiting token is harmless
stale expected version plus valid waiting token is ignored safely
stale timer message after reschedule becomes a no-op

These are the most important correctness tests for the run-to-wait architecture.

12.4.3 Restart And Recovery Tests

These tests should simulate realistic restart conditions:

app restart with immediate signal already in queue
app restart with delayed signal not yet due
app restart after delayed signal becomes due
app restart after dequeue but before commit
Oracle container restart while waiting instances exist
Oracle restart while delayed messages are still pending
service restart with dead-letter backlog present

These tests should prove that no polling is needed to recover normal execution.

12.4.4 Oracle Load And Timing Tests

These tests should focus on timing variance and backlog behavior:

cold-container delayed message latency envelope
many delayed messages becoming due in the same second
burst of immediate signals after service startup
mixed immediate and delayed signals on one queue
long-running dequeue loops with empty polls between real messages
bounded backlog drain time for representative queue depth

The goal is not only correctness, but knowing what timing variance is normal on local and CI Oracle containers.

The detailed workload model, KPI set, harness structure, and test-tier split should live in 08-load-and-performance-plan.md.

12.5 Bulstrad Product-Parity Tests

Synthetic engine tests are necessary but not sufficient.

The main parity suite should use real Bulstrad declarative workflows with scripted downstream transport responses. The purpose is to prove that the Serdica engine executes product workflows, not just toy workflows.

Recommended first-wave Bulstrad coverage:

transport-heavy completion flows such as AssistantPrintInsisDocuments
approval/review chains such as ReviewPolicyOpenForChange
parent-child workflow chains such as OpenForChangePolicy
cancellation flows such as AnnexCancellation
policy end-state flows such as AssistantPolicyCancellation
reinstate or reopen flows such as AssistantPolicyReinstate
shared-policy integration flows such as InsisIntegrationNew
shared-policy confirmation and conversion flows such as QuotationConfirm
failure-tolerant cleanup flows such as QuoteOrAplCancel

Each Bulstrad test should assert:

task sequence
task payload shape
transport invocation order
final workflow state
runtime version progression
absence of leaked subworkflow frames or stale wait metadata

Current Oracle-backed parity coverage already includes these families and uses restarted providers plus real Oracle workflow tables, not synthetic in-memory state.

12.6 Chaos And Fault-Injection Tests

The engine should also have a deterministic chaos suite.

Recommended failure points:

before snapshot save
after snapshot save but before projection save
after projection save but before AQ enqueue
after AQ enqueue but before commit
after dequeue but before signal processing completes
after signal processing but before lease commit

Recommended assertions:

no duplicate open tasks
no lost committed signal
no unbounded retry loop
no invalid version rollback
no stuck instance without an explainable wait reason

12.7 Parity Tests

The most important tests compare outcomes against the current declarative workflow expectations:

same task sequence
same state changes
same business reference results
same transport payload shaping

13. Supportability

Operations staff should be able to answer:

what is this instance waiting for
when was it last executed
what signal is due next
why was a signal ignored
why did a signal go to DLQ
which step failed

This is why runtime state inspection and structured failure metadata are mandatory.

11 KiB Raw Blame History