Files
git.stella-ops.org/docs/workflow/engine/05-service-surface-hosting-and-operations.md
master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00

11 KiB

05. Service Surface, Hosting, And Operations

1. Public Service Surface

The engine replacement must preserve the current workflow product APIs.

That means the following capability groups remain stable:

  • workflow definition inspection
  • workflow start
  • workflow tasks list/get/assign/release/complete
  • workflow instances list/get
  • workflow diagrams
  • workflow retention run
  • canonical schema inspection
  • canonical import validation

The existing service-contract groups remain the baseline:

  • workflow definition contracts
  • workflow start contracts
  • workflow task contracts
  • workflow instance contracts
  • workflow operational contracts

2. Service Metadata

The service should continue to advertise:

  • definition inspection support
  • instance inspection support
  • canonical schema inspection support
  • canonical validation support

The diagram provider value should change from old-runtime semantics to an engine-compatible diagram provider, but the public contract can remain unchanged.

3. Workflow Diagram Strategy

The current diagram service builds a simplified linear diagram from definition metadata and overlays instance/task status.

The current simplified workflow diagram service is the baseline. V1 engine design keeps this approach.

Why:

  • it is already product-compatible
  • it does not depend on Elsa runtime internals
  • it uses task and instance projections, which remain in place

The engine should not block on building a richer graph renderer.

4. Authorization And Assignment

Authorization remains in the service layer, not the engine kernel.

This should remain true in v1:

  • engine activates tasks
  • projection store writes tasks
  • service decides who may assign/release/complete them

The engine should never embed user-specific authorization policy.

5. Hosting Model

5.1 Host Shape

The service process should host:

  • API endpoints
  • canonical definition cache
  • runtime provider
  • AQ signal consumer hosted service
  • retention hosted service

5.2 Background Services

Recommended hosted services:

  • WorkflowEngineSignalHostedService
  • WorkflowEngineScheduleHostedService This may be unnecessary if delayed AQ messages are consumed by the same signal service.
  • WorkflowRetentionHostedService

5.3 Concurrency Configuration

The host must expose configuration for:

  • signal consumer count
  • max concurrent execution handlers
  • dequeue wait duration
  • per-execution timeout

6. Configuration Model

6.1 Runtime Configuration

Recommended runtime options:

{
  "WorkflowRuntime": {
    "Provider": "SerdicaEngine",
    "FailStartupOnInvalidDefinition": true
  }
}

In v1 this is a single-provider choice, not a mixed routing system.

6.2 Engine Execution Configuration

Recommended engine options:

{
  "WorkflowEngine": {
    "NodeId": "workflow-node-1",
    "MaxConcurrentExecutions": 16,
    "ExecutionTimeoutSeconds": 300,
    "DefinitionCacheMode": "Startup"
  }
}

6.3 AQ Configuration

Recommended AQ options:

{
  "WorkflowAq": {
    "Schema": "SRD_WFKLW",
    "SignalQueueName": "WF_SIGNAL_Q",
    "ScheduleQueueName": "WF_SCHEDULE_Q",
    "DeadLetterQueueName": "WF_DLQ_Q",
    "DequeueWaitSeconds": 30,
    "MaxDeliveryAttempts": 10,
    "SignalConsumers": 4
  }
}

6.4 Retention Configuration

Reuse the existing retention options and align engine snapshot retention with projection retention.

7. Operational Diagnostics

The engine must make the following available in logs and metrics:

  • workflow instance id
  • workflow name
  • workflow version
  • business reference key
  • signal id
  • signal type
  • waiting token
  • state version
  • node id
  • execution duration
  • dequeue latency
  • retry count
  • dead-letter count
  • transport name and step id on failure

8. Metrics

Recommended metrics:

  • workflows started
  • workflows completed
  • workflows failed
  • tasks activated
  • task completions processed
  • AQ signals dequeued
  • AQ signal failures
  • AQ DLQ count
  • timer signals fired
  • stale signals ignored
  • execution conflict retries
  • average execution slice duration
  • active waiting instances by waiting kind

9. Logging

Logging should distinguish between:

  • product logs
  • engine execution logs
  • signal bus logs
  • scheduler logs
  • transport logs

The engine should log structured fields, not only free text.

Minimum structured fields:

  • workflowInstanceId
  • workflowName
  • workflowVersion
  • businessReferenceKey
  • signalId
  • signalType
  • nodeId
  • stateVersion
  • waitingToken

10. Failure Handling Policy

10.1 Recoverable Failures

Examples:

  • transient transport failure
  • transient AQ dequeue failure
  • optimistic concurrency conflict

Handling:

  • retry execution
  • reschedule if policy exists
  • keep workflow consistent

10.2 Non-Recoverable Failures

Examples:

  • invalid snapshot format
  • missing definition for existing instance
  • unresolvable signal payload

Handling:

  • move signal to DLQ
  • mark instance runtime state as failed or blocked
  • expose failure through inspection

11. Security Boundaries

11.1 Service API Boundary

User-facing authorization stays where it currently belongs:

  • endpoint layer
  • task authorization service

11.2 Engine Boundary

The engine should trust only:

  • validated workflow definitions
  • validated task completion requests from the service layer
  • authenticated transport adapters

11.3 AQ Boundary

AQ queues should be scoped to the workflow schema and not shared casually with unrelated services.

12. Testing Strategy

12.1 Unit Tests

Test:

  • canonical interpreter step behavior
  • resume pointer serialization
  • waiting token behavior
  • optimistic concurrency conflict handling
  • AQ envelope serialization

12.2 Component Tests

Test:

  • start flow to task activation
  • task completion to next task
  • timer registration to delayed resume
  • subworkflow completion to parent resume
  • transport failure to retry or failure branch

12.3 Integration Tests

Test with real Oracle and AQ:

  • signal enqueue/dequeue
  • delayed message handling
  • restart recovery
  • multi-node duplicate delivery safety

12.4 Oracle And AQ Reliability Tests

The engine should have a dedicated Oracle-focused integration suite, not just generic workflow integration coverage.

The Oracle suite should be split into four layers.

12.4.1 Oracle Transport Reality Tests

These tests prove the raw AQ behavior that the engine depends on:

  • immediate enqueue followed by blocking dequeue
  • delayed enqueue followed by eventual dequeue
  • enqueue with transaction commit succeeds
  • enqueue with transaction rollback disappears
  • dequeue with OnCommit plus rollback causes redelivery
  • dequeue with OnCommit plus commit removes message
  • dead-letter enqueue and replay path
  • browse path against dead-letter queue
  • queue creation and teardown in ephemeral schemas or ephemeral queue names

These tests should stay small and synthetic so transport failures are easy to isolate.

12.4.2 Engine Persistence And Delivery Coupling Tests

These tests prove that Oracle state and Oracle AQ stay consistent together:

  • runtime state update plus AQ enqueue committed atomically
  • runtime state update rolled back means no visible signal
  • projection update plus AQ enqueue committed atomically
  • duplicate AQ delivery with the same waiting token is harmless
  • stale expected version plus valid waiting token is ignored safely
  • stale timer message after reschedule becomes a no-op

These are the most important correctness tests for the run-to-wait architecture.

12.4.3 Restart And Recovery Tests

These tests should simulate realistic restart conditions:

  • app restart with immediate signal already in queue
  • app restart with delayed signal not yet due
  • app restart after delayed signal becomes due
  • app restart after dequeue but before commit
  • Oracle container restart while waiting instances exist
  • Oracle restart while delayed messages are still pending
  • service restart with dead-letter backlog present

These tests should prove that no polling is needed to recover normal execution.

12.4.4 Oracle Load And Timing Tests

These tests should focus on timing variance and backlog behavior:

  • cold-container delayed message latency envelope
  • many delayed messages becoming due in the same second
  • burst of immediate signals after service startup
  • mixed immediate and delayed signals on one queue
  • long-running dequeue loops with empty polls between real messages
  • bounded backlog drain time for representative queue depth

The goal is not only correctness, but knowing what timing variance is normal on local and CI Oracle containers.

The detailed workload model, KPI set, harness structure, and test-tier split should live in 08-load-and-performance-plan.md.

12.5 Bulstrad Product-Parity Tests

Synthetic engine tests are necessary but not sufficient.

The main parity suite should use real Bulstrad declarative workflows with scripted downstream transport responses. The purpose is to prove that the Serdica engine executes product workflows, not just toy workflows.

Recommended first-wave Bulstrad coverage:

  • transport-heavy completion flows such as AssistantPrintInsisDocuments
  • approval/review chains such as ReviewPolicyOpenForChange
  • parent-child workflow chains such as OpenForChangePolicy
  • cancellation flows such as AnnexCancellation
  • policy end-state flows such as AssistantPolicyCancellation
  • reinstate or reopen flows such as AssistantPolicyReinstate
  • shared-policy integration flows such as InsisIntegrationNew
  • shared-policy confirmation and conversion flows such as QuotationConfirm
  • failure-tolerant cleanup flows such as QuoteOrAplCancel

Each Bulstrad test should assert:

  • task sequence
  • task payload shape
  • transport invocation order
  • final workflow state
  • runtime version progression
  • absence of leaked subworkflow frames or stale wait metadata

Current Oracle-backed parity coverage already includes these families and uses restarted providers plus real Oracle workflow tables, not synthetic in-memory state.

12.6 Chaos And Fault-Injection Tests

The engine should also have a deterministic chaos suite.

Recommended failure points:

  • before snapshot save
  • after snapshot save but before projection save
  • after projection save but before AQ enqueue
  • after AQ enqueue but before commit
  • after dequeue but before signal processing completes
  • after signal processing but before lease commit

Recommended assertions:

  • no duplicate open tasks
  • no lost committed signal
  • no unbounded retry loop
  • no invalid version rollback
  • no stuck instance without an explainable wait reason

12.7 Parity Tests

The most important tests compare outcomes against the current declarative workflow expectations:

  • same task sequence
  • same state changes
  • same business reference results
  • same transport payload shaping

13. Supportability

Operations staff should be able to answer:

  • what is this instance waiting for
  • when was it last executed
  • what signal is due next
  • why was a signal ignored
  • why did a signal go to DLQ
  • which step failed

This is why runtime state inspection and structured failure metadata are mandatory.