Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects

Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00
parent e56f9a114a
commit f5b5f24d95
422 changed files with 85428 additions and 0 deletions
--- a/docs/workflow/engine/08-load-and-performance-plan.md
+++ b/docs/workflow/engine/08-load-and-performance-plan.md
@@ -0,0 +1,544 @@
+# 08. Load And Performance Plan
+
+## Purpose
+
+This document defines how the Serdica workflow engine should be load-tested, performance-characterized, and capacity-sized once functional parity is in place.
+
+The goal is not only to prove that the engine is correct under load, but to answer these product and platform questions:
+
+- how many workflow starts, task completions, and signal resumes can one node sustain
+- how quickly does backlog drain after restart or outage
+- how much timing variance is normal for Oracle AQ on local Docker, CI, and shared environments
+- which workloads are Oracle-bound, AQ-bound, or engine-bound
+- which scenarios are safe to gate in PR and which belong in nightly or explicit soak runs
+
+## Principles
+
+The performance plan follows these rules:
+
+- correctness comes first; a fast but lossy engine result is a failed run
+- performance tests must be split by intent: smoke, characterization, stress, soak, and failure-under-load
+- transport-only tests and full workflow tests must both exist; they answer different questions
+- synthetic workflows are required for stable measurement
+- representative Bulstrad workflows are required for product confidence
+- PR gates should use coarse, stable envelopes
+- nightly and explicit runs should record and compare detailed metrics
+- Oracle and AQ behavior must be measured directly, not inferred from app logs alone
+
+## What Must Be Measured
+
+### Correctness Under Load
+
+Every load run should capture:
+
+- total workflows started
+- total tasks activated
+- total tasks completed
+- total signals published
+- total signals processed
+- total signals ignored as stale or duplicate
+- total dead-lettered signals
+- total runtime concurrency conflicts
+- total failed runs
+- total stuck instances at end of run
+
+Correctness invariants:
+
+- no lost committed signal
+- no duplicate open task for the same logical wait
+- no orphan subworkflow frame
+- no runtime state row left without a valid explainable wait reason
+- no queue backlog remaining after a successful drain phase unless the scenario intentionally leaves poison messages in DLQ
+
+### Latency
+
+The engine should measure at least:
+
+- start-to-first-task latency
+- start-to-completion latency
+- task-complete-to-next-task latency
+- signal-publish-to-task-visible latency
+- timer-due-to-resume latency
+- delayed-message lateness relative to requested due time
+- backlog-drain completion time
+- restart-to-first-processed-signal time
+
+These should be recorded as:
+
+- average
+- p50
+- p95
+- p99
+- max
+
+### Throughput
+
+The engine should measure:
+
+- workflows started per second
+- task completions per second
+- signals published per second
+- signals processed per second
+- backlog drain rate in signals per second
+- completed end-to-end business workflows per minute
+
+### Saturation
+
+The engine should measure:
+
+- app process CPU
+- app process private memory and working set
+- Oracle container CPU and memory when running locally
+- queue depth over time
+- active waiting instances over time
+- dead-letter depth over time
+- runtime state update conflicts over time
+- open task count over time
+
+### Oracle-Side Signals
+
+If the environment permits access, also collect:
+
+- AQ queue depth before, during, and after load
+- queue-table growth during sustained runs
+- visible dequeue lag
+- Oracle session count for the test service
+- lock or wait spikes on workflow tables
+- transaction duration for mutation transactions
+
+If the environment does not permit these views, fall back to:
+
+- app-side timing
+- browse counts from AQ
+- workflow table row counts
+- signal pump telemetry snapshots
+
+## Workload Model
+
+The load plan should be split into four workload families.
+
+### 1. Transport Microbenchmarks
+
+These isolate Oracle AQ behavior from workflow logic.
+
+Use them to answer:
+
+- how fast can AQ accept immediate messages
+- how fast can AQ release delayed messages
+- what is the drain rate for mixed backlogs
+- how much delayed-message jitter is normal
+
+Core scenarios:
+
+- burst immediate enqueue and drain
+- burst delayed enqueue with same due second
+- mixed immediate and delayed enqueue on one queue
+- dequeue rollback redelivery under sustained load
+- dead-letter and replay backlog
+- delayed backlog surviving Oracle restart
+
+### 2. Synthetic Engine Workloads
+
+These isolate the runtime from business-specific transport noise.
+
+Recommended synthetic workflow types:
+
+- start-to-complete with no task
+- start-to-task with one human task
+- signal-wait then task activation
+- timer-wait then task activation
+- continue-with dispatcher chain
+- parent-child subworkflow chain
+
+Use them to answer:
+
+- raw start throughput
+- raw resume throughput
+- timer-due drain rate
+- subworkflow coordination cost
+- task activation/update cost
+
+### 3. Representative Bulstrad Workloads
+
+These prove that realistic product workflows behave well under load.
+
+The first performance wave should use workflows that are already functionally covered in the Oracle suite:
+
+- `AssistantPrintInsisDocuments`
+- `OpenForChangePolicy`
+- `ReviewPolicyOpenForChange`
+- `AssistantAddAnnex`
+- `AnnexCancellation`
+- `AssistantPolicyCancellation`
+- `AssistantPolicyReinstate`
+- `InsisIntegrationNew`
+- `QuotationConfirm`
+- `QuoteOrAplCancel`
+
+Use them to answer:
+
+- how the engine behaves with realistic transport payload shaping
+- how nested child workflows affect latency
+- how multi-step review chains behave during backlog drain
+- how short utility flows compare to long policy chains
+
+### 4. Failure-Under-Load Workloads
+
+These are not optional. A production engine must be tested while busy.
+
+Scenarios:
+
+- provider restart during active signal drain
+- Oracle restart while delayed backlog exists
+- dead-letter replay while new live signals continue to arrive
+- duplicate signal storm against the same waiting instance set
+- one worker repeatedly failing while another healthy worker continues
+- scheduled backlog plus external-signal backlog mixed together
+
+Use them to answer:
+
+- whether recovery stays bounded
+- whether backlog drain remains monotonic
+- whether duplicate-delivery protections still hold under pressure
+- whether DLQ replay can safely coexist with live traffic
+
+## Test Tiers
+
+Performance testing should not be a single bucket.
+
+### Tier 1: PR Smoke
+
+Purpose:
+
+- catch catastrophic regressions quickly
+
+Characteristics:
+
+- small datasets
+- short run time
+- deterministic scenarios
+- hard pass/fail envelopes
+
+Recommended scope:
+
+- one AQ immediate burst
+- one AQ delayed backlog burst
+- one synthetic signal-resume scenario
+- one short Bulstrad business flow
+
+Target duration:
+
+- under 5 minutes total
+
+Gating style:
+
+- zero correctness failures
+- no DLQ unless explicitly expected
+- coarse latency ceilings only
+
+### Tier 2: Nightly Characterization
+
+Purpose:
+
+- measure trends and detect meaningful performance regression
+
+Characteristics:
+
+- moderate dataset
+- multiple concurrency levels
+- metrics persisted as artifacts
+
+Recommended scope:
+
+- full Oracle transport matrix
+- synthetic engine workloads at 1, 4, 8, and 16-way concurrency
+- 3-5 representative Bulstrad families
+- restart and DLQ replay under moderate backlog
+
+Target duration:
+
+- 15 to 45 minutes
+
+Gating style:
+
+- correctness failures fail the run
+- latency/throughput compare against baseline with tolerance
+
+### Tier 3: Weekly Soak
+
+Purpose:
+
+- detect leaks, drift, and long-tail timing issues
+
+Characteristics:
+
+- long-running mixed workload
+- periodic restarts or controlled faults
+- queue depth and runtime-state stability tracking
+
+Recommended scope:
+
+- 30 to 120 minute mixed load
+- immediate, delayed, and replay traffic mixed together
+- repeated provider restarts
+- one Oracle restart in the middle of the run
+
+Gating style:
+
+- no unbounded backlog growth
+- no stuck instances
+- no memory growth trend outside a defined envelope
+
+### Tier 4: Explicit Capacity And Breakpoint Runs
+
+Purpose:
+
+- learn real limits before production sizing decisions
+
+Characteristics:
+
+- not part of normal CI
+- intentionally pushes throughput until latency or failure thresholds break
+
+Recommended scope:
+
+- ramp concurrency upward until queue lag or DB pressure exceeds target
+- test one-node and multi-node configurations
+- record saturation points, not just pass/fail
+
+Deliverable:
+
+- capacity report with recommended node counts and operational envelopes
+
+## Scenario Matrix
+
+The initial scenario matrix should look like this.
+
+### Oracle AQ Transport
+
+- immediate burst: 100, 500, 1000 messages
+- delayed burst: 50, 100, 250 messages due in same second
+- mixed burst: 70 percent immediate, 30 percent delayed
+- redelivery burst: 25 messages rolled back once then committed
+- DLQ burst: 25 poison messages then replay
+
+### Synthetic Engine
+
+- start-to-task: 50, 200, 500 workflow starts
+- task-complete-to-next-task: 50, 200 completions
+- signal-wait-resume: 50, 200, 500 waiting instances resumed concurrently
+- timer-wait-resume: 50, 200 due timers
+- subworkflow chain: 25, 100 parent-child chains
+
+### Bulstrad Business
+
+- short business flow: `QuoteOrAplCancel`
+- medium transport flow: `InsisIntegrationNew`
+- child-workflow flow: `QuotationConfirm`
+- long review chain: `OpenForChangePolicy`
+- print flow: `AssistantPrintInsisDocuments`
+- cancellation flow: `AnnexCancellation`
+
+### Failure Under Load
+
+- 100 waiting instances, provider restart during drain
+- 100 delayed messages, Oracle restart before due time
+- 50 poison signals plus live replay traffic
+- duplicate external signal storm against 50 waiting instances
+- mixed task completions and signal resumes on same service instance set
+
+## Concurrency Steps
+
+Use explicit concurrency ladders instead of one arbitrary load value.
+
+Recommended first ladder:
+
+- 1
+- 4
+- 8
+- 16
+- 32
+
+Use different ladders if the environment is too small, but always record:
+
+- node count
+- worker concurrency
+- queue backlog size
+- workflow count
+- message mix
+
+## Metrics Collection Design
+
+The harness should persist results for every performance run.
+
+Each result set should include:
+
+- scenario name
+- git commit or working tree marker
+- test timestamp
+- environment label
+- node count
+- concurrency level
+- workflow count
+- signal count
+- Oracle queue names used
+- measured latency summary
+- throughput summary
+- correctness summary
+- process resource summary
+- optional Oracle observations
+
+Recommended output format:
+
+- JSON artifact for machines
+- short markdown summary for humans
+
+Recommended location:
+
+- `TestResults/workflow-performance/`
+
+## Baseline Strategy
+
+Do not hard-code aggressive latency thresholds before collecting stable data.
+
+Use this sequence:
+
+1. characterization phase
+   Run each scenario several times on local Docker and CI Oracle.
+
+2. baseline phase
+   Record stable p50, p95, p99, throughput, and drain-rate envelopes.
+
+3. gating phase
+   Add coarse PR thresholds and tighter nightly regression detection.
+
+PR thresholds should be:
+
+- intentionally forgiving
+- correctness-first
+- designed to catch major regressions only
+
+Nightly thresholds should be:
+
+- baseline-relative
+- environment-specific if necessary
+- reviewed whenever Oracle container images or CI hardware changes
+
+## Harness Design
+
+The load harness should be separate from the normal fast integration suite.
+
+Recommended structure:
+
+- keep correctness-focused Oracle AQ tests in the current integration project
+- add categorized performance tests with explicit categories such as:
+  - `WorkflowPerfLatency`
+  - `WorkflowPerfThroughput`
+  - `WorkflowPerfSmoke`
+  - `WorkflowPerfNightly`
+  - `WorkflowPerfSoak`
+  - `WorkflowPerfCapacity`
+- keep scenario builders reusable so the same workflow/transports can be used in correctness and performance runs
+
+The harness should include:
+
+- scenario driver
+- result collector
+- metric aggregator
+- optional Oracle observation collector
+- artifact writer
+- explicit phase-latency capture for start, signal publish, and signal-to-completion on the synthetic signal round-trip workload
+
+## Multi-Backend Expansion Rules
+
+Once Oracle is the validated reference baseline, PostgreSQL and MongoDB must adopt the same load and performance structure instead of inventing backend-specific suites first.
+
+Required rules:
+
+- keep one shared scenario catalog for Oracle, PostgreSQL, and MongoDB
+- compare backends first on normalized workflow metrics, not backend-native counters
+- keep backend-native metrics as appendices, not as the headline result
+- use the same tier names and artifact schema across all backends
+- keep the same curated Bulstrad workload pack across all backends unless a workflow is backend-blocked by a real functional defect
+
+The shared artifact set should ultimately include:
+
+- `10-oracle-performance-baseline-<date>.md/.json`
+- `11-postgres-performance-baseline-<date>.md/.json`
+- `12-mongo-performance-baseline-<date>.md/.json`
+- `13-backend-comparison-<date>.md/.json`
+
+The shared normalized metrics are:
+
+- serial end-to-end latency
+- start-to-first-task latency
+- signal-publish-to-visible-resume latency
+- steady-state throughput
+- capacity ladder at `c1`, `c4`, `c8`, and `c16`
+- backlog drain time
+- failures
+- dead letters
+- runtime conflicts
+- stuck instances
+
+Backend-native appendices should include:
+
+- Oracle:
+  - AQ browse depth
+  - `V$SYSSTAT` deltas
+  - `V$SYS_TIME_MODEL` deltas
+  - top wait deltas
+- PostgreSQL:
+  - queue-table depth
+  - `pg_stat_database`
+  - `pg_stat_statements`
+  - lock and wait observations
+  - WAL pressure observations
+- MongoDB:
+  - signal collection depth
+  - `serverStatus` counters
+  - transaction counters
+  - change-stream wake observations
+  - lock percentage observations
+
+## Oracle-Specific Observation Plan
+
+For Oracle-backed runs, observe both the engine and the database.
+
+At minimum, record:
+
+- AQ browse depth before, during, and after the run
+- count of runtime-state rows touched
+- count of task and task-event rows created
+- number of dead-lettered signals
+- duplicate/stale resume ignore count
+
+If the environment allows deeper Oracle access, also record:
+
+- session count for the service user
+- top wait classes during the run
+- lock waits on workflow tables
+- statement time for key mutation queries
+
+## Exit Criteria
+
+The load/performance work is complete when:
+
+- PR smoke scenarios are stable and cheap enough to run continuously
+- nightly characterization produces persisted metrics and useful regression signal
+- at least one weekly soak run is stable without correctness drift
+- representative Bulstrad families have measured latency and throughput envelopes
+- Oracle restart, provider restart, DLQ replay, and duplicate-delivery scenarios are all characterized under load
+- the team can state a first production sizing recommendation for one node and multi-node deployment
+
+## Next Sprint Shape
+
+This plan maps naturally to a dedicated sprint focused on:
+
+- performance harness infrastructure
+- synthetic scenario library
+- representative Bulstrad workload runner
+- metrics artifact generation
+- baseline capture
+- first capacity report
+