Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into standalone StellaOps.Workflow.* libraries targeting net10.0. Libraries (14): - Contracts, Abstractions (compiler, decompiler, expression runtime) - Engine (execution, signaling, scheduling, projections, hosted services) - ElkSharp (generic graph layout algorithm) - Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg - Signaling.Redis, Signaling.OracleAq - DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle WebService: ASP.NET Core Minimal API with 22 endpoints Tests (8 projects, 109 tests pass): - Engine.Tests (105 pass), WebService.Tests (4 E2E pass) - Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests - Signaling.Redis.Tests, IntegrationTests.Shared Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
544
docs/workflow/engine/08-load-and-performance-plan.md
Normal file
544
docs/workflow/engine/08-load-and-performance-plan.md
Normal file
@@ -0,0 +1,544 @@
|
||||
# 08. Load And Performance Plan
|
||||
|
||||
## Purpose
|
||||
|
||||
This document defines how the Serdica workflow engine should be load-tested, performance-characterized, and capacity-sized once functional parity is in place.
|
||||
|
||||
The goal is not only to prove that the engine is correct under load, but to answer these product and platform questions:
|
||||
|
||||
- how many workflow starts, task completions, and signal resumes can one node sustain
|
||||
- how quickly does backlog drain after restart or outage
|
||||
- how much timing variance is normal for Oracle AQ on local Docker, CI, and shared environments
|
||||
- which workloads are Oracle-bound, AQ-bound, or engine-bound
|
||||
- which scenarios are safe to gate in PR and which belong in nightly or explicit soak runs
|
||||
|
||||
## Principles
|
||||
|
||||
The performance plan follows these rules:
|
||||
|
||||
- correctness comes first; a fast but lossy engine result is a failed run
|
||||
- performance tests must be split by intent: smoke, characterization, stress, soak, and failure-under-load
|
||||
- transport-only tests and full workflow tests must both exist; they answer different questions
|
||||
- synthetic workflows are required for stable measurement
|
||||
- representative Bulstrad workflows are required for product confidence
|
||||
- PR gates should use coarse, stable envelopes
|
||||
- nightly and explicit runs should record and compare detailed metrics
|
||||
- Oracle and AQ behavior must be measured directly, not inferred from app logs alone
|
||||
|
||||
## What Must Be Measured
|
||||
|
||||
### Correctness Under Load
|
||||
|
||||
Every load run should capture:
|
||||
|
||||
- total workflows started
|
||||
- total tasks activated
|
||||
- total tasks completed
|
||||
- total signals published
|
||||
- total signals processed
|
||||
- total signals ignored as stale or duplicate
|
||||
- total dead-lettered signals
|
||||
- total runtime concurrency conflicts
|
||||
- total failed runs
|
||||
- total stuck instances at end of run
|
||||
|
||||
Correctness invariants:
|
||||
|
||||
- no lost committed signal
|
||||
- no duplicate open task for the same logical wait
|
||||
- no orphan subworkflow frame
|
||||
- no runtime state row left without a valid explainable wait reason
|
||||
- no queue backlog remaining after a successful drain phase unless the scenario intentionally leaves poison messages in DLQ
|
||||
|
||||
### Latency
|
||||
|
||||
The engine should measure at least:
|
||||
|
||||
- start-to-first-task latency
|
||||
- start-to-completion latency
|
||||
- task-complete-to-next-task latency
|
||||
- signal-publish-to-task-visible latency
|
||||
- timer-due-to-resume latency
|
||||
- delayed-message lateness relative to requested due time
|
||||
- backlog-drain completion time
|
||||
- restart-to-first-processed-signal time
|
||||
|
||||
These should be recorded as:
|
||||
|
||||
- average
|
||||
- p50
|
||||
- p95
|
||||
- p99
|
||||
- max
|
||||
|
||||
### Throughput
|
||||
|
||||
The engine should measure:
|
||||
|
||||
- workflows started per second
|
||||
- task completions per second
|
||||
- signals published per second
|
||||
- signals processed per second
|
||||
- backlog drain rate in signals per second
|
||||
- completed end-to-end business workflows per minute
|
||||
|
||||
### Saturation
|
||||
|
||||
The engine should measure:
|
||||
|
||||
- app process CPU
|
||||
- app process private memory and working set
|
||||
- Oracle container CPU and memory when running locally
|
||||
- queue depth over time
|
||||
- active waiting instances over time
|
||||
- dead-letter depth over time
|
||||
- runtime state update conflicts over time
|
||||
- open task count over time
|
||||
|
||||
### Oracle-Side Signals
|
||||
|
||||
If the environment permits access, also collect:
|
||||
|
||||
- AQ queue depth before, during, and after load
|
||||
- queue-table growth during sustained runs
|
||||
- visible dequeue lag
|
||||
- Oracle session count for the test service
|
||||
- lock or wait spikes on workflow tables
|
||||
- transaction duration for mutation transactions
|
||||
|
||||
If the environment does not permit these views, fall back to:
|
||||
|
||||
- app-side timing
|
||||
- browse counts from AQ
|
||||
- workflow table row counts
|
||||
- signal pump telemetry snapshots
|
||||
|
||||
## Workload Model
|
||||
|
||||
The load plan should be split into four workload families.
|
||||
|
||||
### 1. Transport Microbenchmarks
|
||||
|
||||
These isolate Oracle AQ behavior from workflow logic.
|
||||
|
||||
Use them to answer:
|
||||
|
||||
- how fast can AQ accept immediate messages
|
||||
- how fast can AQ release delayed messages
|
||||
- what is the drain rate for mixed backlogs
|
||||
- how much delayed-message jitter is normal
|
||||
|
||||
Core scenarios:
|
||||
|
||||
- burst immediate enqueue and drain
|
||||
- burst delayed enqueue with same due second
|
||||
- mixed immediate and delayed enqueue on one queue
|
||||
- dequeue rollback redelivery under sustained load
|
||||
- dead-letter and replay backlog
|
||||
- delayed backlog surviving Oracle restart
|
||||
|
||||
### 2. Synthetic Engine Workloads
|
||||
|
||||
These isolate the runtime from business-specific transport noise.
|
||||
|
||||
Recommended synthetic workflow types:
|
||||
|
||||
- start-to-complete with no task
|
||||
- start-to-task with one human task
|
||||
- signal-wait then task activation
|
||||
- timer-wait then task activation
|
||||
- continue-with dispatcher chain
|
||||
- parent-child subworkflow chain
|
||||
|
||||
Use them to answer:
|
||||
|
||||
- raw start throughput
|
||||
- raw resume throughput
|
||||
- timer-due drain rate
|
||||
- subworkflow coordination cost
|
||||
- task activation/update cost
|
||||
|
||||
### 3. Representative Bulstrad Workloads
|
||||
|
||||
These prove that realistic product workflows behave well under load.
|
||||
|
||||
The first performance wave should use workflows that are already functionally covered in the Oracle suite:
|
||||
|
||||
- `AssistantPrintInsisDocuments`
|
||||
- `OpenForChangePolicy`
|
||||
- `ReviewPolicyOpenForChange`
|
||||
- `AssistantAddAnnex`
|
||||
- `AnnexCancellation`
|
||||
- `AssistantPolicyCancellation`
|
||||
- `AssistantPolicyReinstate`
|
||||
- `InsisIntegrationNew`
|
||||
- `QuotationConfirm`
|
||||
- `QuoteOrAplCancel`
|
||||
|
||||
Use them to answer:
|
||||
|
||||
- how the engine behaves with realistic transport payload shaping
|
||||
- how nested child workflows affect latency
|
||||
- how multi-step review chains behave during backlog drain
|
||||
- how short utility flows compare to long policy chains
|
||||
|
||||
### 4. Failure-Under-Load Workloads
|
||||
|
||||
These are not optional. A production engine must be tested while busy.
|
||||
|
||||
Scenarios:
|
||||
|
||||
- provider restart during active signal drain
|
||||
- Oracle restart while delayed backlog exists
|
||||
- dead-letter replay while new live signals continue to arrive
|
||||
- duplicate signal storm against the same waiting instance set
|
||||
- one worker repeatedly failing while another healthy worker continues
|
||||
- scheduled backlog plus external-signal backlog mixed together
|
||||
|
||||
Use them to answer:
|
||||
|
||||
- whether recovery stays bounded
|
||||
- whether backlog drain remains monotonic
|
||||
- whether duplicate-delivery protections still hold under pressure
|
||||
- whether DLQ replay can safely coexist with live traffic
|
||||
|
||||
## Test Tiers
|
||||
|
||||
Performance testing should not be a single bucket.
|
||||
|
||||
### Tier 1: PR Smoke
|
||||
|
||||
Purpose:
|
||||
|
||||
- catch catastrophic regressions quickly
|
||||
|
||||
Characteristics:
|
||||
|
||||
- small datasets
|
||||
- short run time
|
||||
- deterministic scenarios
|
||||
- hard pass/fail envelopes
|
||||
|
||||
Recommended scope:
|
||||
|
||||
- one AQ immediate burst
|
||||
- one AQ delayed backlog burst
|
||||
- one synthetic signal-resume scenario
|
||||
- one short Bulstrad business flow
|
||||
|
||||
Target duration:
|
||||
|
||||
- under 5 minutes total
|
||||
|
||||
Gating style:
|
||||
|
||||
- zero correctness failures
|
||||
- no DLQ unless explicitly expected
|
||||
- coarse latency ceilings only
|
||||
|
||||
### Tier 2: Nightly Characterization
|
||||
|
||||
Purpose:
|
||||
|
||||
- measure trends and detect meaningful performance regression
|
||||
|
||||
Characteristics:
|
||||
|
||||
- moderate dataset
|
||||
- multiple concurrency levels
|
||||
- metrics persisted as artifacts
|
||||
|
||||
Recommended scope:
|
||||
|
||||
- full Oracle transport matrix
|
||||
- synthetic engine workloads at 1, 4, 8, and 16-way concurrency
|
||||
- 3-5 representative Bulstrad families
|
||||
- restart and DLQ replay under moderate backlog
|
||||
|
||||
Target duration:
|
||||
|
||||
- 15 to 45 minutes
|
||||
|
||||
Gating style:
|
||||
|
||||
- correctness failures fail the run
|
||||
- latency/throughput compare against baseline with tolerance
|
||||
|
||||
### Tier 3: Weekly Soak
|
||||
|
||||
Purpose:
|
||||
|
||||
- detect leaks, drift, and long-tail timing issues
|
||||
|
||||
Characteristics:
|
||||
|
||||
- long-running mixed workload
|
||||
- periodic restarts or controlled faults
|
||||
- queue depth and runtime-state stability tracking
|
||||
|
||||
Recommended scope:
|
||||
|
||||
- 30 to 120 minute mixed load
|
||||
- immediate, delayed, and replay traffic mixed together
|
||||
- repeated provider restarts
|
||||
- one Oracle restart in the middle of the run
|
||||
|
||||
Gating style:
|
||||
|
||||
- no unbounded backlog growth
|
||||
- no stuck instances
|
||||
- no memory growth trend outside a defined envelope
|
||||
|
||||
### Tier 4: Explicit Capacity And Breakpoint Runs
|
||||
|
||||
Purpose:
|
||||
|
||||
- learn real limits before production sizing decisions
|
||||
|
||||
Characteristics:
|
||||
|
||||
- not part of normal CI
|
||||
- intentionally pushes throughput until latency or failure thresholds break
|
||||
|
||||
Recommended scope:
|
||||
|
||||
- ramp concurrency upward until queue lag or DB pressure exceeds target
|
||||
- test one-node and multi-node configurations
|
||||
- record saturation points, not just pass/fail
|
||||
|
||||
Deliverable:
|
||||
|
||||
- capacity report with recommended node counts and operational envelopes
|
||||
|
||||
## Scenario Matrix
|
||||
|
||||
The initial scenario matrix should look like this.
|
||||
|
||||
### Oracle AQ Transport
|
||||
|
||||
- immediate burst: 100, 500, 1000 messages
|
||||
- delayed burst: 50, 100, 250 messages due in same second
|
||||
- mixed burst: 70 percent immediate, 30 percent delayed
|
||||
- redelivery burst: 25 messages rolled back once then committed
|
||||
- DLQ burst: 25 poison messages then replay
|
||||
|
||||
### Synthetic Engine
|
||||
|
||||
- start-to-task: 50, 200, 500 workflow starts
|
||||
- task-complete-to-next-task: 50, 200 completions
|
||||
- signal-wait-resume: 50, 200, 500 waiting instances resumed concurrently
|
||||
- timer-wait-resume: 50, 200 due timers
|
||||
- subworkflow chain: 25, 100 parent-child chains
|
||||
|
||||
### Bulstrad Business
|
||||
|
||||
- short business flow: `QuoteOrAplCancel`
|
||||
- medium transport flow: `InsisIntegrationNew`
|
||||
- child-workflow flow: `QuotationConfirm`
|
||||
- long review chain: `OpenForChangePolicy`
|
||||
- print flow: `AssistantPrintInsisDocuments`
|
||||
- cancellation flow: `AnnexCancellation`
|
||||
|
||||
### Failure Under Load
|
||||
|
||||
- 100 waiting instances, provider restart during drain
|
||||
- 100 delayed messages, Oracle restart before due time
|
||||
- 50 poison signals plus live replay traffic
|
||||
- duplicate external signal storm against 50 waiting instances
|
||||
- mixed task completions and signal resumes on same service instance set
|
||||
|
||||
## Concurrency Steps
|
||||
|
||||
Use explicit concurrency ladders instead of one arbitrary load value.
|
||||
|
||||
Recommended first ladder:
|
||||
|
||||
- 1
|
||||
- 4
|
||||
- 8
|
||||
- 16
|
||||
- 32
|
||||
|
||||
Use different ladders if the environment is too small, but always record:
|
||||
|
||||
- node count
|
||||
- worker concurrency
|
||||
- queue backlog size
|
||||
- workflow count
|
||||
- message mix
|
||||
|
||||
## Metrics Collection Design
|
||||
|
||||
The harness should persist results for every performance run.
|
||||
|
||||
Each result set should include:
|
||||
|
||||
- scenario name
|
||||
- git commit or working tree marker
|
||||
- test timestamp
|
||||
- environment label
|
||||
- node count
|
||||
- concurrency level
|
||||
- workflow count
|
||||
- signal count
|
||||
- Oracle queue names used
|
||||
- measured latency summary
|
||||
- throughput summary
|
||||
- correctness summary
|
||||
- process resource summary
|
||||
- optional Oracle observations
|
||||
|
||||
Recommended output format:
|
||||
|
||||
- JSON artifact for machines
|
||||
- short markdown summary for humans
|
||||
|
||||
Recommended location:
|
||||
|
||||
- `TestResults/workflow-performance/`
|
||||
|
||||
## Baseline Strategy
|
||||
|
||||
Do not hard-code aggressive latency thresholds before collecting stable data.
|
||||
|
||||
Use this sequence:
|
||||
|
||||
1. characterization phase
|
||||
Run each scenario several times on local Docker and CI Oracle.
|
||||
|
||||
2. baseline phase
|
||||
Record stable p50, p95, p99, throughput, and drain-rate envelopes.
|
||||
|
||||
3. gating phase
|
||||
Add coarse PR thresholds and tighter nightly regression detection.
|
||||
|
||||
PR thresholds should be:
|
||||
|
||||
- intentionally forgiving
|
||||
- correctness-first
|
||||
- designed to catch major regressions only
|
||||
|
||||
Nightly thresholds should be:
|
||||
|
||||
- baseline-relative
|
||||
- environment-specific if necessary
|
||||
- reviewed whenever Oracle container images or CI hardware changes
|
||||
|
||||
## Harness Design
|
||||
|
||||
The load harness should be separate from the normal fast integration suite.
|
||||
|
||||
Recommended structure:
|
||||
|
||||
- keep correctness-focused Oracle AQ tests in the current integration project
|
||||
- add categorized performance tests with explicit categories such as:
|
||||
- `WorkflowPerfLatency`
|
||||
- `WorkflowPerfThroughput`
|
||||
- `WorkflowPerfSmoke`
|
||||
- `WorkflowPerfNightly`
|
||||
- `WorkflowPerfSoak`
|
||||
- `WorkflowPerfCapacity`
|
||||
- keep scenario builders reusable so the same workflow/transports can be used in correctness and performance runs
|
||||
|
||||
The harness should include:
|
||||
|
||||
- scenario driver
|
||||
- result collector
|
||||
- metric aggregator
|
||||
- optional Oracle observation collector
|
||||
- artifact writer
|
||||
- explicit phase-latency capture for start, signal publish, and signal-to-completion on the synthetic signal round-trip workload
|
||||
|
||||
## Multi-Backend Expansion Rules
|
||||
|
||||
Once Oracle is the validated reference baseline, PostgreSQL and MongoDB must adopt the same load and performance structure instead of inventing backend-specific suites first.
|
||||
|
||||
Required rules:
|
||||
|
||||
- keep one shared scenario catalog for Oracle, PostgreSQL, and MongoDB
|
||||
- compare backends first on normalized workflow metrics, not backend-native counters
|
||||
- keep backend-native metrics as appendices, not as the headline result
|
||||
- use the same tier names and artifact schema across all backends
|
||||
- keep the same curated Bulstrad workload pack across all backends unless a workflow is backend-blocked by a real functional defect
|
||||
|
||||
The shared artifact set should ultimately include:
|
||||
|
||||
- `10-oracle-performance-baseline-<date>.md/.json`
|
||||
- `11-postgres-performance-baseline-<date>.md/.json`
|
||||
- `12-mongo-performance-baseline-<date>.md/.json`
|
||||
- `13-backend-comparison-<date>.md/.json`
|
||||
|
||||
The shared normalized metrics are:
|
||||
|
||||
- serial end-to-end latency
|
||||
- start-to-first-task latency
|
||||
- signal-publish-to-visible-resume latency
|
||||
- steady-state throughput
|
||||
- capacity ladder at `c1`, `c4`, `c8`, and `c16`
|
||||
- backlog drain time
|
||||
- failures
|
||||
- dead letters
|
||||
- runtime conflicts
|
||||
- stuck instances
|
||||
|
||||
Backend-native appendices should include:
|
||||
|
||||
- Oracle:
|
||||
- AQ browse depth
|
||||
- `V$SYSSTAT` deltas
|
||||
- `V$SYS_TIME_MODEL` deltas
|
||||
- top wait deltas
|
||||
- PostgreSQL:
|
||||
- queue-table depth
|
||||
- `pg_stat_database`
|
||||
- `pg_stat_statements`
|
||||
- lock and wait observations
|
||||
- WAL pressure observations
|
||||
- MongoDB:
|
||||
- signal collection depth
|
||||
- `serverStatus` counters
|
||||
- transaction counters
|
||||
- change-stream wake observations
|
||||
- lock percentage observations
|
||||
|
||||
## Oracle-Specific Observation Plan
|
||||
|
||||
For Oracle-backed runs, observe both the engine and the database.
|
||||
|
||||
At minimum, record:
|
||||
|
||||
- AQ browse depth before, during, and after the run
|
||||
- count of runtime-state rows touched
|
||||
- count of task and task-event rows created
|
||||
- number of dead-lettered signals
|
||||
- duplicate/stale resume ignore count
|
||||
|
||||
If the environment allows deeper Oracle access, also record:
|
||||
|
||||
- session count for the service user
|
||||
- top wait classes during the run
|
||||
- lock waits on workflow tables
|
||||
- statement time for key mutation queries
|
||||
|
||||
## Exit Criteria
|
||||
|
||||
The load/performance work is complete when:
|
||||
|
||||
- PR smoke scenarios are stable and cheap enough to run continuously
|
||||
- nightly characterization produces persisted metrics and useful regression signal
|
||||
- at least one weekly soak run is stable without correctness drift
|
||||
- representative Bulstrad families have measured latency and throughput envelopes
|
||||
- Oracle restart, provider restart, DLQ replay, and duplicate-delivery scenarios are all characterized under load
|
||||
- the team can state a first production sizing recommendation for one node and multi-node deployment
|
||||
|
||||
## Next Sprint Shape
|
||||
|
||||
This plan maps naturally to a dedicated sprint focused on:
|
||||
|
||||
- performance harness infrastructure
|
||||
- synthetic scenario library
|
||||
- representative Bulstrad workload runner
|
||||
- metrics artifact generation
|
||||
- baseline capture
|
||||
- first capacity report
|
||||
|
||||
Reference in New Issue
Block a user