Files

master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects

Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 19:14:44 +02:00

15 KiB

Raw Blame History

08. Load And Performance Plan

Purpose

This document defines how the Serdica workflow engine should be load-tested, performance-characterized, and capacity-sized once functional parity is in place.

The goal is not only to prove that the engine is correct under load, but to answer these product and platform questions:

how many workflow starts, task completions, and signal resumes can one node sustain
how quickly does backlog drain after restart or outage
how much timing variance is normal for Oracle AQ on local Docker, CI, and shared environments
which workloads are Oracle-bound, AQ-bound, or engine-bound
which scenarios are safe to gate in PR and which belong in nightly or explicit soak runs

Principles

The performance plan follows these rules:

correctness comes first; a fast but lossy engine result is a failed run
performance tests must be split by intent: smoke, characterization, stress, soak, and failure-under-load
transport-only tests and full workflow tests must both exist; they answer different questions
synthetic workflows are required for stable measurement
representative Bulstrad workflows are required for product confidence
PR gates should use coarse, stable envelopes
nightly and explicit runs should record and compare detailed metrics
Oracle and AQ behavior must be measured directly, not inferred from app logs alone

What Must Be Measured

Correctness Under Load

Every load run should capture:

total workflows started
total tasks activated
total tasks completed
total signals published
total signals processed
total signals ignored as stale or duplicate
total dead-lettered signals
total runtime concurrency conflicts
total failed runs
total stuck instances at end of run

Correctness invariants:

no lost committed signal
no duplicate open task for the same logical wait
no orphan subworkflow frame
no runtime state row left without a valid explainable wait reason
no queue backlog remaining after a successful drain phase unless the scenario intentionally leaves poison messages in DLQ

Latency

The engine should measure at least:

start-to-first-task latency
start-to-completion latency
task-complete-to-next-task latency
signal-publish-to-task-visible latency
timer-due-to-resume latency
delayed-message lateness relative to requested due time
backlog-drain completion time
restart-to-first-processed-signal time

These should be recorded as:

average
p50
p95
p99
max

Throughput

The engine should measure:

workflows started per second
task completions per second
signals published per second
signals processed per second
backlog drain rate in signals per second
completed end-to-end business workflows per minute

Saturation

The engine should measure:

app process CPU
app process private memory and working set
Oracle container CPU and memory when running locally
queue depth over time
active waiting instances over time
dead-letter depth over time
runtime state update conflicts over time
open task count over time

Oracle-Side Signals

If the environment permits access, also collect:

AQ queue depth before, during, and after load
queue-table growth during sustained runs
visible dequeue lag
Oracle session count for the test service
lock or wait spikes on workflow tables
transaction duration for mutation transactions

If the environment does not permit these views, fall back to:

app-side timing
browse counts from AQ
workflow table row counts
signal pump telemetry snapshots

Workload Model

The load plan should be split into four workload families.

1. Transport Microbenchmarks

These isolate Oracle AQ behavior from workflow logic.

Use them to answer:

how fast can AQ accept immediate messages
how fast can AQ release delayed messages
what is the drain rate for mixed backlogs
how much delayed-message jitter is normal

Core scenarios:

burst immediate enqueue and drain
burst delayed enqueue with same due second
mixed immediate and delayed enqueue on one queue
dequeue rollback redelivery under sustained load
dead-letter and replay backlog
delayed backlog surviving Oracle restart

2. Synthetic Engine Workloads

These isolate the runtime from business-specific transport noise.

Recommended synthetic workflow types:

start-to-complete with no task
start-to-task with one human task
signal-wait then task activation
timer-wait then task activation
continue-with dispatcher chain
parent-child subworkflow chain

Use them to answer:

raw start throughput
raw resume throughput
timer-due drain rate
subworkflow coordination cost
task activation/update cost

3. Representative Bulstrad Workloads

These prove that realistic product workflows behave well under load.

The first performance wave should use workflows that are already functionally covered in the Oracle suite:

AssistantPrintInsisDocuments
OpenForChangePolicy
ReviewPolicyOpenForChange
AssistantAddAnnex
AnnexCancellation
AssistantPolicyCancellation
AssistantPolicyReinstate
InsisIntegrationNew
QuotationConfirm
QuoteOrAplCancel

Use them to answer:

how the engine behaves with realistic transport payload shaping
how nested child workflows affect latency
how multi-step review chains behave during backlog drain
how short utility flows compare to long policy chains

4. Failure-Under-Load Workloads

These are not optional. A production engine must be tested while busy.

Scenarios:

provider restart during active signal drain
Oracle restart while delayed backlog exists
dead-letter replay while new live signals continue to arrive
duplicate signal storm against the same waiting instance set
one worker repeatedly failing while another healthy worker continues
scheduled backlog plus external-signal backlog mixed together

Use them to answer:

whether recovery stays bounded
whether backlog drain remains monotonic
whether duplicate-delivery protections still hold under pressure
whether DLQ replay can safely coexist with live traffic

Test Tiers

Performance testing should not be a single bucket.

Tier 1: PR Smoke

Purpose:

catch catastrophic regressions quickly

Characteristics:

small datasets
short run time
deterministic scenarios
hard pass/fail envelopes

Recommended scope:

one AQ immediate burst
one AQ delayed backlog burst
one synthetic signal-resume scenario
one short Bulstrad business flow

Target duration:

under 5 minutes total

Gating style:

zero correctness failures
no DLQ unless explicitly expected
coarse latency ceilings only

Tier 2: Nightly Characterization

Purpose:

measure trends and detect meaningful performance regression

Characteristics:

moderate dataset
multiple concurrency levels
metrics persisted as artifacts

Recommended scope:

full Oracle transport matrix
synthetic engine workloads at 1, 4, 8, and 16-way concurrency
3-5 representative Bulstrad families
restart and DLQ replay under moderate backlog

Target duration:

15 to 45 minutes

Gating style:

correctness failures fail the run
latency/throughput compare against baseline with tolerance

Tier 3: Weekly Soak

Purpose:

detect leaks, drift, and long-tail timing issues

Characteristics:

long-running mixed workload
periodic restarts or controlled faults
queue depth and runtime-state stability tracking

Recommended scope:

30 to 120 minute mixed load
immediate, delayed, and replay traffic mixed together
repeated provider restarts
one Oracle restart in the middle of the run

Gating style:

no unbounded backlog growth
no stuck instances
no memory growth trend outside a defined envelope

Tier 4: Explicit Capacity And Breakpoint Runs

Purpose:

learn real limits before production sizing decisions

Characteristics:

not part of normal CI
intentionally pushes throughput until latency or failure thresholds break

Recommended scope:

ramp concurrency upward until queue lag or DB pressure exceeds target
test one-node and multi-node configurations
record saturation points, not just pass/fail

Deliverable:

capacity report with recommended node counts and operational envelopes

Scenario Matrix

The initial scenario matrix should look like this.

Oracle AQ Transport

immediate burst: 100, 500, 1000 messages
delayed burst: 50, 100, 250 messages due in same second
mixed burst: 70 percent immediate, 30 percent delayed
redelivery burst: 25 messages rolled back once then committed
DLQ burst: 25 poison messages then replay

Synthetic Engine

start-to-task: 50, 200, 500 workflow starts
task-complete-to-next-task: 50, 200 completions
signal-wait-resume: 50, 200, 500 waiting instances resumed concurrently
timer-wait-resume: 50, 200 due timers
subworkflow chain: 25, 100 parent-child chains

Bulstrad Business

short business flow: QuoteOrAplCancel
medium transport flow: InsisIntegrationNew
child-workflow flow: QuotationConfirm
long review chain: OpenForChangePolicy
print flow: AssistantPrintInsisDocuments
cancellation flow: AnnexCancellation

Failure Under Load

100 waiting instances, provider restart during drain
100 delayed messages, Oracle restart before due time
50 poison signals plus live replay traffic
duplicate external signal storm against 50 waiting instances
mixed task completions and signal resumes on same service instance set

Concurrency Steps

Use explicit concurrency ladders instead of one arbitrary load value.

Recommended first ladder:

Use different ladders if the environment is too small, but always record:

node count
worker concurrency
queue backlog size
workflow count
message mix

Metrics Collection Design

The harness should persist results for every performance run.

Each result set should include:

scenario name
git commit or working tree marker
test timestamp
environment label
node count
concurrency level
workflow count
signal count
Oracle queue names used
measured latency summary
throughput summary
correctness summary
process resource summary
optional Oracle observations

Recommended output format:

JSON artifact for machines
short markdown summary for humans

Recommended location:

TestResults/workflow-performance/

Baseline Strategy

Do not hard-code aggressive latency thresholds before collecting stable data.

Use this sequence:

characterization phase Run each scenario several times on local Docker and CI Oracle.
baseline phase Record stable p50, p95, p99, throughput, and drain-rate envelopes.
gating phase Add coarse PR thresholds and tighter nightly regression detection.

PR thresholds should be:

intentionally forgiving
correctness-first
designed to catch major regressions only

Nightly thresholds should be:

baseline-relative
environment-specific if necessary
reviewed whenever Oracle container images or CI hardware changes

Harness Design

The load harness should be separate from the normal fast integration suite.

Recommended structure:

keep correctness-focused Oracle AQ tests in the current integration project
add categorized performance tests with explicit categories such as:
- WorkflowPerfLatency
- WorkflowPerfThroughput
- WorkflowPerfSmoke
- WorkflowPerfNightly
- WorkflowPerfSoak
- WorkflowPerfCapacity
keep scenario builders reusable so the same workflow/transports can be used in correctness and performance runs

The harness should include:

scenario driver
result collector
metric aggregator
optional Oracle observation collector
artifact writer
explicit phase-latency capture for start, signal publish, and signal-to-completion on the synthetic signal round-trip workload

Multi-Backend Expansion Rules

Once Oracle is the validated reference baseline, PostgreSQL and MongoDB must adopt the same load and performance structure instead of inventing backend-specific suites first.

Required rules:

keep one shared scenario catalog for Oracle, PostgreSQL, and MongoDB
compare backends first on normalized workflow metrics, not backend-native counters
keep backend-native metrics as appendices, not as the headline result
use the same tier names and artifact schema across all backends
keep the same curated Bulstrad workload pack across all backends unless a workflow is backend-blocked by a real functional defect

The shared artifact set should ultimately include:

10-oracle-performance-baseline-<date>.md/.json
11-postgres-performance-baseline-<date>.md/.json
12-mongo-performance-baseline-<date>.md/.json
13-backend-comparison-<date>.md/.json

The shared normalized metrics are:

serial end-to-end latency
start-to-first-task latency
signal-publish-to-visible-resume latency
steady-state throughput
capacity ladder at c1, c4, c8, and c16
backlog drain time
failures
dead letters
runtime conflicts
stuck instances

Backend-native appendices should include:

Oracle:
- AQ browse depth
- V$SYSSTAT deltas
- V$SYS_TIME_MODEL deltas
- top wait deltas
PostgreSQL:
- queue-table depth
- pg_stat_database
- pg_stat_statements
- lock and wait observations
- WAL pressure observations
MongoDB:
- signal collection depth
- serverStatus counters
- transaction counters
- change-stream wake observations
- lock percentage observations

Oracle-Specific Observation Plan

For Oracle-backed runs, observe both the engine and the database.

At minimum, record:

AQ browse depth before, during, and after the run
count of runtime-state rows touched
count of task and task-event rows created
number of dead-lettered signals
duplicate/stale resume ignore count

If the environment allows deeper Oracle access, also record:

session count for the service user
top wait classes during the run
lock waits on workflow tables
statement time for key mutation queries

Exit Criteria

The load/performance work is complete when:

PR smoke scenarios are stable and cheap enough to run continuously
nightly characterization produces persisted metrics and useful regression signal
at least one weekly soak run is stable without correctness drift
representative Bulstrad families have measured latency and throughput envelopes
Oracle restart, provider restart, DLQ replay, and duplicate-delivery scenarios are all characterized under load
the team can state a first production sizing recommendation for one node and multi-node deployment

Next Sprint Shape

This plan maps naturally to a dedicated sprint focused on:

performance harness infrastructure
synthetic scenario library
representative Bulstrad workload runner
metrics artifact generation
baseline capture
first capacity report

15 KiB Raw Blame History

08. Load And Performance Plan

Purpose

Principles

What Must Be Measured

Correctness Under Load

Latency

Throughput

Saturation

Oracle-Side Signals

Workload Model

1. Transport Microbenchmarks

2. Synthetic Engine Workloads

3. Representative Bulstrad Workloads

4. Failure-Under-Load Workloads

Test Tiers

Tier 1: PR Smoke

Tier 2: Nightly Characterization

Tier 3: Weekly Soak

Tier 4: Explicit Capacity And Breakpoint Runs

Scenario Matrix

Oracle AQ Transport

Synthetic Engine

Bulstrad Business

Failure Under Load

Concurrency Steps

Metrics Collection Design

Baseline Strategy

Harness Design

Multi-Backend Expansion Rules

Oracle-Specific Observation Plan

Exit Criteria

Next Sprint Shape

15 KiB

Raw Blame History