Files
git.stella-ops.org/docs/workflow/engine/08-load-and-performance-plan.md
master f5b5f24d95 Add StellaOps.Workflow engine: 14 libraries, WebService, 8 test projects
Extract product-agnostic workflow engine from Ablera.Serdica.Workflow into
standalone StellaOps.Workflow.* libraries targeting net10.0.

Libraries (14):
- Contracts, Abstractions (compiler, decompiler, expression runtime)
- Engine (execution, signaling, scheduling, projections, hosted services)
- ElkSharp (generic graph layout algorithm)
- Renderer.ElkSharp, Renderer.ElkJs, Renderer.Msagl, Renderer.Svg
- Signaling.Redis, Signaling.OracleAq
- DataStore.MongoDB, DataStore.PostgreSQL, DataStore.Oracle

WebService: ASP.NET Core Minimal API with 22 endpoints

Tests (8 projects, 109 tests pass):
- Engine.Tests (105 pass), WebService.Tests (4 E2E pass)
- Renderer.Tests, DataStore.MongoDB/Oracle/PostgreSQL.Tests
- Signaling.Redis.Tests, IntegrationTests.Shared

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:14:44 +02:00

15 KiB

08. Load And Performance Plan

Purpose

This document defines how the Serdica workflow engine should be load-tested, performance-characterized, and capacity-sized once functional parity is in place.

The goal is not only to prove that the engine is correct under load, but to answer these product and platform questions:

  • how many workflow starts, task completions, and signal resumes can one node sustain
  • how quickly does backlog drain after restart or outage
  • how much timing variance is normal for Oracle AQ on local Docker, CI, and shared environments
  • which workloads are Oracle-bound, AQ-bound, or engine-bound
  • which scenarios are safe to gate in PR and which belong in nightly or explicit soak runs

Principles

The performance plan follows these rules:

  • correctness comes first; a fast but lossy engine result is a failed run
  • performance tests must be split by intent: smoke, characterization, stress, soak, and failure-under-load
  • transport-only tests and full workflow tests must both exist; they answer different questions
  • synthetic workflows are required for stable measurement
  • representative Bulstrad workflows are required for product confidence
  • PR gates should use coarse, stable envelopes
  • nightly and explicit runs should record and compare detailed metrics
  • Oracle and AQ behavior must be measured directly, not inferred from app logs alone

What Must Be Measured

Correctness Under Load

Every load run should capture:

  • total workflows started
  • total tasks activated
  • total tasks completed
  • total signals published
  • total signals processed
  • total signals ignored as stale or duplicate
  • total dead-lettered signals
  • total runtime concurrency conflicts
  • total failed runs
  • total stuck instances at end of run

Correctness invariants:

  • no lost committed signal
  • no duplicate open task for the same logical wait
  • no orphan subworkflow frame
  • no runtime state row left without a valid explainable wait reason
  • no queue backlog remaining after a successful drain phase unless the scenario intentionally leaves poison messages in DLQ

Latency

The engine should measure at least:

  • start-to-first-task latency
  • start-to-completion latency
  • task-complete-to-next-task latency
  • signal-publish-to-task-visible latency
  • timer-due-to-resume latency
  • delayed-message lateness relative to requested due time
  • backlog-drain completion time
  • restart-to-first-processed-signal time

These should be recorded as:

  • average
  • p50
  • p95
  • p99
  • max

Throughput

The engine should measure:

  • workflows started per second
  • task completions per second
  • signals published per second
  • signals processed per second
  • backlog drain rate in signals per second
  • completed end-to-end business workflows per minute

Saturation

The engine should measure:

  • app process CPU
  • app process private memory and working set
  • Oracle container CPU and memory when running locally
  • queue depth over time
  • active waiting instances over time
  • dead-letter depth over time
  • runtime state update conflicts over time
  • open task count over time

Oracle-Side Signals

If the environment permits access, also collect:

  • AQ queue depth before, during, and after load
  • queue-table growth during sustained runs
  • visible dequeue lag
  • Oracle session count for the test service
  • lock or wait spikes on workflow tables
  • transaction duration for mutation transactions

If the environment does not permit these views, fall back to:

  • app-side timing
  • browse counts from AQ
  • workflow table row counts
  • signal pump telemetry snapshots

Workload Model

The load plan should be split into four workload families.

1. Transport Microbenchmarks

These isolate Oracle AQ behavior from workflow logic.

Use them to answer:

  • how fast can AQ accept immediate messages
  • how fast can AQ release delayed messages
  • what is the drain rate for mixed backlogs
  • how much delayed-message jitter is normal

Core scenarios:

  • burst immediate enqueue and drain
  • burst delayed enqueue with same due second
  • mixed immediate and delayed enqueue on one queue
  • dequeue rollback redelivery under sustained load
  • dead-letter and replay backlog
  • delayed backlog surviving Oracle restart

2. Synthetic Engine Workloads

These isolate the runtime from business-specific transport noise.

Recommended synthetic workflow types:

  • start-to-complete with no task
  • start-to-task with one human task
  • signal-wait then task activation
  • timer-wait then task activation
  • continue-with dispatcher chain
  • parent-child subworkflow chain

Use them to answer:

  • raw start throughput
  • raw resume throughput
  • timer-due drain rate
  • subworkflow coordination cost
  • task activation/update cost

3. Representative Bulstrad Workloads

These prove that realistic product workflows behave well under load.

The first performance wave should use workflows that are already functionally covered in the Oracle suite:

  • AssistantPrintInsisDocuments
  • OpenForChangePolicy
  • ReviewPolicyOpenForChange
  • AssistantAddAnnex
  • AnnexCancellation
  • AssistantPolicyCancellation
  • AssistantPolicyReinstate
  • InsisIntegrationNew
  • QuotationConfirm
  • QuoteOrAplCancel

Use them to answer:

  • how the engine behaves with realistic transport payload shaping
  • how nested child workflows affect latency
  • how multi-step review chains behave during backlog drain
  • how short utility flows compare to long policy chains

4. Failure-Under-Load Workloads

These are not optional. A production engine must be tested while busy.

Scenarios:

  • provider restart during active signal drain
  • Oracle restart while delayed backlog exists
  • dead-letter replay while new live signals continue to arrive
  • duplicate signal storm against the same waiting instance set
  • one worker repeatedly failing while another healthy worker continues
  • scheduled backlog plus external-signal backlog mixed together

Use them to answer:

  • whether recovery stays bounded
  • whether backlog drain remains monotonic
  • whether duplicate-delivery protections still hold under pressure
  • whether DLQ replay can safely coexist with live traffic

Test Tiers

Performance testing should not be a single bucket.

Tier 1: PR Smoke

Purpose:

  • catch catastrophic regressions quickly

Characteristics:

  • small datasets
  • short run time
  • deterministic scenarios
  • hard pass/fail envelopes

Recommended scope:

  • one AQ immediate burst
  • one AQ delayed backlog burst
  • one synthetic signal-resume scenario
  • one short Bulstrad business flow

Target duration:

  • under 5 minutes total

Gating style:

  • zero correctness failures
  • no DLQ unless explicitly expected
  • coarse latency ceilings only

Tier 2: Nightly Characterization

Purpose:

  • measure trends and detect meaningful performance regression

Characteristics:

  • moderate dataset
  • multiple concurrency levels
  • metrics persisted as artifacts

Recommended scope:

  • full Oracle transport matrix
  • synthetic engine workloads at 1, 4, 8, and 16-way concurrency
  • 3-5 representative Bulstrad families
  • restart and DLQ replay under moderate backlog

Target duration:

  • 15 to 45 minutes

Gating style:

  • correctness failures fail the run
  • latency/throughput compare against baseline with tolerance

Tier 3: Weekly Soak

Purpose:

  • detect leaks, drift, and long-tail timing issues

Characteristics:

  • long-running mixed workload
  • periodic restarts or controlled faults
  • queue depth and runtime-state stability tracking

Recommended scope:

  • 30 to 120 minute mixed load
  • immediate, delayed, and replay traffic mixed together
  • repeated provider restarts
  • one Oracle restart in the middle of the run

Gating style:

  • no unbounded backlog growth
  • no stuck instances
  • no memory growth trend outside a defined envelope

Tier 4: Explicit Capacity And Breakpoint Runs

Purpose:

  • learn real limits before production sizing decisions

Characteristics:

  • not part of normal CI
  • intentionally pushes throughput until latency or failure thresholds break

Recommended scope:

  • ramp concurrency upward until queue lag or DB pressure exceeds target
  • test one-node and multi-node configurations
  • record saturation points, not just pass/fail

Deliverable:

  • capacity report with recommended node counts and operational envelopes

Scenario Matrix

The initial scenario matrix should look like this.

Oracle AQ Transport

  • immediate burst: 100, 500, 1000 messages
  • delayed burst: 50, 100, 250 messages due in same second
  • mixed burst: 70 percent immediate, 30 percent delayed
  • redelivery burst: 25 messages rolled back once then committed
  • DLQ burst: 25 poison messages then replay

Synthetic Engine

  • start-to-task: 50, 200, 500 workflow starts
  • task-complete-to-next-task: 50, 200 completions
  • signal-wait-resume: 50, 200, 500 waiting instances resumed concurrently
  • timer-wait-resume: 50, 200 due timers
  • subworkflow chain: 25, 100 parent-child chains

Bulstrad Business

  • short business flow: QuoteOrAplCancel
  • medium transport flow: InsisIntegrationNew
  • child-workflow flow: QuotationConfirm
  • long review chain: OpenForChangePolicy
  • print flow: AssistantPrintInsisDocuments
  • cancellation flow: AnnexCancellation

Failure Under Load

  • 100 waiting instances, provider restart during drain
  • 100 delayed messages, Oracle restart before due time
  • 50 poison signals plus live replay traffic
  • duplicate external signal storm against 50 waiting instances
  • mixed task completions and signal resumes on same service instance set

Concurrency Steps

Use explicit concurrency ladders instead of one arbitrary load value.

Recommended first ladder:

  • 1
  • 4
  • 8
  • 16
  • 32

Use different ladders if the environment is too small, but always record:

  • node count
  • worker concurrency
  • queue backlog size
  • workflow count
  • message mix

Metrics Collection Design

The harness should persist results for every performance run.

Each result set should include:

  • scenario name
  • git commit or working tree marker
  • test timestamp
  • environment label
  • node count
  • concurrency level
  • workflow count
  • signal count
  • Oracle queue names used
  • measured latency summary
  • throughput summary
  • correctness summary
  • process resource summary
  • optional Oracle observations

Recommended output format:

  • JSON artifact for machines
  • short markdown summary for humans

Recommended location:

  • TestResults/workflow-performance/

Baseline Strategy

Do not hard-code aggressive latency thresholds before collecting stable data.

Use this sequence:

  1. characterization phase Run each scenario several times on local Docker and CI Oracle.

  2. baseline phase Record stable p50, p95, p99, throughput, and drain-rate envelopes.

  3. gating phase Add coarse PR thresholds and tighter nightly regression detection.

PR thresholds should be:

  • intentionally forgiving
  • correctness-first
  • designed to catch major regressions only

Nightly thresholds should be:

  • baseline-relative
  • environment-specific if necessary
  • reviewed whenever Oracle container images or CI hardware changes

Harness Design

The load harness should be separate from the normal fast integration suite.

Recommended structure:

  • keep correctness-focused Oracle AQ tests in the current integration project
  • add categorized performance tests with explicit categories such as:
    • WorkflowPerfLatency
    • WorkflowPerfThroughput
    • WorkflowPerfSmoke
    • WorkflowPerfNightly
    • WorkflowPerfSoak
    • WorkflowPerfCapacity
  • keep scenario builders reusable so the same workflow/transports can be used in correctness and performance runs

The harness should include:

  • scenario driver
  • result collector
  • metric aggregator
  • optional Oracle observation collector
  • artifact writer
  • explicit phase-latency capture for start, signal publish, and signal-to-completion on the synthetic signal round-trip workload

Multi-Backend Expansion Rules

Once Oracle is the validated reference baseline, PostgreSQL and MongoDB must adopt the same load and performance structure instead of inventing backend-specific suites first.

Required rules:

  • keep one shared scenario catalog for Oracle, PostgreSQL, and MongoDB
  • compare backends first on normalized workflow metrics, not backend-native counters
  • keep backend-native metrics as appendices, not as the headline result
  • use the same tier names and artifact schema across all backends
  • keep the same curated Bulstrad workload pack across all backends unless a workflow is backend-blocked by a real functional defect

The shared artifact set should ultimately include:

  • 10-oracle-performance-baseline-<date>.md/.json
  • 11-postgres-performance-baseline-<date>.md/.json
  • 12-mongo-performance-baseline-<date>.md/.json
  • 13-backend-comparison-<date>.md/.json

The shared normalized metrics are:

  • serial end-to-end latency
  • start-to-first-task latency
  • signal-publish-to-visible-resume latency
  • steady-state throughput
  • capacity ladder at c1, c4, c8, and c16
  • backlog drain time
  • failures
  • dead letters
  • runtime conflicts
  • stuck instances

Backend-native appendices should include:

  • Oracle:
    • AQ browse depth
    • V$SYSSTAT deltas
    • V$SYS_TIME_MODEL deltas
    • top wait deltas
  • PostgreSQL:
    • queue-table depth
    • pg_stat_database
    • pg_stat_statements
    • lock and wait observations
    • WAL pressure observations
  • MongoDB:
    • signal collection depth
    • serverStatus counters
    • transaction counters
    • change-stream wake observations
    • lock percentage observations

Oracle-Specific Observation Plan

For Oracle-backed runs, observe both the engine and the database.

At minimum, record:

  • AQ browse depth before, during, and after the run
  • count of runtime-state rows touched
  • count of task and task-event rows created
  • number of dead-lettered signals
  • duplicate/stale resume ignore count

If the environment allows deeper Oracle access, also record:

  • session count for the service user
  • top wait classes during the run
  • lock waits on workflow tables
  • statement time for key mutation queries

Exit Criteria

The load/performance work is complete when:

  • PR smoke scenarios are stable and cheap enough to run continuously
  • nightly characterization produces persisted metrics and useful regression signal
  • at least one weekly soak run is stable without correctness drift
  • representative Bulstrad families have measured latency and throughput envelopes
  • Oracle restart, provider restart, DLQ replay, and duplicate-delivery scenarios are all characterized under load
  • the team can state a first production sizing recommendation for one node and multi-node deployment

Next Sprint Shape

This plan maps naturally to a dedicated sprint focused on:

  • performance harness infrastructure
  • synthetic scenario library
  • representative Bulstrad workload runner
  • metrics artifact generation
  • baseline capture
  • first capacity report