# 08. Load And Performance Plan

## Purpose

This document defines how the Serdica workflow engine should be load-tested, performance-characterized, and capacity-sized once functional parity is in place.

The goal is not only to prove that the engine is correct under load, but to answer these product and platform questions:

- how many workflow starts, task completions, and signal resumes can one node sustain
- how quickly does backlog drain after restart or outage
- how much timing variance is normal for Oracle AQ on local Docker, CI, and shared environments
- which workloads are Oracle-bound, AQ-bound, or engine-bound
- which scenarios are safe to gate in PR and which belong in nightly or explicit soak runs

## Principles

The performance plan follows these rules:

- correctness comes first; a fast but lossy engine result is a failed run
- performance tests must be split by intent: smoke, characterization, stress, soak, and failure-under-load
- transport-only tests and full workflow tests must both exist; they answer different questions
- synthetic workflows are required for stable measurement
- representative Bulstrad workflows are required for product confidence
- PR gates should use coarse, stable envelopes
- nightly and explicit runs should record and compare detailed metrics
- Oracle and AQ behavior must be measured directly, not inferred from app logs alone

## What Must Be Measured

### Correctness Under Load

Every load run should capture:

- total workflows started
- total tasks activated
- total tasks completed
- total signals published
- total signals processed
- total signals ignored as stale or duplicate
- total dead-lettered signals
- total runtime concurrency conflicts
- total failed runs
- total stuck instances at end of run

Correctness invariants:

- no lost committed signal
- no duplicate open task for the same logical wait
- no orphan subworkflow frame
- no runtime state row left without a valid explainable wait reason
- no queue backlog remaining after a successful drain phase unless the scenario intentionally leaves poison messages in DLQ

### Latency

The engine should measure at least:

- start-to-first-task latency
- start-to-completion latency
- task-complete-to-next-task latency
- signal-publish-to-task-visible latency
- timer-due-to-resume latency
- delayed-message lateness relative to requested due time
- backlog-drain completion time
- restart-to-first-processed-signal time

These should be recorded as:

- average
- p50
- p95
- p99
- max

### Throughput

The engine should measure:

- workflows started per second
- task completions per second
- signals published per second
- signals processed per second
- backlog drain rate in signals per second
- completed end-to-end business workflows per minute

### Saturation

The engine should measure:

- app process CPU
- app process private memory and working set
- Oracle container CPU and memory when running locally
- queue depth over time
- active waiting instances over time
- dead-letter depth over time
- runtime state update conflicts over time
- open task count over time

### Oracle-Side Signals

If the environment permits access, also collect:

- AQ queue depth before, during, and after load
- queue-table growth during sustained runs
- visible dequeue lag
- Oracle session count for the test service
- lock or wait spikes on workflow tables
- transaction duration for mutation transactions

If the environment does not permit these views, fall back to:

- app-side timing
- browse counts from AQ
- workflow table row counts
- signal pump telemetry snapshots

## Workload Model

The load plan should be split into four workload families.

### 1. Transport Microbenchmarks

These isolate Oracle AQ behavior from workflow logic.

Use them to answer:

- how fast can AQ accept immediate messages
- how fast can AQ release delayed messages
- what is the drain rate for mixed backlogs
- how much delayed-message jitter is normal

Core scenarios:

- burst immediate enqueue and drain
- burst delayed enqueue with same due second
- mixed immediate and delayed enqueue on one queue
- dequeue rollback redelivery under sustained load
- dead-letter and replay backlog
- delayed backlog surviving Oracle restart

### 2. Synthetic Engine Workloads

These isolate the runtime from business-specific transport noise.

Recommended synthetic workflow types:

- start-to-complete with no task
- start-to-task with one human task
- signal-wait then task activation
- timer-wait then task activation
- continue-with dispatcher chain
- parent-child subworkflow chain

Use them to answer:

- raw start throughput
- raw resume throughput
- timer-due drain rate
- subworkflow coordination cost
- task activation/update cost

### 3. Representative Bulstrad Workloads

These prove that realistic product workflows behave well under load.

The first performance wave should use workflows that are already functionally covered in the Oracle suite:

- `AssistantPrintInsisDocuments`
- `OpenForChangePolicy`
- `ReviewPolicyOpenForChange`
- `AssistantAddAnnex`
- `AnnexCancellation`
- `AssistantPolicyCancellation`
- `AssistantPolicyReinstate`
- `InsisIntegrationNew`
- `QuotationConfirm`
- `QuoteOrAplCancel`

Use them to answer:

- how the engine behaves with realistic transport payload shaping
- how nested child workflows affect latency
- how multi-step review chains behave during backlog drain
- how short utility flows compare to long policy chains

### 4. Failure-Under-Load Workloads

These are not optional. A production engine must be tested while busy.

Scenarios:

- provider restart during active signal drain
- Oracle restart while delayed backlog exists
- dead-letter replay while new live signals continue to arrive
- duplicate signal storm against the same waiting instance set
- one worker repeatedly failing while another healthy worker continues
- scheduled backlog plus external-signal backlog mixed together

Use them to answer:

- whether recovery stays bounded
- whether backlog drain remains monotonic
- whether duplicate-delivery protections still hold under pressure
- whether DLQ replay can safely coexist with live traffic

## Test Tiers

Performance testing should not be a single bucket.

### Tier 1: PR Smoke

Purpose:

- catch catastrophic regressions quickly

Characteristics:

- small datasets
- short run time
- deterministic scenarios
- hard pass/fail envelopes

Recommended scope:

- one AQ immediate burst
- one AQ delayed backlog burst
- one synthetic signal-resume scenario
- one short Bulstrad business flow

Target duration:

- under 5 minutes total

Gating style:

- zero correctness failures
- no DLQ unless explicitly expected
- coarse latency ceilings only

### Tier 2: Nightly Characterization

Purpose:

- measure trends and detect meaningful performance regression

Characteristics:

- moderate dataset
- multiple concurrency levels
- metrics persisted as artifacts

Recommended scope:

- full Oracle transport matrix
- synthetic engine workloads at 1, 4, 8, and 16-way concurrency
- 3-5 representative Bulstrad families
- restart and DLQ replay under moderate backlog

Target duration:

- 15 to 45 minutes

Gating style:

- correctness failures fail the run
- latency/throughput compare against baseline with tolerance

### Tier 3: Weekly Soak

Purpose:

- detect leaks, drift, and long-tail timing issues

Characteristics:

- long-running mixed workload
- periodic restarts or controlled faults
- queue depth and runtime-state stability tracking

Recommended scope:

- 30 to 120 minute mixed load
- immediate, delayed, and replay traffic mixed together
- repeated provider restarts
- one Oracle restart in the middle of the run

Gating style:

- no unbounded backlog growth
- no stuck instances
- no memory growth trend outside a defined envelope

### Tier 4: Explicit Capacity And Breakpoint Runs

Purpose:

- learn real limits before production sizing decisions

Characteristics:

- not part of normal CI
- intentionally pushes throughput until latency or failure thresholds break

Recommended scope:

- ramp concurrency upward until queue lag or DB pressure exceeds target
- test one-node and multi-node configurations
- record saturation points, not just pass/fail

Deliverable:

- capacity report with recommended node counts and operational envelopes

## Scenario Matrix

The initial scenario matrix should look like this.

### Oracle AQ Transport

- immediate burst: 100, 500, 1000 messages
- delayed burst: 50, 100, 250 messages due in same second
- mixed burst: 70 percent immediate, 30 percent delayed
- redelivery burst: 25 messages rolled back once then committed
- DLQ burst: 25 poison messages then replay

### Synthetic Engine

- start-to-task: 50, 200, 500 workflow starts
- task-complete-to-next-task: 50, 200 completions
- signal-wait-resume: 50, 200, 500 waiting instances resumed concurrently
- timer-wait-resume: 50, 200 due timers
- subworkflow chain: 25, 100 parent-child chains

### Bulstrad Business

- short business flow: `QuoteOrAplCancel`
- medium transport flow: `InsisIntegrationNew`
- child-workflow flow: `QuotationConfirm`
- long review chain: `OpenForChangePolicy`
- print flow: `AssistantPrintInsisDocuments`
- cancellation flow: `AnnexCancellation`

### Failure Under Load

- 100 waiting instances, provider restart during drain
- 100 delayed messages, Oracle restart before due time
- 50 poison signals plus live replay traffic
- duplicate external signal storm against 50 waiting instances
- mixed task completions and signal resumes on same service instance set

## Concurrency Steps

Use explicit concurrency ladders instead of one arbitrary load value.

Recommended first ladder:

- 1
- 4
- 8
- 16
- 32

Use different ladders if the environment is too small, but always record:

- node count
- worker concurrency
- queue backlog size
- workflow count
- message mix

## Metrics Collection Design

The harness should persist results for every performance run.

Each result set should include:

- scenario name
- git commit or working tree marker
- test timestamp
- environment label
- node count
- concurrency level
- workflow count
- signal count
- Oracle queue names used
- measured latency summary
- throughput summary
- correctness summary
- process resource summary
- optional Oracle observations

Recommended output format:

- JSON artifact for machines
- short markdown summary for humans

Recommended location:

- `TestResults/workflow-performance/`

## Baseline Strategy

Do not hard-code aggressive latency thresholds before collecting stable data.

Use this sequence:

1. characterization phase
   Run each scenario several times on local Docker and CI Oracle.

2. baseline phase
   Record stable p50, p95, p99, throughput, and drain-rate envelopes.

3. gating phase
   Add coarse PR thresholds and tighter nightly regression detection.

PR thresholds should be:

- intentionally forgiving
- correctness-first
- designed to catch major regressions only

Nightly thresholds should be:

- baseline-relative
- environment-specific if necessary
- reviewed whenever Oracle container images or CI hardware changes

## Harness Design

The load harness should be separate from the normal fast integration suite.

Recommended structure:

- keep correctness-focused Oracle AQ tests in the current integration project
- add categorized performance tests with explicit categories such as:
  - `WorkflowPerfLatency`
  - `WorkflowPerfThroughput`
  - `WorkflowPerfSmoke`
  - `WorkflowPerfNightly`
  - `WorkflowPerfSoak`
  - `WorkflowPerfCapacity`
- keep scenario builders reusable so the same workflow/transports can be used in correctness and performance runs

The harness should include:

- scenario driver
- result collector
- metric aggregator
- optional Oracle observation collector
- artifact writer
- explicit phase-latency capture for start, signal publish, and signal-to-completion on the synthetic signal round-trip workload

## Multi-Backend Expansion Rules

Once Oracle is the validated reference baseline, PostgreSQL and MongoDB must adopt the same load and performance structure instead of inventing backend-specific suites first.

Required rules:

- keep one shared scenario catalog for Oracle, PostgreSQL, and MongoDB
- compare backends first on normalized workflow metrics, not backend-native counters
- keep backend-native metrics as appendices, not as the headline result
- use the same tier names and artifact schema across all backends
- keep the same curated Bulstrad workload pack across all backends unless a workflow is backend-blocked by a real functional defect

The shared artifact set should ultimately include:

- `10-oracle-performance-baseline-<date>.md/.json`
- `11-postgres-performance-baseline-<date>.md/.json`
- `12-mongo-performance-baseline-<date>.md/.json`
- `13-backend-comparison-<date>.md/.json`

The shared normalized metrics are:

- serial end-to-end latency
- start-to-first-task latency
- signal-publish-to-visible-resume latency
- steady-state throughput
- capacity ladder at `c1`, `c4`, `c8`, and `c16`
- backlog drain time
- failures
- dead letters
- runtime conflicts
- stuck instances

Backend-native appendices should include:

- Oracle:
  - AQ browse depth
  - `V$SYSSTAT` deltas
  - `V$SYS_TIME_MODEL` deltas
  - top wait deltas
- PostgreSQL:
  - queue-table depth
  - `pg_stat_database`
  - `pg_stat_statements`
  - lock and wait observations
  - WAL pressure observations
- MongoDB:
  - signal collection depth
  - `serverStatus` counters
  - transaction counters
  - change-stream wake observations
  - lock percentage observations

## Oracle-Specific Observation Plan

For Oracle-backed runs, observe both the engine and the database.

At minimum, record:

- AQ browse depth before, during, and after the run
- count of runtime-state rows touched
- count of task and task-event rows created
- number of dead-lettered signals
- duplicate/stale resume ignore count

If the environment allows deeper Oracle access, also record:

- session count for the service user
- top wait classes during the run
- lock waits on workflow tables
- statement time for key mutation queries

## Exit Criteria

The load/performance work is complete when:

- PR smoke scenarios are stable and cheap enough to run continuously
- nightly characterization produces persisted metrics and useful regression signal
- at least one weekly soak run is stable without correctness drift
- representative Bulstrad families have measured latency and throughput envelopes
- Oracle restart, provider restart, DLQ replay, and duplicate-delivery scenarios are all characterized under load
- the team can state a first production sizing recommendation for one node and multi-node deployment

## Next Sprint Shape

This plan maps naturally to a dedicated sprint focused on:

- performance harness infrastructure
- synthetic scenario library
- representative Bulstrad workload runner
- metrics artifact generation
- baseline capture
- first capacity report