Files
git.stella-ops.org/docs/dev/performance-testing-playbook.md

664 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Performance Testing Pipeline for Queue-Based Workflows
> **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.
---
## What we're measuring (plain English)
* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job.
* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip.
* **Worker service time:** time to pick up, process, and ack.
* **Queueing delay:** time spent waiting in the queue (arrival → start of worker).
These four add up to the "hop latency" users feel when the system is under load.
---
## Minimal tracing you can add today
Emit these IDs/headers end-to-end:
* `x-stella-corr-id` (uuid)
* `x-stella-enq-ts` (gateway enqueue ts, ns)
* `x-stella-claim-ts` (worker claim ts, ns)
* `x-stella-done-ts` (worker done ts, ns)
From these, compute:
* `queue_delay = claim_ts - enq_ts`
* `service_time = done_ts - claim_ts`
* `http_ttfs = gateway_first_byte_ts - http_request_start_ts`
* `hop_latency = done_ts - enq_ts` (or return-path if synchronous)
Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.
---
## Valkey commands (safe, BSD Valkey)
Use **Valkey Streams + Consumer Groups** for fairness and metrics:
* Enqueue: `XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>`
* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >`
* Ack: `XACK jobs workers <id>`
Add a small Lua for timestamping at enqueue (atomic):
```lua
-- KEYS[1]=stream
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
return redis.call('XADD', KEYS[1], '*',
'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])
```
---
## Load shapes to test (find the envelope)
1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
3. **Step-up/down:** double every 2 min until SLO breach; then halve down.
4. **Long tail soak:** run at 7080% of max for 1h; watch p95-p99.9 drift.
Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**.
---
## k6 script (HTTP client pressure)
```javascript
// save as hop-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
scenarios: {
step_load: {
executor: 'ramping-arrival-rate',
startRate: 20, timeUnit: '1s',
preAllocatedVUs: 200, maxVUs: 5000,
stages: [
{ target: 50, duration: '1m' },
{ target: 100, duration: '1m' },
{ target: 200, duration: '1m' },
{ target: 400, duration: '1m' },
{ target: 800, duration: '1m' },
],
},
},
thresholds: {
'http_req_failed': ['rate<0.01'],
'http_req_duration{phase:hop}': ['p(95)<500'],
},
};
export default function () {
const corr = crypto.randomUUID();
const res = http.post(
__ENV.GW_URL,
JSON.stringify({ data: 'ping', corr }),
{
headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
tags: { phase: 'hop' },
}
);
check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
sleep(0.01);
}
```
Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js`
---
## Worker hooks (.NET 10 sketch)
```csharp
// At claim
var now = Stopwatch.GetTimestamp(); // monotonic
var claimNs = now.ToNanoseconds();
log.AddTag("x-stella-claim-ts", claimNs);
// After processing
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
log.AddTag("x-stella-done-ts", doneNs);
// Include corr-id and stream entry id in logs/metrics
```
Helper:
```csharp
public static class MonoTime {
static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
}
```
---
## Prometheus metrics to expose
* `valkey_enqueue_ns` (histogram)
* `valkey_claim_block_ms` (gauge)
* `worker_service_ns` (histogram, labels: worker_type, route)
* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`)
* `enqueue_rate`, `dequeue_rate` (counters)
Example recording rules:
```yaml
- record: hop:queue_delay_p95
expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
- record: hop:service_time_p95
expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
- record: hop:latency_budget_p95
expr: hop:queue_delay_p95 + hop:service_time_p95
```
---
## Autoscaling signals (HPA/KEDA friendly)
* **Primary:** queue depth & its derivative (d/dt).
* **Secondary:** p95 `queue_delay` and worker CPU.
* **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`.
---
## Plot the "envelope" (what you'll look at)
* X-axis: **offered load** (req/s).
* Y-axis: **p95 hop latency** (ms).
* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises).
* Add secondary panel: **queue depth** vs load.
---
# Performance Test Guidelines
## HTTP → Valkey → Worker pipeline
## 1) Objectives and scope
### Primary objectives
Your performance tests MUST answer these questions with evidence:
1. **Capacity knee**: At what offered load does **queue delay** start growing sharply?
2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load?
3. **Decomposition**: How much of hop latency is:
* gateway enqueue time
* Valkey enqueue/claim RTT
* queue wait time
* worker service time
4. **Scaling behavior**: How do these change with worker replica counts (N workers)?
5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?
### Non-goals (explicitly out of scope unless you add them later)
* Micro-optimizing single function runtime
* Synthetic "max QPS" records without a representative payload
* Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke
---
## 2) Definitions and required metrics
### Required latency definitions (standardize these names)
Agents MUST compute and report these per request/job:
* **`t_http_accept`**: time from client send → gateway accepts request
* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side)
* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s)
* **`t_queue_delay`**: `claim_ts - enq_ts`
* **`t_service`**: `done_ts - claim_ts`
* **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency)
* Optional but recommended:
* **`t_ack`**: time to ack completion (Valkey ack RTT)
* **`t_http_response`**: request start → gateway response sent (TTFB/TTFS)
### Required percentiles and aggregations
Per scenario step (e.g., each offered load plateau), agents MUST output:
* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue`
* Throughput: offered rps and achieved rps
* Error rate: HTTP failures, enqueue failures, worker failures
* Queue depth and backlog drain time
### Required system-level telemetry (minimum)
Agents MUST collect these time series during tests:
* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
* **Gateway**: CPU/mem, request rate, response codes, request duration histogram
---
## 3) Environment and test hygiene requirements
### Environment requirements
Agents SHOULD run tests in an environment that matches production in:
* container CPU/memory limits
* number of nodes, network topology
* Valkey topology (single, cluster, sentinel, etc.)
* worker replica autoscaling rules (or deliberately disabled)
If exact parity isn't possible, agents MUST record all known differences in the report.
### Test hygiene (non-negotiable)
Agents MUST:
1. **Start from empty queues** (no backlog).
2. **Disable client retries** (or explicitly run two variants: retries off / retries on).
3. **Warm up** before measuring (e.g., 60s warm-up minimum).
4. **Hold steady plateaus** long enough to stabilize (usually 25 minutes per step).
5. **Cool down** and verify backlog drains (queue depth returns to baseline).
6. Record exact versions/SHAs of gateway/worker and Valkey config.
### Load generator hygiene
Agents MUST ensure the load generator is not the bottleneck:
* CPU < ~70% during test
* no local socket exhaustion
* enough VUs/connections
* if needed, distributed load generation
---
## 4) Instrumentation spec (agents implement this first)
### Correlation and timestamps
Agents MUST propagate an end-to-end correlation ID and timestamps.
**Required fields**
* `corr_id` (UUID)
* `enq_ts_ns` (set at enqueue, monotonic or consistent clock)
* `claim_ts_ns` (set by worker when job is claimed)
* `done_ts_ns` (set by worker when job processing ends)
**Where these live**
* HTTP request header: `x-corr-id: <uuid>`
* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type
* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns`
### Clock requirements
Agents MUST use a consistent timing source:
* Prefer monotonic timers for durations (Stopwatch / monotonic clock)
* If timestamps cross machines, ensure they're comparable:
* either rely on synchronized clocks (NTP) **and** monitor drift
* or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)
**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.
### Valkey queue semantics (recommended)
Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability:
* Enqueue: `XADD jobs * corr <uuid> enq <ns> payload <...>`
* Claim: `XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >`
* Ack: `XACK jobs workers <id>`
Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag.
### Metrics exposure
Agents MUST publish Prometheus (or equivalent) histograms:
* `gateway_enqueue_seconds` (or ns) histogram
* `valkey_enqueue_rtt_seconds` histogram
* `worker_service_seconds` histogram
* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline)
* `hop_latency_seconds` histogram
---
## 5) Workload modeling and test data
Agents MUST define a workload model before running capacity tests:
1. **Endpoint(s)**: list exact gateway routes under test
2. **Payload types**: small/typical/large
3. **Mix**: e.g., 70/25/5 by payload size
4. **Idempotency rules**: ensure repeated jobs don't corrupt state
5. **Data reset strategy**: how test data is cleaned or isolated per run
Agents SHOULD test at least:
* Typical payload (p50)
* Large payload (p95)
* Worst-case allowed payload (bounded by your API limits)
---
## 6) Scenario suite your agents MUST implement
Each scenario MUST be defined as code/config (not manual).
### Scenario A — Smoke (fast sanity)
**Goal**: verify instrumentation + basic correctness
**Load**: low (e.g., 15 rps), 2 minutes
**Pass**:
* 0 backlog after run
* error rate < 0.1%
* metrics present for all segments
### Scenario B — Baseline (repeatable reference point)
**Goal**: establish a stable baseline for regression tracking
**Load**: fixed moderate load (e.g., 3050% of expected capacity), 10 minutes
**Pass**:
* p95 `t_hop` within baseline ± tolerance (set after first runs)
* no upward drift in p95 across time (trend line ~flat)
### Scenario C — Capacity ramp (open-loop)
**Goal**: find the knee where queueing begins
**Method**: open-loop arrival-rate ramp with plateaus
Example stages (edit to fit your system):
* 50 rps for 2m
* 100 rps for 2m
* 200 rps for 2m
* 400 rps for 2m
* until SLO breach or errors spike
**MUST**:
* warm-up stage before first plateau
* record per-plateau summary
**Stop conditions** (any triggers stop):
* error rate > 1%
* queue depth grows without bound over an entire plateau
* p95 `t_hop` exceeds SLO for 2 consecutive plateaus
### Scenario D — Stress (push past capacity)
**Goal**: characterize failure mode and recovery
**Load**: 120200% of knee load, 510 minutes
**Pass** (for resilience):
* system does not crash permanently
* once load stops, backlog drains within target time (define it)
### Scenario E — Burst / spike
**Goal**: see how quickly queue grows and drains
**Load shape**:
* baseline low load
* sudden burst (e.g., 10× for 1030s)
* return to baseline
**Report**:
* peak queue depth
* time to drain to baseline
* p99 `t_hop` during burst
### Scenario F — Soak (long-running)
**Goal**: detect drift (leaks, fragmentation, GC patterns)
**Load**: 7085% of knee, 60180 minutes
**Pass**:
* p95 does not trend upward beyond threshold
* memory remains bounded
* no rising error rate
### Scenario G — Scaling curve (worker replica sweep)
**Goal**: turn results into scaling rules
**Method**:
* Repeat Scenario C with worker replicas = 1, 2, 4, 8…
**Deliverable**:
* plot of knee load vs worker count
* p95 `t_service` vs worker count (should remain similar; queue delay should drop)
---
## 7) Execution protocol (runbook)
Agents MUST run every scenario using the same disciplined flow:
### Pre-run checklist
* confirm system versions/SHAs
* confirm autoscaling mode:
* **Off** for baseline capacity characterization
* **On** for validating autoscaling policies
* clear queues and consumer group pending entries
* restart or at least record "time since deploy" for services (cold vs warm)
### During run
* ensure load is truly open-loop when required (arrival-rate based)
* continuously record:
* offered vs achieved rate
* queue depth
* CPU/mem for gateway/worker/Valkey
### Post-run
* stop load
* wait until backlog drains (or record that it doesn't)
* export:
* k6/runner raw output
* Prometheus time series snapshot
* sampled logs with corr_id fields
* generate a summary report automatically (no hand calculations)
---
## 8) Analysis rules (how agents compute "the envelope")
Agents MUST generate at minimum two plots per run:
1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis)
* overlay p99 (and SLO line)
2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time
### How to identify the "knee"
Agents SHOULD mark the knee as the first plateau where:
* queue depth grows monotonically within the plateau, **or**
* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50100%)
### Convert results into scaling guidance
Agents SHOULD compute:
* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker)
* recommended replicas for offered load λ at target utilization U:
* `workers_needed = ceil(λ * mean(t_service) / U)`
* choose U ~ 0.60.75 for headroom
This should be reported alongside the measured envelope.
---
## 9) Pass/fail criteria and regression gates
Agents MUST define gates in configuration, not in someone's head.
Suggested gating structure:
* **Smoke gate**: error rate < 0.1%, backlog drains
* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history)
* **Capacity gate**: knee load regression < 10% (optional but very valuable)
* **Soak gate**: p95 drift over time < 15% and no memory runaway
---
## 10) Common pitfalls (agents must avoid)
1. **Closed-loop tests used for capacity**
Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.
2. **Ignoring queue depth**
A system can look "healthy" in request latency while silently building backlog.
3. **Measuring only gateway latency**
You must measure enqueue claim done to see the real hop.
4. **Load generator bottleneck**
If the generator saturates, you'll under-estimate capacity.
5. **Retries enabled by default**
Retries can inflate load and hide root causes; run with retries off first.
6. **Not controlling warm vs cold**
Cold caches vs warmed services produce different envelopes; record the condition.
---
# Agent implementation checklist (deliverables)
Assign these as concrete tasks to your agents.
## Agent 1 — Observability & tracing
MUST deliver:
* correlation id propagation gateway Valkey worker
* timestamps `enq/claim/done`
* Prometheus histograms for enqueue, service, hop
* queue depth metric (`XLEN` / `XINFO` lag)
## Agent 2 — Load test harness
MUST deliver:
* test runner scripts (k6 or equivalent) for scenarios AG
* test config file (YAML/JSON) controlling:
* stages (rates/durations)
* payload mix
* headers (corr-id)
* reproducible seeds and version stamping
## Agent 3 — Result collector and analyzer
MUST deliver:
* a pipeline that merges:
* load generator output
* hop timing data (from logs or a completion stream)
* Prometheus snapshots
* automatic summary + plots:
* latency envelope
* queue depth/drain
* CSV/JSON exports for long-term tracking
## Agent 4 — Reporting and dashboards
MUST deliver:
* a standard report template that includes:
* environment details
* scenario details
* key charts
* knee estimate
* scaling recommendation
* Grafana dashboard with the required panels
## Agent 5 — CI / release integration
SHOULD deliver:
* PR-level smoke test (Scenario A)
* nightly baseline (Scenario B)
* weekly capacity sweep (Scenario C + scaling curve)
---
## Template: scenario spec (agents can copy/paste)
```yaml
test_run:
system_under_test:
gateway_sha: "<git sha>"
worker_sha: "<git sha>"
valkey_version: "<version>"
environment:
cluster: "<name>"
workers: 4
autoscaling: "off" # off|on
workload:
endpoint: "/hop"
payload_profile: "p50"
mix:
p50: 0.7
p95: 0.25
max: 0.05
scenario:
name: "capacity_ramp"
mode: "open_loop"
warmup_seconds: 60
stages:
- rps: 50
duration_seconds: 120
- rps: 100
duration_seconds: 120
- rps: 200
duration_seconds: 120
- rps: 400
duration_seconds: 120
gates:
max_error_rate: 0.01
slo_ms_p95_hop: 500
backlog_must_drain_seconds: 300
outputs:
artifacts_dir: "./artifacts/<timestamp>/"
```
---
## Sample folder layout
```
perf/
docker-compose.yml
prometheus/
prometheus.yml
k6/
lib.js
smoke.js
capacity_ramp.js
burst.js
soak.js
stress.js
scaling_curve.sh
tools/
analyze.py
src/
Perf.Gateway/
Perf.Worker/
```
---
**Document Version**: 1.0
**Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md
**Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use.