# Performance Testing Pipeline for Queue-Based Workflows > **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing. --- ## What we're measuring (plain English) * **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job. * **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip. * **Worker service time:** time to pick up, process, and ack. * **Queueing delay:** time spent waiting in the queue (arrival → start of worker). These four add up to the "hop latency" users feel when the system is under load. --- ## Minimal tracing you can add today Emit these IDs/headers end-to-end: * `x-stella-corr-id` (uuid) * `x-stella-enq-ts` (gateway enqueue ts, ns) * `x-stella-claim-ts` (worker claim ts, ns) * `x-stella-done-ts` (worker done ts, ns) From these, compute: * `queue_delay = claim_ts - enq_ts` * `service_time = done_ts - claim_ts` * `http_ttfs = gateway_first_byte_ts - http_request_start_ts` * `hop_latency = done_ts - enq_ts` (or return-path if synchronous) Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock. --- ## Valkey commands (safe, BSD Valkey) Use **Valkey Streams + Consumer Groups** for fairness and metrics: * Enqueue: `XADD jobs * corr-id enq-ts payload <...>` * Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >` * Ack: `XACK jobs workers ` Add a small Lua for timestamping at enqueue (atomic): ```lua -- KEYS[1]=stream -- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload return redis.call('XADD', KEYS[1], '*', 'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3]) ``` --- ## Load shapes to test (find the envelope) 1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset. 2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time. 3. **Step-up/down:** double every 2 min until SLO breach; then halve down. 4. **Long tail soak:** run at 70–80% of max for 1h; watch p95-p99.9 drift. Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**. --- ## k6 script (HTTP client pressure) ```javascript // save as hop-test.js import http from 'k6/http'; import { check, sleep } from 'k6'; export let options = { scenarios: { step_load: { executor: 'ramping-arrival-rate', startRate: 20, timeUnit: '1s', preAllocatedVUs: 200, maxVUs: 5000, stages: [ { target: 50, duration: '1m' }, { target: 100, duration: '1m' }, { target: 200, duration: '1m' }, { target: 400, duration: '1m' }, { target: 800, duration: '1m' }, ], }, }, thresholds: { 'http_req_failed': ['rate<0.01'], 'http_req_duration{phase:hop}': ['p(95)<500'], }, }; export default function () { const corr = crypto.randomUUID(); const res = http.post( __ENV.GW_URL, JSON.stringify({ data: 'ping', corr }), { headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr }, tags: { phase: 'hop' }, } ); check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 }); sleep(0.01); } ``` Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js` --- ## Worker hooks (.NET 10 sketch) ```csharp // At claim var now = Stopwatch.GetTimestamp(); // monotonic var claimNs = now.ToNanoseconds(); log.AddTag("x-stella-claim-ts", claimNs); // After processing var doneNs = Stopwatch.GetTimestamp().ToNanoseconds(); log.AddTag("x-stella-done-ts", doneNs); // Include corr-id and stream entry id in logs/metrics ``` Helper: ```csharp public static class MonoTime { static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency; public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick); } ``` --- ## Prometheus metrics to expose * `valkey_enqueue_ns` (histogram) * `valkey_claim_block_ms` (gauge) * `worker_service_ns` (histogram, labels: worker_type, route) * `queue_depth` (gauge via `XLEN` or `XINFO STREAM`) * `enqueue_rate`, `dequeue_rate` (counters) Example recording rules: ```yaml - record: hop:queue_delay_p95 expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le)) - record: hop:service_time_p95 expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le)) - record: hop:latency_budget_p95 expr: hop:queue_delay_p95 + hop:service_time_p95 ``` --- ## Autoscaling signals (HPA/KEDA friendly) * **Primary:** queue depth & its derivative (d/dt). * **Secondary:** p95 `queue_delay` and worker CPU. * **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`. --- ## Plot the "envelope" (what you'll look at) * X-axis: **offered load** (req/s). * Y-axis: **p95 hop latency** (ms). * Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises). * Add secondary panel: **queue depth** vs load. --- # Performance Test Guidelines ## HTTP → Valkey → Worker pipeline ## 1) Objectives and scope ### Primary objectives Your performance tests MUST answer these questions with evidence: 1. **Capacity knee**: At what offered load does **queue delay** start growing sharply? 2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load? 3. **Decomposition**: How much of hop latency is: * gateway enqueue time * Valkey enqueue/claim RTT * queue wait time * worker service time 4. **Scaling behavior**: How do these change with worker replica counts (N workers)? 5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)? ### Non-goals (explicitly out of scope unless you add them later) * Micro-optimizing single function runtime * Synthetic "max QPS" records without a representative payload * Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke --- ## 2) Definitions and required metrics ### Required latency definitions (standardize these names) Agents MUST compute and report these per request/job: * **`t_http_accept`**: time from client send → gateway accepts request * **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side) * **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s) * **`t_queue_delay`**: `claim_ts - enq_ts` * **`t_service`**: `done_ts - claim_ts` * **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency) * Optional but recommended: * **`t_ack`**: time to ack completion (Valkey ack RTT) * **`t_http_response`**: request start → gateway response sent (TTFB/TTFS) ### Required percentiles and aggregations Per scenario step (e.g., each offered load plateau), agents MUST output: * p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue` * Throughput: offered rps and achieved rps * Error rate: HTTP failures, enqueue failures, worker failures * Queue depth and backlog drain time ### Required system-level telemetry (minimum) Agents MUST collect these time series during tests: * **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators * **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count * **Gateway**: CPU/mem, request rate, response codes, request duration histogram --- ## 3) Environment and test hygiene requirements ### Environment requirements Agents SHOULD run tests in an environment that matches production in: * container CPU/memory limits * number of nodes, network topology * Valkey topology (single, cluster, sentinel, etc.) * worker replica autoscaling rules (or deliberately disabled) If exact parity isn't possible, agents MUST record all known differences in the report. ### Test hygiene (non-negotiable) Agents MUST: 1. **Start from empty queues** (no backlog). 2. **Disable client retries** (or explicitly run two variants: retries off / retries on). 3. **Warm up** before measuring (e.g., 60s warm-up minimum). 4. **Hold steady plateaus** long enough to stabilize (usually 2–5 minutes per step). 5. **Cool down** and verify backlog drains (queue depth returns to baseline). 6. Record exact versions/SHAs of gateway/worker and Valkey config. ### Load generator hygiene Agents MUST ensure the load generator is not the bottleneck: * CPU < ~70% during test * no local socket exhaustion * enough VUs/connections * if needed, distributed load generation --- ## 4) Instrumentation spec (agents implement this first) ### Correlation and timestamps Agents MUST propagate an end-to-end correlation ID and timestamps. **Required fields** * `corr_id` (UUID) * `enq_ts_ns` (set at enqueue, monotonic or consistent clock) * `claim_ts_ns` (set by worker when job is claimed) * `done_ts_ns` (set by worker when job processing ends) **Where these live** * HTTP request header: `x-corr-id: ` * Valkey job payload fields: `corr`, `enq`, and optionally payload size/type * Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns` ### Clock requirements Agents MUST use a consistent timing source: * Prefer monotonic timers for durations (Stopwatch / monotonic clock) * If timestamps cross machines, ensure they're comparable: * either rely on synchronized clocks (NTP) **and** monitor drift * or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay) **Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity. ### Valkey queue semantics (recommended) Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability: * Enqueue: `XADD jobs * corr enq payload <...>` * Claim: `XREADGROUP GROUP workers COUNT 1 BLOCK 1000 STREAMS jobs >` * Ack: `XACK jobs workers ` Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag. ### Metrics exposure Agents MUST publish Prometheus (or equivalent) histograms: * `gateway_enqueue_seconds` (or ns) histogram * `valkey_enqueue_rtt_seconds` histogram * `worker_service_seconds` histogram * `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline) * `hop_latency_seconds` histogram --- ## 5) Workload modeling and test data Agents MUST define a workload model before running capacity tests: 1. **Endpoint(s)**: list exact gateway routes under test 2. **Payload types**: small/typical/large 3. **Mix**: e.g., 70/25/5 by payload size 4. **Idempotency rules**: ensure repeated jobs don't corrupt state 5. **Data reset strategy**: how test data is cleaned or isolated per run Agents SHOULD test at least: * Typical payload (p50) * Large payload (p95) * Worst-case allowed payload (bounded by your API limits) --- ## 6) Scenario suite your agents MUST implement Each scenario MUST be defined as code/config (not manual). ### Scenario A — Smoke (fast sanity) **Goal**: verify instrumentation + basic correctness **Load**: low (e.g., 1–5 rps), 2 minutes **Pass**: * 0 backlog after run * error rate < 0.1% * metrics present for all segments ### Scenario B — Baseline (repeatable reference point) **Goal**: establish a stable baseline for regression tracking **Load**: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes **Pass**: * p95 `t_hop` within baseline ± tolerance (set after first runs) * no upward drift in p95 across time (trend line ~flat) ### Scenario C — Capacity ramp (open-loop) **Goal**: find the knee where queueing begins **Method**: open-loop arrival-rate ramp with plateaus Example stages (edit to fit your system): * 50 rps for 2m * 100 rps for 2m * 200 rps for 2m * 400 rps for 2m * … until SLO breach or errors spike **MUST**: * warm-up stage before first plateau * record per-plateau summary **Stop conditions** (any triggers stop): * error rate > 1% * queue depth grows without bound over an entire plateau * p95 `t_hop` exceeds SLO for 2 consecutive plateaus ### Scenario D — Stress (push past capacity) **Goal**: characterize failure mode and recovery **Load**: 120–200% of knee load, 5–10 minutes **Pass** (for resilience): * system does not crash permanently * once load stops, backlog drains within target time (define it) ### Scenario E — Burst / spike **Goal**: see how quickly queue grows and drains **Load shape**: * baseline low load * sudden burst (e.g., 10× for 10–30s) * return to baseline **Report**: * peak queue depth * time to drain to baseline * p99 `t_hop` during burst ### Scenario F — Soak (long-running) **Goal**: detect drift (leaks, fragmentation, GC patterns) **Load**: 70–85% of knee, 60–180 minutes **Pass**: * p95 does not trend upward beyond threshold * memory remains bounded * no rising error rate ### Scenario G — Scaling curve (worker replica sweep) **Goal**: turn results into scaling rules **Method**: * Repeat Scenario C with worker replicas = 1, 2, 4, 8… **Deliverable**: * plot of knee load vs worker count * p95 `t_service` vs worker count (should remain similar; queue delay should drop) --- ## 7) Execution protocol (runbook) Agents MUST run every scenario using the same disciplined flow: ### Pre-run checklist * confirm system versions/SHAs * confirm autoscaling mode: * **Off** for baseline capacity characterization * **On** for validating autoscaling policies * clear queues and consumer group pending entries * restart or at least record "time since deploy" for services (cold vs warm) ### During run * ensure load is truly open-loop when required (arrival-rate based) * continuously record: * offered vs achieved rate * queue depth * CPU/mem for gateway/worker/Valkey ### Post-run * stop load * wait until backlog drains (or record that it doesn't) * export: * k6/runner raw output * Prometheus time series snapshot * sampled logs with corr_id fields * generate a summary report automatically (no hand calculations) --- ## 8) Analysis rules (how agents compute "the envelope") Agents MUST generate at minimum two plots per run: 1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis) * overlay p99 (and SLO line) 2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time ### How to identify the "knee" Agents SHOULD mark the knee as the first plateau where: * queue depth grows monotonically within the plateau, **or** * p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50–100%) ### Convert results into scaling guidance Agents SHOULD compute: * `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker) * recommended replicas for offered load λ at target utilization U: * `workers_needed = ceil(λ * mean(t_service) / U)` * choose U ~ 0.6–0.75 for headroom This should be reported alongside the measured envelope. --- ## 9) Pass/fail criteria and regression gates Agents MUST define gates in configuration, not in someone's head. Suggested gating structure: * **Smoke gate**: error rate < 0.1%, backlog drains * **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history) * **Capacity gate**: knee load regression < 10% (optional but very valuable) * **Soak gate**: p95 drift over time < 15% and no memory runaway --- ## 10) Common pitfalls (agents must avoid) 1. **Closed-loop tests used for capacity** Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity. 2. **Ignoring queue depth** A system can look "healthy" in request latency while silently building backlog. 3. **Measuring only gateway latency** You must measure enqueue → claim → done to see the real hop. 4. **Load generator bottleneck** If the generator saturates, you'll under-estimate capacity. 5. **Retries enabled by default** Retries can inflate load and hide root causes; run with retries off first. 6. **Not controlling warm vs cold** Cold caches vs warmed services produce different envelopes; record the condition. --- # Agent implementation checklist (deliverables) Assign these as concrete tasks to your agents. ## Agent 1 — Observability & tracing MUST deliver: * correlation id propagation gateway → Valkey → worker * timestamps `enq/claim/done` * Prometheus histograms for enqueue, service, hop * queue depth metric (`XLEN` / `XINFO` lag) ## Agent 2 — Load test harness MUST deliver: * test runner scripts (k6 or equivalent) for scenarios A–G * test config file (YAML/JSON) controlling: * stages (rates/durations) * payload mix * headers (corr-id) * reproducible seeds and version stamping ## Agent 3 — Result collector and analyzer MUST deliver: * a pipeline that merges: * load generator output * hop timing data (from logs or a completion stream) * Prometheus snapshots * automatic summary + plots: * latency envelope * queue depth/drain * CSV/JSON exports for long-term tracking ## Agent 4 — Reporting and dashboards MUST deliver: * a standard report template that includes: * environment details * scenario details * key charts * knee estimate * scaling recommendation * Grafana dashboard with the required panels ## Agent 5 — CI / release integration SHOULD deliver: * PR-level smoke test (Scenario A) * nightly baseline (Scenario B) * weekly capacity sweep (Scenario C + scaling curve) --- ## Template: scenario spec (agents can copy/paste) ```yaml test_run: system_under_test: gateway_sha: "" worker_sha: "" valkey_version: "" environment: cluster: "" workers: 4 autoscaling: "off" # off|on workload: endpoint: "/hop" payload_profile: "p50" mix: p50: 0.7 p95: 0.25 max: 0.05 scenario: name: "capacity_ramp" mode: "open_loop" warmup_seconds: 60 stages: - rps: 50 duration_seconds: 120 - rps: 100 duration_seconds: 120 - rps: 200 duration_seconds: 120 - rps: 400 duration_seconds: 120 gates: max_error_rate: 0.01 slo_ms_p95_hop: 500 backlog_must_drain_seconds: 300 outputs: artifacts_dir: "./artifacts//" ``` --- ## Sample folder layout ``` perf/ docker-compose.yml prometheus/ prometheus.yml k6/ lib.js smoke.js capacity_ramp.js burst.js soak.js stress.js scaling_curve.sh tools/ analyze.py src/ Perf.Gateway/ Perf.Worker/ ``` --- **Document Version**: 1.0 **Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md **Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use.