Files
git.stella-ops.org/docs/dev/performance-testing-playbook.md

19 KiB
Raw Blame History

Performance Testing Pipeline for Queue-Based Workflows

Note

: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.


What we're measuring (plain English)

  • TTFB/TTFS (HTTP): time the gateway spends accepting the request + queuing the job.
  • Valkey latency: enqueue (LPUSH/XADD), pop/claim (BRPOP/XREADGROUP), and round-trip.
  • Worker service time: time to pick up, process, and ack.
  • Queueing delay: time spent waiting in the queue (arrival → start of worker).

These four add up to the "hop latency" users feel when the system is under load.


Minimal tracing you can add today

Emit these IDs/headers end-to-end:

  • x-stella-corr-id (uuid)
  • x-stella-enq-ts (gateway enqueue ts, ns)
  • x-stella-claim-ts (worker claim ts, ns)
  • x-stella-done-ts (worker done ts, ns)

From these, compute:

  • queue_delay = claim_ts - enq_ts
  • service_time = done_ts - claim_ts
  • http_ttfs = gateway_first_byte_ts - http_request_start_ts
  • hop_latency = done_ts - enq_ts (or return-path if synchronous)

Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.


Valkey commands (safe, BSD Valkey)

Use Valkey Streams + Consumer Groups for fairness and metrics:

  • Enqueue: XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>
  • Claim: XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >
  • Ack: XACK jobs workers <id>

Add a small Lua for timestamping at enqueue (atomic):

-- KEYS[1]=stream
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
return redis.call('XADD', KEYS[1], '*',
  'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])

Load shapes to test (find the envelope)

  1. Open-loop (arrival-rate controlled): 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
  2. Burst: 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
  3. Step-up/down: double every 2 min until SLO breach; then halve down.
  4. Long tail soak: run at 7080% of max for 1h; watch p95-p99.9 drift.

Target outputs per step: p50/p90/p95/p99 for queue_delay, service_time, hop_latency, plus throughput and error rate.


k6 script (HTTP client pressure)

// save as hop-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  scenarios: {
    step_load: {
      executor: 'ramping-arrival-rate',
      startRate: 20, timeUnit: '1s',
      preAllocatedVUs: 200, maxVUs: 5000,
      stages: [
        { target: 50, duration: '1m' },
        { target: 100, duration: '1m' },
        { target: 200, duration: '1m' },
        { target: 400, duration: '1m' },
        { target: 800, duration: '1m' },
      ],
    },
  },
  thresholds: {
    'http_req_failed': ['rate<0.01'],
    'http_req_duration{phase:hop}': ['p(95)<500'],
  },
};

export default function () {
  const corr = crypto.randomUUID();
  const res = http.post(
    __ENV.GW_URL,
    JSON.stringify({ data: 'ping', corr }),
    {
      headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
      tags: { phase: 'hop' },
    }
  );
  check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
  sleep(0.01);
}

Run: GW_URL=https://gateway.example/hop k6 run hop-test.js


Worker hooks (.NET 10 sketch)

// At claim
var now = Stopwatch.GetTimestamp(); // monotonic
var claimNs = now.ToNanoseconds();
log.AddTag("x-stella-claim-ts", claimNs);

// After processing
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
log.AddTag("x-stella-done-ts", doneNs);
// Include corr-id and stream entry id in logs/metrics

Helper:

public static class MonoTime {
  static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
  public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
}

Prometheus metrics to expose

  • valkey_enqueue_ns (histogram)
  • valkey_claim_block_ms (gauge)
  • worker_service_ns (histogram, labels: worker_type, route)
  • queue_depth (gauge via XLEN or XINFO STREAM)
  • enqueue_rate, dequeue_rate (counters)

Example recording rules:

- record: hop:queue_delay_p95
  expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
- record: hop:service_time_p95
  expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
- record: hop:latency_budget_p95
  expr: hop:queue_delay_p95 + hop:service_time_p95

Autoscaling signals (HPA/KEDA friendly)

  • Primary: queue depth & its derivative (d/dt).
  • Secondary: p95 queue_delay and worker CPU.
  • Safety: max in-flight per worker; backpressure HTTP 429 when queue_depth > D or p95_queue_delay > SLO*0.8.

Plot the "envelope" (what you'll look at)

  • X-axis: offered load (req/s).
  • Y-axis: p95 hop latency (ms).
  • Overlay: p99 (dashed), SLO line (e.g., 500 ms), and capacity knee (where p95 sharply rises).
  • Add secondary panel: queue depth vs load.

Performance Test Guidelines

HTTP → Valkey → Worker pipeline

1) Objectives and scope

Primary objectives

Your performance tests MUST answer these questions with evidence:

  1. Capacity knee: At what offered load does queue delay start growing sharply?
  2. User-impact envelope: What are p50/p95/p99 hop latency curves vs offered load?
  3. Decomposition: How much of hop latency is:
    • gateway enqueue time
    • Valkey enqueue/claim RTT
    • queue wait time
    • worker service time
  4. Scaling behavior: How do these change with worker replica counts (N workers)?
  5. Stability: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?

Non-goals (explicitly out of scope unless you add them later)

  • Micro-optimizing single function runtime
  • Synthetic "max QPS" records without a representative payload
  • Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke

2) Definitions and required metrics

Required latency definitions (standardize these names)

Agents MUST compute and report these per request/job:

  • t_http_accept: time from client send → gateway accepts request
  • t_enqueue: time spent in gateway to enqueue into Valkey (server-side)
  • t_valkey_rtt_enq: client-observed RTT for enqueue command(s)
  • t_queue_delay: claim_ts - enq_ts
  • t_service: done_ts - claim_ts
  • t_hop: done_ts - enq_ts (this is the "true pipeline hop" latency)
  • Optional but recommended:
    • t_ack: time to ack completion (Valkey ack RTT)
    • t_http_response: request start → gateway response sent (TTFB/TTFS)

Required percentiles and aggregations

Per scenario step (e.g., each offered load plateau), agents MUST output:

  • p50 / p90 / p95 / p99 / p99.9 for: t_hop, t_queue_delay, t_service, t_enqueue
  • Throughput: offered rps and achieved rps
  • Error rate: HTTP failures, enqueue failures, worker failures
  • Queue depth and backlog drain time

Required system-level telemetry (minimum)

Agents MUST collect these time series during tests:

  • Worker: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
  • Valkey: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
  • Gateway: CPU/mem, request rate, response codes, request duration histogram

3) Environment and test hygiene requirements

Environment requirements

Agents SHOULD run tests in an environment that matches production in:

  • container CPU/memory limits
  • number of nodes, network topology
  • Valkey topology (single, cluster, sentinel, etc.)
  • worker replica autoscaling rules (or deliberately disabled)

If exact parity isn't possible, agents MUST record all known differences in the report.

Test hygiene (non-negotiable)

Agents MUST:

  1. Start from empty queues (no backlog).
  2. Disable client retries (or explicitly run two variants: retries off / retries on).
  3. Warm up before measuring (e.g., 60s warm-up minimum).
  4. Hold steady plateaus long enough to stabilize (usually 25 minutes per step).
  5. Cool down and verify backlog drains (queue depth returns to baseline).
  6. Record exact versions/SHAs of gateway/worker and Valkey config.

Load generator hygiene

Agents MUST ensure the load generator is not the bottleneck:

  • CPU < ~70% during test
  • no local socket exhaustion
  • enough VUs/connections
  • if needed, distributed load generation

4) Instrumentation spec (agents implement this first)

Correlation and timestamps

Agents MUST propagate an end-to-end correlation ID and timestamps.

Required fields

  • corr_id (UUID)
  • enq_ts_ns (set at enqueue, monotonic or consistent clock)
  • claim_ts_ns (set by worker when job is claimed)
  • done_ts_ns (set by worker when job processing ends)

Where these live

  • HTTP request header: x-corr-id: <uuid>
  • Valkey job payload fields: corr, enq, and optionally payload size/type
  • Worker logs/metrics: include corr_id, job id, claim_ts_ns, done_ts_ns

Clock requirements

Agents MUST use a consistent timing source:

  • Prefer monotonic timers for durations (Stopwatch / monotonic clock)
  • If timestamps cross machines, ensure they're comparable:
    • either rely on synchronized clocks (NTP) and monitor drift
    • or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)

Practical recommendation: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.

Agents SHOULD use Streams + Consumer Groups for stable claim semantics and good observability:

  • Enqueue: XADD jobs * corr <uuid> enq <ns> payload <...>
  • Claim: XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >
  • Ack: XACK jobs workers <id>

Agents MUST record stream length (XLEN) or consumer group lag (XINFO GROUPS) as queue depth/lag.

Metrics exposure

Agents MUST publish Prometheus (or equivalent) histograms:

  • gateway_enqueue_seconds (or ns) histogram
  • valkey_enqueue_rtt_seconds histogram
  • worker_service_seconds histogram
  • queue_delay_seconds histogram (derived from timestamps; can be computed in worker or offline)
  • hop_latency_seconds histogram

5) Workload modeling and test data

Agents MUST define a workload model before running capacity tests:

  1. Endpoint(s): list exact gateway routes under test
  2. Payload types: small/typical/large
  3. Mix: e.g., 70/25/5 by payload size
  4. Idempotency rules: ensure repeated jobs don't corrupt state
  5. Data reset strategy: how test data is cleaned or isolated per run

Agents SHOULD test at least:

  • Typical payload (p50)
  • Large payload (p95)
  • Worst-case allowed payload (bounded by your API limits)

6) Scenario suite your agents MUST implement

Each scenario MUST be defined as code/config (not manual).

Scenario A — Smoke (fast sanity)

Goal: verify instrumentation + basic correctness Load: low (e.g., 15 rps), 2 minutes Pass:

  • 0 backlog after run
  • error rate < 0.1%
  • metrics present for all segments

Scenario B — Baseline (repeatable reference point)

Goal: establish a stable baseline for regression tracking Load: fixed moderate load (e.g., 3050% of expected capacity), 10 minutes Pass:

  • p95 t_hop within baseline ± tolerance (set after first runs)
  • no upward drift in p95 across time (trend line ~flat)

Scenario C — Capacity ramp (open-loop)

Goal: find the knee where queueing begins Method: open-loop arrival-rate ramp with plateaus Example stages (edit to fit your system):

  • 50 rps for 2m
  • 100 rps for 2m
  • 200 rps for 2m
  • 400 rps for 2m
  • … until SLO breach or errors spike

MUST:

  • warm-up stage before first plateau
  • record per-plateau summary

Stop conditions (any triggers stop):

  • error rate > 1%
  • queue depth grows without bound over an entire plateau
  • p95 t_hop exceeds SLO for 2 consecutive plateaus

Scenario D — Stress (push past capacity)

Goal: characterize failure mode and recovery Load: 120200% of knee load, 510 minutes Pass (for resilience):

  • system does not crash permanently
  • once load stops, backlog drains within target time (define it)

Scenario E — Burst / spike

Goal: see how quickly queue grows and drains Load shape:

  • baseline low load
  • sudden burst (e.g., 10× for 1030s)
  • return to baseline

Report:

  • peak queue depth
  • time to drain to baseline
  • p99 t_hop during burst

Scenario F — Soak (long-running)

Goal: detect drift (leaks, fragmentation, GC patterns) Load: 7085% of knee, 60180 minutes Pass:

  • p95 does not trend upward beyond threshold
  • memory remains bounded
  • no rising error rate

Scenario G — Scaling curve (worker replica sweep)

Goal: turn results into scaling rules Method:

  • Repeat Scenario C with worker replicas = 1, 2, 4, 8…

Deliverable:

  • plot of knee load vs worker count
  • p95 t_service vs worker count (should remain similar; queue delay should drop)

7) Execution protocol (runbook)

Agents MUST run every scenario using the same disciplined flow:

Pre-run checklist

  • confirm system versions/SHAs
  • confirm autoscaling mode:
    • Off for baseline capacity characterization
    • On for validating autoscaling policies
  • clear queues and consumer group pending entries
  • restart or at least record "time since deploy" for services (cold vs warm)

During run

  • ensure load is truly open-loop when required (arrival-rate based)
  • continuously record:
    • offered vs achieved rate
    • queue depth
    • CPU/mem for gateway/worker/Valkey

Post-run

  • stop load
  • wait until backlog drains (or record that it doesn't)
  • export:
    • k6/runner raw output
    • Prometheus time series snapshot
    • sampled logs with corr_id fields
  • generate a summary report automatically (no hand calculations)

8) Analysis rules (how agents compute "the envelope")

Agents MUST generate at minimum two plots per run:

  1. Latency envelope: offered load (x-axis) vs p95 t_hop (y-axis)
    • overlay p99 (and SLO line)
  2. Queue behavior: offered load vs queue depth (or lag), plus drain time

How to identify the "knee"

Agents SHOULD mark the knee as the first plateau where:

  • queue depth grows monotonically within the plateau, or
  • p95 t_queue_delay increases by > X% step-to-step (e.g., 50100%)

Convert results into scaling guidance

Agents SHOULD compute:

  • capacity_per_worker ≈ 1 / mean(t_service) (jobs/sec per worker)
  • recommended replicas for offered load λ at target utilization U:
    • workers_needed = ceil(λ * mean(t_service) / U)
    • choose U ~ 0.60.75 for headroom

This should be reported alongside the measured envelope.


9) Pass/fail criteria and regression gates

Agents MUST define gates in configuration, not in someone's head.

Suggested gating structure:

  • Smoke gate: error rate < 0.1%, backlog drains
  • Baseline gate: p95 t_hop regression < 10% (tune after you have history)
  • Capacity gate: knee load regression < 10% (optional but very valuable)
  • Soak gate: p95 drift over time < 15% and no memory runaway

10) Common pitfalls (agents must avoid)

  1. Closed-loop tests used for capacity Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.

  2. Ignoring queue depth A system can look "healthy" in request latency while silently building backlog.

  3. Measuring only gateway latency You must measure enqueue → claim → done to see the real hop.

  4. Load generator bottleneck If the generator saturates, you'll under-estimate capacity.

  5. Retries enabled by default Retries can inflate load and hide root causes; run with retries off first.

  6. Not controlling warm vs cold Cold caches vs warmed services produce different envelopes; record the condition.


Agent implementation checklist (deliverables)

Assign these as concrete tasks to your agents.

Agent 1 — Observability & tracing

MUST deliver:

  • correlation id propagation gateway → Valkey → worker
  • timestamps enq/claim/done
  • Prometheus histograms for enqueue, service, hop
  • queue depth metric (XLEN / XINFO lag)

Agent 2 — Load test harness

MUST deliver:

  • test runner scripts (k6 or equivalent) for scenarios AG
  • test config file (YAML/JSON) controlling:
    • stages (rates/durations)
    • payload mix
    • headers (corr-id)
  • reproducible seeds and version stamping

Agent 3 — Result collector and analyzer

MUST deliver:

  • a pipeline that merges:
    • load generator output
    • hop timing data (from logs or a completion stream)
    • Prometheus snapshots
  • automatic summary + plots:
    • latency envelope
    • queue depth/drain
  • CSV/JSON exports for long-term tracking

Agent 4 — Reporting and dashboards

MUST deliver:

  • a standard report template that includes:
    • environment details
    • scenario details
    • key charts
    • knee estimate
    • scaling recommendation
  • Grafana dashboard with the required panels

Agent 5 — CI / release integration

SHOULD deliver:

  • PR-level smoke test (Scenario A)
  • nightly baseline (Scenario B)
  • weekly capacity sweep (Scenario C + scaling curve)

Template: scenario spec (agents can copy/paste)

test_run:
  system_under_test:
    gateway_sha: "<git sha>"
    worker_sha: "<git sha>"
    valkey_version: "<version>"
  environment:
    cluster: "<name>"
    workers: 4
    autoscaling: "off"   # off|on
  workload:
    endpoint: "/hop"
    payload_profile: "p50"
    mix:
      p50: 0.7
      p95: 0.25
      max: 0.05
  scenario:
    name: "capacity_ramp"
    mode: "open_loop"
    warmup_seconds: 60
    stages:
      - rps: 50
        duration_seconds: 120
      - rps: 100
        duration_seconds: 120
      - rps: 200
        duration_seconds: 120
      - rps: 400
        duration_seconds: 120
  gates:
    max_error_rate: 0.01
    slo_ms_p95_hop: 500
    backlog_must_drain_seconds: 300
  outputs:
    artifacts_dir: "./artifacts/<timestamp>/"

Sample folder layout

perf/
  docker-compose.yml
  prometheus/
    prometheus.yml
  k6/
    lib.js
    smoke.js
    capacity_ramp.js
    burst.js
    soak.js
    stress.js
    scaling_curve.sh
  tools/
    analyze.py
  src/
    Perf.Gateway/
    Perf.Worker/

Document Version: 1.0 Archived From: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md Archive Reason: Wrong content was pasted; this performance testing content preserved for future use.