Files

master dee252940b SPRINT_3600_0001_0001 - Reachability Drift Detection Master Plan

2025-12-18 00:02:31 +02:00

19 KiB

Raw Blame History

Performance Testing Pipeline for Queue-Based Workflows

Note

: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.

What we're measuring (plain English)

TTFB/TTFS (HTTP): time the gateway spends accepting the request + queuing the job.
Valkey latency: enqueue (LPUSH/XADD), pop/claim (BRPOP/XREADGROUP), and round-trip.
Worker service time: time to pick up, process, and ack.
Queueing delay: time spent waiting in the queue (arrival → start of worker).

These four add up to the "hop latency" users feel when the system is under load.

Minimal tracing you can add today

Emit these IDs/headers end-to-end:

x-stella-corr-id (uuid)
x-stella-enq-ts (gateway enqueue ts, ns)
x-stella-claim-ts (worker claim ts, ns)
x-stella-done-ts (worker done ts, ns)

From these, compute:

queue_delay = claim_ts - enq_ts
service_time = done_ts - claim_ts
http_ttfs = gateway_first_byte_ts - http_request_start_ts
hop_latency = done_ts - enq_ts (or return-path if synchronous)

Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.

Valkey commands (safe, BSD Valkey)

Use Valkey Streams + Consumer Groups for fairness and metrics:

Enqueue: XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>
Claim: XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >
Ack: XACK jobs workers <id>

Add a small Lua for timestamping at enqueue (atomic):

-- KEYS[1]=stream
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
return redis.call('XADD', KEYS[1], '*',
  'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])

Load shapes to test (find the envelope)

Open-loop (arrival-rate controlled): 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
Burst: 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
Step-up/down: double every 2 min until SLO breach; then halve down.
Long tail soak: run at 70–80% of max for 1h; watch p95-p99.9 drift.

Target outputs per step: p50/p90/p95/p99 for queue_delay, service_time, hop_latency, plus throughput and error rate.

k6 script (HTTP client pressure)

// save as hop-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  scenarios: {
    step_load: {
      executor: 'ramping-arrival-rate',
      startRate: 20, timeUnit: '1s',
      preAllocatedVUs: 200, maxVUs: 5000,
      stages: [
        { target: 50, duration: '1m' },
        { target: 100, duration: '1m' },
        { target: 200, duration: '1m' },
        { target: 400, duration: '1m' },
        { target: 800, duration: '1m' },
      ],
    },
  },
  thresholds: {
    'http_req_failed': ['rate<0.01'],
    'http_req_duration{phase:hop}': ['p(95)<500'],
  },
};

export default function () {
  const corr = crypto.randomUUID();
  const res = http.post(
    __ENV.GW_URL,
    JSON.stringify({ data: 'ping', corr }),
    {
      headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
      tags: { phase: 'hop' },
    }
  );
  check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
  sleep(0.01);
}

Run: GW_URL=https://gateway.example/hop k6 run hop-test.js

Worker hooks (.NET 10 sketch)

// At claim
var now = Stopwatch.GetTimestamp(); // monotonic
var claimNs = now.ToNanoseconds();
log.AddTag("x-stella-claim-ts", claimNs);

// After processing
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
log.AddTag("x-stella-done-ts", doneNs);
// Include corr-id and stream entry id in logs/metrics

Helper:

public static class MonoTime {
  static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
  public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
}

Prometheus metrics to expose

valkey_enqueue_ns (histogram)
valkey_claim_block_ms (gauge)
worker_service_ns (histogram, labels: worker_type, route)
queue_depth (gauge via XLEN or XINFO STREAM)
enqueue_rate, dequeue_rate (counters)

Example recording rules:

- record: hop:queue_delay_p95
  expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
- record: hop:service_time_p95
  expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
- record: hop:latency_budget_p95
  expr: hop:queue_delay_p95 + hop:service_time_p95

Autoscaling signals (HPA/KEDA friendly)

Primary: queue depth & its derivative (d/dt).
Secondary: p95 queue_delay and worker CPU.
Safety: max in-flight per worker; backpressure HTTP 429 when queue_depth > D or p95_queue_delay > SLO*0.8.

Plot the "envelope" (what you'll look at)

X-axis: offered load (req/s).
Y-axis: p95 hop latency (ms).
Overlay: p99 (dashed), SLO line (e.g., 500 ms), and capacity knee (where p95 sharply rises).
Add secondary panel: queue depth vs load.

Performance Test Guidelines

HTTP → Valkey → Worker pipeline

1) Objectives and scope

Primary objectives

Your performance tests MUST answer these questions with evidence:

Capacity knee: At what offered load does queue delay start growing sharply?
User-impact envelope: What are p50/p95/p99 hop latency curves vs offered load?
Decomposition: How much of hop latency is:
- gateway enqueue time
- Valkey enqueue/claim RTT
- queue wait time
- worker service time
Scaling behavior: How do these change with worker replica counts (N workers)?
Stability: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?

Non-goals (explicitly out of scope unless you add them later)

Micro-optimizing single function runtime
Synthetic "max QPS" records without a representative payload
Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke

2) Definitions and required metrics

Required latency definitions (standardize these names)

Agents MUST compute and report these per request/job:

t_http_accept: time from client send → gateway accepts request
t_enqueue: time spent in gateway to enqueue into Valkey (server-side)
t_valkey_rtt_enq: client-observed RTT for enqueue command(s)
t_queue_delay: claim_ts - enq_ts
t_service: done_ts - claim_ts
t_hop: done_ts - enq_ts (this is the "true pipeline hop" latency)
Optional but recommended:
- t_ack: time to ack completion (Valkey ack RTT)
- t_http_response: request start → gateway response sent (TTFB/TTFS)

Required percentiles and aggregations

Per scenario step (e.g., each offered load plateau), agents MUST output:

p50 / p90 / p95 / p99 / p99.9 for: t_hop, t_queue_delay, t_service, t_enqueue
Throughput: offered rps and achieved rps
Error rate: HTTP failures, enqueue failures, worker failures
Queue depth and backlog drain time

Required system-level telemetry (minimum)

Agents MUST collect these time series during tests:

Worker: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
Valkey: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
Gateway: CPU/mem, request rate, response codes, request duration histogram

3) Environment and test hygiene requirements

Environment requirements

Agents SHOULD run tests in an environment that matches production in:

container CPU/memory limits
number of nodes, network topology
Valkey topology (single, cluster, sentinel, etc.)
worker replica autoscaling rules (or deliberately disabled)

If exact parity isn't possible, agents MUST record all known differences in the report.

Test hygiene (non-negotiable)

Agents MUST:

Start from empty queues (no backlog).
Disable client retries (or explicitly run two variants: retries off / retries on).
Warm up before measuring (e.g., 60s warm-up minimum).
Hold steady plateaus long enough to stabilize (usually 2–5 minutes per step).
Cool down and verify backlog drains (queue depth returns to baseline).
Record exact versions/SHAs of gateway/worker and Valkey config.

Load generator hygiene

Agents MUST ensure the load generator is not the bottleneck:

CPU < ~70% during test
no local socket exhaustion
enough VUs/connections
if needed, distributed load generation

4) Instrumentation spec (agents implement this first)

Correlation and timestamps

Agents MUST propagate an end-to-end correlation ID and timestamps.

Required fields

corr_id (UUID)
enq_ts_ns (set at enqueue, monotonic or consistent clock)
claim_ts_ns (set by worker when job is claimed)
done_ts_ns (set by worker when job processing ends)

Where these live

HTTP request header: x-corr-id: <uuid>
Valkey job payload fields: corr, enq, and optionally payload size/type
Worker logs/metrics: include corr_id, job id, claim_ts_ns, done_ts_ns

Clock requirements

Agents MUST use a consistent timing source:

Prefer monotonic timers for durations (Stopwatch / monotonic clock)
If timestamps cross machines, ensure they're comparable:
- either rely on synchronized clocks (NTP) and monitor drift
- or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)

Practical recommendation: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.

Valkey queue semantics (recommended)

Agents SHOULD use Streams + Consumer Groups for stable claim semantics and good observability:

Enqueue: XADD jobs * corr <uuid> enq <ns> payload <...>
Claim: XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >
Ack: XACK jobs workers <id>

Agents MUST record stream length (XLEN) or consumer group lag (XINFO GROUPS) as queue depth/lag.

Metrics exposure

Agents MUST publish Prometheus (or equivalent) histograms:

gateway_enqueue_seconds (or ns) histogram
valkey_enqueue_rtt_seconds histogram
worker_service_seconds histogram
queue_delay_seconds histogram (derived from timestamps; can be computed in worker or offline)
hop_latency_seconds histogram

5) Workload modeling and test data

Agents MUST define a workload model before running capacity tests:

Endpoint(s): list exact gateway routes under test
Payload types: small/typical/large
Mix: e.g., 70/25/5 by payload size
Idempotency rules: ensure repeated jobs don't corrupt state
Data reset strategy: how test data is cleaned or isolated per run

Agents SHOULD test at least:

Typical payload (p50)
Large payload (p95)
Worst-case allowed payload (bounded by your API limits)

6) Scenario suite your agents MUST implement

Each scenario MUST be defined as code/config (not manual).

Scenario A — Smoke (fast sanity)

Goal: verify instrumentation + basic correctness Load: low (e.g., 1–5 rps), 2 minutes Pass:

0 backlog after run
error rate < 0.1%
metrics present for all segments

Scenario B — Baseline (repeatable reference point)

Goal: establish a stable baseline for regression tracking Load: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes Pass:

p95 t_hop within baseline ± tolerance (set after first runs)
no upward drift in p95 across time (trend line ~flat)

Scenario C — Capacity ramp (open-loop)

Goal: find the knee where queueing begins Method: open-loop arrival-rate ramp with plateaus Example stages (edit to fit your system):

50 rps for 2m
100 rps for 2m
200 rps for 2m
400 rps for 2m
… until SLO breach or errors spike

MUST:

warm-up stage before first plateau
record per-plateau summary

Stop conditions (any triggers stop):

error rate > 1%
queue depth grows without bound over an entire plateau
p95 t_hop exceeds SLO for 2 consecutive plateaus

Scenario D — Stress (push past capacity)

Goal: characterize failure mode and recovery Load: 120–200% of knee load, 5–10 minutes Pass (for resilience):

system does not crash permanently
once load stops, backlog drains within target time (define it)

Scenario E — Burst / spike

Goal: see how quickly queue grows and drains Load shape:

baseline low load
sudden burst (e.g., 10× for 10–30s)
return to baseline

Report:

peak queue depth
time to drain to baseline
p99 t_hop during burst

Scenario F — Soak (long-running)

Goal: detect drift (leaks, fragmentation, GC patterns) Load: 70–85% of knee, 60–180 minutes Pass:

p95 does not trend upward beyond threshold
memory remains bounded
no rising error rate

Scenario G — Scaling curve (worker replica sweep)

Goal: turn results into scaling rules Method:

Repeat Scenario C with worker replicas = 1, 2, 4, 8…

Deliverable:

plot of knee load vs worker count
p95 t_service vs worker count (should remain similar; queue delay should drop)

7) Execution protocol (runbook)

Agents MUST run every scenario using the same disciplined flow:

Pre-run checklist

confirm system versions/SHAs
confirm autoscaling mode:
- Off for baseline capacity characterization
- On for validating autoscaling policies
clear queues and consumer group pending entries
restart or at least record "time since deploy" for services (cold vs warm)

During run

ensure load is truly open-loop when required (arrival-rate based)
continuously record:
- offered vs achieved rate
- queue depth
- CPU/mem for gateway/worker/Valkey

Post-run

stop load
wait until backlog drains (or record that it doesn't)
export:
- k6/runner raw output
- Prometheus time series snapshot
- sampled logs with corr_id fields
generate a summary report automatically (no hand calculations)

8) Analysis rules (how agents compute "the envelope")

Agents MUST generate at minimum two plots per run:

Latency envelope: offered load (x-axis) vs p95 t_hop (y-axis)
- overlay p99 (and SLO line)
Queue behavior: offered load vs queue depth (or lag), plus drain time

How to identify the "knee"

Agents SHOULD mark the knee as the first plateau where:

queue depth grows monotonically within the plateau, or
p95 t_queue_delay increases by > X% step-to-step (e.g., 50–100%)

Convert results into scaling guidance

Agents SHOULD compute:

capacity_per_worker ≈ 1 / mean(t_service) (jobs/sec per worker)
recommended replicas for offered load λ at target utilization U:
- workers_needed = ceil(λ * mean(t_service) / U)
- choose U ~ 0.6–0.75 for headroom

This should be reported alongside the measured envelope.

9) Pass/fail criteria and regression gates

Agents MUST define gates in configuration, not in someone's head.

Suggested gating structure:

Smoke gate: error rate < 0.1%, backlog drains
Baseline gate: p95 t_hop regression < 10% (tune after you have history)
Capacity gate: knee load regression < 10% (optional but very valuable)
Soak gate: p95 drift over time < 15% and no memory runaway

10) Common pitfalls (agents must avoid)

Closed-loop tests used for capacity Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.
Ignoring queue depth A system can look "healthy" in request latency while silently building backlog.
Measuring only gateway latency You must measure enqueue → claim → done to see the real hop.
Load generator bottleneck If the generator saturates, you'll under-estimate capacity.
Retries enabled by default Retries can inflate load and hide root causes; run with retries off first.
Not controlling warm vs cold Cold caches vs warmed services produce different envelopes; record the condition.

Agent implementation checklist (deliverables)

Assign these as concrete tasks to your agents.

Agent 1 — Observability & tracing

MUST deliver:

correlation id propagation gateway → Valkey → worker
timestamps enq/claim/done
Prometheus histograms for enqueue, service, hop
queue depth metric (XLEN / XINFO lag)

Agent 2 — Load test harness

MUST deliver:

test runner scripts (k6 or equivalent) for scenarios A–G
test config file (YAML/JSON) controlling:
- stages (rates/durations)
- payload mix
- headers (corr-id)
reproducible seeds and version stamping

Agent 3 — Result collector and analyzer

MUST deliver:

a pipeline that merges:
- load generator output
- hop timing data (from logs or a completion stream)
- Prometheus snapshots
automatic summary + plots:
- latency envelope
- queue depth/drain
CSV/JSON exports for long-term tracking

Agent 4 — Reporting and dashboards

MUST deliver:

a standard report template that includes:
- environment details
- scenario details
- key charts
- knee estimate
- scaling recommendation
Grafana dashboard with the required panels

Agent 5 — CI / release integration

SHOULD deliver:

PR-level smoke test (Scenario A)
nightly baseline (Scenario B)
weekly capacity sweep (Scenario C + scaling curve)

Template: scenario spec (agents can copy/paste)

test_run:
  system_under_test:
    gateway_sha: "<git sha>"
    worker_sha: "<git sha>"
    valkey_version: "<version>"
  environment:
    cluster: "<name>"
    workers: 4
    autoscaling: "off"   # off|on
  workload:
    endpoint: "/hop"
    payload_profile: "p50"
    mix:
      p50: 0.7
      p95: 0.25
      max: 0.05
  scenario:
    name: "capacity_ramp"
    mode: "open_loop"
    warmup_seconds: 60
    stages:
      - rps: 50
        duration_seconds: 120
      - rps: 100
        duration_seconds: 120
      - rps: 200
        duration_seconds: 120
      - rps: 400
        duration_seconds: 120
  gates:
    max_error_rate: 0.01
    slo_ms_p95_hop: 500
    backlog_must_drain_seconds: 300
  outputs:
    artifacts_dir: "./artifacts/<timestamp>/"

Sample folder layout

perf/
  docker-compose.yml
  prometheus/
    prometheus.yml
  k6/
    lib.js
    smoke.js
    capacity_ramp.js
    burst.js
    soak.js
    stress.js
    scaling_curve.sh
  tools/
    analyze.py
  src/
    Perf.Gateway/
    Perf.Worker/

Document Version: 1.0 Archived From: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md Archive Reason: Wrong content was pasted; this performance testing content preserved for future use.

19 KiB Raw Blame History Unescape Escape

Performance Testing Pipeline for Queue-Based Workflows

What we're measuring (plain English)

Minimal tracing you can add today

Valkey commands (safe, BSD Valkey)

Load shapes to test (find the envelope)

k6 script (HTTP client pressure)

Worker hooks (.NET 10 sketch)

Prometheus metrics to expose

Autoscaling signals (HPA/KEDA friendly)

Plot the "envelope" (what you'll look at)

Performance Test Guidelines

HTTP → Valkey → Worker pipeline

1) Objectives and scope

Primary objectives

Non-goals (explicitly out of scope unless you add them later)

2) Definitions and required metrics

Required latency definitions (standardize these names)

Required percentiles and aggregations

Required system-level telemetry (minimum)

3) Environment and test hygiene requirements

Environment requirements

Test hygiene (non-negotiable)

Load generator hygiene

4) Instrumentation spec (agents implement this first)

Correlation and timestamps

Clock requirements

Valkey queue semantics (recommended)

Metrics exposure

5) Workload modeling and test data

6) Scenario suite your agents MUST implement

Scenario A — Smoke (fast sanity)

Scenario B — Baseline (repeatable reference point)

Scenario C — Capacity ramp (open-loop)

Scenario D — Stress (push past capacity)

Scenario E — Burst / spike

Scenario F — Soak (long-running)

Scenario G — Scaling curve (worker replica sweep)

7) Execution protocol (runbook)

Pre-run checklist

During run

Post-run

8) Analysis rules (how agents compute "the envelope")

How to identify the "knee"

Convert results into scaling guidance

9) Pass/fail criteria and regression gates

10) Common pitfalls (agents must avoid)

Agent implementation checklist (deliverables)

Agent 1 — Observability & tracing

Agent 2 — Load test harness

Agent 3 — Result collector and analyzer

Agent 4 — Reporting and dashboards

Agent 5 — CI / release integration

Template: scenario spec (agents can copy/paste)

Sample folder layout

19 KiB

Raw Blame History