git.stella-ops.org/docs/dev/performance-testing-playbook.md

# Performance Testing Pipeline for Queue-Based Workflows

> **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.

---

## What we're measuring (plain English)

* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job.
* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip.
* **Worker service time:** time to pick up, process, and ack.
* **Queueing delay:** time spent waiting in the queue (arrival → start of worker).

These four add up to the "hop latency" users feel when the system is under load.

---

## Minimal tracing you can add today

Emit these IDs/headers end-to-end:

* `x-stella-corr-id` (uuid)
* `x-stella-enq-ts` (gateway enqueue ts, ns)
* `x-stella-claim-ts` (worker claim ts, ns)
* `x-stella-done-ts` (worker done ts, ns)

From these, compute:

* `queue_delay = claim_ts - enq_ts`
* `service_time = done_ts - claim_ts`
* `http_ttfs = gateway_first_byte_ts - http_request_start_ts`
* `hop_latency = done_ts - enq_ts` (or return-path if synchronous)

Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.

---

## Valkey commands (safe, BSD Valkey)

Use **Valkey Streams + Consumer Groups** for fairness and metrics:

* Enqueue: `XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>`
* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >`
* Ack: `XACK jobs workers <id>`

Add a small Lua for timestamping at enqueue (atomic):

```lua
-- KEYS[1]=stream
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
return redis.call('XADD', KEYS[1], '*',
  'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])
```

---

## Load shapes to test (find the envelope)

1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
3. **Step-up/down:** double every 2 min until SLO breach; then halve down.
4. **Long tail soak:** run at 70–80% of max for 1h; watch p95-p99.9 drift.

Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**.

---

## k6 script (HTTP client pressure)

```javascript
// save as hop-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  scenarios: {
    step_load: {
      executor: 'ramping-arrival-rate',
      startRate: 20, timeUnit: '1s',
      preAllocatedVUs: 200, maxVUs: 5000,
      stages: [
        { target: 50, duration: '1m' },
        { target: 100, duration: '1m' },
        { target: 200, duration: '1m' },
        { target: 400, duration: '1m' },
        { target: 800, duration: '1m' },
      ],
    },
  },
  thresholds: {
    'http_req_failed': ['rate<0.01'],
    'http_req_duration{phase:hop}': ['p(95)<500'],
  },
};

export default function () {
  const corr = crypto.randomUUID();
  const res = http.post(
    __ENV.GW_URL,
    JSON.stringify({ data: 'ping', corr }),
    {
      headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
      tags: { phase: 'hop' },
    }
  );
  check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
  sleep(0.01);
}
```

Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js`

---

## Worker hooks (.NET 10 sketch)

```csharp
// At claim
var now = Stopwatch.GetTimestamp(); // monotonic
var claimNs = now.ToNanoseconds();
log.AddTag("x-stella-claim-ts", claimNs);

// After processing
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
log.AddTag("x-stella-done-ts", doneNs);
// Include corr-id and stream entry id in logs/metrics
```

Helper:

```csharp
public static class MonoTime {
  static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
  public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
}
```

---

## Prometheus metrics to expose

* `valkey_enqueue_ns` (histogram)
* `valkey_claim_block_ms` (gauge)
* `worker_service_ns` (histogram, labels: worker_type, route)
* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`)
* `enqueue_rate`, `dequeue_rate` (counters)

Example recording rules:

```yaml
- record: hop:queue_delay_p95
  expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
- record: hop:service_time_p95
  expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
- record: hop:latency_budget_p95
  expr: hop:queue_delay_p95 + hop:service_time_p95
```

---

## Autoscaling signals (HPA/KEDA friendly)

* **Primary:** queue depth & its derivative (d/dt).
* **Secondary:** p95 `queue_delay` and worker CPU.
* **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`.

---

## Plot the "envelope" (what you'll look at)

* X-axis: **offered load** (req/s).
* Y-axis: **p95 hop latency** (ms).
* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises).
* Add secondary panel: **queue depth** vs load.

---

# Performance Test Guidelines

## HTTP → Valkey → Worker pipeline

## 1) Objectives and scope

### Primary objectives

Your performance tests MUST answer these questions with evidence:

1. **Capacity knee**: At what offered load does **queue delay** start growing sharply?
2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load?
3. **Decomposition**: How much of hop latency is:
   * gateway enqueue time
   * Valkey enqueue/claim RTT
   * queue wait time
   * worker service time
4. **Scaling behavior**: How do these change with worker replica counts (N workers)?
5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?

### Non-goals (explicitly out of scope unless you add them later)

* Micro-optimizing single function runtime
* Synthetic "max QPS" records without a representative payload
* Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke

---

## 2) Definitions and required metrics

### Required latency definitions (standardize these names)

Agents MUST compute and report these per request/job:

* **`t_http_accept`**: time from client send → gateway accepts request
* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side)
* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s)
* **`t_queue_delay`**: `claim_ts - enq_ts`
* **`t_service`**: `done_ts - claim_ts`
* **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency)
* Optional but recommended:
  * **`t_ack`**: time to ack completion (Valkey ack RTT)
  * **`t_http_response`**: request start → gateway response sent (TTFB/TTFS)

### Required percentiles and aggregations

Per scenario step (e.g., each offered load plateau), agents MUST output:

* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue`
* Throughput: offered rps and achieved rps
* Error rate: HTTP failures, enqueue failures, worker failures
* Queue depth and backlog drain time

### Required system-level telemetry (minimum)

Agents MUST collect these time series during tests:

* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
* **Gateway**: CPU/mem, request rate, response codes, request duration histogram

---

## 3) Environment and test hygiene requirements

### Environment requirements

Agents SHOULD run tests in an environment that matches production in:

* container CPU/memory limits
* number of nodes, network topology
* Valkey topology (single, cluster, sentinel, etc.)
* worker replica autoscaling rules (or deliberately disabled)

If exact parity isn't possible, agents MUST record all known differences in the report.

### Test hygiene (non-negotiable)

Agents MUST:

1. **Start from empty queues** (no backlog).
2. **Disable client retries** (or explicitly run two variants: retries off / retries on).
3. **Warm up** before measuring (e.g., 60s warm-up minimum).
4. **Hold steady plateaus** long enough to stabilize (usually 2–5 minutes per step).
5. **Cool down** and verify backlog drains (queue depth returns to baseline).
6. Record exact versions/SHAs of gateway/worker and Valkey config.

### Load generator hygiene

Agents MUST ensure the load generator is not the bottleneck:

* CPU < ~70% during test
* no local socket exhaustion
* enough VUs/connections
* if needed, distributed load generation

---

## 4) Instrumentation spec (agents implement this first)

### Correlation and timestamps

Agents MUST propagate an end-to-end correlation ID and timestamps.

**Required fields**

* `corr_id` (UUID)
* `enq_ts_ns` (set at enqueue, monotonic or consistent clock)
* `claim_ts_ns` (set by worker when job is claimed)
* `done_ts_ns` (set by worker when job processing ends)

**Where these live**

* HTTP request header: `x-corr-id: <uuid>`
* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type
* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns`

### Clock requirements

Agents MUST use a consistent timing source:

* Prefer monotonic timers for durations (Stopwatch / monotonic clock)
* If timestamps cross machines, ensure they're comparable:
  * either rely on synchronized clocks (NTP) **and** monitor drift
  * or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)

**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.

### Valkey queue semantics (recommended)

Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability:

* Enqueue: `XADD jobs * corr <uuid> enq <ns> payload <...>`
* Claim: `XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >`
* Ack: `XACK jobs workers <id>`

Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag.

### Metrics exposure

Agents MUST publish Prometheus (or equivalent) histograms:

* `gateway_enqueue_seconds` (or ns) histogram
* `valkey_enqueue_rtt_seconds` histogram
* `worker_service_seconds` histogram
* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline)
* `hop_latency_seconds` histogram

---

## 5) Workload modeling and test data

Agents MUST define a workload model before running capacity tests:

1. **Endpoint(s)**: list exact gateway routes under test
2. **Payload types**: small/typical/large
3. **Mix**: e.g., 70/25/5 by payload size
4. **Idempotency rules**: ensure repeated jobs don't corrupt state
5. **Data reset strategy**: how test data is cleaned or isolated per run

Agents SHOULD test at least:

* Typical payload (p50)
* Large payload (p95)
* Worst-case allowed payload (bounded by your API limits)

---

## 6) Scenario suite your agents MUST implement

Each scenario MUST be defined as code/config (not manual).

### Scenario A — Smoke (fast sanity)

**Goal**: verify instrumentation + basic correctness
**Load**: low (e.g., 1–5 rps), 2 minutes
**Pass**:

* 0 backlog after run
* error rate < 0.1%
* metrics present for all segments

### Scenario B — Baseline (repeatable reference point)

**Goal**: establish a stable baseline for regression tracking
**Load**: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes
**Pass**:

* p95 `t_hop` within baseline ± tolerance (set after first runs)
* no upward drift in p95 across time (trend line ~flat)

### Scenario C — Capacity ramp (open-loop)

**Goal**: find the knee where queueing begins
**Method**: open-loop arrival-rate ramp with plateaus
Example stages (edit to fit your system):

* 50 rps for 2m
* 100 rps for 2m
* 200 rps for 2m
* 400 rps for 2m
* … until SLO breach or errors spike

**MUST**:

* warm-up stage before first plateau
* record per-plateau summary

**Stop conditions** (any triggers stop):

* error rate > 1%
* queue depth grows without bound over an entire plateau
* p95 `t_hop` exceeds SLO for 2 consecutive plateaus

### Scenario D — Stress (push past capacity)

**Goal**: characterize failure mode and recovery
**Load**: 120–200% of knee load, 5–10 minutes
**Pass** (for resilience):

* system does not crash permanently
* once load stops, backlog drains within target time (define it)

### Scenario E — Burst / spike

**Goal**: see how quickly queue grows and drains
**Load shape**:

* baseline low load
* sudden burst (e.g., 10× for 10–30s)
* return to baseline

**Report**:

* peak queue depth
* time to drain to baseline
* p99 `t_hop` during burst

### Scenario F — Soak (long-running)

**Goal**: detect drift (leaks, fragmentation, GC patterns)
**Load**: 70–85% of knee, 60–180 minutes
**Pass**:

* p95 does not trend upward beyond threshold
* memory remains bounded
* no rising error rate

### Scenario G — Scaling curve (worker replica sweep)

**Goal**: turn results into scaling rules
**Method**:

* Repeat Scenario C with worker replicas = 1, 2, 4, 8…

**Deliverable**:

* plot of knee load vs worker count
* p95 `t_service` vs worker count (should remain similar; queue delay should drop)

---

## 7) Execution protocol (runbook)

Agents MUST run every scenario using the same disciplined flow:

### Pre-run checklist

* confirm system versions/SHAs
* confirm autoscaling mode:
  * **Off** for baseline capacity characterization
  * **On** for validating autoscaling policies
* clear queues and consumer group pending entries
* restart or at least record "time since deploy" for services (cold vs warm)

### During run

* ensure load is truly open-loop when required (arrival-rate based)
* continuously record:
  * offered vs achieved rate
  * queue depth
  * CPU/mem for gateway/worker/Valkey

### Post-run

* stop load
* wait until backlog drains (or record that it doesn't)
* export:
  * k6/runner raw output
  * Prometheus time series snapshot
  * sampled logs with corr_id fields
* generate a summary report automatically (no hand calculations)

---

## 8) Analysis rules (how agents compute "the envelope")

Agents MUST generate at minimum two plots per run:

1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis)
   * overlay p99 (and SLO line)
2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time

### How to identify the "knee"

Agents SHOULD mark the knee as the first plateau where:

* queue depth grows monotonically within the plateau, **or**
* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50–100%)

### Convert results into scaling guidance

Agents SHOULD compute:

* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker)
* recommended replicas for offered load λ at target utilization U:
  * `workers_needed = ceil(λ * mean(t_service) / U)`
  * choose U ~ 0.6–0.75 for headroom

This should be reported alongside the measured envelope.

---

## 9) Pass/fail criteria and regression gates

Agents MUST define gates in configuration, not in someone's head.

Suggested gating structure:

* **Smoke gate**: error rate < 0.1%, backlog drains
* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history)
* **Capacity gate**: knee load regression < 10% (optional but very valuable)
* **Soak gate**: p95 drift over time < 15% and no memory runaway

---

## 10) Common pitfalls (agents must avoid)

1. **Closed-loop tests used for capacity**
   Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.

2. **Ignoring queue depth**
   A system can look "healthy" in request latency while silently building backlog.

3. **Measuring only gateway latency**
   You must measure enqueue → claim → done to see the real hop.

4. **Load generator bottleneck**
   If the generator saturates, you'll under-estimate capacity.

5. **Retries enabled by default**
   Retries can inflate load and hide root causes; run with retries off first.

6. **Not controlling warm vs cold**
   Cold caches vs warmed services produce different envelopes; record the condition.

---

# Agent implementation checklist (deliverables)

Assign these as concrete tasks to your agents.

## Agent 1 — Observability & tracing

MUST deliver:

* correlation id propagation gateway → Valkey → worker
* timestamps `enq/claim/done`
* Prometheus histograms for enqueue, service, hop
* queue depth metric (`XLEN` / `XINFO` lag)

## Agent 2 — Load test harness

MUST deliver:

* test runner scripts (k6 or equivalent) for scenarios A–G
* test config file (YAML/JSON) controlling:
  * stages (rates/durations)
  * payload mix
  * headers (corr-id)
* reproducible seeds and version stamping

## Agent 3 — Result collector and analyzer

MUST deliver:

* a pipeline that merges:
  * load generator output
  * hop timing data (from logs or a completion stream)
  * Prometheus snapshots
* automatic summary + plots:
  * latency envelope
  * queue depth/drain
* CSV/JSON exports for long-term tracking

## Agent 4 — Reporting and dashboards

MUST deliver:

* a standard report template that includes:
  * environment details
  * scenario details
  * key charts
  * knee estimate
  * scaling recommendation
* Grafana dashboard with the required panels

## Agent 5 — CI / release integration

SHOULD deliver:

* PR-level smoke test (Scenario A)
* nightly baseline (Scenario B)
* weekly capacity sweep (Scenario C + scaling curve)

---

## Template: scenario spec (agents can copy/paste)

```yaml
test_run:
  system_under_test:
    gateway_sha: "<git sha>"
    worker_sha: "<git sha>"
    valkey_version: "<version>"
  environment:
    cluster: "<name>"
    workers: 4
    autoscaling: "off"   # off|on
  workload:
    endpoint: "/hop"
    payload_profile: "p50"
    mix:
      p50: 0.7
      p95: 0.25
      max: 0.05
  scenario:
    name: "capacity_ramp"
    mode: "open_loop"
    warmup_seconds: 60
    stages:
      - rps: 50
        duration_seconds: 120
      - rps: 100
        duration_seconds: 120
      - rps: 200
        duration_seconds: 120
      - rps: 400
        duration_seconds: 120
  gates:
    max_error_rate: 0.01
    slo_ms_p95_hop: 500
    backlog_must_drain_seconds: 300
  outputs:
    artifacts_dir: "./artifacts/<timestamp>/"
```

---

## Sample folder layout

```
perf/
  docker-compose.yml
  prometheus/
    prometheus.yml
  k6/
    lib.js
    smoke.js
    capacity_ramp.js
    burst.js
    soak.js
    stress.js
    scaling_curve.sh
  tools/
    analyze.py
  src/
    Perf.Gateway/
    Perf.Worker/
```

---

**Document Version**: 1.0
**Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md
**Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use.