SPRINT_3600_0001_0001 - Reachability Drift Detection Master Plan

2025-12-18 00:02:31 +02:00
parent 8bbfe4d2d2
commit dee252940b
13 changed files with 6099 additions and 1651 deletions
--- a/docs/dev/performance-testing-playbook.md
+++ b/docs/dev/performance-testing-playbook.md
@@ -0,0 +1,663 @@
+# Performance Testing Pipeline for Queue-Based Workflows
+
+> **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.
+
+---
+
+## What we're measuring (plain English)
+
+* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job.
+* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip.
+* **Worker service time:** time to pick up, process, and ack.
+* **Queueing delay:** time spent waiting in the queue (arrival → start of worker).
+
+These four add up to the "hop latency" users feel when the system is under load.
+
+---
+
+## Minimal tracing you can add today
+
+Emit these IDs/headers end-to-end:
+
+* `x-stella-corr-id` (uuid)
+* `x-stella-enq-ts` (gateway enqueue ts, ns)
+* `x-stella-claim-ts` (worker claim ts, ns)
+* `x-stella-done-ts` (worker done ts, ns)
+
+From these, compute:
+
+* `queue_delay = claim_ts - enq_ts`
+* `service_time = done_ts - claim_ts`
+* `http_ttfs = gateway_first_byte_ts - http_request_start_ts`
+* `hop_latency = done_ts - enq_ts` (or return-path if synchronous)
+
+Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.
+
+---
+
+## Valkey commands (safe, BSD Valkey)
+
+Use **Valkey Streams + Consumer Groups** for fairness and metrics:
+
+* Enqueue: `XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>`
+* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >`
+* Ack: `XACK jobs workers <id>`
+
+Add a small Lua for timestamping at enqueue (atomic):
+
+```lua
+-- KEYS[1]=stream
+-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
+return redis.call('XADD', KEYS[1], '*',
+  'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])
+```
+
+---
+
+## Load shapes to test (find the envelope)
+
+1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
+2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
+3. **Step-up/down:** double every 2 min until SLO breach; then halve down.
+4. **Long tail soak:** run at 70–80% of max for 1h; watch p95-p99.9 drift.
+
+Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**.
+
+---
+
+## k6 script (HTTP client pressure)
+
+```javascript
+// save as hop-test.js
+import http from 'k6/http';
+import { check, sleep } from 'k6';
+
+export let options = {
+  scenarios: {
+    step_load: {
+      executor: 'ramping-arrival-rate',
+      startRate: 20, timeUnit: '1s',
+      preAllocatedVUs: 200, maxVUs: 5000,
+      stages: [
+        { target: 50, duration: '1m' },
+        { target: 100, duration: '1m' },
+        { target: 200, duration: '1m' },
+        { target: 400, duration: '1m' },
+        { target: 800, duration: '1m' },
+      ],
+    },
+  },
+  thresholds: {
+    'http_req_failed': ['rate<0.01'],
+    'http_req_duration{phase:hop}': ['p(95)<500'],
+  },
+};
+
+export default function () {
+  const corr = crypto.randomUUID();
+  const res = http.post(
+    __ENV.GW_URL,
+    JSON.stringify({ data: 'ping', corr }),
+    {
+      headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
+      tags: { phase: 'hop' },
+    }
+  );
+  check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
+  sleep(0.01);
+}
+```
+
+Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js`
+
+---
+
+## Worker hooks (.NET 10 sketch)
+
+```csharp
+// At claim
+var now = Stopwatch.GetTimestamp(); // monotonic
+var claimNs = now.ToNanoseconds();
+log.AddTag("x-stella-claim-ts", claimNs);
+
+// After processing
+var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
+log.AddTag("x-stella-done-ts", doneNs);
+// Include corr-id and stream entry id in logs/metrics
+```
+
+Helper:
+
+```csharp
+public static class MonoTime {
+  static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
+  public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
+}
+```
+
+---
+
+## Prometheus metrics to expose
+
+* `valkey_enqueue_ns` (histogram)
+* `valkey_claim_block_ms` (gauge)
+* `worker_service_ns` (histogram, labels: worker_type, route)
+* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`)
+* `enqueue_rate`, `dequeue_rate` (counters)
+
+Example recording rules:
+
+```yaml
+- record: hop:queue_delay_p95
+  expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
+- record: hop:service_time_p95
+  expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
+- record: hop:latency_budget_p95
+  expr: hop:queue_delay_p95 + hop:service_time_p95
+```
+
+---
+
+## Autoscaling signals (HPA/KEDA friendly)
+
+* **Primary:** queue depth & its derivative (d/dt).
+* **Secondary:** p95 `queue_delay` and worker CPU.
+* **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`.
+
+---
+
+## Plot the "envelope" (what you'll look at)
+
+* X-axis: **offered load** (req/s).
+* Y-axis: **p95 hop latency** (ms).
+* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises).
+* Add secondary panel: **queue depth** vs load.
+
+---
+
+# Performance Test Guidelines
+
+## HTTP → Valkey → Worker pipeline
+
+## 1) Objectives and scope
+
+### Primary objectives
+
+Your performance tests MUST answer these questions with evidence:
+
+1. **Capacity knee**: At what offered load does **queue delay** start growing sharply?
+2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load?
+3. **Decomposition**: How much of hop latency is:
+   * gateway enqueue time
+   * Valkey enqueue/claim RTT
+   * queue wait time
+   * worker service time
+4. **Scaling behavior**: How do these change with worker replica counts (N workers)?
+5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?
+
+### Non-goals (explicitly out of scope unless you add them later)
+
+* Micro-optimizing single function runtime
+* Synthetic "max QPS" records without a representative payload
+* Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke
+
+---
+
+## 2) Definitions and required metrics
+
+### Required latency definitions (standardize these names)
+
+Agents MUST compute and report these per request/job:
+
+* **`t_http_accept`**: time from client send → gateway accepts request
+* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side)
+* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s)
+* **`t_queue_delay`**: `claim_ts - enq_ts`
+* **`t_service`**: `done_ts - claim_ts`
+* **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency)
+* Optional but recommended:
+  * **`t_ack`**: time to ack completion (Valkey ack RTT)
+  * **`t_http_response`**: request start → gateway response sent (TTFB/TTFS)
+
+### Required percentiles and aggregations
+
+Per scenario step (e.g., each offered load plateau), agents MUST output:
+
+* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue`
+* Throughput: offered rps and achieved rps
+* Error rate: HTTP failures, enqueue failures, worker failures
+* Queue depth and backlog drain time
+
+### Required system-level telemetry (minimum)
+
+Agents MUST collect these time series during tests:
+
+* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
+* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
+* **Gateway**: CPU/mem, request rate, response codes, request duration histogram
+
+---
+
+## 3) Environment and test hygiene requirements
+
+### Environment requirements
+
+Agents SHOULD run tests in an environment that matches production in:
+
+* container CPU/memory limits
+* number of nodes, network topology
+* Valkey topology (single, cluster, sentinel, etc.)
+* worker replica autoscaling rules (or deliberately disabled)
+
+If exact parity isn't possible, agents MUST record all known differences in the report.
+
+### Test hygiene (non-negotiable)
+
+Agents MUST:
+
+1. **Start from empty queues** (no backlog).
+2. **Disable client retries** (or explicitly run two variants: retries off / retries on).
+3. **Warm up** before measuring (e.g., 60s warm-up minimum).
+4. **Hold steady plateaus** long enough to stabilize (usually 2–5 minutes per step).
+5. **Cool down** and verify backlog drains (queue depth returns to baseline).
+6. Record exact versions/SHAs of gateway/worker and Valkey config.
+
+### Load generator hygiene
+
+Agents MUST ensure the load generator is not the bottleneck:
+
+* CPU < ~70% during test
+* no local socket exhaustion
+* enough VUs/connections
+* if needed, distributed load generation
+
+---
+
+## 4) Instrumentation spec (agents implement this first)
+
+### Correlation and timestamps
+
+Agents MUST propagate an end-to-end correlation ID and timestamps.
+
+**Required fields**
+
+* `corr_id` (UUID)
+* `enq_ts_ns` (set at enqueue, monotonic or consistent clock)
+* `claim_ts_ns` (set by worker when job is claimed)
+* `done_ts_ns` (set by worker when job processing ends)
+
+**Where these live**
+
+* HTTP request header: `x-corr-id: <uuid>`
+* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type
+* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns`
+
+### Clock requirements
+
+Agents MUST use a consistent timing source:
+
+* Prefer monotonic timers for durations (Stopwatch / monotonic clock)
+* If timestamps cross machines, ensure they're comparable:
+  * either rely on synchronized clocks (NTP) **and** monitor drift
+  * or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)
+
+**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.
+
+### Valkey queue semantics (recommended)
+
+Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability:
+
+* Enqueue: `XADD jobs * corr <uuid> enq <ns> payload <...>`
+* Claim: `XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >`
+* Ack: `XACK jobs workers <id>`
+
+Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag.
+
+### Metrics exposure
+
+Agents MUST publish Prometheus (or equivalent) histograms:
+
+* `gateway_enqueue_seconds` (or ns) histogram
+* `valkey_enqueue_rtt_seconds` histogram
+* `worker_service_seconds` histogram
+* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline)
+* `hop_latency_seconds` histogram
+
+---
+
+## 5) Workload modeling and test data
+
+Agents MUST define a workload model before running capacity tests:
+
+1. **Endpoint(s)**: list exact gateway routes under test
+2. **Payload types**: small/typical/large
+3. **Mix**: e.g., 70/25/5 by payload size
+4. **Idempotency rules**: ensure repeated jobs don't corrupt state
+5. **Data reset strategy**: how test data is cleaned or isolated per run
+
+Agents SHOULD test at least:
+
+* Typical payload (p50)
+* Large payload (p95)
+* Worst-case allowed payload (bounded by your API limits)
+
+---
+
+## 6) Scenario suite your agents MUST implement
+
+Each scenario MUST be defined as code/config (not manual).
+
+### Scenario A — Smoke (fast sanity)
+
+**Goal**: verify instrumentation + basic correctness
+**Load**: low (e.g., 1–5 rps), 2 minutes
+**Pass**:
+
+* 0 backlog after run
+* error rate < 0.1%
+* metrics present for all segments
+
+### Scenario B — Baseline (repeatable reference point)
+
+**Goal**: establish a stable baseline for regression tracking
+**Load**: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes
+**Pass**:
+
+* p95 `t_hop` within baseline ± tolerance (set after first runs)
+* no upward drift in p95 across time (trend line ~flat)
+
+### Scenario C — Capacity ramp (open-loop)
+
+**Goal**: find the knee where queueing begins
+**Method**: open-loop arrival-rate ramp with plateaus
+Example stages (edit to fit your system):
+
+* 50 rps for 2m
+* 100 rps for 2m
+* 200 rps for 2m
+* 400 rps for 2m
+* … until SLO breach or errors spike
+
+**MUST**:
+
+* warm-up stage before first plateau
+* record per-plateau summary
+
+**Stop conditions** (any triggers stop):
+
+* error rate > 1%
+* queue depth grows without bound over an entire plateau
+* p95 `t_hop` exceeds SLO for 2 consecutive plateaus
+
+### Scenario D — Stress (push past capacity)
+
+**Goal**: characterize failure mode and recovery
+**Load**: 120–200% of knee load, 5–10 minutes
+**Pass** (for resilience):
+
+* system does not crash permanently
+* once load stops, backlog drains within target time (define it)
+
+### Scenario E — Burst / spike
+
+**Goal**: see how quickly queue grows and drains
+**Load shape**:
+
+* baseline low load
+* sudden burst (e.g., 10× for 10–30s)
+* return to baseline
+
+**Report**:
+
+* peak queue depth
+* time to drain to baseline
+* p99 `t_hop` during burst
+
+### Scenario F — Soak (long-running)
+
+**Goal**: detect drift (leaks, fragmentation, GC patterns)
+**Load**: 70–85% of knee, 60–180 minutes
+**Pass**:
+
+* p95 does not trend upward beyond threshold
+* memory remains bounded
+* no rising error rate
+
+### Scenario G — Scaling curve (worker replica sweep)
+
+**Goal**: turn results into scaling rules
+**Method**:
+
+* Repeat Scenario C with worker replicas = 1, 2, 4, 8…
+
+**Deliverable**:
+
+* plot of knee load vs worker count
+* p95 `t_service` vs worker count (should remain similar; queue delay should drop)
+
+---
+
+## 7) Execution protocol (runbook)
+
+Agents MUST run every scenario using the same disciplined flow:
+
+### Pre-run checklist
+
+* confirm system versions/SHAs
+* confirm autoscaling mode:
+  * **Off** for baseline capacity characterization
+  * **On** for validating autoscaling policies
+* clear queues and consumer group pending entries
+* restart or at least record "time since deploy" for services (cold vs warm)
+
+### During run
+
+* ensure load is truly open-loop when required (arrival-rate based)
+* continuously record:
+  * offered vs achieved rate
+  * queue depth
+  * CPU/mem for gateway/worker/Valkey
+
+### Post-run
+
+* stop load
+* wait until backlog drains (or record that it doesn't)
+* export:
+  * k6/runner raw output
+  * Prometheus time series snapshot
+  * sampled logs with corr_id fields
+* generate a summary report automatically (no hand calculations)
+
+---
+
+## 8) Analysis rules (how agents compute "the envelope")
+
+Agents MUST generate at minimum two plots per run:
+
+1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis)
+   * overlay p99 (and SLO line)
+2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time
+
+### How to identify the "knee"
+
+Agents SHOULD mark the knee as the first plateau where:
+
+* queue depth grows monotonically within the plateau, **or**
+* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50–100%)
+
+### Convert results into scaling guidance
+
+Agents SHOULD compute:
+
+* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker)
+* recommended replicas for offered load λ at target utilization U:
+  * `workers_needed = ceil(λ * mean(t_service) / U)`
+  * choose U ~ 0.6–0.75 for headroom
+
+This should be reported alongside the measured envelope.
+
+---
+
+## 9) Pass/fail criteria and regression gates
+
+Agents MUST define gates in configuration, not in someone's head.
+
+Suggested gating structure:
+
+* **Smoke gate**: error rate < 0.1%, backlog drains
+* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history)
+* **Capacity gate**: knee load regression < 10% (optional but very valuable)
+* **Soak gate**: p95 drift over time < 15% and no memory runaway
+
+---
+
+## 10) Common pitfalls (agents must avoid)
+
+1. **Closed-loop tests used for capacity**
+   Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.
+
+2. **Ignoring queue depth**
+   A system can look "healthy" in request latency while silently building backlog.
+
+3. **Measuring only gateway latency**
+   You must measure enqueue → claim → done to see the real hop.
+
+4. **Load generator bottleneck**
+   If the generator saturates, you'll under-estimate capacity.
+
+5. **Retries enabled by default**
+   Retries can inflate load and hide root causes; run with retries off first.
+
+6. **Not controlling warm vs cold**
+   Cold caches vs warmed services produce different envelopes; record the condition.
+
+---
+
+# Agent implementation checklist (deliverables)
+
+Assign these as concrete tasks to your agents.
+
+## Agent 1 — Observability & tracing
+
+MUST deliver:
+
+* correlation id propagation gateway → Valkey → worker
+* timestamps `enq/claim/done`
+* Prometheus histograms for enqueue, service, hop
+* queue depth metric (`XLEN` / `XINFO` lag)
+
+## Agent 2 — Load test harness
+
+MUST deliver:
+
+* test runner scripts (k6 or equivalent) for scenarios A–G
+* test config file (YAML/JSON) controlling:
+  * stages (rates/durations)
+  * payload mix
+  * headers (corr-id)
+* reproducible seeds and version stamping
+
+## Agent 3 — Result collector and analyzer
+
+MUST deliver:
+
+* a pipeline that merges:
+  * load generator output
+  * hop timing data (from logs or a completion stream)
+  * Prometheus snapshots
+* automatic summary + plots:
+  * latency envelope
+  * queue depth/drain
+* CSV/JSON exports for long-term tracking
+
+## Agent 4 — Reporting and dashboards
+
+MUST deliver:
+
+* a standard report template that includes:
+  * environment details
+  * scenario details
+  * key charts
+  * knee estimate
+  * scaling recommendation
+* Grafana dashboard with the required panels
+
+## Agent 5 — CI / release integration
+
+SHOULD deliver:
+
+* PR-level smoke test (Scenario A)
+* nightly baseline (Scenario B)
+* weekly capacity sweep (Scenario C + scaling curve)
+
+---
+
+## Template: scenario spec (agents can copy/paste)
+
+```yaml
+test_run:
+  system_under_test:
+    gateway_sha: "<git sha>"
+    worker_sha: "<git sha>"
+    valkey_version: "<version>"
+  environment:
+    cluster: "<name>"
+    workers: 4
+    autoscaling: "off"   # off|on
+  workload:
+    endpoint: "/hop"
+    payload_profile: "p50"
+    mix:
+      p50: 0.7
+      p95: 0.25
+      max: 0.05
+  scenario:
+    name: "capacity_ramp"
+    mode: "open_loop"
+    warmup_seconds: 60
+    stages:
+      - rps: 50
+        duration_seconds: 120
+      - rps: 100
+        duration_seconds: 120
+      - rps: 200
+        duration_seconds: 120
+      - rps: 400
+        duration_seconds: 120
+  gates:
+    max_error_rate: 0.01
+    slo_ms_p95_hop: 500
+    backlog_must_drain_seconds: 300
+  outputs:
+    artifacts_dir: "./artifacts/<timestamp>/"
+```
+
+---
+
+## Sample folder layout
+
+```
+perf/
+  docker-compose.yml
+  prometheus/
+    prometheus.yml
+  k6/
+    lib.js
+    smoke.js
+    capacity_ramp.js
+    burst.js
+    soak.js
+    stress.js
+    scaling_curve.sh
+  tools/
+    analyze.py
+  src/
+    Perf.Gateway/
+    Perf.Worker/
+```
+
+---
+
+**Document Version**: 1.0
+**Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md
+**Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use.