SPRINT_3600_0001_0001 - Reachability Drift Detection Master Plan

This commit is contained in:
2025-12-18 00:02:31 +02:00
parent 8bbfe4d2d2
commit dee252940b
13 changed files with 6099 additions and 1651 deletions

View File

@@ -0,0 +1,663 @@
# Performance Testing Pipeline for Queue-Based Workflows
> **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.
---
## What we're measuring (plain English)
* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job.
* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip.
* **Worker service time:** time to pick up, process, and ack.
* **Queueing delay:** time spent waiting in the queue (arrival → start of worker).
These four add up to the "hop latency" users feel when the system is under load.
---
## Minimal tracing you can add today
Emit these IDs/headers end-to-end:
* `x-stella-corr-id` (uuid)
* `x-stella-enq-ts` (gateway enqueue ts, ns)
* `x-stella-claim-ts` (worker claim ts, ns)
* `x-stella-done-ts` (worker done ts, ns)
From these, compute:
* `queue_delay = claim_ts - enq_ts`
* `service_time = done_ts - claim_ts`
* `http_ttfs = gateway_first_byte_ts - http_request_start_ts`
* `hop_latency = done_ts - enq_ts` (or return-path if synchronous)
Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.
---
## Valkey commands (safe, BSD Valkey)
Use **Valkey Streams + Consumer Groups** for fairness and metrics:
* Enqueue: `XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>`
* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >`
* Ack: `XACK jobs workers <id>`
Add a small Lua for timestamping at enqueue (atomic):
```lua
-- KEYS[1]=stream
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
return redis.call('XADD', KEYS[1], '*',
'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])
```
---
## Load shapes to test (find the envelope)
1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
3. **Step-up/down:** double every 2 min until SLO breach; then halve down.
4. **Long tail soak:** run at 7080% of max for 1h; watch p95-p99.9 drift.
Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**.
---
## k6 script (HTTP client pressure)
```javascript
// save as hop-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
scenarios: {
step_load: {
executor: 'ramping-arrival-rate',
startRate: 20, timeUnit: '1s',
preAllocatedVUs: 200, maxVUs: 5000,
stages: [
{ target: 50, duration: '1m' },
{ target: 100, duration: '1m' },
{ target: 200, duration: '1m' },
{ target: 400, duration: '1m' },
{ target: 800, duration: '1m' },
],
},
},
thresholds: {
'http_req_failed': ['rate<0.01'],
'http_req_duration{phase:hop}': ['p(95)<500'],
},
};
export default function () {
const corr = crypto.randomUUID();
const res = http.post(
__ENV.GW_URL,
JSON.stringify({ data: 'ping', corr }),
{
headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
tags: { phase: 'hop' },
}
);
check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
sleep(0.01);
}
```
Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js`
---
## Worker hooks (.NET 10 sketch)
```csharp
// At claim
var now = Stopwatch.GetTimestamp(); // monotonic
var claimNs = now.ToNanoseconds();
log.AddTag("x-stella-claim-ts", claimNs);
// After processing
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
log.AddTag("x-stella-done-ts", doneNs);
// Include corr-id and stream entry id in logs/metrics
```
Helper:
```csharp
public static class MonoTime {
static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
}
```
---
## Prometheus metrics to expose
* `valkey_enqueue_ns` (histogram)
* `valkey_claim_block_ms` (gauge)
* `worker_service_ns` (histogram, labels: worker_type, route)
* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`)
* `enqueue_rate`, `dequeue_rate` (counters)
Example recording rules:
```yaml
- record: hop:queue_delay_p95
expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
- record: hop:service_time_p95
expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
- record: hop:latency_budget_p95
expr: hop:queue_delay_p95 + hop:service_time_p95
```
---
## Autoscaling signals (HPA/KEDA friendly)
* **Primary:** queue depth & its derivative (d/dt).
* **Secondary:** p95 `queue_delay` and worker CPU.
* **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`.
---
## Plot the "envelope" (what you'll look at)
* X-axis: **offered load** (req/s).
* Y-axis: **p95 hop latency** (ms).
* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises).
* Add secondary panel: **queue depth** vs load.
---
# Performance Test Guidelines
## HTTP → Valkey → Worker pipeline
## 1) Objectives and scope
### Primary objectives
Your performance tests MUST answer these questions with evidence:
1. **Capacity knee**: At what offered load does **queue delay** start growing sharply?
2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load?
3. **Decomposition**: How much of hop latency is:
* gateway enqueue time
* Valkey enqueue/claim RTT
* queue wait time
* worker service time
4. **Scaling behavior**: How do these change with worker replica counts (N workers)?
5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?
### Non-goals (explicitly out of scope unless you add them later)
* Micro-optimizing single function runtime
* Synthetic "max QPS" records without a representative payload
* Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke
---
## 2) Definitions and required metrics
### Required latency definitions (standardize these names)
Agents MUST compute and report these per request/job:
* **`t_http_accept`**: time from client send → gateway accepts request
* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side)
* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s)
* **`t_queue_delay`**: `claim_ts - enq_ts`
* **`t_service`**: `done_ts - claim_ts`
* **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency)
* Optional but recommended:
* **`t_ack`**: time to ack completion (Valkey ack RTT)
* **`t_http_response`**: request start → gateway response sent (TTFB/TTFS)
### Required percentiles and aggregations
Per scenario step (e.g., each offered load plateau), agents MUST output:
* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue`
* Throughput: offered rps and achieved rps
* Error rate: HTTP failures, enqueue failures, worker failures
* Queue depth and backlog drain time
### Required system-level telemetry (minimum)
Agents MUST collect these time series during tests:
* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
* **Gateway**: CPU/mem, request rate, response codes, request duration histogram
---
## 3) Environment and test hygiene requirements
### Environment requirements
Agents SHOULD run tests in an environment that matches production in:
* container CPU/memory limits
* number of nodes, network topology
* Valkey topology (single, cluster, sentinel, etc.)
* worker replica autoscaling rules (or deliberately disabled)
If exact parity isn't possible, agents MUST record all known differences in the report.
### Test hygiene (non-negotiable)
Agents MUST:
1. **Start from empty queues** (no backlog).
2. **Disable client retries** (or explicitly run two variants: retries off / retries on).
3. **Warm up** before measuring (e.g., 60s warm-up minimum).
4. **Hold steady plateaus** long enough to stabilize (usually 25 minutes per step).
5. **Cool down** and verify backlog drains (queue depth returns to baseline).
6. Record exact versions/SHAs of gateway/worker and Valkey config.
### Load generator hygiene
Agents MUST ensure the load generator is not the bottleneck:
* CPU < ~70% during test
* no local socket exhaustion
* enough VUs/connections
* if needed, distributed load generation
---
## 4) Instrumentation spec (agents implement this first)
### Correlation and timestamps
Agents MUST propagate an end-to-end correlation ID and timestamps.
**Required fields**
* `corr_id` (UUID)
* `enq_ts_ns` (set at enqueue, monotonic or consistent clock)
* `claim_ts_ns` (set by worker when job is claimed)
* `done_ts_ns` (set by worker when job processing ends)
**Where these live**
* HTTP request header: `x-corr-id: <uuid>`
* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type
* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns`
### Clock requirements
Agents MUST use a consistent timing source:
* Prefer monotonic timers for durations (Stopwatch / monotonic clock)
* If timestamps cross machines, ensure they're comparable:
* either rely on synchronized clocks (NTP) **and** monitor drift
* or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)
**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.
### Valkey queue semantics (recommended)
Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability:
* Enqueue: `XADD jobs * corr <uuid> enq <ns> payload <...>`
* Claim: `XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >`
* Ack: `XACK jobs workers <id>`
Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag.
### Metrics exposure
Agents MUST publish Prometheus (or equivalent) histograms:
* `gateway_enqueue_seconds` (or ns) histogram
* `valkey_enqueue_rtt_seconds` histogram
* `worker_service_seconds` histogram
* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline)
* `hop_latency_seconds` histogram
---
## 5) Workload modeling and test data
Agents MUST define a workload model before running capacity tests:
1. **Endpoint(s)**: list exact gateway routes under test
2. **Payload types**: small/typical/large
3. **Mix**: e.g., 70/25/5 by payload size
4. **Idempotency rules**: ensure repeated jobs don't corrupt state
5. **Data reset strategy**: how test data is cleaned or isolated per run
Agents SHOULD test at least:
* Typical payload (p50)
* Large payload (p95)
* Worst-case allowed payload (bounded by your API limits)
---
## 6) Scenario suite your agents MUST implement
Each scenario MUST be defined as code/config (not manual).
### Scenario A — Smoke (fast sanity)
**Goal**: verify instrumentation + basic correctness
**Load**: low (e.g., 15 rps), 2 minutes
**Pass**:
* 0 backlog after run
* error rate < 0.1%
* metrics present for all segments
### Scenario B — Baseline (repeatable reference point)
**Goal**: establish a stable baseline for regression tracking
**Load**: fixed moderate load (e.g., 3050% of expected capacity), 10 minutes
**Pass**:
* p95 `t_hop` within baseline ± tolerance (set after first runs)
* no upward drift in p95 across time (trend line ~flat)
### Scenario C — Capacity ramp (open-loop)
**Goal**: find the knee where queueing begins
**Method**: open-loop arrival-rate ramp with plateaus
Example stages (edit to fit your system):
* 50 rps for 2m
* 100 rps for 2m
* 200 rps for 2m
* 400 rps for 2m
* until SLO breach or errors spike
**MUST**:
* warm-up stage before first plateau
* record per-plateau summary
**Stop conditions** (any triggers stop):
* error rate > 1%
* queue depth grows without bound over an entire plateau
* p95 `t_hop` exceeds SLO for 2 consecutive plateaus
### Scenario D — Stress (push past capacity)
**Goal**: characterize failure mode and recovery
**Load**: 120200% of knee load, 510 minutes
**Pass** (for resilience):
* system does not crash permanently
* once load stops, backlog drains within target time (define it)
### Scenario E — Burst / spike
**Goal**: see how quickly queue grows and drains
**Load shape**:
* baseline low load
* sudden burst (e.g., 10× for 1030s)
* return to baseline
**Report**:
* peak queue depth
* time to drain to baseline
* p99 `t_hop` during burst
### Scenario F — Soak (long-running)
**Goal**: detect drift (leaks, fragmentation, GC patterns)
**Load**: 7085% of knee, 60180 minutes
**Pass**:
* p95 does not trend upward beyond threshold
* memory remains bounded
* no rising error rate
### Scenario G — Scaling curve (worker replica sweep)
**Goal**: turn results into scaling rules
**Method**:
* Repeat Scenario C with worker replicas = 1, 2, 4, 8…
**Deliverable**:
* plot of knee load vs worker count
* p95 `t_service` vs worker count (should remain similar; queue delay should drop)
---
## 7) Execution protocol (runbook)
Agents MUST run every scenario using the same disciplined flow:
### Pre-run checklist
* confirm system versions/SHAs
* confirm autoscaling mode:
* **Off** for baseline capacity characterization
* **On** for validating autoscaling policies
* clear queues and consumer group pending entries
* restart or at least record "time since deploy" for services (cold vs warm)
### During run
* ensure load is truly open-loop when required (arrival-rate based)
* continuously record:
* offered vs achieved rate
* queue depth
* CPU/mem for gateway/worker/Valkey
### Post-run
* stop load
* wait until backlog drains (or record that it doesn't)
* export:
* k6/runner raw output
* Prometheus time series snapshot
* sampled logs with corr_id fields
* generate a summary report automatically (no hand calculations)
---
## 8) Analysis rules (how agents compute "the envelope")
Agents MUST generate at minimum two plots per run:
1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis)
* overlay p99 (and SLO line)
2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time
### How to identify the "knee"
Agents SHOULD mark the knee as the first plateau where:
* queue depth grows monotonically within the plateau, **or**
* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50100%)
### Convert results into scaling guidance
Agents SHOULD compute:
* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker)
* recommended replicas for offered load λ at target utilization U:
* `workers_needed = ceil(λ * mean(t_service) / U)`
* choose U ~ 0.60.75 for headroom
This should be reported alongside the measured envelope.
---
## 9) Pass/fail criteria and regression gates
Agents MUST define gates in configuration, not in someone's head.
Suggested gating structure:
* **Smoke gate**: error rate < 0.1%, backlog drains
* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history)
* **Capacity gate**: knee load regression < 10% (optional but very valuable)
* **Soak gate**: p95 drift over time < 15% and no memory runaway
---
## 10) Common pitfalls (agents must avoid)
1. **Closed-loop tests used for capacity**
Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.
2. **Ignoring queue depth**
A system can look "healthy" in request latency while silently building backlog.
3. **Measuring only gateway latency**
You must measure enqueue claim done to see the real hop.
4. **Load generator bottleneck**
If the generator saturates, you'll under-estimate capacity.
5. **Retries enabled by default**
Retries can inflate load and hide root causes; run with retries off first.
6. **Not controlling warm vs cold**
Cold caches vs warmed services produce different envelopes; record the condition.
---
# Agent implementation checklist (deliverables)
Assign these as concrete tasks to your agents.
## Agent 1 — Observability & tracing
MUST deliver:
* correlation id propagation gateway Valkey worker
* timestamps `enq/claim/done`
* Prometheus histograms for enqueue, service, hop
* queue depth metric (`XLEN` / `XINFO` lag)
## Agent 2 — Load test harness
MUST deliver:
* test runner scripts (k6 or equivalent) for scenarios AG
* test config file (YAML/JSON) controlling:
* stages (rates/durations)
* payload mix
* headers (corr-id)
* reproducible seeds and version stamping
## Agent 3 — Result collector and analyzer
MUST deliver:
* a pipeline that merges:
* load generator output
* hop timing data (from logs or a completion stream)
* Prometheus snapshots
* automatic summary + plots:
* latency envelope
* queue depth/drain
* CSV/JSON exports for long-term tracking
## Agent 4 — Reporting and dashboards
MUST deliver:
* a standard report template that includes:
* environment details
* scenario details
* key charts
* knee estimate
* scaling recommendation
* Grafana dashboard with the required panels
## Agent 5 — CI / release integration
SHOULD deliver:
* PR-level smoke test (Scenario A)
* nightly baseline (Scenario B)
* weekly capacity sweep (Scenario C + scaling curve)
---
## Template: scenario spec (agents can copy/paste)
```yaml
test_run:
system_under_test:
gateway_sha: "<git sha>"
worker_sha: "<git sha>"
valkey_version: "<version>"
environment:
cluster: "<name>"
workers: 4
autoscaling: "off" # off|on
workload:
endpoint: "/hop"
payload_profile: "p50"
mix:
p50: 0.7
p95: 0.25
max: 0.05
scenario:
name: "capacity_ramp"
mode: "open_loop"
warmup_seconds: 60
stages:
- rps: 50
duration_seconds: 120
- rps: 100
duration_seconds: 120
- rps: 200
duration_seconds: 120
- rps: 400
duration_seconds: 120
gates:
max_error_rate: 0.01
slo_ms_p95_hop: 500
backlog_must_drain_seconds: 300
outputs:
artifacts_dir: "./artifacts/<timestamp>/"
```
---
## Sample folder layout
```
perf/
docker-compose.yml
prometheus/
prometheus.yml
k6/
lib.js
smoke.js
capacity_ramp.js
burst.js
soak.js
stress.js
scaling_curve.sh
tools/
analyze.py
src/
Perf.Gateway/
Perf.Worker/
```
---
**Document Version**: 1.0
**Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md
**Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use.