SPRINT_3600_0001_0001 - Reachability Drift Detection Master Plan
This commit is contained in:
663
docs/dev/performance-testing-playbook.md
Normal file
663
docs/dev/performance-testing-playbook.md
Normal file
@@ -0,0 +1,663 @@
|
||||
# Performance Testing Pipeline for Queue-Based Workflows
|
||||
|
||||
> **Note**: This document was originally created as part of advisory analysis. It provides a comprehensive playbook for HTTP → Valkey → Worker performance testing.
|
||||
|
||||
---
|
||||
|
||||
## What we're measuring (plain English)
|
||||
|
||||
* **TTFB/TTFS (HTTP):** time the gateway spends accepting the request + queuing the job.
|
||||
* **Valkey latency:** enqueue (`LPUSH`/`XADD`), pop/claim (`BRPOP`/`XREADGROUP`), and round-trip.
|
||||
* **Worker service time:** time to pick up, process, and ack.
|
||||
* **Queueing delay:** time spent waiting in the queue (arrival → start of worker).
|
||||
|
||||
These four add up to the "hop latency" users feel when the system is under load.
|
||||
|
||||
---
|
||||
|
||||
## Minimal tracing you can add today
|
||||
|
||||
Emit these IDs/headers end-to-end:
|
||||
|
||||
* `x-stella-corr-id` (uuid)
|
||||
* `x-stella-enq-ts` (gateway enqueue ts, ns)
|
||||
* `x-stella-claim-ts` (worker claim ts, ns)
|
||||
* `x-stella-done-ts` (worker done ts, ns)
|
||||
|
||||
From these, compute:
|
||||
|
||||
* `queue_delay = claim_ts - enq_ts`
|
||||
* `service_time = done_ts - claim_ts`
|
||||
* `http_ttfs = gateway_first_byte_ts - http_request_start_ts`
|
||||
* `hop_latency = done_ts - enq_ts` (or return-path if synchronous)
|
||||
|
||||
Clock-sync tip: use monotonic clocks in code and convert to ns; don't mix wall-clock.
|
||||
|
||||
---
|
||||
|
||||
## Valkey commands (safe, BSD Valkey)
|
||||
|
||||
Use **Valkey Streams + Consumer Groups** for fairness and metrics:
|
||||
|
||||
* Enqueue: `XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>`
|
||||
* Claim: `XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >`
|
||||
* Ack: `XACK jobs workers <id>`
|
||||
|
||||
Add a small Lua for timestamping at enqueue (atomic):
|
||||
|
||||
```lua
|
||||
-- KEYS[1]=stream
|
||||
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
|
||||
return redis.call('XADD', KEYS[1], '*',
|
||||
'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Load shapes to test (find the envelope)
|
||||
|
||||
1. **Open-loop (arrival-rate controlled):** 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
|
||||
2. **Burst:** 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
|
||||
3. **Step-up/down:** double every 2 min until SLO breach; then halve down.
|
||||
4. **Long tail soak:** run at 70–80% of max for 1h; watch p95-p99.9 drift.
|
||||
|
||||
Target outputs per step: **p50/p90/p95/p99** for `queue_delay`, `service_time`, `hop_latency`, plus **throughput** and **error rate**.
|
||||
|
||||
---
|
||||
|
||||
## k6 script (HTTP client pressure)
|
||||
|
||||
```javascript
|
||||
// save as hop-test.js
|
||||
import http from 'k6/http';
|
||||
import { check, sleep } from 'k6';
|
||||
|
||||
export let options = {
|
||||
scenarios: {
|
||||
step_load: {
|
||||
executor: 'ramping-arrival-rate',
|
||||
startRate: 20, timeUnit: '1s',
|
||||
preAllocatedVUs: 200, maxVUs: 5000,
|
||||
stages: [
|
||||
{ target: 50, duration: '1m' },
|
||||
{ target: 100, duration: '1m' },
|
||||
{ target: 200, duration: '1m' },
|
||||
{ target: 400, duration: '1m' },
|
||||
{ target: 800, duration: '1m' },
|
||||
],
|
||||
},
|
||||
},
|
||||
thresholds: {
|
||||
'http_req_failed': ['rate<0.01'],
|
||||
'http_req_duration{phase:hop}': ['p(95)<500'],
|
||||
},
|
||||
};
|
||||
|
||||
export default function () {
|
||||
const corr = crypto.randomUUID();
|
||||
const res = http.post(
|
||||
__ENV.GW_URL,
|
||||
JSON.stringify({ data: 'ping', corr }),
|
||||
{
|
||||
headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
|
||||
tags: { phase: 'hop' },
|
||||
}
|
||||
);
|
||||
check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
|
||||
sleep(0.01);
|
||||
}
|
||||
```
|
||||
|
||||
Run: `GW_URL=https://gateway.example/hop k6 run hop-test.js`
|
||||
|
||||
---
|
||||
|
||||
## Worker hooks (.NET 10 sketch)
|
||||
|
||||
```csharp
|
||||
// At claim
|
||||
var now = Stopwatch.GetTimestamp(); // monotonic
|
||||
var claimNs = now.ToNanoseconds();
|
||||
log.AddTag("x-stella-claim-ts", claimNs);
|
||||
|
||||
// After processing
|
||||
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
|
||||
log.AddTag("x-stella-done-ts", doneNs);
|
||||
// Include corr-id and stream entry id in logs/metrics
|
||||
```
|
||||
|
||||
Helper:
|
||||
|
||||
```csharp
|
||||
public static class MonoTime {
|
||||
static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
|
||||
public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus metrics to expose
|
||||
|
||||
* `valkey_enqueue_ns` (histogram)
|
||||
* `valkey_claim_block_ms` (gauge)
|
||||
* `worker_service_ns` (histogram, labels: worker_type, route)
|
||||
* `queue_depth` (gauge via `XLEN` or `XINFO STREAM`)
|
||||
* `enqueue_rate`, `dequeue_rate` (counters)
|
||||
|
||||
Example recording rules:
|
||||
|
||||
```yaml
|
||||
- record: hop:queue_delay_p95
|
||||
expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
|
||||
- record: hop:service_time_p95
|
||||
expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
|
||||
- record: hop:latency_budget_p95
|
||||
expr: hop:queue_delay_p95 + hop:service_time_p95
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Autoscaling signals (HPA/KEDA friendly)
|
||||
|
||||
* **Primary:** queue depth & its derivative (d/dt).
|
||||
* **Secondary:** p95 `queue_delay` and worker CPU.
|
||||
* **Safety:** max in-flight per worker; backpressure HTTP 429 when `queue_depth > D` or `p95_queue_delay > SLO*0.8`.
|
||||
|
||||
---
|
||||
|
||||
## Plot the "envelope" (what you'll look at)
|
||||
|
||||
* X-axis: **offered load** (req/s).
|
||||
* Y-axis: **p95 hop latency** (ms).
|
||||
* Overlay: p99 (dashed), **SLO line** (e.g., 500 ms), and **capacity knee** (where p95 sharply rises).
|
||||
* Add secondary panel: **queue depth** vs load.
|
||||
|
||||
---
|
||||
|
||||
# Performance Test Guidelines
|
||||
|
||||
## HTTP → Valkey → Worker pipeline
|
||||
|
||||
## 1) Objectives and scope
|
||||
|
||||
### Primary objectives
|
||||
|
||||
Your performance tests MUST answer these questions with evidence:
|
||||
|
||||
1. **Capacity knee**: At what offered load does **queue delay** start growing sharply?
|
||||
2. **User-impact envelope**: What are p50/p95/p99 **hop latency** curves vs offered load?
|
||||
3. **Decomposition**: How much of hop latency is:
|
||||
* gateway enqueue time
|
||||
* Valkey enqueue/claim RTT
|
||||
* queue wait time
|
||||
* worker service time
|
||||
4. **Scaling behavior**: How do these change with worker replica counts (N workers)?
|
||||
5. **Stability**: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?
|
||||
|
||||
### Non-goals (explicitly out of scope unless you add them later)
|
||||
|
||||
* Micro-optimizing single function runtime
|
||||
* Synthetic "max QPS" records without a representative payload
|
||||
* Tests that don't collect segment metrics (end-to-end only) for anything beyond basic smoke
|
||||
|
||||
---
|
||||
|
||||
## 2) Definitions and required metrics
|
||||
|
||||
### Required latency definitions (standardize these names)
|
||||
|
||||
Agents MUST compute and report these per request/job:
|
||||
|
||||
* **`t_http_accept`**: time from client send → gateway accepts request
|
||||
* **`t_enqueue`**: time spent in gateway to enqueue into Valkey (server-side)
|
||||
* **`t_valkey_rtt_enq`**: client-observed RTT for enqueue command(s)
|
||||
* **`t_queue_delay`**: `claim_ts - enq_ts`
|
||||
* **`t_service`**: `done_ts - claim_ts`
|
||||
* **`t_hop`**: `done_ts - enq_ts` (this is the "true pipeline hop" latency)
|
||||
* Optional but recommended:
|
||||
* **`t_ack`**: time to ack completion (Valkey ack RTT)
|
||||
* **`t_http_response`**: request start → gateway response sent (TTFB/TTFS)
|
||||
|
||||
### Required percentiles and aggregations
|
||||
|
||||
Per scenario step (e.g., each offered load plateau), agents MUST output:
|
||||
|
||||
* p50 / p90 / p95 / p99 / p99.9 for: `t_hop`, `t_queue_delay`, `t_service`, `t_enqueue`
|
||||
* Throughput: offered rps and achieved rps
|
||||
* Error rate: HTTP failures, enqueue failures, worker failures
|
||||
* Queue depth and backlog drain time
|
||||
|
||||
### Required system-level telemetry (minimum)
|
||||
|
||||
Agents MUST collect these time series during tests:
|
||||
|
||||
* **Worker**: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
|
||||
* **Valkey**: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
|
||||
* **Gateway**: CPU/mem, request rate, response codes, request duration histogram
|
||||
|
||||
---
|
||||
|
||||
## 3) Environment and test hygiene requirements
|
||||
|
||||
### Environment requirements
|
||||
|
||||
Agents SHOULD run tests in an environment that matches production in:
|
||||
|
||||
* container CPU/memory limits
|
||||
* number of nodes, network topology
|
||||
* Valkey topology (single, cluster, sentinel, etc.)
|
||||
* worker replica autoscaling rules (or deliberately disabled)
|
||||
|
||||
If exact parity isn't possible, agents MUST record all known differences in the report.
|
||||
|
||||
### Test hygiene (non-negotiable)
|
||||
|
||||
Agents MUST:
|
||||
|
||||
1. **Start from empty queues** (no backlog).
|
||||
2. **Disable client retries** (or explicitly run two variants: retries off / retries on).
|
||||
3. **Warm up** before measuring (e.g., 60s warm-up minimum).
|
||||
4. **Hold steady plateaus** long enough to stabilize (usually 2–5 minutes per step).
|
||||
5. **Cool down** and verify backlog drains (queue depth returns to baseline).
|
||||
6. Record exact versions/SHAs of gateway/worker and Valkey config.
|
||||
|
||||
### Load generator hygiene
|
||||
|
||||
Agents MUST ensure the load generator is not the bottleneck:
|
||||
|
||||
* CPU < ~70% during test
|
||||
* no local socket exhaustion
|
||||
* enough VUs/connections
|
||||
* if needed, distributed load generation
|
||||
|
||||
---
|
||||
|
||||
## 4) Instrumentation spec (agents implement this first)
|
||||
|
||||
### Correlation and timestamps
|
||||
|
||||
Agents MUST propagate an end-to-end correlation ID and timestamps.
|
||||
|
||||
**Required fields**
|
||||
|
||||
* `corr_id` (UUID)
|
||||
* `enq_ts_ns` (set at enqueue, monotonic or consistent clock)
|
||||
* `claim_ts_ns` (set by worker when job is claimed)
|
||||
* `done_ts_ns` (set by worker when job processing ends)
|
||||
|
||||
**Where these live**
|
||||
|
||||
* HTTP request header: `x-corr-id: <uuid>`
|
||||
* Valkey job payload fields: `corr`, `enq`, and optionally payload size/type
|
||||
* Worker logs/metrics: include `corr_id`, job id, `claim_ts_ns`, `done_ts_ns`
|
||||
|
||||
### Clock requirements
|
||||
|
||||
Agents MUST use a consistent timing source:
|
||||
|
||||
* Prefer monotonic timers for durations (Stopwatch / monotonic clock)
|
||||
* If timestamps cross machines, ensure they're comparable:
|
||||
* either rely on synchronized clocks (NTP) **and** monitor drift
|
||||
* or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)
|
||||
|
||||
**Practical recommendation**: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.
|
||||
|
||||
### Valkey queue semantics (recommended)
|
||||
|
||||
Agents SHOULD use **Streams + Consumer Groups** for stable claim semantics and good observability:
|
||||
|
||||
* Enqueue: `XADD jobs * corr <uuid> enq <ns> payload <...>`
|
||||
* Claim: `XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >`
|
||||
* Ack: `XACK jobs workers <id>`
|
||||
|
||||
Agents MUST record stream length (`XLEN`) or consumer group lag (`XINFO GROUPS`) as queue depth/lag.
|
||||
|
||||
### Metrics exposure
|
||||
|
||||
Agents MUST publish Prometheus (or equivalent) histograms:
|
||||
|
||||
* `gateway_enqueue_seconds` (or ns) histogram
|
||||
* `valkey_enqueue_rtt_seconds` histogram
|
||||
* `worker_service_seconds` histogram
|
||||
* `queue_delay_seconds` histogram (derived from timestamps; can be computed in worker or offline)
|
||||
* `hop_latency_seconds` histogram
|
||||
|
||||
---
|
||||
|
||||
## 5) Workload modeling and test data
|
||||
|
||||
Agents MUST define a workload model before running capacity tests:
|
||||
|
||||
1. **Endpoint(s)**: list exact gateway routes under test
|
||||
2. **Payload types**: small/typical/large
|
||||
3. **Mix**: e.g., 70/25/5 by payload size
|
||||
4. **Idempotency rules**: ensure repeated jobs don't corrupt state
|
||||
5. **Data reset strategy**: how test data is cleaned or isolated per run
|
||||
|
||||
Agents SHOULD test at least:
|
||||
|
||||
* Typical payload (p50)
|
||||
* Large payload (p95)
|
||||
* Worst-case allowed payload (bounded by your API limits)
|
||||
|
||||
---
|
||||
|
||||
## 6) Scenario suite your agents MUST implement
|
||||
|
||||
Each scenario MUST be defined as code/config (not manual).
|
||||
|
||||
### Scenario A — Smoke (fast sanity)
|
||||
|
||||
**Goal**: verify instrumentation + basic correctness
|
||||
**Load**: low (e.g., 1–5 rps), 2 minutes
|
||||
**Pass**:
|
||||
|
||||
* 0 backlog after run
|
||||
* error rate < 0.1%
|
||||
* metrics present for all segments
|
||||
|
||||
### Scenario B — Baseline (repeatable reference point)
|
||||
|
||||
**Goal**: establish a stable baseline for regression tracking
|
||||
**Load**: fixed moderate load (e.g., 30–50% of expected capacity), 10 minutes
|
||||
**Pass**:
|
||||
|
||||
* p95 `t_hop` within baseline ± tolerance (set after first runs)
|
||||
* no upward drift in p95 across time (trend line ~flat)
|
||||
|
||||
### Scenario C — Capacity ramp (open-loop)
|
||||
|
||||
**Goal**: find the knee where queueing begins
|
||||
**Method**: open-loop arrival-rate ramp with plateaus
|
||||
Example stages (edit to fit your system):
|
||||
|
||||
* 50 rps for 2m
|
||||
* 100 rps for 2m
|
||||
* 200 rps for 2m
|
||||
* 400 rps for 2m
|
||||
* … until SLO breach or errors spike
|
||||
|
||||
**MUST**:
|
||||
|
||||
* warm-up stage before first plateau
|
||||
* record per-plateau summary
|
||||
|
||||
**Stop conditions** (any triggers stop):
|
||||
|
||||
* error rate > 1%
|
||||
* queue depth grows without bound over an entire plateau
|
||||
* p95 `t_hop` exceeds SLO for 2 consecutive plateaus
|
||||
|
||||
### Scenario D — Stress (push past capacity)
|
||||
|
||||
**Goal**: characterize failure mode and recovery
|
||||
**Load**: 120–200% of knee load, 5–10 minutes
|
||||
**Pass** (for resilience):
|
||||
|
||||
* system does not crash permanently
|
||||
* once load stops, backlog drains within target time (define it)
|
||||
|
||||
### Scenario E — Burst / spike
|
||||
|
||||
**Goal**: see how quickly queue grows and drains
|
||||
**Load shape**:
|
||||
|
||||
* baseline low load
|
||||
* sudden burst (e.g., 10× for 10–30s)
|
||||
* return to baseline
|
||||
|
||||
**Report**:
|
||||
|
||||
* peak queue depth
|
||||
* time to drain to baseline
|
||||
* p99 `t_hop` during burst
|
||||
|
||||
### Scenario F — Soak (long-running)
|
||||
|
||||
**Goal**: detect drift (leaks, fragmentation, GC patterns)
|
||||
**Load**: 70–85% of knee, 60–180 minutes
|
||||
**Pass**:
|
||||
|
||||
* p95 does not trend upward beyond threshold
|
||||
* memory remains bounded
|
||||
* no rising error rate
|
||||
|
||||
### Scenario G — Scaling curve (worker replica sweep)
|
||||
|
||||
**Goal**: turn results into scaling rules
|
||||
**Method**:
|
||||
|
||||
* Repeat Scenario C with worker replicas = 1, 2, 4, 8…
|
||||
|
||||
**Deliverable**:
|
||||
|
||||
* plot of knee load vs worker count
|
||||
* p95 `t_service` vs worker count (should remain similar; queue delay should drop)
|
||||
|
||||
---
|
||||
|
||||
## 7) Execution protocol (runbook)
|
||||
|
||||
Agents MUST run every scenario using the same disciplined flow:
|
||||
|
||||
### Pre-run checklist
|
||||
|
||||
* confirm system versions/SHAs
|
||||
* confirm autoscaling mode:
|
||||
* **Off** for baseline capacity characterization
|
||||
* **On** for validating autoscaling policies
|
||||
* clear queues and consumer group pending entries
|
||||
* restart or at least record "time since deploy" for services (cold vs warm)
|
||||
|
||||
### During run
|
||||
|
||||
* ensure load is truly open-loop when required (arrival-rate based)
|
||||
* continuously record:
|
||||
* offered vs achieved rate
|
||||
* queue depth
|
||||
* CPU/mem for gateway/worker/Valkey
|
||||
|
||||
### Post-run
|
||||
|
||||
* stop load
|
||||
* wait until backlog drains (or record that it doesn't)
|
||||
* export:
|
||||
* k6/runner raw output
|
||||
* Prometheus time series snapshot
|
||||
* sampled logs with corr_id fields
|
||||
* generate a summary report automatically (no hand calculations)
|
||||
|
||||
---
|
||||
|
||||
## 8) Analysis rules (how agents compute "the envelope")
|
||||
|
||||
Agents MUST generate at minimum two plots per run:
|
||||
|
||||
1. **Latency envelope**: offered load (x-axis) vs p95 `t_hop` (y-axis)
|
||||
* overlay p99 (and SLO line)
|
||||
2. **Queue behavior**: offered load vs queue depth (or lag), plus drain time
|
||||
|
||||
### How to identify the "knee"
|
||||
|
||||
Agents SHOULD mark the knee as the first plateau where:
|
||||
|
||||
* queue depth grows monotonically within the plateau, **or**
|
||||
* p95 `t_queue_delay` increases by > X% step-to-step (e.g., 50–100%)
|
||||
|
||||
### Convert results into scaling guidance
|
||||
|
||||
Agents SHOULD compute:
|
||||
|
||||
* `capacity_per_worker ≈ 1 / mean(t_service)` (jobs/sec per worker)
|
||||
* recommended replicas for offered load λ at target utilization U:
|
||||
* `workers_needed = ceil(λ * mean(t_service) / U)`
|
||||
* choose U ~ 0.6–0.75 for headroom
|
||||
|
||||
This should be reported alongside the measured envelope.
|
||||
|
||||
---
|
||||
|
||||
## 9) Pass/fail criteria and regression gates
|
||||
|
||||
Agents MUST define gates in configuration, not in someone's head.
|
||||
|
||||
Suggested gating structure:
|
||||
|
||||
* **Smoke gate**: error rate < 0.1%, backlog drains
|
||||
* **Baseline gate**: p95 `t_hop` regression < 10% (tune after you have history)
|
||||
* **Capacity gate**: knee load regression < 10% (optional but very valuable)
|
||||
* **Soak gate**: p95 drift over time < 15% and no memory runaway
|
||||
|
||||
---
|
||||
|
||||
## 10) Common pitfalls (agents must avoid)
|
||||
|
||||
1. **Closed-loop tests used for capacity**
|
||||
Closed-loop ("N concurrent users") self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.
|
||||
|
||||
2. **Ignoring queue depth**
|
||||
A system can look "healthy" in request latency while silently building backlog.
|
||||
|
||||
3. **Measuring only gateway latency**
|
||||
You must measure enqueue → claim → done to see the real hop.
|
||||
|
||||
4. **Load generator bottleneck**
|
||||
If the generator saturates, you'll under-estimate capacity.
|
||||
|
||||
5. **Retries enabled by default**
|
||||
Retries can inflate load and hide root causes; run with retries off first.
|
||||
|
||||
6. **Not controlling warm vs cold**
|
||||
Cold caches vs warmed services produce different envelopes; record the condition.
|
||||
|
||||
---
|
||||
|
||||
# Agent implementation checklist (deliverables)
|
||||
|
||||
Assign these as concrete tasks to your agents.
|
||||
|
||||
## Agent 1 — Observability & tracing
|
||||
|
||||
MUST deliver:
|
||||
|
||||
* correlation id propagation gateway → Valkey → worker
|
||||
* timestamps `enq/claim/done`
|
||||
* Prometheus histograms for enqueue, service, hop
|
||||
* queue depth metric (`XLEN` / `XINFO` lag)
|
||||
|
||||
## Agent 2 — Load test harness
|
||||
|
||||
MUST deliver:
|
||||
|
||||
* test runner scripts (k6 or equivalent) for scenarios A–G
|
||||
* test config file (YAML/JSON) controlling:
|
||||
* stages (rates/durations)
|
||||
* payload mix
|
||||
* headers (corr-id)
|
||||
* reproducible seeds and version stamping
|
||||
|
||||
## Agent 3 — Result collector and analyzer
|
||||
|
||||
MUST deliver:
|
||||
|
||||
* a pipeline that merges:
|
||||
* load generator output
|
||||
* hop timing data (from logs or a completion stream)
|
||||
* Prometheus snapshots
|
||||
* automatic summary + plots:
|
||||
* latency envelope
|
||||
* queue depth/drain
|
||||
* CSV/JSON exports for long-term tracking
|
||||
|
||||
## Agent 4 — Reporting and dashboards
|
||||
|
||||
MUST deliver:
|
||||
|
||||
* a standard report template that includes:
|
||||
* environment details
|
||||
* scenario details
|
||||
* key charts
|
||||
* knee estimate
|
||||
* scaling recommendation
|
||||
* Grafana dashboard with the required panels
|
||||
|
||||
## Agent 5 — CI / release integration
|
||||
|
||||
SHOULD deliver:
|
||||
|
||||
* PR-level smoke test (Scenario A)
|
||||
* nightly baseline (Scenario B)
|
||||
* weekly capacity sweep (Scenario C + scaling curve)
|
||||
|
||||
---
|
||||
|
||||
## Template: scenario spec (agents can copy/paste)
|
||||
|
||||
```yaml
|
||||
test_run:
|
||||
system_under_test:
|
||||
gateway_sha: "<git sha>"
|
||||
worker_sha: "<git sha>"
|
||||
valkey_version: "<version>"
|
||||
environment:
|
||||
cluster: "<name>"
|
||||
workers: 4
|
||||
autoscaling: "off" # off|on
|
||||
workload:
|
||||
endpoint: "/hop"
|
||||
payload_profile: "p50"
|
||||
mix:
|
||||
p50: 0.7
|
||||
p95: 0.25
|
||||
max: 0.05
|
||||
scenario:
|
||||
name: "capacity_ramp"
|
||||
mode: "open_loop"
|
||||
warmup_seconds: 60
|
||||
stages:
|
||||
- rps: 50
|
||||
duration_seconds: 120
|
||||
- rps: 100
|
||||
duration_seconds: 120
|
||||
- rps: 200
|
||||
duration_seconds: 120
|
||||
- rps: 400
|
||||
duration_seconds: 120
|
||||
gates:
|
||||
max_error_rate: 0.01
|
||||
slo_ms_p95_hop: 500
|
||||
backlog_must_drain_seconds: 300
|
||||
outputs:
|
||||
artifacts_dir: "./artifacts/<timestamp>/"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sample folder layout
|
||||
|
||||
```
|
||||
perf/
|
||||
docker-compose.yml
|
||||
prometheus/
|
||||
prometheus.yml
|
||||
k6/
|
||||
lib.js
|
||||
smoke.js
|
||||
capacity_ramp.js
|
||||
burst.js
|
||||
soak.js
|
||||
stress.js
|
||||
scaling_curve.sh
|
||||
tools/
|
||||
analyze.py
|
||||
src/
|
||||
Perf.Gateway/
|
||||
Perf.Worker/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Document Version**: 1.0
|
||||
**Archived From**: docs/product-advisories/unprocessed/16-Dec-2025 - Reimagining Proof-Linked UX in Security Workflows.md
|
||||
**Archive Reason**: Wrong content was pasted; this performance testing content preserved for future use.
|
||||
Reference in New Issue
Block a user