Files
git.stella-ops.org/docs-archived/product-advisories/15-Dec-2025 - Modeling StellaRouter Performance Curves.md
2026-01-05 16:02:11 +02:00

48 KiB
Raw Blame History

Heres a compact, readytouse playbook to measure and plot performance envelopes for an HTTP → Valkey → Worker hop under variable concurrency, so you can tune autoscaling and predict uservisible spikes.


What were measuring (plain English)

  • TTFB/TTFS (HTTP): time the gateway spends accepting the request + queuing the job.
  • Valkey latency: enqueue (LPUSH/XADD), pop/claim (BRPOP/XREADGROUP), and roundtrip.
  • Worker service time: time to pick up, process, and ack.
  • Queueing delay: time spent waiting in the queue (arrival → start of worker).

These four add up to the “hop latency” users feel when the system is under load.


Minimal tracing you can add today

Emit these IDs/headers endtoend:

  • x-stella-corr-id (uuid)
  • x-stella-enq-ts (gateway enqueue ts, ns)
  • x-stella-claim-ts (worker claim ts, ns)
  • x-stella-done-ts (worker done ts, ns)

From these, compute:

  • queue_delay = claim_ts - enq_ts
  • service_time = done_ts - claim_ts
  • http_ttfs = gateway_first_byte_ts - http_request_start_ts
  • hop_latency = done_ts - enq_ts (or returnpath if synchronous)

Clocksync tip: use monotonic clocks in code and convert to ns; dont mix wallclock.


Valkey commands (safe, BSD Valkey)

Use Valkey Streams + Consumer Groups for fairness and metrics:

  • Enqueue: XADD jobs * corr-id <uuid> enq-ts <ns> payload <...>
  • Claim: XREADGROUP GROUP workers w1 COUNT 1 BLOCK 1000 STREAMS jobs >
  • Ack: XACK jobs workers <id>

Add a small Lua for timestamping at enqueue (atomic):

-- KEYS[1]=stream
-- ARGV[1]=enq_ts_ns, ARGV[2]=corr_id, ARGV[3]=payload
return redis.call('XADD', KEYS[1], '*',
  'corr', ARGV[2], 'enq', ARGV[1], 'p', ARGV[3])

Load shapes to test (find the envelope)

  1. Openloop (arrivalrate controlled): 50 → 10k req/min in steps; constant rate per step. Reveals queueing onset.
  2. Burst: 0 → N in short spikes (e.g., 5k in 10s) to see saturation and drain time.
  3. Stepup/down: double every 2 min until SLO breach; then halve down.
  4. Long tail soak: run at 7080% of max for 1h; watch p95p99.9 drift.

Target outputs per step: p50/p90/p95/p99 for queue_delay, service_time, hop_latency, plus throughput and error rate.


k6 script (HTTP client pressure)

// save as hop-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  scenarios: {
    step_load: {
      executor: 'ramping-arrival-rate',
      startRate: 20, timeUnit: '1s',
      preAllocatedVUs: 200, maxVUs: 5000,
      stages: [
        { target: 50, duration: '1m' },
        { target: 100, duration: '1m' },
        { target: 200, duration: '1m' },
        { target: 400, duration: '1m' },
        { target: 800, duration: '1m' },
      ],
    },
  },
  thresholds: {
    'http_req_failed': ['rate<0.01'],
    'http_req_duration{phase:hop}': ['p(95)<500'],
  },
};

export default function () {
  const corr = crypto.randomUUID();
  const res = http.post(
    __ENV.GW_URL,
    JSON.stringify({ data: 'ping', corr }),
    {
      headers: { 'Content-Type': 'application/json', 'x-stella-corr-id': corr },
      tags: { phase: 'hop' },
    }
  );
  check(res, { 'status 2xx/202': r => r.status === 200 || r.status === 202 });
  sleep(0.01);
}

Run: GW_URL=https://gateway.example/hop k6 run hop-test.js


Worker hooks (.NET 10 sketch)

// At claim
var now = Stopwatch.GetTimestamp(); // monotonic
var claimNs = now.ToNanoseconds();
log.AddTag("x-stella-claim-ts", claimNs);

// After processing
var doneNs = Stopwatch.GetTimestamp().ToNanoseconds();
log.AddTag("x-stella-done-ts", doneNs);
// Include corr-id and stream entry id in logs/metrics

Helper:

public static class MonoTime {
  static readonly double _nsPerTick = 1_000_000_000d / Stopwatch.Frequency;
  public static long ToNanoseconds(this long ticks) => (long)(ticks * _nsPerTick);
}

Prometheus metrics to expose

  • valkey_enqueue_ns (histogram)
  • valkey_claim_block_ms (gauge)
  • worker_service_ns (histogram, labels: worker_type, route)
  • queue_depth (gauge via XLEN or XINFO STREAM)
  • enqueue_rate, dequeue_rate (counters)

Example recording rules:

- record: hop:queue_delay_p95
  expr: histogram_quantile(0.95, sum(rate(valkey_enqueue_ns_bucket[1m])) by (le))
- record: hop:service_time_p95
  expr: histogram_quantile(0.95, sum(rate(worker_service_ns_bucket[1m])) by (le))
- record: hop:latency_budget_p95
  expr: hop:queue_delay_p95 + hop:service_time_p95

Autoscaling signals (HPA/KEDA friendly)

  • Primary: queue depth & its derivative (d/dt).
  • Secondary: p95 queue_delay and worker CPU.
  • Safety: max inflight per worker; backpressure HTTP 429 when queue_depth > D or p95_queue_delay > SLO*0.8.

Plot the “envelope” (what youll look at)

  • Xaxis: offered load (req/s).
  • Yaxis: p95 hop latency (ms).
  • Overlay: p99 (dashed), SLO line (e.g., 500ms), and capacity knee (where p95 sharply rises).
  • Add secondary panel: queue depth vs load.

If you want, I can generate a readymade notebook that ingests your logs/metrics CSV and outputs these plots. Below is a set of implementation guidelines your agents can follow to build a repeatable performance test system for the HTTP → Valkey → Worker pipeline. Its written as a “spec + runbook” with clear MUST/SHOULD requirements and concrete scenario definitions.


Performance Test Guidelines

HTTP → Valkey → Worker pipeline

1) Objectives and scope

Primary objectives

Your performance tests MUST answer these questions with evidence:

  1. Capacity knee: At what offered load does queue delay start growing sharply?

  2. User-impact envelope: What are p50/p95/p99 hop latency curves vs offered load?

  3. Decomposition: How much of hop latency is:

    • gateway enqueue time
    • Valkey enqueue/claim RTT
    • queue wait time
    • worker service time
  4. Scaling behavior: How do these change with worker replica counts (N workers)?

  5. Stability: Under sustained load, do latencies drift (GC, memory, fragmentation, background jobs)?

Non-goals (explicitly out of scope unless you add them later)

  • Micro-optimizing single function runtime
  • Synthetic “max QPS” records without a representative payload
  • Tests that dont collect segment metrics (end-to-end only) for anything beyond basic smoke

2) Definitions and required metrics

Required latency definitions (standardize these names)

Agents MUST compute and report these per request/job:

  • t_http_accept: time from client send → gateway accepts request

  • t_enqueue: time spent in gateway to enqueue into Valkey (server-side)

  • t_valkey_rtt_enq: client-observed RTT for enqueue command(s)

  • t_queue_delay: claim_ts - enq_ts

  • t_service: done_ts - claim_ts

  • t_hop: done_ts - enq_ts (this is the “true pipeline hop” latency)

  • Optional but recommended:

    • t_ack: time to ack completion (Valkey ack RTT)
    • t_http_response: request start → gateway response sent (TTFB/TTFS)

Required percentiles and aggregations

Per scenario step (e.g., each offered load plateau), agents MUST output:

  • p50 / p90 / p95 / p99 / p99.9 for: t_hop, t_queue_delay, t_service, t_enqueue
  • Throughput: offered rps and achieved rps
  • Error rate: HTTP failures, enqueue failures, worker failures
  • Queue depth and backlog drain time

Required system-level telemetry (minimum)

Agents MUST collect these time series during tests:

  • Worker: CPU, memory, GC pauses (if .NET), threadpool saturation indicators
  • Valkey: ops/sec, connected clients, blocked clients, memory used, evictions, slowlog count
  • Gateway: CPU/mem, request rate, response codes, request duration histogram

3) Environment and test hygiene requirements

Environment requirements

Agents SHOULD run tests in an environment that matches production in:

  • container CPU/memory limits
  • number of nodes, network topology
  • Valkey topology (single, cluster, sentinel, etc.)
  • worker replica autoscaling rules (or deliberately disabled)

If exact parity isnt possible, agents MUST record all known differences in the report.

Test hygiene (non-negotiable)

Agents MUST:

  1. Start from empty queues (no backlog).
  2. Disable client retries (or explicitly run two variants: retries off / retries on).
  3. Warm up before measuring (e.g., 60s warm-up minimum).
  4. Hold steady plateaus long enough to stabilize (usually 25 minutes per step).
  5. Cool down and verify backlog drains (queue depth returns to baseline).
  6. Record exact versions/SHAs of gateway/worker and Valkey config.

Load generator hygiene

Agents MUST ensure the load generator is not the bottleneck:

  • CPU < ~70% during test
  • no local socket exhaustion
  • enough VUs/connections
  • if needed, distributed load generation

4) Instrumentation spec (agents implement this first)

Correlation and timestamps

Agents MUST propagate an end-to-end correlation ID and timestamps.

Required fields

  • corr_id (UUID)
  • enq_ts_ns (set at enqueue, monotonic or consistent clock)
  • claim_ts_ns (set by worker when job is claimed)
  • done_ts_ns (set by worker when job processing ends)

Where these live

  • HTTP request header: x-corr-id: <uuid>
  • Valkey job payload fields: corr, enq, and optionally payload size/type
  • Worker logs/metrics: include corr_id, job id, claim_ts_ns, done_ts_ns

Clock requirements

Agents MUST use a consistent timing source:

  • Prefer monotonic timers for durations (Stopwatch / monotonic clock)

  • If timestamps cross machines, ensure theyre comparable:

    • either rely on synchronized clocks (NTP) and monitor drift
    • or compute durations using monotonic tick deltas within the same host and transmit durations (less ideal for queue delay)

Practical recommendation: use wall-clock ns for cross-host timestamps with NTP + drift checks, and also record per-host monotonic durations for sanity.

Agents SHOULD use Streams + Consumer Groups for stable claim semantics and good observability:

  • Enqueue: XADD jobs * corr <uuid> enq <ns> payload <...>
  • Claim: XREADGROUP GROUP workers <consumer> COUNT 1 BLOCK 1000 STREAMS jobs >
  • Ack: XACK jobs workers <id>

Agents MUST record stream length (XLEN) or consumer group lag (XINFO GROUPS) as queue depth/lag.

Metrics exposure

Agents MUST publish Prometheus (or equivalent) histograms:

  • gateway_enqueue_seconds (or ns) histogram
  • valkey_enqueue_rtt_seconds histogram
  • worker_service_seconds histogram
  • queue_delay_seconds histogram (derived from timestamps; can be computed in worker or offline)
  • hop_latency_seconds histogram

5) Workload modeling and test data

Agents MUST define a workload model before running capacity tests:

  1. Endpoint(s): list exact gateway routes under test
  2. Payload types: small/typical/large
  3. Mix: e.g., 70/25/5 by payload size
  4. Idempotency rules: ensure repeated jobs dont corrupt state
  5. Data reset strategy: how test data is cleaned or isolated per run

Agents SHOULD test at least:

  • Typical payload (p50)
  • Large payload (p95)
  • Worst-case allowed payload (bounded by your API limits)

6) Scenario suite your agents MUST implement

Each scenario MUST be defined as code/config (not manual).

Scenario A — Smoke (fast sanity)

Goal: verify instrumentation + basic correctness Load: low (e.g., 15 rps), 2 minutes Pass:

  • 0 backlog after run
  • error rate < 0.1%
  • metrics present for all segments

Scenario B — Baseline (repeatable reference point)

Goal: establish a stable baseline for regression tracking Load: fixed moderate load (e.g., 3050% of expected capacity), 10 minutes Pass:

  • p95 t_hop within baseline ± tolerance (set after first runs)
  • no upward drift in p95 across time (trend line ~flat)

Scenario C — Capacity ramp (open-loop)

Goal: find the knee where queueing begins Method: open-loop arrival-rate ramp with plateaus Example stages (edit to fit your system):

  • 50 rps for 2m
  • 100 rps for 2m
  • 200 rps for 2m
  • 400 rps for 2m
  • … until SLO breach or errors spike

MUST:

  • warm-up stage before first plateau
  • record per-plateau summary

Stop conditions (any triggers stop):

  • error rate > 1%
  • queue depth grows without bound over an entire plateau
  • p95 t_hop exceeds SLO for 2 consecutive plateaus

Scenario D — Stress (push past capacity)

Goal: characterize failure mode and recovery Load: 120200% of knee load, 510 minutes Pass (for resilience):

  • system does not crash permanently
  • once load stops, backlog drains within target time (define it)

Scenario E — Burst / spike

Goal: see how quickly queue grows and drains Load shape:

  • baseline low load
  • sudden burst (e.g., 10× for 1030s)
  • return to baseline

Report:

  • peak queue depth
  • time to drain to baseline
  • p99 t_hop during burst

Scenario F — Soak (long-running)

Goal: detect drift (leaks, fragmentation, GC patterns) Load: 7085% of knee, 60180 minutes Pass:

  • p95 does not trend upward beyond threshold
  • memory remains bounded
  • no rising error rate

Scenario G — Scaling curve (worker replica sweep)

Goal: turn results into scaling rules Method:

  • Repeat Scenario C with worker replicas = 1, 2, 4, 8… Deliverable:
  • plot of knee load vs worker count
  • p95 t_service vs worker count (should remain similar; queue delay should drop)

7) Execution protocol (runbook)

Agents MUST run every scenario using the same disciplined flow:

Pre-run checklist

  • confirm system versions/SHAs

  • confirm autoscaling mode:

    • Off for baseline capacity characterization
    • On for validating autoscaling policies
  • clear queues and consumer group pending entries

  • restart or at least record “time since deploy” for services (cold vs warm)

During run

  • ensure load is truly open-loop when required (arrival-rate based)

  • continuously record:

    • offered vs achieved rate
    • queue depth
    • CPU/mem for gateway/worker/Valkey

Post-run

  • stop load

  • wait until backlog drains (or record that it doesnt)

  • export:

    • k6/runner raw output
    • Prometheus time series snapshot
    • sampled logs with corr_id fields
  • generate a summary report automatically (no hand calculations)


8) Analysis rules (how agents compute “the envelope”)

Agents MUST generate at minimum two plots per run:

  1. Latency envelope: offered load (x-axis) vs p95 t_hop (y-axis)

    • overlay p99 (and SLO line)
  2. Queue behavior: offered load vs queue depth (or lag), plus drain time

How to identify the “knee”

Agents SHOULD mark the knee as the first plateau where:

  • queue depth grows monotonically within the plateau, or
  • p95 t_queue_delay increases by > X% step-to-step (e.g., 50100%)

Convert results into scaling guidance

Agents SHOULD compute:

  • capacity_per_worker ≈ 1 / mean(t_service) (jobs/sec per worker)

  • recommended replicas for offered load λ at target utilization U:

    • workers_needed = ceil(λ * mean(t_service) / U)
    • choose U ~ 0.60.75 for headroom

This should be reported alongside the measured envelope.


9) Pass/fail criteria and regression gates

Agents MUST define gates in configuration, not in someones head.

Suggested gating structure:

  • Smoke gate: error rate < 0.1%, backlog drains
  • Baseline gate: p95 t_hop regression < 10% (tune after you have history)
  • Capacity gate: knee load regression < 10% (optional but very valuable)
  • Soak gate: p95 drift over time < 15% and no memory runaway

10) Common pitfalls (agents must avoid)

  1. Closed-loop tests used for capacity Closed-loop (“N concurrent users”) self-throttles and can hide queueing onset. Use open-loop arrival rate for capacity.

  2. Ignoring queue depth A system can look “healthy” in request latency while silently building backlog.

  3. Measuring only gateway latency You must measure enqueue → claim → done to see the real hop.

  4. Load generator bottleneck If the generator saturates, youll under-estimate capacity.

  5. Retries enabled by default Retries can inflate load and hide root causes; run with retries off first.

  6. Not controlling warm vs cold Cold caches vs warmed services produce different envelopes; record the condition.


Agent implementation checklist (deliverables)

Assign these as concrete tasks to your agents.

Agent 1 — Observability & tracing

MUST deliver:

  • correlation id propagation gateway → Valkey → worker
  • timestamps enq/claim/done
  • Prometheus histograms for enqueue, service, hop
  • queue depth metric (XLEN / XINFO lag)

Agent 2 — Load test harness

MUST deliver:

  • test runner scripts (k6 or equivalent) for scenarios AG

  • test config file (YAML/JSON) controlling:

    • stages (rates/durations)
    • payload mix
    • headers (corr-id)
  • reproducible seeds and version stamping

Agent 3 — Result collector and analyzer

MUST deliver:

  • a pipeline that merges:

    • load generator output
    • hop timing data (from logs or a completion stream)
    • Prometheus snapshots
  • automatic summary + plots:

    • latency envelope
    • queue depth/drain
  • CSV/JSON exports for long-term tracking

Agent 4 — Reporting and dashboards

MUST deliver:

  • a standard report template that includes:

    • environment details
    • scenario details
    • key charts
    • knee estimate
    • scaling recommendation
  • Grafana dashboard with the required panels

Agent 5 — CI / release integration

SHOULD deliver:

  • PR-level smoke test (Scenario A)
  • nightly baseline (Scenario B)
  • weekly capacity sweep (Scenario C + scaling curve)

Template: scenario spec (agents can copy/paste)

test_run:
  system_under_test:
    gateway_sha: "<git sha>"
    worker_sha: "<git sha>"
    valkey_version: "<version>"
  environment:
    cluster: "<name>"
    workers: 4
    autoscaling: "off"   # off|on
  workload:
    endpoint: "/hop"
    payload_profile: "p50"
    mix:
      p50: 0.7
      p95: 0.25
      max: 0.05
  scenario:
    name: "capacity_ramp"
    mode: "open_loop"
    warmup_seconds: 60
    stages:
      - rps: 50
        duration_seconds: 120
      - rps: 100
        duration_seconds: 120
      - rps: 200
        duration_seconds: 120
      - rps: 400
        duration_seconds: 120
  gates:
    max_error_rate: 0.01
    slo_ms_p95_hop: 500
    backlog_must_drain_seconds: 300
  outputs:
    artifacts_dir: "./artifacts/<timestamp>/"

If you want, I can also provide a single “golden” folder structure (tests/ scripts/ dashboards/ analysis/) and a “definition of done” checklist that matches how your repo is organized—but the above is already sufficient for agents to start implementing immediately. Below is a sample / partial implementation that gives full functional coverage of your performance-test requirements (instrumentation, correlation, timestamps, queue semantics, scenarios AG, artifact export, and analysis). It is intentionally minimal and “swap-in-real-code” friendly.

You can copy these files into a perf/ folder in your repo, build, and run locally with Docker Compose.


1) Suggested folder layout

perf/
  docker-compose.yml
  prometheus/
    prometheus.yml
  k6/
    lib.js
    smoke.js
    capacity_ramp.js
    burst.js
    soak.js
    stress.js
    scaling_curve.sh
  tools/
    analyze.py
  src/
    Perf.Gateway/
      Perf.Gateway.csproj
      Program.cs
      Metrics.cs
      ValkeyStreams.cs
      TimeNs.cs
    Perf.Worker/
      Perf.Worker.csproj
      Program.cs
      WorkerService.cs
      Metrics.cs
      ValkeyStreams.cs
      TimeNs.cs

2) Gateway sample (.NET 10, Minimal API)

perf/src/Perf.Gateway/Perf.Gateway.csproj

<Project Sdk="Microsoft.NET.Sdk.Web">
  <PropertyGroup>
    <TargetFramework>net10.0</TargetFramework>
    <Nullable>enable</Nullable>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>

  <ItemGroup>
    <!-- StackExchange.Redis is MIT and works with Valkey -->
    <PackageReference Include="StackExchange.Redis" Version="2.8.16" />
  </ItemGroup>
</Project>

perf/src/Perf.Gateway/TimeNs.cs

namespace Perf.Gateway;

public static class TimeNs
{
    private static readonly long UnixEpochTicks = DateTime.UnixEpoch.Ticks; // 100ns units

    public static long UnixNowNs()
    {
        var ticks = DateTime.UtcNow.Ticks - UnixEpochTicks; // 100ns
        return ticks * 100L; // ns
    }
}

perf/src/Perf.Gateway/Metrics.cs

using System.Collections.Concurrent;
using System.Globalization;
using System.Text;

namespace Perf.Gateway;

public sealed class Metrics
{
    private readonly ConcurrentDictionary<string, long> _counters = new();

    // Simple fixed-bucket histograms in seconds (Prometheus histogram format)
    private readonly ConcurrentDictionary<string, Histogram> _h = new();

    public void Inc(string name, long by = 1) => _counters.AddOrUpdate(name, by, (_, v) => v + by);

    public Histogram Hist(string name, double[] bucketsSeconds) =>
        _h.GetOrAdd(name, _ => new Histogram(name, bucketsSeconds));

    public string ExportPrometheus()
    {
        var sb = new StringBuilder(16 * 1024);

        foreach (var (k, v) in _counters.OrderBy(kv => kv.Key))
        {
            sb.Append("# TYPE ").Append(k).Append(" counter\n");
            sb.Append(k).Append(' ').Append(v.ToString(CultureInfo.InvariantCulture)).Append('\n');
        }

        foreach (var hist in _h.Values.OrderBy(h => h.Name))
        {
            sb.Append(hist.Export());
        }

        return sb.ToString();
    }

    public sealed class Histogram
    {
        public string Name { get; }
        private readonly double[] _buckets; // sorted
        private readonly long[] _bucketCounts; // cumulative exposed later
        private long _count;
        private double _sum;

        private readonly object _lock = new();

        public Histogram(string name, double[] bucketsSeconds)
        {
            Name = name;
            _buckets = bucketsSeconds.OrderBy(x => x).ToArray();
            _bucketCounts = new long[_buckets.Length];
        }

        public void Observe(double seconds)
        {
            lock (_lock)
            {
                _count++;
                _sum += seconds;

                for (int i = 0; i < _buckets.Length; i++)
                {
                    if (seconds <= _buckets[i]) _bucketCounts[i]++;
                }
            }
        }

        public string Export()
        {
            // Prometheus hist buckets are cumulative; we already maintain that.
            var sb = new StringBuilder(2048);
            sb.Append("# TYPE ").Append(Name).Append(" histogram\n");

            lock (_lock)
            {
                for (int i = 0; i < _buckets.Length; i++)
                {
                    sb.Append(Name).Append("_bucket{le=\"")
                      .Append(_buckets[i].ToString("0.################", CultureInfo.InvariantCulture))
                      .Append("\"} ")
                      .Append(_bucketCounts[i].ToString(CultureInfo.InvariantCulture))
                      .Append('\n');
                }

                sb.Append(Name).Append("_bucket{le=\"+Inf\"} ")
                  .Append(_count.ToString(CultureInfo.InvariantCulture))
                  .Append('\n');

                sb.Append(Name).Append("_sum ")
                  .Append(_sum.ToString(CultureInfo.InvariantCulture))
                  .Append('\n');

                sb.Append(Name).Append("_count ")
                  .Append(_count.ToString(CultureInfo.InvariantCulture))
                  .Append('\n');
            }

            return sb.ToString();
        }
    }
}

perf/src/Perf.Gateway/ValkeyStreams.cs

using StackExchange.Redis;

namespace Perf.Gateway;

public sealed class ValkeyStreams
{
    private readonly IDatabase _db;
    public ValkeyStreams(IConnectionMultiplexer mux) => _db = mux.GetDatabase();

    public async Task EnsureConsumerGroupAsync(string stream, string group)
    {
        try
        {
            // XGROUP CREATE <key> <groupname> $ MKSTREAM
            await _db.ExecuteAsync("XGROUP", "CREATE", stream, group, "$", "MKSTREAM");
        }
        catch (RedisServerException ex) when (ex.Message.Contains("BUSYGROUP", StringComparison.OrdinalIgnoreCase))
        {
            // ok
        }
    }

    public async Task<RedisResult> XAddAsync(string stream, NameValueEntry[] fields)
    {
        // XADD stream * field value field value ...
        var args = new List<object>(2 + fields.Length * 2) { stream, "*" };
        foreach (var f in fields) { args.Add(f.Name); args.Add(f.Value); }
        return await _db.ExecuteAsync("XADD", args.ToArray());
    }
}

perf/src/Perf.Gateway/Program.cs

using Perf.Gateway;
using StackExchange.Redis;
using System.Diagnostics;

var builder = WebApplication.CreateBuilder(args);

var valkey = builder.Configuration["VALKEY_ENDPOINT"] ?? "valkey:6379";
builder.Services.AddSingleton<IConnectionMultiplexer>(_ => ConnectionMultiplexer.Connect(valkey));
builder.Services.AddSingleton<ValkeyStreams>();
builder.Services.AddSingleton<Metrics>();

var app = builder.Build();

var metrics = app.Services.GetRequiredService<Metrics>();
var streams = app.Services.GetRequiredService<ValkeyStreams>();

const string JobsStream = "stella:perf:jobs";
const string DoneStream = "stella:perf:done";
const string Group = "workers";

await streams.EnsureConsumerGroupAsync(JobsStream, Group);

var allowTestControl = (app.Configuration["ALLOW_TEST_CONTROL"] ?? "1") == "1";
var runs = new Dictionary<string, long>(StringComparer.Ordinal); // run_id -> start_ns

if (allowTestControl)
{
    app.MapPost("/test/start", () =>
    {
        var runId = Guid.NewGuid().ToString("N");
        var startNs = TimeNs.UnixNowNs();
        lock (runs) runs[runId] = startNs;

        metrics.Inc("perf_test_start_total");
        return Results.Ok(new { run_id = runId, start_ns = startNs, jobs_stream = JobsStream, done_stream = DoneStream });
    });

    app.MapPost("/test/end/{runId}", (string runId) =>
    {
        lock (runs) runs.Remove(runId);
        metrics.Inc("perf_test_end_total");
        return Results.Ok(new { run_id = runId });
    });
}

app.MapGet("/metrics", () => Results.Text(metrics.ExportPrometheus(), "text/plain; version=0.0.4"));

app.MapPost("/hop", async (HttpRequest req) =>
{
    // Correlation / run id
    var corr = req.Headers["x-stella-corr-id"].FirstOrDefault() ?? Guid.NewGuid().ToString();
    var runId = req.Headers["x-stella-run-id"].FirstOrDefault() ?? "no-run";

    // Enqueue timestamp (UTC-derived ns)
    var enqNs = TimeNs.UnixNowNs();

    // Read raw body (payload) - keep it simple for perf harness
    string payload;
    using (var sr = new StreamReader(req.Body))
        payload = await sr.ReadToEndAsync();

    var sw = Stopwatch.GetTimestamp();

    // Valkey enqueue
    var valkeySw = Stopwatch.GetTimestamp();
    var entryId = await streams.XAddAsync(JobsStream, new[]
    {
        new NameValueEntry("corr", corr),
        new NameValueEntry("run", runId),
        new NameValueEntry("enq_ns", enqNs),
        new NameValueEntry("payload", payload),
    });
    var valkeyRttSec = (Stopwatch.GetTimestamp() - valkeySw) / (double)Stopwatch.Frequency;

    var enqueueSec = (Stopwatch.GetTimestamp() - sw) / (double)Stopwatch.Frequency;

    metrics.Inc("hop_requests_total");
    metrics.Hist("gateway_enqueue_seconds", new[] { .001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2 }).Observe(enqueueSec);
    metrics.Hist("valkey_enqueue_rtt_seconds", new[] { .0005, .001, .002, .005, .01, .02, .05, .1, .2 }).Observe(valkeyRttSec);

    return Results.Accepted(value: new { corr, run_id = runId, enq_ns = enqNs, entry_id = entryId.ToString() });
});

app.Run("http://0.0.0.0:8080");

3) Worker sample (.NET 10 hosted service + metrics)

perf/src/Perf.Worker/Perf.Worker.csproj

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net10.0</TargetFramework>
    <Nullable>enable</Nullable>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="StackExchange.Redis" Version="2.8.16" />
  </ItemGroup>
</Project>

perf/src/Perf.Worker/TimeNs.cs

namespace Perf.Worker;

public static class TimeNs
{
    private static readonly long UnixEpochTicks = DateTime.UnixEpoch.Ticks;
    public static long UnixNowNs() => (DateTime.UtcNow.Ticks - UnixEpochTicks) * 100L;
}

perf/src/Perf.Worker/Metrics.cs

// Same as gateway Metrics.cs (copy/paste). Keep identical for consistency.
using System.Collections.Concurrent;
using System.Globalization;
using System.Text;

namespace Perf.Worker;

public sealed class Metrics
{
    private readonly ConcurrentDictionary<string, long> _counters = new();
    private readonly ConcurrentDictionary<string, Histogram> _h = new();

    public void Inc(string name, long by = 1) => _counters.AddOrUpdate(name, by, (_, v) => v + by);
    public Histogram Hist(string name, double[] bucketsSeconds) =>
        _h.GetOrAdd(name, _ => new Histogram(name, bucketsSeconds));

    public string ExportPrometheus()
    {
        var sb = new StringBuilder(16 * 1024);

        foreach (var (k, v) in _counters.OrderBy(kv => kv.Key))
        {
            sb.Append("# TYPE ").Append(k).Append(" counter\n");
            sb.Append(k).Append(' ').Append(v.ToString(CultureInfo.InvariantCulture)).Append('\n');
        }

        foreach (var hist in _h.Values.OrderBy(h => h.Name))
            sb.Append(hist.Export());

        return sb.ToString();
    }

    public sealed class Histogram
    {
        public string Name { get; }
        private readonly double[] _buckets;
        private readonly long[] _bucketCounts;
        private long _count;
        private double _sum;
        private readonly object _lock = new();

        public Histogram(string name, double[] bucketsSeconds)
        {
            Name = name;
            _buckets = bucketsSeconds.OrderBy(x => x).ToArray();
            _bucketCounts = new long[_buckets.Length];
        }

        public void Observe(double seconds)
        {
            lock (_lock)
            {
                _count++;
                _sum += seconds;
                for (int i = 0; i < _buckets.Length; i++)
                    if (seconds <= _buckets[i]) _bucketCounts[i]++;
            }
        }

        public string Export()
        {
            var sb = new StringBuilder(2048);
            sb.Append("# TYPE ").Append(Name).Append(" histogram\n");
            lock (_lock)
            {
                for (int i = 0; i < _buckets.Length; i++)
                {
                    sb.Append(Name).Append("_bucket{le=\"")
                      .Append(_buckets[i].ToString("0.################", CultureInfo.InvariantCulture))
                      .Append("\"} ")
                      .Append(_bucketCounts[i].ToString(CultureInfo.InvariantCulture))
                      .Append('\n');
                }

                sb.Append(Name).Append("_bucket{le=\"+Inf\"} ")
                  .Append(_count.ToString(CultureInfo.InvariantCulture))
                  .Append('\n');

                sb.Append(Name).Append("_sum ")
                  .Append(_sum.ToString(CultureInfo.InvariantCulture))
                  .Append('\n');

                sb.Append(Name).Append("_count ")
                  .Append(_count.ToString(CultureInfo.InvariantCulture))
                  .Append('\n');
            }
            return sb.ToString();
        }
    }
}

perf/src/Perf.Worker/ValkeyStreams.cs

using StackExchange.Redis;

namespace Perf.Worker;

public sealed class ValkeyStreams
{
    private readonly IDatabase _db;
    public ValkeyStreams(IConnectionMultiplexer mux) => _db = mux.GetDatabase();

    public async Task EnsureConsumerGroupAsync(string stream, string group)
    {
        try
        {
            await _db.ExecuteAsync("XGROUP", "CREATE", stream, group, "$", "MKSTREAM");
        }
        catch (RedisServerException ex) when (ex.Message.Contains("BUSYGROUP", StringComparison.OrdinalIgnoreCase)) { }
    }

    public async Task<RedisResult> XReadGroupAsync(string group, string consumer, string stream, string id, int count, int blockMs)
        => await _db.ExecuteAsync("XREADGROUP", "GROUP", group, consumer, "COUNT", count, "BLOCK", blockMs, "STREAMS", stream, id);

    public async Task XAckAsync(string stream, string group, RedisValue id)
        => await _db.ExecuteAsync("XACK", stream, group, id);

    public async Task<RedisResult> XAddAsync(string stream, NameValueEntry[] fields)
    {
        var args = new List<object>(2 + fields.Length * 2) { stream, "*" };
        foreach (var f in fields) { args.Add(f.Name); args.Add(f.Value); }
        return await _db.ExecuteAsync("XADD", args.ToArray());
    }
}

perf/src/Perf.Worker/WorkerService.cs

using StackExchange.Redis;
using System.Diagnostics;

namespace Perf.Worker;

public sealed class WorkerService : BackgroundService
{
    private readonly ValkeyStreams _streams;
    private readonly Metrics _metrics;
    private readonly ILogger<WorkerService> _log;

    private const string JobsStream = "stella:perf:jobs";
    private const string DoneStream = "stella:perf:done";
    private const string Group = "workers";

    private readonly string _consumer;

    public WorkerService(ValkeyStreams streams, Metrics metrics, ILogger<WorkerService> log)
    {
        _streams = streams;
        _metrics = metrics;
        _log = log;
        _consumer = Environment.GetEnvironmentVariable("WORKER_CONSUMER") ?? $"w-{Environment.MachineName}-{Guid.NewGuid():N}";
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        await _streams.EnsureConsumerGroupAsync(JobsStream, Group);

        var serviceBuckets = new[] { .001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5 };
        var queueBuckets = new[] { .001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10, 30 };
        var hopBuckets   = new[] { .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10, 30 };

        while (!stoppingToken.IsCancellationRequested)
        {
            RedisResult res;
            try
            {
                res = await _streams.XReadGroupAsync(Group, _consumer, JobsStream, ">", count: 1, blockMs: 1000);
            }
            catch (Exception ex)
            {
                _metrics.Inc("worker_xread_errors_total");
                _log.LogWarning(ex, "XREADGROUP failed");
                await Task.Delay(250, stoppingToken);
                continue;
            }

            if (res.IsNull) continue;

            // Parse XREADGROUP result (array -> stream -> entries)
            // Expected shape: [[stream, [[id, [field, value, field, value...]], ...]]]
            var outer = (RedisResult[])res!;
            foreach (var streamBlock in outer)
            {
                var sb = (RedisResult[])streamBlock!;
                var entries = (RedisResult[])sb[1]!;

                foreach (var entry in entries)
                {
                    var e = (RedisResult[])entry!;
                    var entryId = (RedisValue)e[0]!;
                    var fields = (RedisResult[])e[1]!;

                    string corr = "", run = "no-run";
                    long enqNs = 0;

                    for (int i = 0; i < fields.Length; i += 2)
                    {
                        var key = (string)fields[i]!;
                        var val = fields[i + 1].ToString();
                        if (key == "corr") corr = val;
                        else if (key == "run") run = val;
                        else if (key == "enq_ns") _ = long.TryParse(val, out enqNs);
                    }

                    var claimNs = TimeNs.UnixNowNs();

                    var sw = Stopwatch.GetTimestamp();

                    // Placeholder "service work"  replace with real processing
                    // Keep it deterministic-ish; use env var to model different service times.
                    var workMs = int.TryParse(Environment.GetEnvironmentVariable("WORK_MS"), out var ms) ? ms : 5;
                    await Task.Delay(workMs, stoppingToken);

                    var doneNs = TimeNs.UnixNowNs();
                    var serviceSec = (Stopwatch.GetTimestamp() - sw) / (double)Stopwatch.Frequency;

                    var queueDelaySec = enqNs > 0 ? (claimNs - enqNs) / 1_000_000_000d : double.NaN;
                    var hopSec = enqNs > 0 ? (doneNs - enqNs) / 1_000_000_000d : double.NaN;

                    // Ack then publish "done" record for offline analysis
                    await _streams.XAckAsync(JobsStream, Group, entryId);

                    await _streams.XAddAsync(DoneStream, new[]
                    {
                        new NameValueEntry("run", run),
                        new NameValueEntry("corr", corr),
                        new NameValueEntry("entry", entryId),
                        new NameValueEntry("enq_ns", enqNs),
                        new NameValueEntry("claim_ns", claimNs),
                        new NameValueEntry("done_ns", doneNs),
                        new NameValueEntry("work_ms", workMs),
                    });

                    _metrics.Inc("worker_jobs_total");
                    _metrics.Hist("worker_service_seconds", serviceBuckets).Observe(serviceSec);

                    if (!double.IsNaN(queueDelaySec))
                        _metrics.Hist("queue_delay_seconds", queueBuckets).Observe(queueDelaySec);

                    if (!double.IsNaN(hopSec))
                        _metrics.Hist("hop_latency_seconds", hopBuckets).Observe(hopSec);
                }
            }
        }
    }
}

perf/src/Perf.Worker/Program.cs

using Perf.Worker;
using StackExchange.Redis;

var builder = Host.CreateApplicationBuilder(args);

var valkey = builder.Configuration["VALKEY_ENDPOINT"] ?? "valkey:6379";
builder.Services.AddSingleton<IConnectionMultiplexer>(_ => ConnectionMultiplexer.Connect(valkey));
builder.Services.AddSingleton<ValkeyStreams>();
builder.Services.AddSingleton<Metrics>();
builder.Services.AddHostedService<WorkerService>();

// Minimal metrics endpoint
builder.Services.AddSingleton<IHostedService>(sp =>
{
    return new SimpleMetricsServer(
        sp.GetRequiredService<Metrics>(),
        url: "http://0.0.0.0:8081/metrics"
    );
});

var host = builder.Build();
await host.RunAsync();

// ---- minimal metrics server ----
file sealed class SimpleMetricsServer : BackgroundService
{
    private readonly Metrics _metrics;
    private readonly string _url;

    public SimpleMetricsServer(Metrics metrics, string url) { _metrics = metrics; _url = url; }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        var builder = WebApplication.CreateBuilder();
        var app = builder.Build();
        app.MapGet("/metrics", () => Results.Text(_metrics.ExportPrometheus(), "text/plain; version=0.0.4"));
        await app.RunAsync(_url, stoppingToken);
    }
}

4) Docker Compose (Valkey + gateway + worker + Prometheus)

perf/docker-compose.yml

services:
  valkey:
    image: valkey/valkey:7.2
    ports: ["6379:6379"]

  gateway:
    build:
      context: ./src/Perf.Gateway
    environment:
      - VALKEY_ENDPOINT=valkey:6379
      - ALLOW_TEST_CONTROL=1
    ports: ["8080:8080"]
    depends_on: [valkey]

  worker:
    build:
      context: ./src/Perf.Worker
    environment:
      - VALKEY_ENDPOINT=valkey:6379
      - WORK_MS=5
    ports: ["8081:8081"]
    depends_on: [valkey]

  prometheus:
    image: prom/prometheus:v2.55.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    ports: ["9090:9090"]
    depends_on: [gateway, worker]

perf/prometheus/prometheus.yml

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: gateway
    static_configs:
      - targets: ["gateway:8080"]

  - job_name: worker
    static_configs:
      - targets: ["worker:8081"]

Run:

cd perf
docker compose up -d --build

5) k6 scenarios AG (open-loop where required)

perf/k6/lib.js

import http from "k6/http";

export function startRun(baseUrl) {
  const res = http.post(`${baseUrl}/test/start`, null, { tags: { phase: "control" } });
  if (res.status !== 200) throw new Error(`startRun failed: ${res.status} ${res.body}`);
  return res.json();
}

export function hop(baseUrl, runId) {
  const corr = crypto.randomUUID();
  const payload = JSON.stringify({ corr, data: "ping" });

  return http.post(
    `${baseUrl}/hop`,
    payload,
    {
      headers: {
        "content-type": "application/json",
        "x-stella-run-id": runId,
        "x-stella-corr-id": corr
      },
      tags: { phase: "hop" }
    }
  );
}

Scenario A: Smoke — perf/k6/smoke.js

import { check, sleep } from "k6";
import { startRun, hop } from "./lib.js";

export const options = {
  scenarios: {
    smoke: {
      executor: "constant-arrival-rate",
      rate: 2,
      timeUnit: "1s",
      duration: "2m",
      preAllocatedVUs: 20,
      maxVUs: 200
    }
  },
  thresholds: {
    http_req_failed: ["rate<0.001"]
  }
};

export function setup() {
  return startRun(__ENV.GW_URL);
}

export default function (data) {
  const res = hop(__ENV.GW_URL, data.run_id);
  check(res, { "202 accepted": r => r.status === 202 });
  sleep(0.01);
}

Scenario C: Capacity ramp (open-loop) — perf/k6/capacity_ramp.js

import { check } from "k6";
import { startRun, hop } from "./lib.js";

export const options = {
  scenarios: {
    ramp: {
      executor: "ramping-arrival-rate",
      startRate: 50,
      timeUnit: "1s",
      preAllocatedVUs: 200,
      maxVUs: 5000,
      stages: [
        { target: 50, duration: "2m" },
        { target: 100, duration: "2m" },
        { target: 200, duration: "2m" },
        { target: 400, duration: "2m" },
        { target: 800, duration: "2m" }
      ]
    }
  },
  thresholds: {
    http_req_failed: ["rate<0.01"]
  }
};

export function setup() {
  return startRun(__ENV.GW_URL);
}

export default function (data) {
  const res = hop(__ENV.GW_URL, data.run_id);
  check(res, { "202 accepted": r => r.status === 202 });
}

Scenario E: Burst — perf/k6/burst.js

import { check } from "k6";
import { startRun, hop } from "./lib.js";

export const options = {
  scenarios: {
    burst: {
      executor: "ramping-arrival-rate",
      startRate: 20,
      timeUnit: "1s",
      preAllocatedVUs: 200,
      maxVUs: 5000,
      stages: [
        { target: 20, duration: "60s" },
        { target: 400, duration: "20s" },
        { target: 20, duration: "120s" }
      ]
    }
  }
};

export function setup() { return startRun(__ENV.GW_URL); }

export default function (data) {
  const res = hop(__ENV.GW_URL, data.run_id);
  check(res, { "202": r => r.status === 202 });
}

Scenario F: Soak — perf/k6/soak.js

import { check } from "k6";
import { startRun, hop } from "./lib.js";

export const options = {
  scenarios: {
    soak: {
      executor: "constant-arrival-rate",
      rate: 200,
      timeUnit: "1s",
      duration: "60m",
      preAllocatedVUs: 500,
      maxVUs: 5000
    }
  }
};

export function setup() { return startRun(__ENV.GW_URL); }

export default function (data) {
  const res = hop(__ENV.GW_URL, data.run_id);
  check(res, { "202": r => r.status === 202 });
}

Scenario D: Stress — perf/k6/stress.js

import { check } from "k6";
import { startRun, hop } from "./lib.js";

export const options = {
  scenarios: {
    stress: {
      executor: "constant-arrival-rate",
      rate: 1500,
      timeUnit: "1s",
      duration: "10m",
      preAllocatedVUs: 2000,
      maxVUs: 15000
    }
  },
  thresholds: {
    http_req_failed: ["rate<0.05"]
  }
};

export function setup() { return startRun(__ENV.GW_URL); }

export default function (data) {
  const res = hop(__ENV.GW_URL, data.run_id);
  check(res, { "202": r => r.status === 202 });
}

Scenario G: Scaling curve orchestration — perf/k6/scaling_curve.sh

#!/usr/bin/env bash
set -euo pipefail

GW_URL="${GW_URL:-http://localhost:8080}"

for n in 1 2 4 8; do
  echo "== Scaling workers to $n =="
  docker compose -f ../docker-compose.yml up -d --scale worker="$n"

  mkdir -p "../artifacts/scale-$n"
  k6 run \
    -e GW_URL="$GW_URL" \
    --summary-export "../artifacts/scale-$n/k6-summary.json" \
    ./capacity_ramp.js
done

Run (examples):

cd perf/k6
GW_URL=http://localhost:8080 k6 run --summary-export ../artifacts/smoke-summary.json smoke.js
GW_URL=http://localhost:8080 k6 run --summary-export ../artifacts/ramp-summary.json capacity_ramp.js

6) Offline analysis tool (reads “done” stream by run_id)

perf/tools/analyze.py

import os, sys, json, math
from datetime import datetime, timezone

import redis

def pct(values, p):
    if not values:
        return None
    values = sorted(values)
    k = (len(values) - 1) * (p / 100.0)
    f = math.floor(k); c = math.ceil(k)
    if f == c:
        return values[int(k)]
    return values[f] * (c - k) + values[c] * (k - f)

def main():
    valkey = os.getenv("VALKEY_ENDPOINT", "localhost:6379")
    host, port = valkey.split(":")
    r = redis.Redis(host=host, port=int(port), decode_responses=True)

    run_id = os.getenv("RUN_ID")
    if not run_id:
        print("Set RUN_ID env var (from /test/start response).", file=sys.stderr)
        sys.exit(2)

    done_stream = os.getenv("DONE_STREAM", "stella:perf:done")

    # Read all entries (sample scale). For big runs use XREAD with cursor.
    entries = r.xrange(done_stream, min='-', max='+', count=200000)

    hop_ms = []
    queue_ms = []
    service_ms = []

    matched = 0
    for entry_id, fields in entries:
        if fields.get("run") != run_id:
            continue
        matched += 1

        enq_ns = int(fields.get("enq_ns", "0"))
        claim_ns = int(fields.get("claim_ns", "0"))
        done_ns = int(fields.get("done_ns", "0"))

        if enq_ns > 0 and claim_ns > 0:
            queue_ms.append((claim_ns - enq_ns) / 1_000_000.0)
        if claim_ns > 0 and done_ns > 0:
            service_ms.append((done_ns - claim_ns) / 1_000_000.0)
        if enq_ns > 0 and done_ns > 0:
            hop_ms.append((done_ns - enq_ns) / 1_000_000.0)

    summary = {
        "run_id": run_id,
        "done_stream": done_stream,
        "matched_jobs": matched,
        "hop_ms": {
            "p50": pct(hop_ms, 50), "p95": pct(hop_ms, 95), "p99": pct(hop_ms, 99)
        },
        "queue_ms": {
            "p50": pct(queue_ms, 50), "p95": pct(queue_ms, 95), "p99": pct(queue_ms, 99)
        },
        "service_ms": {
            "p50": pct(service_ms, 50), "p95": pct(service_ms, 95), "p99": pct(service_ms, 99)
        },
        "generated_at": datetime.now(timezone.utc).isoformat()
    }

    print(json.dumps(summary, indent=2))

if __name__ == "__main__":
    main()

Run:

pip install redis
RUN_ID=<value_from_/test/start> python perf/tools/analyze.py

This yields the key percentiles for hop, queue_delay, and service from the authoritative worker-side timestamps.


7) What this sample already covers

  • Correlation: x-stella-corr-id end-to-end.

  • Run isolation: x-stella-run-id created via /test/start, used to filter results.

  • Valkey Streams + consumer group: fair claim semantics.

  • Required timestamps: enq_ns, claim_ns, done_ns.

  • Metrics:

    • gateway_enqueue_seconds histogram
    • valkey_enqueue_rtt_seconds histogram
    • worker_service_seconds, queue_delay_seconds, hop_latency_seconds histograms
  • Scenarios:

    • A Smoke, C Capacity ramp, D Stress, E Burst, F Soak
    • G Scaling curve via script (repeat ramp across worker counts)

8) Immediate next hardening steps (still “small”)

  1. Add queue depth / lag gauges: in worker or gateway poll XLEN stella:perf:jobs and export as a gauge metric in Prometheus format.
  2. Drain-time measurement: implement /test/end/{runId} that waits until “matched jobs stop increasing” + queue depth returns baseline, and records a final metric.
  3. Stage slicing (per plateau stats): extend analyze.py to accept your k6 stage plan and compute p95 per stage window (based on start_ns).

If you want, I can extend the sample with (1) queue-depth export and (2) per-plateau slicing in analyze.py without adding any new dependencies.