Files
git.stella-ops.org/docs/implplan/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
StellaOps Bot 28823a8960 save progress
2025-12-18 09:10:36 +02:00

18 KiB
Raw Blame History

Router Rate Limiting - Implementation Guide

For: Implementation agents / reviewers for Sprint 1200_001_001 through 1200_001_006 Status: DONE (Sprints 16 closed; Sprint 4 closed N/A) Evidence: src/__Libraries/StellaOps.Router.Gateway/RateLimit/, tests/StellaOps.Router.Gateway.Tests/ Last Updated: 2025-12-18


Purpose

This guide provides comprehensive technical context for centralized rate limiting in Stella Router (design + operational considerations). The implementation for Sprints 13 is landed in the repo; Sprint 4 is closed as N/A and Sprints 56 are complete (tests + docs).


Table of Contents

  1. Architecture Overview
  2. Configuration Philosophy
  3. Performance Considerations
  4. Valkey Integration
  5. Testing Strategy
  6. Common Pitfalls
  7. Debugging Guide
  8. Operational Runbook

Architecture Overview

Design Principles

  1. Router-Centralized: Rate limiting is a router responsibility, not a microservice responsibility
  2. Fail-Open: Never block all traffic due to infrastructure failures
  3. Observable: Every decision must be metrified
  4. Deterministic: Same request at same time should get same decision (within window)
  5. Fair: Use sliding windows where possible to avoid thundering herd

Two-Tier Architecture

Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
              ↓ DENY                                  ↓ DENY
            429 + Retry-After                      429 + Retry-After

Why two tiers?

  • Instance tier protects individual router process (CPU, memory, sockets)
  • Environment tier protects shared backend (aggregate across all routers)

Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.

Decision Flow

1. Extract microservice + route from request
2. Check instance limits (always, fast path)
   └─> DENY? Return 429
3. Check activation gate (local 5-min counter)
   └─> Below threshold? Skip env check (optimization)
4. Check environment limits (Valkey call)
   └─> Circuit breaker open? Skip (fail-open)
   └─> Valkey error? Skip (fail-open)
   └─> DENY? Return 429
5. Forward to upstream

Configuration Philosophy

Inheritance Model

Global Defaults
  └─> Environment Defaults
       └─> Microservice Overrides
            └─> Route Overrides (most specific)

Replacement, not merge: When a child level specifies limits, it REPLACES parent limits entirely.

Example:

for_environment:
  per_seconds: 300
  max_requests: 30000        # Global default

  microservices:
    scanner:
      per_seconds: 60
      max_requests: 600       # REPLACES global (not merged)
      routes:
        scan_submit:
          per_seconds: 10
          max_requests: 50    # REPLACES microservice (not merged)

Result:

  • POST /scanner/api/scans → 50 req/10sec (route level)
  • GET /scanner/api/other → 600 req/60sec (microservice level)
  • GET /policy/api/evaluate → 30000 req/300sec (global level)

Rule Stacking (AND Logic)

Multiple rules at same level = ALL must pass.

concelier:
  rules:
    - per_seconds: 1
      max_requests: 10        # Rule 1: 10/sec
    - per_seconds: 3600
      max_requests: 3000      # Rule 2: 3000/hour

Both rules enforced. Request denied if EITHER limit exceeded.

Sensible Defaults

If configuration omitted:

  • for_instance: No limits (effectively unlimited)
  • for_environment: No limits
  • activation_threshold: 5000 (skip Valkey if <5000 req/5min)
  • circuit_breaker.failure_threshold: 5
  • circuit_breaker.timeout_seconds: 30

Recommendation: Always configure at least global defaults.


Performance Considerations

Instance Limiter Performance

Target: <1ms P99 latency

Implementation: Sliding window with ring buffer.

// Efficient: O(1) increment, O(k) advance where k = buckets cleared
long[] _buckets;  // Ring buffer, size = window_seconds / granularity
long _total;      // Running sum

Lock contention: Single lock per counter. Acceptable for <10k req/sec per router.

Memory: ~24 bytes per window (array overhead + fields).

Optimization: For very high traffic (>50k req/sec), consider lock-free implementation with Interlocked operations.

Environment Limiter Performance

Target: <10ms P99 latency (including Valkey RTT)

Critical path: Every request to environment limiter makes a Valkey call.

Optimization: Activation Gate

Skip Valkey if local instance traffic < threshold:

if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
{
    // Skip expensive Valkey check
    return instanceDecision;
}

Effect: Reduces Valkey load by 80%+ in low-traffic scenarios.

Trade-off: Under threshold, environment limits not enforced. Acceptable if:

  • Each router instance threshold is set appropriately
  • Primary concern is high-traffic scenarios

Lua Script Performance

  • Single round-trip to Valkey (atomic)
  • Multiple INCR operations in single script (fast, no network)
  • TTL set only on first increment (optimization)

Valkey Sizing: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).


Valkey Integration

Connection Management

Use ConnectionMultiplexer from StackExchange.Redis:

var _connection = ConnectionMultiplexer.Connect(connectionString);
var _db = _connection.GetDatabase();

Important: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.

Lua Script Loading

Scripts loaded at startup and cached by SHA:

var script = File.ReadAllText("rate_limit_check.lua");
var server = _connection.GetServer(_connection.GetEndPoints().First());
var sha = server.ScriptLoad(script);

Persistence: Valkey caches scripts in memory. They survive across requests but NOT across restarts.

Recommendation: Load script at startup, store SHA, use ScriptEvaluateAsync(sha, ...) for all calls.

Key Naming Strategy

Format: {bucket}:env:{service}:{rule_name}:{window_start}

Example: stella-router-rate-limit:env:concelier:per_second:1702821600

Why include window_start in key?

Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted.

Benefit: No manual cleanup, memory efficient.

Clock Skew Handling

Problem: Different routers may have slightly different clocks, causing them to disagree on window boundaries.

Solution: Use Valkey server time (redis.call("TIME")) in Lua script, not client time.

local now = tonumber(redis.call("TIME")[1])  -- Valkey server time
local window_start = now - (now % window_sec)

Result: All routers agree on window boundaries (Valkey is source of truth).

Circuit Breaker Thresholds

failure_threshold: 5 consecutive failures before opening timeout_seconds: 30 seconds before attempting half-open half_open_timeout: 10 seconds to test one request

Tuning:

  • Lower failure_threshold = faster fail-open (more availability, less strict limiting)
  • Higher failure_threshold = tolerate more transient errors (stricter limiting)

Recommendation: Start with defaults, adjust based on Valkey stability.


Testing Strategy

Unit Tests (xUnit)

Coverage targets:

  • Configuration loading: 100%
  • Validation logic: 100%
  • Sliding window counter: 100%
  • Route matching: 100%
  • Inheritance resolution: 100%

Test patterns:

[Fact]
public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
{
    var counter = new SlidingWindowCounter(windowSeconds: 10);
    counter.Increment(); // count = 1

    // Simulate time passing (mock or Thread.Sleep in tests)
    AdvanceTime(11); // seconds

    Assert.Equal(0, counter.GetCount()); // Window expired, count reset
}

Integration Tests (TestServer + Testcontainers)

Valkey integration:

[Fact]
public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
{
    using var valkey = new ValkeyContainer();
    await valkey.StartAsync();

    var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
    var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);

    var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);

    // First 5 requests should pass
    for (int i = 0; i < 5; i++)
    {
        var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
        Assert.True(decision.Value.Allowed);
    }

    // 6th request should be denied
    var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
    Assert.False(deniedDecision.Value.Allowed);
    Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
}

Middleware integration:

[Fact]
public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
{
    using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
    var client = testServer.CreateClient();

    // Configure rate limit: 5 req/sec
    // Send 6 requests rapidly
    for (int i = 0; i < 6; i++)
    {
        var response = await client.GetAsync("/api/test");
        if (i < 5)
        {
            Assert.Equal(HttpStatusCode.OK, response.StatusCode);
        }
        else
        {
            Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
            Assert.True(response.Headers.Contains("Retry-After"));
        }
    }
}

Load Tests (k6)

Scenario A: Instance Limits

import http from 'k6/http';
import { check } from 'k6';

export const options = {
  scenarios: {
    instance_limit: {
      executor: 'constant-arrival-rate',
      rate: 100, // 100 req/sec
      timeUnit: '1s',
      duration: '30s',
      preAllocatedVUs: 50,
    },
  },
};

export default function () {
  const res = http.get('http://router/api/test');
  check(res, {
    'status 200 or 429': (r) => r.status === 200 || r.status === 429,
    'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
  });
}

Scenario B: Environment Limits (Multi-Instance)

Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced.

Scenario E: Valkey Failure

Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open).


Common Pitfalls

1. Forgetting to Update Middleware Pipeline Order

Problem: Rate limit middleware added AFTER routing decision → can't identify microservice.

Solution: Add rate limit middleware BEFORE routing decision:

app.UsePayloadLimits();
app.UseRateLimiting();        // HERE
app.UseEndpointResolution();
app.UseRoutingDecision();

2. Circuit Breaker Never Closes

Problem: Circuit breaker opens, but never attempts recovery.

Cause: Half-open logic not implemented or timeout too long.

Solution: Implement half-open state with timeout:

if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
    _state = CircuitState.HalfOpen; // Allow one test request
}

3. Lua Script Not Found at Runtime

Problem: Script file not copied to output directory.

Solution: Set file properties in .csproj:

<ItemGroup>
  <Content Include="RateLimit\Scripts\*.lua">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>

4. Activation Gate Never Triggers

Problem: Activation counter not incremented on every request.

Cause: Counter incremented only when instance limit is enforced.

Solution: Increment activation counter ALWAYS, not just when checking limits:

public RateLimitDecision TryAcquire(string? microservice)
{
    _activationCounter.Increment(); // ALWAYS increment
    // ... rest of logic
}

5. Route Matching Case-Sensitivity Issues

Problem: /API/Scans doesn't match /api/scans.

Solution: Use case-insensitive comparisons:

string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)

6. Valkey Key Explosion

Problem: Too many keys in Valkey, memory usage high.

Cause: Forgetting to set TTL on keys.

Solution: ALWAYS set TTL when creating keys:

if count == 1 then
    redis.call("EXPIRE", key, window_sec + 2)
end

+2 buffer: Gives grace period to avoid edge cases.


Debugging Guide

Scenario 1: Requests Being Denied But Shouldn't Be

Steps:

  1. Check metrics: Which scope is denying? (instance or environment)
rate(stella_router_rate_limit_denied_total[1m])
  1. Check configured limits:
# View config
kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
  1. Check activation gate:
stella_router_rate_limit_activation_gate_enabled

If 0, activation gate is disabled—all requests hit Valkey.

  1. Check Valkey keys:
redis-cli -h valkey.stellaops.local
> KEYS stella-router-rate-limit:env:*
> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
> GET stella-router-rate-limit:env:concelier:per_second:1702821600
  1. Check circuit breaker state:
stella_router_rate_limit_circuit_breaker_state{state="open"}

If 1, circuit breaker is open—env limits not enforced.

Scenario 2: Rate Limits Not Being Enforced

Steps:

  1. Verify middleware is registered:
// Check Startup.cs or Program.cs
app.UseRateLimiting(); // Should be present
  1. Verify configuration loaded:
// Add logging in RateLimitService constructor
_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
    _config.ForInstance != null,
    _config.ForEnvironment != null);
  1. Check metrics—are requests even hitting rate limiter?
rate(stella_router_rate_limit_allowed_total[1m])

If 0, middleware not in pipeline or not being called.

  1. Check microservice identification:
// Add logging in middleware
var microservice = context.Items["RoutingTarget"] as string;
_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);

If "unknown", routing metadata not set—rate limiter can't apply service-specific limits.

Scenario 3: Valkey Errors

Steps:

  1. Check circuit breaker metrics:
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
  1. Check Valkey connectivity:
redis-cli -h valkey.stellaops.local PING
  1. Check Lua script loaded:
redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
  1. Check Valkey logs for errors:
kubectl logs -f valkey-0 | grep ERROR
  1. Verify Lua script syntax:
redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua

Operational Runbook

Deployment Checklist

  • Valkey cluster healthy (check redis-cli PING)
  • Configuration validated (run stella-router validate-config)
  • Metrics scraping configured (Prometheus targets)
  • Dashboards imported (Grafana)
  • Alerts configured (Alertmanager)
  • Shadow mode enabled (limits set 10x expected traffic)
  • Rollback plan documented

Monitoring Dashboards

Dashboard 1: Rate Limiting Overview

Panels:

  • Requests allowed vs denied (pie chart)
  • Denial rate by microservice (line graph)
  • Denial rate by route (heatmap)
  • Retry-After distribution (histogram)

Dashboard 2: Performance

Panels:

  • Decision latency P50/P95/P99 (instance vs environment)
  • Valkey call latency P95
  • Activation gate effectiveness (% skipped)

Dashboard 3: Health

Panels:

  • Circuit breaker state (gauge)
  • Valkey error rate
  • Most denied routes (top 10 table)

Alert Definitions

Critical:

- alert: RateLimitValkeyCriticalFailure
  expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
  for: 5m
  annotations:
    summary: "Rate limit circuit breaker open for >5min"
    description: "Valkey unavailable, environment limits not enforced"

- alert: RateLimitAllRequestsDenied
  expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
  for: 1m
  annotations:
    summary: "100% denial rate"
    description: "Possible configuration error"

Warning:

- alert: RateLimitHighDenialRate
  expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
  for: 5m
  annotations:
    summary: ">20% requests denied"
    description: "High denial rate, check if expected"

- alert: RateLimitValkeyHighLatency
  expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
  for: 5m
  annotations:
    summary: "Valkey latency >100ms P95"
    description: "Valkey performance degraded"

Tuning Guidelines

Scenario: Too many requests denied

  1. Check if denial rate is expected (traffic spike?)
  2. If not, increase limits:
    • Start with 2x current limits
    • Monitor for 24 hours
    • Adjust as needed

Scenario: Valkey overloaded

  1. Check ops/sec: redis-cli INFO stats | grep instantaneous_ops_per_sec
  2. If >50k ops/sec, consider:
    • Increase activation threshold (reduce Valkey calls)
    • Add Valkey replicas (read scaling)
    • Shard by microservice (write scaling)

Scenario: Circuit breaker flapping

  1. Check failure rate:
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
  1. If transient errors, increase failure_threshold
  2. If persistent errors, fix Valkey issue

Rollback Procedure

  1. Disable rate limiting:
rate_limiting:
  for_instance: null
  for_environment: null
  1. Deploy config update
  2. Verify traffic flows normally
  3. Investigate issue offline

References

  • Advisory: docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + RetryAfter Backpressure Control.md
  • Master Sprint Tracker: docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md
  • Sprint Files: docs/implplan/SPRINT_1200_001_00X_*.md
  • HTTP 429 Semantics: RFC 6585
  • HTTP Retry-After: RFC 7231 Section 7.1.3
  • Valkey Documentation: https://valkey.io/docs/