# Router Rate Limiting - Implementation Guide **For:** Implementation agents executing Sprint 1200_001_001 through 1200_001_006 **Last Updated:** 2025-12-17 --- ## Purpose This guide provides comprehensive technical context for implementing centralized rate limiting in Stella Router. It covers architecture decisions, patterns, gotchas, and operational considerations. --- ## Table of Contents 1. [Architecture Overview](#architecture-overview) 2. [Configuration Philosophy](#configuration-philosophy) 3. [Performance Considerations](#performance-considerations) 4. [Valkey Integration](#valkey-integration) 5. [Testing Strategy](#testing-strategy) 6. [Common Pitfalls](#common-pitfalls) 7. [Debugging Guide](#debugging-guide) 8. [Operational Runbook](#operational-runbook) --- ## Architecture Overview ### Design Principles 1. **Router-Centralized**: Rate limiting is a router responsibility, not a microservice responsibility 2. **Fail-Open**: Never block all traffic due to infrastructure failures 3. **Observable**: Every decision must be metrified 4. **Deterministic**: Same request at same time should get same decision (within window) 5. **Fair**: Use sliding windows where possible to avoid thundering herd ### Two-Tier Architecture ``` Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream ↓ DENY ↓ DENY 429 + Retry-After 429 + Retry-After ``` **Why two tiers?** - **Instance tier** protects individual router process (CPU, memory, sockets) - **Environment tier** protects shared backend (aggregate across all routers) Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low. ### Decision Flow ``` 1. Extract microservice + route from request 2. Check instance limits (always, fast path) └─> DENY? Return 429 3. Check activation gate (local 5-min counter) └─> Below threshold? Skip env check (optimization) 4. Check environment limits (Valkey call) └─> Circuit breaker open? Skip (fail-open) └─> Valkey error? Skip (fail-open) └─> DENY? Return 429 5. Forward to upstream ``` --- ## Configuration Philosophy ### Inheritance Model ``` Global Defaults └─> Environment Defaults └─> Microservice Overrides └─> Route Overrides (most specific) ``` **Replacement, not merge**: When a child level specifies limits, it REPLACES parent limits entirely. **Example:** ```yaml for_environment: per_seconds: 300 max_requests: 30000 # Global default microservices: scanner: per_seconds: 60 max_requests: 600 # REPLACES global (not merged) routes: scan_submit: per_seconds: 10 max_requests: 50 # REPLACES microservice (not merged) ``` Result: - `POST /scanner/api/scans` → 50 req/10sec (route level) - `GET /scanner/api/other` → 600 req/60sec (microservice level) - `GET /policy/api/evaluate` → 30000 req/300sec (global level) ### Rule Stacking (AND Logic) Multiple rules at same level = ALL must pass. ```yaml concelier: rules: - per_seconds: 1 max_requests: 10 # Rule 1: 10/sec - per_seconds: 3600 max_requests: 3000 # Rule 2: 3000/hour ``` Both rules enforced. Request denied if EITHER limit exceeded. ### Sensible Defaults If configuration omitted: - `for_instance`: No limits (effectively unlimited) - `for_environment`: No limits - `activation_threshold`: 5000 (skip Valkey if <5000 req/5min) - `circuit_breaker.failure_threshold`: 5 - `circuit_breaker.timeout_seconds`: 30 **Recommendation**: Always configure at least global defaults. --- ## Performance Considerations ### Instance Limiter Performance **Target:** <1ms P99 latency **Implementation:** Sliding window with ring buffer. ```csharp // Efficient: O(1) increment, O(k) advance where k = buckets cleared long[] _buckets; // Ring buffer, size = window_seconds / granularity long _total; // Running sum ``` **Lock contention**: Single lock per counter. Acceptable for <10k req/sec per router. **Memory**: ~24 bytes per window (array overhead + fields). **Optimization**: For very high traffic (>50k req/sec), consider lock-free implementation with `Interlocked` operations. ### Environment Limiter Performance **Target:** <10ms P99 latency (including Valkey RTT) **Critical path**: Every request to environment limiter makes a Valkey call. **Optimization: Activation Gate** Skip Valkey if local instance traffic < threshold: ```csharp if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min) { // Skip expensive Valkey check return instanceDecision; } ``` **Effect**: Reduces Valkey load by 80%+ in low-traffic scenarios. **Trade-off**: Under threshold, environment limits not enforced. Acceptable if: - Each router instance threshold is set appropriately - Primary concern is high-traffic scenarios **Lua Script Performance** - Single round-trip to Valkey (atomic) - Multiple `INCR` operations in single script (fast, no network) - TTL set only on first increment (optimization) **Valkey Sizing**: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity). --- ## Valkey Integration ### Connection Management Use `ConnectionMultiplexer` from StackExchange.Redis: ```csharp var _connection = ConnectionMultiplexer.Connect(connectionString); var _db = _connection.GetDatabase(); ``` **Important**: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere. ### Lua Script Loading Scripts loaded at startup and cached by SHA: ```csharp var script = File.ReadAllText("rate_limit_check.lua"); var server = _connection.GetServer(_connection.GetEndPoints().First()); var sha = server.ScriptLoad(script); ``` **Persistence**: Valkey caches scripts in memory. They survive across requests but NOT across restarts. **Recommendation**: Load script at startup, store SHA, use `ScriptEvaluateAsync(sha, ...)` for all calls. ### Key Naming Strategy Format: `{bucket}:env:{service}:{rule_name}:{window_start}` Example: `stella-router-rate-limit:env:concelier:per_second:1702821600` **Why include window_start in key?** Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted. **Benefit**: No manual cleanup, memory efficient. ### Clock Skew Handling **Problem**: Different routers may have slightly different clocks, causing them to disagree on window boundaries. **Solution**: Use Valkey server time (`redis.call("TIME")`) in Lua script, not client time. ```lua local now = tonumber(redis.call("TIME")[1]) -- Valkey server time local window_start = now - (now % window_sec) ``` **Result**: All routers agree on window boundaries (Valkey is source of truth). ### Circuit Breaker Thresholds **failure_threshold**: 5 consecutive failures before opening **timeout_seconds**: 30 seconds before attempting half-open **half_open_timeout**: 10 seconds to test one request **Tuning**: - Lower failure_threshold = faster fail-open (more availability, less strict limiting) - Higher failure_threshold = tolerate more transient errors (stricter limiting) **Recommendation**: Start with defaults, adjust based on Valkey stability. --- ## Testing Strategy ### Unit Tests (xUnit) **Coverage targets:** - Configuration loading: 100% - Validation logic: 100% - Sliding window counter: 100% - Route matching: 100% - Inheritance resolution: 100% **Test patterns:** ```csharp [Fact] public void SlidingWindowCounter_WhenWindowExpires_ResetsCount() { var counter = new SlidingWindowCounter(windowSeconds: 10); counter.Increment(); // count = 1 // Simulate time passing (mock or Thread.Sleep in tests) AdvanceTime(11); // seconds Assert.Equal(0, counter.GetCount()); // Window expired, count reset } ``` ### Integration Tests (TestServer + Testcontainers) **Valkey integration:** ```csharp [Fact] public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429() { using var valkey = new ValkeyContainer(); await valkey.StartAsync(); var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket"); var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger); var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...); // First 5 requests should pass for (int i = 0; i < 5; i++) { var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None); Assert.True(decision.Value.Allowed); } // 6th request should be denied var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None); Assert.False(deniedDecision.Value.Allowed); Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds); } ``` **Middleware integration:** ```csharp [Fact] public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter() { using var testServer = new TestServer(new WebHostBuilder().UseStartup()); var client = testServer.CreateClient(); // Configure rate limit: 5 req/sec // Send 6 requests rapidly for (int i = 0; i < 6; i++) { var response = await client.GetAsync("/api/test"); if (i < 5) { Assert.Equal(HttpStatusCode.OK, response.StatusCode); } else { Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode); Assert.True(response.Headers.Contains("Retry-After")); } } } ``` ### Load Tests (k6) **Scenario A: Instance Limits** ```javascript import http from 'k6/http'; import { check } from 'k6'; export const options = { scenarios: { instance_limit: { executor: 'constant-arrival-rate', rate: 100, // 100 req/sec timeUnit: '1s', duration: '30s', preAllocatedVUs: 50, }, }, }; export default function () { const res = http.get('http://router/api/test'); check(res, { 'status 200 or 429': (r) => r.status === 200 || r.status === 429, 'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined, }); } ``` **Scenario B: Environment Limits (Multi-Instance)** Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced. **Scenario E: Valkey Failure** Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open). --- ## Common Pitfalls ### 1. Forgetting to Update Middleware Pipeline Order **Problem**: Rate limit middleware added AFTER routing decision → can't identify microservice. **Solution**: Add rate limit middleware BEFORE routing decision: ```csharp app.UsePayloadLimits(); app.UseRateLimiting(); // HERE app.UseEndpointResolution(); app.UseRoutingDecision(); ``` ### 2. Circuit Breaker Never Closes **Problem**: Circuit breaker opens, but never attempts recovery. **Cause**: Half-open logic not implemented or timeout too long. **Solution**: Implement half-open state with timeout: ```csharp if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt) { _state = CircuitState.HalfOpen; // Allow one test request } ``` ### 3. Lua Script Not Found at Runtime **Problem**: Script file not copied to output directory. **Solution**: Set file properties in `.csproj`: ```xml PreserveNewest ``` ### 4. Activation Gate Never Triggers **Problem**: Activation counter not incremented on every request. **Cause**: Counter incremented only when instance limit is enforced. **Solution**: Increment activation counter ALWAYS, not just when checking limits: ```csharp public RateLimitDecision TryAcquire(string? microservice) { _activationCounter.Increment(); // ALWAYS increment // ... rest of logic } ``` ### 5. Route Matching Case-Sensitivity Issues **Problem**: `/API/Scans` doesn't match `/api/scans`. **Solution**: Use case-insensitive comparisons: ```csharp string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase) ``` ### 6. Valkey Key Explosion **Problem**: Too many keys in Valkey, memory usage high. **Cause**: Forgetting to set TTL on keys. **Solution**: ALWAYS set TTL when creating keys: ```lua if count == 1 then redis.call("EXPIRE", key, window_sec + 2) end ``` **+2 buffer**: Gives grace period to avoid edge cases. --- ## Debugging Guide ### Scenario 1: Requests Being Denied But Shouldn't Be **Steps:** 1. Check metrics: Which scope is denying? (instance or environment) ```promql rate(stella_router_rate_limit_denied_total[1m]) ``` 2. Check configured limits: ```bash # View config kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting" ``` 3. Check activation gate: ```promql stella_router_rate_limit_activation_gate_enabled ``` If 0, activation gate is disabled—all requests hit Valkey. 4. Check Valkey keys: ```bash redis-cli -h valkey.stellaops.local > KEYS stella-router-rate-limit:env:* > TTL stella-router-rate-limit:env:concelier:per_second:1702821600 > GET stella-router-rate-limit:env:concelier:per_second:1702821600 ``` 5. Check circuit breaker state: ```promql stella_router_rate_limit_circuit_breaker_state{state="open"} ``` If 1, circuit breaker is open—env limits not enforced. ### Scenario 2: Rate Limits Not Being Enforced **Steps:** 1. Verify middleware is registered: ```csharp // Check Startup.cs or Program.cs app.UseRateLimiting(); // Should be present ``` 2. Verify configuration loaded: ```csharp // Add logging in RateLimitService constructor _logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}", _config.ForInstance != null, _config.ForEnvironment != null); ``` 3. Check metrics—are requests even hitting rate limiter? ```promql rate(stella_router_rate_limit_allowed_total[1m]) ``` If 0, middleware not in pipeline or not being called. 4. Check microservice identification: ```csharp // Add logging in middleware var microservice = context.Items["RoutingTarget"] as string; _logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice); ``` If "unknown", routing metadata not set—rate limiter can't apply service-specific limits. ### Scenario 3: Valkey Errors **Steps:** 1. Check circuit breaker metrics: ```promql rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m]) ``` 2. Check Valkey connectivity: ```bash redis-cli -h valkey.stellaops.local PING ``` 3. Check Lua script loaded: ```bash redis-cli -h valkey.stellaops.local SCRIPT EXISTS ``` 4. Check Valkey logs for errors: ```bash kubectl logs -f valkey-0 | grep ERROR ``` 5. Verify Lua script syntax: ```bash redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua ``` --- ## Operational Runbook ### Deployment Checklist - [ ] Valkey cluster healthy (check `redis-cli PING`) - [ ] Configuration validated (run `stella-router validate-config`) - [ ] Metrics scraping configured (Prometheus targets) - [ ] Dashboards imported (Grafana) - [ ] Alerts configured (Alertmanager) - [ ] Shadow mode enabled (limits set 10x expected traffic) - [ ] Rollback plan documented ### Monitoring Dashboards **Dashboard 1: Rate Limiting Overview** Panels: - Requests allowed vs denied (pie chart) - Denial rate by microservice (line graph) - Denial rate by route (heatmap) - Retry-After distribution (histogram) **Dashboard 2: Performance** Panels: - Decision latency P50/P95/P99 (instance vs environment) - Valkey call latency P95 - Activation gate effectiveness (% skipped) **Dashboard 3: Health** Panels: - Circuit breaker state (gauge) - Valkey error rate - Most denied routes (top 10 table) ### Alert Definitions **Critical:** ```yaml - alert: RateLimitValkeyCriticalFailure expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1 for: 5m annotations: summary: "Rate limit circuit breaker open for >5min" description: "Valkey unavailable, environment limits not enforced" - alert: RateLimitAllRequestsDenied expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99 for: 1m annotations: summary: "100% denial rate" description: "Possible configuration error" ``` **Warning:** ```yaml - alert: RateLimitHighDenialRate expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2 for: 5m annotations: summary: ">20% requests denied" description: "High denial rate, check if expected" - alert: RateLimitValkeyHighLatency expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100 for: 5m annotations: summary: "Valkey latency >100ms P95" description: "Valkey performance degraded" ``` ### Tuning Guidelines **Scenario: Too many requests denied** 1. Check if denial rate is expected (traffic spike?) 2. If not, increase limits: - Start with 2x current limits - Monitor for 24 hours - Adjust as needed **Scenario: Valkey overloaded** 1. Check ops/sec: `redis-cli INFO stats | grep instantaneous_ops_per_sec` 2. If >50k ops/sec, consider: - Increase activation threshold (reduce Valkey calls) - Add Valkey replicas (read scaling) - Shard by microservice (write scaling) **Scenario: Circuit breaker flapping** 1. Check failure rate: ```promql rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m]) ``` 2. If transient errors, increase failure_threshold 3. If persistent errors, fix Valkey issue ### Rollback Procedure 1. Disable rate limiting: ```yaml rate_limiting: for_instance: null for_environment: null ``` 2. Deploy config update 3. Verify traffic flows normally 4. Investigate issue offline --- ## References - **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md` - **Master Sprint Tracker:** `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md` - **Sprint Files:** `docs/implplan/SPRINT_1200_001_00X_*.md` - **HTTP 429 Semantics:** RFC 6585 - **HTTP Retry-After:** RFC 7231 Section 7.1.3 - **Valkey Documentation:** https://valkey.io/docs/