18 KiB
Router Rate Limiting - Implementation Guide
For: Implementation agents / reviewers for Sprint 1200_001_001 through 1200_001_006
Status: DOING (Sprints 1–3 DONE; Sprint 4 closed N/A; Sprints 5–6 in progress)
Evidence: src/__Libraries/StellaOps.Router.Gateway/RateLimit/, tests/StellaOps.Router.Gateway.Tests/
Last Updated: 2025-12-17
Purpose
This guide provides comprehensive technical context for centralized rate limiting in Stella Router (design + operational considerations). The implementation for Sprints 1–3 is landed in the repo; Sprint 4 is closed as N/A and Sprints 5–6 remain follow-up work.
Table of Contents
- Architecture Overview
- Configuration Philosophy
- Performance Considerations
- Valkey Integration
- Testing Strategy
- Common Pitfalls
- Debugging Guide
- Operational Runbook
Architecture Overview
Design Principles
- Router-Centralized: Rate limiting is a router responsibility, not a microservice responsibility
- Fail-Open: Never block all traffic due to infrastructure failures
- Observable: Every decision must be metrified
- Deterministic: Same request at same time should get same decision (within window)
- Fair: Use sliding windows where possible to avoid thundering herd
Two-Tier Architecture
Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
↓ DENY ↓ DENY
429 + Retry-After 429 + Retry-After
Why two tiers?
- Instance tier protects individual router process (CPU, memory, sockets)
- Environment tier protects shared backend (aggregate across all routers)
Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.
Decision Flow
1. Extract microservice + route from request
2. Check instance limits (always, fast path)
└─> DENY? Return 429
3. Check activation gate (local 5-min counter)
└─> Below threshold? Skip env check (optimization)
4. Check environment limits (Valkey call)
└─> Circuit breaker open? Skip (fail-open)
└─> Valkey error? Skip (fail-open)
└─> DENY? Return 429
5. Forward to upstream
Configuration Philosophy
Inheritance Model
Global Defaults
└─> Environment Defaults
└─> Microservice Overrides
└─> Route Overrides (most specific)
Replacement, not merge: When a child level specifies limits, it REPLACES parent limits entirely.
Example:
for_environment:
per_seconds: 300
max_requests: 30000 # Global default
microservices:
scanner:
per_seconds: 60
max_requests: 600 # REPLACES global (not merged)
routes:
scan_submit:
per_seconds: 10
max_requests: 50 # REPLACES microservice (not merged)
Result:
POST /scanner/api/scans→ 50 req/10sec (route level)GET /scanner/api/other→ 600 req/60sec (microservice level)GET /policy/api/evaluate→ 30000 req/300sec (global level)
Rule Stacking (AND Logic)
Multiple rules at same level = ALL must pass.
concelier:
rules:
- per_seconds: 1
max_requests: 10 # Rule 1: 10/sec
- per_seconds: 3600
max_requests: 3000 # Rule 2: 3000/hour
Both rules enforced. Request denied if EITHER limit exceeded.
Sensible Defaults
If configuration omitted:
for_instance: No limits (effectively unlimited)for_environment: No limitsactivation_threshold: 5000 (skip Valkey if <5000 req/5min)circuit_breaker.failure_threshold: 5circuit_breaker.timeout_seconds: 30
Recommendation: Always configure at least global defaults.
Performance Considerations
Instance Limiter Performance
Target: <1ms P99 latency
Implementation: Sliding window with ring buffer.
// Efficient: O(1) increment, O(k) advance where k = buckets cleared
long[] _buckets; // Ring buffer, size = window_seconds / granularity
long _total; // Running sum
Lock contention: Single lock per counter. Acceptable for <10k req/sec per router.
Memory: ~24 bytes per window (array overhead + fields).
Optimization: For very high traffic (>50k req/sec), consider lock-free implementation with Interlocked operations.
Environment Limiter Performance
Target: <10ms P99 latency (including Valkey RTT)
Critical path: Every request to environment limiter makes a Valkey call.
Optimization: Activation Gate
Skip Valkey if local instance traffic < threshold:
if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
{
// Skip expensive Valkey check
return instanceDecision;
}
Effect: Reduces Valkey load by 80%+ in low-traffic scenarios.
Trade-off: Under threshold, environment limits not enforced. Acceptable if:
- Each router instance threshold is set appropriately
- Primary concern is high-traffic scenarios
Lua Script Performance
- Single round-trip to Valkey (atomic)
- Multiple
INCRoperations in single script (fast, no network) - TTL set only on first increment (optimization)
Valkey Sizing: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).
Valkey Integration
Connection Management
Use ConnectionMultiplexer from StackExchange.Redis:
var _connection = ConnectionMultiplexer.Connect(connectionString);
var _db = _connection.GetDatabase();
Important: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.
Lua Script Loading
Scripts loaded at startup and cached by SHA:
var script = File.ReadAllText("rate_limit_check.lua");
var server = _connection.GetServer(_connection.GetEndPoints().First());
var sha = server.ScriptLoad(script);
Persistence: Valkey caches scripts in memory. They survive across requests but NOT across restarts.
Recommendation: Load script at startup, store SHA, use ScriptEvaluateAsync(sha, ...) for all calls.
Key Naming Strategy
Format: {bucket}:env:{service}:{rule_name}:{window_start}
Example: stella-router-rate-limit:env:concelier:per_second:1702821600
Why include window_start in key?
Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted.
Benefit: No manual cleanup, memory efficient.
Clock Skew Handling
Problem: Different routers may have slightly different clocks, causing them to disagree on window boundaries.
Solution: Use Valkey server time (redis.call("TIME")) in Lua script, not client time.
local now = tonumber(redis.call("TIME")[1]) -- Valkey server time
local window_start = now - (now % window_sec)
Result: All routers agree on window boundaries (Valkey is source of truth).
Circuit Breaker Thresholds
failure_threshold: 5 consecutive failures before opening timeout_seconds: 30 seconds before attempting half-open half_open_timeout: 10 seconds to test one request
Tuning:
- Lower failure_threshold = faster fail-open (more availability, less strict limiting)
- Higher failure_threshold = tolerate more transient errors (stricter limiting)
Recommendation: Start with defaults, adjust based on Valkey stability.
Testing Strategy
Unit Tests (xUnit)
Coverage targets:
- Configuration loading: 100%
- Validation logic: 100%
- Sliding window counter: 100%
- Route matching: 100%
- Inheritance resolution: 100%
Test patterns:
[Fact]
public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
{
var counter = new SlidingWindowCounter(windowSeconds: 10);
counter.Increment(); // count = 1
// Simulate time passing (mock or Thread.Sleep in tests)
AdvanceTime(11); // seconds
Assert.Equal(0, counter.GetCount()); // Window expired, count reset
}
Integration Tests (TestServer + Testcontainers)
Valkey integration:
[Fact]
public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
{
using var valkey = new ValkeyContainer();
await valkey.StartAsync();
var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);
var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);
// First 5 requests should pass
for (int i = 0; i < 5; i++)
{
var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
Assert.True(decision.Value.Allowed);
}
// 6th request should be denied
var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
Assert.False(deniedDecision.Value.Allowed);
Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
}
Middleware integration:
[Fact]
public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
{
using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
var client = testServer.CreateClient();
// Configure rate limit: 5 req/sec
// Send 6 requests rapidly
for (int i = 0; i < 6; i++)
{
var response = await client.GetAsync("/api/test");
if (i < 5)
{
Assert.Equal(HttpStatusCode.OK, response.StatusCode);
}
else
{
Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
Assert.True(response.Headers.Contains("Retry-After"));
}
}
}
Load Tests (k6)
Scenario A: Instance Limits
import http from 'k6/http';
import { check } from 'k6';
export const options = {
scenarios: {
instance_limit: {
executor: 'constant-arrival-rate',
rate: 100, // 100 req/sec
timeUnit: '1s',
duration: '30s',
preAllocatedVUs: 50,
},
},
};
export default function () {
const res = http.get('http://router/api/test');
check(res, {
'status 200 or 429': (r) => r.status === 200 || r.status === 429,
'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
});
}
Scenario B: Environment Limits (Multi-Instance)
Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced.
Scenario E: Valkey Failure
Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open).
Common Pitfalls
1. Forgetting to Update Middleware Pipeline Order
Problem: Rate limit middleware added AFTER routing decision → can't identify microservice.
Solution: Add rate limit middleware BEFORE routing decision:
app.UsePayloadLimits();
app.UseRateLimiting(); // HERE
app.UseEndpointResolution();
app.UseRoutingDecision();
2. Circuit Breaker Never Closes
Problem: Circuit breaker opens, but never attempts recovery.
Cause: Half-open logic not implemented or timeout too long.
Solution: Implement half-open state with timeout:
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
_state = CircuitState.HalfOpen; // Allow one test request
}
3. Lua Script Not Found at Runtime
Problem: Script file not copied to output directory.
Solution: Set file properties in .csproj:
<ItemGroup>
<Content Include="RateLimit\Scripts\*.lua">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
4. Activation Gate Never Triggers
Problem: Activation counter not incremented on every request.
Cause: Counter incremented only when instance limit is enforced.
Solution: Increment activation counter ALWAYS, not just when checking limits:
public RateLimitDecision TryAcquire(string? microservice)
{
_activationCounter.Increment(); // ALWAYS increment
// ... rest of logic
}
5. Route Matching Case-Sensitivity Issues
Problem: /API/Scans doesn't match /api/scans.
Solution: Use case-insensitive comparisons:
string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)
6. Valkey Key Explosion
Problem: Too many keys in Valkey, memory usage high.
Cause: Forgetting to set TTL on keys.
Solution: ALWAYS set TTL when creating keys:
if count == 1 then
redis.call("EXPIRE", key, window_sec + 2)
end
+2 buffer: Gives grace period to avoid edge cases.
Debugging Guide
Scenario 1: Requests Being Denied But Shouldn't Be
Steps:
- Check metrics: Which scope is denying? (instance or environment)
rate(stella_router_rate_limit_denied_total[1m])
- Check configured limits:
# View config
kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
- Check activation gate:
stella_router_rate_limit_activation_gate_enabled
If 0, activation gate is disabled—all requests hit Valkey.
- Check Valkey keys:
redis-cli -h valkey.stellaops.local
> KEYS stella-router-rate-limit:env:*
> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
> GET stella-router-rate-limit:env:concelier:per_second:1702821600
- Check circuit breaker state:
stella_router_rate_limit_circuit_breaker_state{state="open"}
If 1, circuit breaker is open—env limits not enforced.
Scenario 2: Rate Limits Not Being Enforced
Steps:
- Verify middleware is registered:
// Check Startup.cs or Program.cs
app.UseRateLimiting(); // Should be present
- Verify configuration loaded:
// Add logging in RateLimitService constructor
_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
_config.ForInstance != null,
_config.ForEnvironment != null);
- Check metrics—are requests even hitting rate limiter?
rate(stella_router_rate_limit_allowed_total[1m])
If 0, middleware not in pipeline or not being called.
- Check microservice identification:
// Add logging in middleware
var microservice = context.Items["RoutingTarget"] as string;
_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);
If "unknown", routing metadata not set—rate limiter can't apply service-specific limits.
Scenario 3: Valkey Errors
Steps:
- Check circuit breaker metrics:
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
- Check Valkey connectivity:
redis-cli -h valkey.stellaops.local PING
- Check Lua script loaded:
redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
- Check Valkey logs for errors:
kubectl logs -f valkey-0 | grep ERROR
- Verify Lua script syntax:
redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua
Operational Runbook
Deployment Checklist
- Valkey cluster healthy (check
redis-cli PING) - Configuration validated (run
stella-router validate-config) - Metrics scraping configured (Prometheus targets)
- Dashboards imported (Grafana)
- Alerts configured (Alertmanager)
- Shadow mode enabled (limits set 10x expected traffic)
- Rollback plan documented
Monitoring Dashboards
Dashboard 1: Rate Limiting Overview
Panels:
- Requests allowed vs denied (pie chart)
- Denial rate by microservice (line graph)
- Denial rate by route (heatmap)
- Retry-After distribution (histogram)
Dashboard 2: Performance
Panels:
- Decision latency P50/P95/P99 (instance vs environment)
- Valkey call latency P95
- Activation gate effectiveness (% skipped)
Dashboard 3: Health
Panels:
- Circuit breaker state (gauge)
- Valkey error rate
- Most denied routes (top 10 table)
Alert Definitions
Critical:
- alert: RateLimitValkeyCriticalFailure
expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
for: 5m
annotations:
summary: "Rate limit circuit breaker open for >5min"
description: "Valkey unavailable, environment limits not enforced"
- alert: RateLimitAllRequestsDenied
expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
for: 1m
annotations:
summary: "100% denial rate"
description: "Possible configuration error"
Warning:
- alert: RateLimitHighDenialRate
expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
for: 5m
annotations:
summary: ">20% requests denied"
description: "High denial rate, check if expected"
- alert: RateLimitValkeyHighLatency
expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
for: 5m
annotations:
summary: "Valkey latency >100ms P95"
description: "Valkey performance degraded"
Tuning Guidelines
Scenario: Too many requests denied
- Check if denial rate is expected (traffic spike?)
- If not, increase limits:
- Start with 2x current limits
- Monitor for 24 hours
- Adjust as needed
Scenario: Valkey overloaded
- Check ops/sec:
redis-cli INFO stats | grep instantaneous_ops_per_sec - If >50k ops/sec, consider:
- Increase activation threshold (reduce Valkey calls)
- Add Valkey replicas (read scaling)
- Shard by microservice (write scaling)
Scenario: Circuit breaker flapping
- Check failure rate:
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
- If transient errors, increase failure_threshold
- If persistent errors, fix Valkey issue
Rollback Procedure
- Disable rate limiting:
rate_limiting:
for_instance: null
for_environment: null
- Deploy config update
- Verify traffic flows normally
- Investigate issue offline
References
- Advisory:
docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md - Master Sprint Tracker:
docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md - Sprint Files:
docs/implplan/SPRINT_1200_001_00X_*.md - HTTP 429 Semantics: RFC 6585
- HTTP Retry-After: RFC 7231 Section 7.1.3
- Valkey Documentation: https://valkey.io/docs/