feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration

- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
2025-12-17 18:02:37 +02:00
parent 394b57f6bf
commit 8bbfe4d2d2
211 changed files with 47179 additions and 1590 deletions
--- a/docs/implplan/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
+++ b/docs/implplan/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
@@ -0,0 +1,707 @@
+# Router Rate Limiting - Implementation Guide
+
+**For:** Implementation agents executing Sprint 1200_001_001 through 1200_001_006
+**Last Updated:** 2025-12-17
+
+---
+
+## Purpose
+
+This guide provides comprehensive technical context for implementing centralized rate limiting in Stella Router. It covers architecture decisions, patterns, gotchas, and operational considerations.
+
+---
+
+## Table of Contents
+
+1. [Architecture Overview](#architecture-overview)
+2. [Configuration Philosophy](#configuration-philosophy)
+3. [Performance Considerations](#performance-considerations)
+4. [Valkey Integration](#valkey-integration)
+5. [Testing Strategy](#testing-strategy)
+6. [Common Pitfalls](#common-pitfalls)
+7. [Debugging Guide](#debugging-guide)
+8. [Operational Runbook](#operational-runbook)
+
+---
+
+## Architecture Overview
+
+### Design Principles
+
+1. **Router-Centralized**: Rate limiting is a router responsibility, not a microservice responsibility
+2. **Fail-Open**: Never block all traffic due to infrastructure failures
+3. **Observable**: Every decision must be metrified
+4. **Deterministic**: Same request at same time should get same decision (within window)
+5. **Fair**: Use sliding windows where possible to avoid thundering herd
+
+### Two-Tier Architecture
+
+```
+Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
+              ↓ DENY                                  ↓ DENY
+            429 + Retry-After                      429 + Retry-After
+```
+
+**Why two tiers?**
+
+- **Instance tier** protects individual router process (CPU, memory, sockets)
+- **Environment tier** protects shared backend (aggregate across all routers)
+
+Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.
+
+### Decision Flow
+
+```
+1. Extract microservice + route from request
+2. Check instance limits (always, fast path)
+   └─> DENY? Return 429
+3. Check activation gate (local 5-min counter)
+   └─> Below threshold? Skip env check (optimization)
+4. Check environment limits (Valkey call)
+   └─> Circuit breaker open? Skip (fail-open)
+   └─> Valkey error? Skip (fail-open)
+   └─> DENY? Return 429
+5. Forward to upstream
+```
+
+---
+
+## Configuration Philosophy
+
+### Inheritance Model
+
+```
+Global Defaults
+  └─> Environment Defaults
+       └─> Microservice Overrides
+            └─> Route Overrides (most specific)
+```
+
+**Replacement, not merge**: When a child level specifies limits, it REPLACES parent limits entirely.
+
+**Example:**
+
+```yaml
+for_environment:
+  per_seconds: 300
+  max_requests: 30000        # Global default
+
+  microservices:
+    scanner:
+      per_seconds: 60
+      max_requests: 600       # REPLACES global (not merged)
+      routes:
+        scan_submit:
+          per_seconds: 10
+          max_requests: 50    # REPLACES microservice (not merged)
+```
+
+Result:
+- `POST /scanner/api/scans` → 50 req/10sec (route level)
+- `GET /scanner/api/other` → 600 req/60sec (microservice level)
+- `GET /policy/api/evaluate` → 30000 req/300sec (global level)
+
+### Rule Stacking (AND Logic)
+
+Multiple rules at same level = ALL must pass.
+
+```yaml
+concelier:
+  rules:
+    - per_seconds: 1
+      max_requests: 10        # Rule 1: 10/sec
+    - per_seconds: 3600
+      max_requests: 3000      # Rule 2: 3000/hour
+```
+
+Both rules enforced. Request denied if EITHER limit exceeded.
+
+### Sensible Defaults
+
+If configuration omitted:
+- `for_instance`: No limits (effectively unlimited)
+- `for_environment`: No limits
+- `activation_threshold`: 5000 (skip Valkey if <5000 req/5min)
+- `circuit_breaker.failure_threshold`: 5
+- `circuit_breaker.timeout_seconds`: 30
+
+**Recommendation**: Always configure at least global defaults.
+
+---
+
+## Performance Considerations
+
+### Instance Limiter Performance
+
+**Target:** <1ms P99 latency
+
+**Implementation:** Sliding window with ring buffer.
+
+```csharp
+// Efficient: O(1) increment, O(k) advance where k = buckets cleared
+long[] _buckets;  // Ring buffer, size = window_seconds / granularity
+long _total;      // Running sum
+```
+
+**Lock contention**: Single lock per counter. Acceptable for <10k req/sec per router.
+
+**Memory**: ~24 bytes per window (array overhead + fields).
+
+**Optimization**: For very high traffic (>50k req/sec), consider lock-free implementation with `Interlocked` operations.
+
+### Environment Limiter Performance
+
+**Target:** <10ms P99 latency (including Valkey RTT)
+
+**Critical path**: Every request to environment limiter makes a Valkey call.
+
+**Optimization: Activation Gate**
+
+Skip Valkey if local instance traffic < threshold:
+
+```csharp
+if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
+{
+    // Skip expensive Valkey check
+    return instanceDecision;
+}
+```
+
+**Effect**: Reduces Valkey load by 80%+ in low-traffic scenarios.
+
+**Trade-off**: Under threshold, environment limits not enforced. Acceptable if:
+- Each router instance threshold is set appropriately
+- Primary concern is high-traffic scenarios
+
+**Lua Script Performance**
+
+- Single round-trip to Valkey (atomic)
+- Multiple `INCR` operations in single script (fast, no network)
+- TTL set only on first increment (optimization)
+
+**Valkey Sizing**: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).
+
+---
+
+## Valkey Integration
+
+### Connection Management
+
+Use `ConnectionMultiplexer` from StackExchange.Redis:
+
+```csharp
+var _connection = ConnectionMultiplexer.Connect(connectionString);
+var _db = _connection.GetDatabase();
+```
+
+**Important**: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.
+
+### Lua Script Loading
+
+Scripts loaded at startup and cached by SHA:
+
+```csharp
+var script = File.ReadAllText("rate_limit_check.lua");
+var server = _connection.GetServer(_connection.GetEndPoints().First());
+var sha = server.ScriptLoad(script);
+```
+
+**Persistence**: Valkey caches scripts in memory. They survive across requests but NOT across restarts.
+
+**Recommendation**: Load script at startup, store SHA, use `ScriptEvaluateAsync(sha, ...)` for all calls.
+
+### Key Naming Strategy
+
+Format: `{bucket}:env:{service}:{rule_name}:{window_start}`
+
+Example: `stella-router-rate-limit:env:concelier:per_second:1702821600`
+
+**Why include window_start in key?**
+
+Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted.
+
+**Benefit**: No manual cleanup, memory efficient.
+
+### Clock Skew Handling
+
+**Problem**: Different routers may have slightly different clocks, causing them to disagree on window boundaries.
+
+**Solution**: Use Valkey server time (`redis.call("TIME")`) in Lua script, not client time.
+
+```lua
+local now = tonumber(redis.call("TIME")[1])  -- Valkey server time
+local window_start = now - (now % window_sec)
+```
+
+**Result**: All routers agree on window boundaries (Valkey is source of truth).
+
+### Circuit Breaker Thresholds
+
+**failure_threshold**: 5 consecutive failures before opening
+**timeout_seconds**: 30 seconds before attempting half-open
+**half_open_timeout**: 10 seconds to test one request
+
+**Tuning**:
+- Lower failure_threshold = faster fail-open (more availability, less strict limiting)
+- Higher failure_threshold = tolerate more transient errors (stricter limiting)
+
+**Recommendation**: Start with defaults, adjust based on Valkey stability.
+
+---
+
+## Testing Strategy
+
+### Unit Tests (xUnit)
+
+**Coverage targets:**
+- Configuration loading: 100%
+- Validation logic: 100%
+- Sliding window counter: 100%
+- Route matching: 100%
+- Inheritance resolution: 100%
+
+**Test patterns:**
+
+```csharp
+[Fact]
+public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
+{
+    var counter = new SlidingWindowCounter(windowSeconds: 10);
+    counter.Increment(); // count = 1
+
+    // Simulate time passing (mock or Thread.Sleep in tests)
+    AdvanceTime(11); // seconds
+
+    Assert.Equal(0, counter.GetCount()); // Window expired, count reset
+}
+```
+
+### Integration Tests (TestServer + Testcontainers)
+
+**Valkey integration:**
+
+```csharp
+[Fact]
+public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
+{
+    using var valkey = new ValkeyContainer();
+    await valkey.StartAsync();
+
+    var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
+    var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);
+
+    var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);
+
+    // First 5 requests should pass
+    for (int i = 0; i < 5; i++)
+    {
+        var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
+        Assert.True(decision.Value.Allowed);
+    }
+
+    // 6th request should be denied
+    var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
+    Assert.False(deniedDecision.Value.Allowed);
+    Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
+}
+```
+
+**Middleware integration:**
+
+```csharp
+[Fact]
+public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
+{
+    using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
+    var client = testServer.CreateClient();
+
+    // Configure rate limit: 5 req/sec
+    // Send 6 requests rapidly
+    for (int i = 0; i < 6; i++)
+    {
+        var response = await client.GetAsync("/api/test");
+        if (i < 5)
+        {
+            Assert.Equal(HttpStatusCode.OK, response.StatusCode);
+        }
+        else
+        {
+            Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
+            Assert.True(response.Headers.Contains("Retry-After"));
+        }
+    }
+}
+```
+
+### Load Tests (k6)
+
+**Scenario A: Instance Limits**
+
+```javascript
+import http from 'k6/http';
+import { check } from 'k6';
+
+export const options = {
+  scenarios: {
+    instance_limit: {
+      executor: 'constant-arrival-rate',
+      rate: 100, // 100 req/sec
+      timeUnit: '1s',
+      duration: '30s',
+      preAllocatedVUs: 50,
+    },
+  },
+};
+
+export default function () {
+  const res = http.get('http://router/api/test');
+  check(res, {
+    'status 200 or 429': (r) => r.status === 200 || r.status === 429,
+    'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
+  });
+}
+```
+
+**Scenario B: Environment Limits (Multi-Instance)**
+
+Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced.
+
+**Scenario E: Valkey Failure**
+
+Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open).
+
+---
+
+## Common Pitfalls
+
+### 1. Forgetting to Update Middleware Pipeline Order
+
+**Problem**: Rate limit middleware added AFTER routing decision → can't identify microservice.
+
+**Solution**: Add rate limit middleware BEFORE routing decision:
+
+```csharp
+app.UsePayloadLimits();
+app.UseRateLimiting();        // HERE
+app.UseEndpointResolution();
+app.UseRoutingDecision();
+```
+
+### 2. Circuit Breaker Never Closes
+
+**Problem**: Circuit breaker opens, but never attempts recovery.
+
+**Cause**: Half-open logic not implemented or timeout too long.
+
+**Solution**: Implement half-open state with timeout:
+
+```csharp
+if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
+{
+    _state = CircuitState.HalfOpen; // Allow one test request
+}
+```
+
+### 3. Lua Script Not Found at Runtime
+
+**Problem**: Script file not copied to output directory.
+
+**Solution**: Set file properties in `.csproj`:
+
+```xml
+<ItemGroup>
+  <Content Include="RateLimit\Scripts\*.lua">
+    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
+  </Content>
+</ItemGroup>
+```
+
+### 4. Activation Gate Never Triggers
+
+**Problem**: Activation counter not incremented on every request.
+
+**Cause**: Counter incremented only when instance limit is enforced.
+
+**Solution**: Increment activation counter ALWAYS, not just when checking limits:
+
+```csharp
+public RateLimitDecision TryAcquire(string? microservice)
+{
+    _activationCounter.Increment(); // ALWAYS increment
+    // ... rest of logic
+}
+```
+
+### 5. Route Matching Case-Sensitivity Issues
+
+**Problem**: `/API/Scans` doesn't match `/api/scans`.
+
+**Solution**: Use case-insensitive comparisons:
+
+```csharp
+string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)
+```
+
+### 6. Valkey Key Explosion
+
+**Problem**: Too many keys in Valkey, memory usage high.
+
+**Cause**: Forgetting to set TTL on keys.
+
+**Solution**: ALWAYS set TTL when creating keys:
+
+```lua
+if count == 1 then
+    redis.call("EXPIRE", key, window_sec + 2)
+end
+```
+
+**+2 buffer**: Gives grace period to avoid edge cases.
+
+---
+
+## Debugging Guide
+
+### Scenario 1: Requests Being Denied But Shouldn't Be
+
+**Steps:**
+
+1. Check metrics: Which scope is denying? (instance or environment)
+
+```promql
+rate(stella_router_rate_limit_denied_total[1m])
+```
+
+2. Check configured limits:
+
+```bash
+# View config
+kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
+```
+
+3. Check activation gate:
+
+```promql
+stella_router_rate_limit_activation_gate_enabled
+```
+
+If 0, activation gate is disabled—all requests hit Valkey.
+
+4. Check Valkey keys:
+
+```bash
+redis-cli -h valkey.stellaops.local
+> KEYS stella-router-rate-limit:env:*
+> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
+> GET stella-router-rate-limit:env:concelier:per_second:1702821600
+```
+
+5. Check circuit breaker state:
+
+```promql
+stella_router_rate_limit_circuit_breaker_state{state="open"}
+```
+
+If 1, circuit breaker is open—env limits not enforced.
+
+### Scenario 2: Rate Limits Not Being Enforced
+
+**Steps:**
+
+1. Verify middleware is registered:
+
+```csharp
+// Check Startup.cs or Program.cs
+app.UseRateLimiting(); // Should be present
+```
+
+2. Verify configuration loaded:
+
+```csharp
+// Add logging in RateLimitService constructor
+_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
+    _config.ForInstance != null,
+    _config.ForEnvironment != null);
+```
+
+3. Check metrics—are requests even hitting rate limiter?
+
+```promql
+rate(stella_router_rate_limit_allowed_total[1m])
+```
+
+If 0, middleware not in pipeline or not being called.
+
+4. Check microservice identification:
+
+```csharp
+// Add logging in middleware
+var microservice = context.Items["RoutingTarget"] as string;
+_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);
+```
+
+If "unknown", routing metadata not set—rate limiter can't apply service-specific limits.
+
+### Scenario 3: Valkey Errors
+
+**Steps:**
+
+1. Check circuit breaker metrics:
+
+```promql
+rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
+```
+
+2. Check Valkey connectivity:
+
+```bash
+redis-cli -h valkey.stellaops.local PING
+```
+
+3. Check Lua script loaded:
+
+```bash
+redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
+```
+
+4. Check Valkey logs for errors:
+
+```bash
+kubectl logs -f valkey-0 | grep ERROR
+```
+
+5. Verify Lua script syntax:
+
+```bash
+redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua
+```
+
+---
+
+## Operational Runbook
+
+### Deployment Checklist
+
+- [ ] Valkey cluster healthy (check `redis-cli PING`)
+- [ ] Configuration validated (run `stella-router validate-config`)
+- [ ] Metrics scraping configured (Prometheus targets)
+- [ ] Dashboards imported (Grafana)
+- [ ] Alerts configured (Alertmanager)
+- [ ] Shadow mode enabled (limits set 10x expected traffic)
+- [ ] Rollback plan documented
+
+### Monitoring Dashboards
+
+**Dashboard 1: Rate Limiting Overview**
+
+Panels:
+- Requests allowed vs denied (pie chart)
+- Denial rate by microservice (line graph)
+- Denial rate by route (heatmap)
+- Retry-After distribution (histogram)
+
+**Dashboard 2: Performance**
+
+Panels:
+- Decision latency P50/P95/P99 (instance vs environment)
+- Valkey call latency P95
+- Activation gate effectiveness (% skipped)
+
+**Dashboard 3: Health**
+
+Panels:
+- Circuit breaker state (gauge)
+- Valkey error rate
+- Most denied routes (top 10 table)
+
+### Alert Definitions
+
+**Critical:**
+
+```yaml
+- alert: RateLimitValkeyCriticalFailure
+  expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
+  for: 5m
+  annotations:
+    summary: "Rate limit circuit breaker open for >5min"
+    description: "Valkey unavailable, environment limits not enforced"
+
+- alert: RateLimitAllRequestsDenied
+  expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
+  for: 1m
+  annotations:
+    summary: "100% denial rate"
+    description: "Possible configuration error"
+```
+
+**Warning:**
+
+```yaml
+- alert: RateLimitHighDenialRate
+  expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
+  for: 5m
+  annotations:
+    summary: ">20% requests denied"
+    description: "High denial rate, check if expected"
+
+- alert: RateLimitValkeyHighLatency
+  expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
+  for: 5m
+  annotations:
+    summary: "Valkey latency >100ms P95"
+    description: "Valkey performance degraded"
+```
+
+### Tuning Guidelines
+
+**Scenario: Too many requests denied**
+
+1. Check if denial rate is expected (traffic spike?)
+2. If not, increase limits:
+   - Start with 2x current limits
+   - Monitor for 24 hours
+   - Adjust as needed
+
+**Scenario: Valkey overloaded**
+
+1. Check ops/sec: `redis-cli INFO stats | grep instantaneous_ops_per_sec`
+2. If >50k ops/sec, consider:
+   - Increase activation threshold (reduce Valkey calls)
+   - Add Valkey replicas (read scaling)
+   - Shard by microservice (write scaling)
+
+**Scenario: Circuit breaker flapping**
+
+1. Check failure rate:
+
+```promql
+rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
+```
+
+2. If transient errors, increase failure_threshold
+3. If persistent errors, fix Valkey issue
+
+### Rollback Procedure
+
+1. Disable rate limiting:
+
+```yaml
+rate_limiting:
+  for_instance: null
+  for_environment: null
+```
+
+2. Deploy config update
+3. Verify traffic flows normally
+4. Investigate issue offline
+
+---
+
+## References
+
+- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
+- **Master Sprint Tracker:** `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md`
+- **Sprint Files:** `docs/implplan/SPRINT_1200_001_00X_*.md`
+- **HTTP 429 Semantics:** RFC 6585
+- **HTTP Retry-After:** RFC 7231 Section 7.1.3
+- **Valkey Documentation:** https://valkey.io/docs/