feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration

- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.
This commit is contained in:
master
2025-12-17 18:02:37 +02:00
parent 394b57f6bf
commit 8bbfe4d2d2
211 changed files with 47179 additions and 1590 deletions

View File

@@ -0,0 +1,707 @@
# Router Rate Limiting - Implementation Guide
**For:** Implementation agents executing Sprint 1200_001_001 through 1200_001_006
**Last Updated:** 2025-12-17
---
## Purpose
This guide provides comprehensive technical context for implementing centralized rate limiting in Stella Router. It covers architecture decisions, patterns, gotchas, and operational considerations.
---
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Configuration Philosophy](#configuration-philosophy)
3. [Performance Considerations](#performance-considerations)
4. [Valkey Integration](#valkey-integration)
5. [Testing Strategy](#testing-strategy)
6. [Common Pitfalls](#common-pitfalls)
7. [Debugging Guide](#debugging-guide)
8. [Operational Runbook](#operational-runbook)
---
## Architecture Overview
### Design Principles
1. **Router-Centralized**: Rate limiting is a router responsibility, not a microservice responsibility
2. **Fail-Open**: Never block all traffic due to infrastructure failures
3. **Observable**: Every decision must be metrified
4. **Deterministic**: Same request at same time should get same decision (within window)
5. **Fair**: Use sliding windows where possible to avoid thundering herd
### Two-Tier Architecture
```
Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
↓ DENY ↓ DENY
429 + Retry-After 429 + Retry-After
```
**Why two tiers?**
- **Instance tier** protects individual router process (CPU, memory, sockets)
- **Environment tier** protects shared backend (aggregate across all routers)
Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.
### Decision Flow
```
1. Extract microservice + route from request
2. Check instance limits (always, fast path)
└─> DENY? Return 429
3. Check activation gate (local 5-min counter)
└─> Below threshold? Skip env check (optimization)
4. Check environment limits (Valkey call)
└─> Circuit breaker open? Skip (fail-open)
└─> Valkey error? Skip (fail-open)
└─> DENY? Return 429
5. Forward to upstream
```
---
## Configuration Philosophy
### Inheritance Model
```
Global Defaults
└─> Environment Defaults
└─> Microservice Overrides
└─> Route Overrides (most specific)
```
**Replacement, not merge**: When a child level specifies limits, it REPLACES parent limits entirely.
**Example:**
```yaml
for_environment:
per_seconds: 300
max_requests: 30000 # Global default
microservices:
scanner:
per_seconds: 60
max_requests: 600 # REPLACES global (not merged)
routes:
scan_submit:
per_seconds: 10
max_requests: 50 # REPLACES microservice (not merged)
```
Result:
- `POST /scanner/api/scans` → 50 req/10sec (route level)
- `GET /scanner/api/other` → 600 req/60sec (microservice level)
- `GET /policy/api/evaluate` → 30000 req/300sec (global level)
### Rule Stacking (AND Logic)
Multiple rules at same level = ALL must pass.
```yaml
concelier:
rules:
- per_seconds: 1
max_requests: 10 # Rule 1: 10/sec
- per_seconds: 3600
max_requests: 3000 # Rule 2: 3000/hour
```
Both rules enforced. Request denied if EITHER limit exceeded.
### Sensible Defaults
If configuration omitted:
- `for_instance`: No limits (effectively unlimited)
- `for_environment`: No limits
- `activation_threshold`: 5000 (skip Valkey if <5000 req/5min)
- `circuit_breaker.failure_threshold`: 5
- `circuit_breaker.timeout_seconds`: 30
**Recommendation**: Always configure at least global defaults.
---
## Performance Considerations
### Instance Limiter Performance
**Target:** <1ms P99 latency
**Implementation:** Sliding window with ring buffer.
```csharp
// Efficient: O(1) increment, O(k) advance where k = buckets cleared
long[] _buckets; // Ring buffer, size = window_seconds / granularity
long _total; // Running sum
```
**Lock contention**: Single lock per counter. Acceptable for <10k req/sec per router.
**Memory**: ~24 bytes per window (array overhead + fields).
**Optimization**: For very high traffic (>50k req/sec), consider lock-free implementation with `Interlocked` operations.
### Environment Limiter Performance
**Target:** <10ms P99 latency (including Valkey RTT)
**Critical path**: Every request to environment limiter makes a Valkey call.
**Optimization: Activation Gate**
Skip Valkey if local instance traffic < threshold:
```csharp
if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
{
// Skip expensive Valkey check
return instanceDecision;
}
```
**Effect**: Reduces Valkey load by 80%+ in low-traffic scenarios.
**Trade-off**: Under threshold, environment limits not enforced. Acceptable if:
- Each router instance threshold is set appropriately
- Primary concern is high-traffic scenarios
**Lua Script Performance**
- Single round-trip to Valkey (atomic)
- Multiple `INCR` operations in single script (fast, no network)
- TTL set only on first increment (optimization)
**Valkey Sizing**: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).
---
## Valkey Integration
### Connection Management
Use `ConnectionMultiplexer` from StackExchange.Redis:
```csharp
var _connection = ConnectionMultiplexer.Connect(connectionString);
var _db = _connection.GetDatabase();
```
**Important**: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.
### Lua Script Loading
Scripts loaded at startup and cached by SHA:
```csharp
var script = File.ReadAllText("rate_limit_check.lua");
var server = _connection.GetServer(_connection.GetEndPoints().First());
var sha = server.ScriptLoad(script);
```
**Persistence**: Valkey caches scripts in memory. They survive across requests but NOT across restarts.
**Recommendation**: Load script at startup, store SHA, use `ScriptEvaluateAsync(sha, ...)` for all calls.
### Key Naming Strategy
Format: `{bucket}:env:{service}:{rule_name}:{window_start}`
Example: `stella-router-rate-limit:env:concelier:per_second:1702821600`
**Why include window_start in key?**
Fixed windowseach window is a separate key with TTL. When window expires, key auto-deleted.
**Benefit**: No manual cleanup, memory efficient.
### Clock Skew Handling
**Problem**: Different routers may have slightly different clocks, causing them to disagree on window boundaries.
**Solution**: Use Valkey server time (`redis.call("TIME")`) in Lua script, not client time.
```lua
local now = tonumber(redis.call("TIME")[1]) -- Valkey server time
local window_start = now - (now % window_sec)
```
**Result**: All routers agree on window boundaries (Valkey is source of truth).
### Circuit Breaker Thresholds
**failure_threshold**: 5 consecutive failures before opening
**timeout_seconds**: 30 seconds before attempting half-open
**half_open_timeout**: 10 seconds to test one request
**Tuning**:
- Lower failure_threshold = faster fail-open (more availability, less strict limiting)
- Higher failure_threshold = tolerate more transient errors (stricter limiting)
**Recommendation**: Start with defaults, adjust based on Valkey stability.
---
## Testing Strategy
### Unit Tests (xUnit)
**Coverage targets:**
- Configuration loading: 100%
- Validation logic: 100%
- Sliding window counter: 100%
- Route matching: 100%
- Inheritance resolution: 100%
**Test patterns:**
```csharp
[Fact]
public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
{
var counter = new SlidingWindowCounter(windowSeconds: 10);
counter.Increment(); // count = 1
// Simulate time passing (mock or Thread.Sleep in tests)
AdvanceTime(11); // seconds
Assert.Equal(0, counter.GetCount()); // Window expired, count reset
}
```
### Integration Tests (TestServer + Testcontainers)
**Valkey integration:**
```csharp
[Fact]
public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
{
using var valkey = new ValkeyContainer();
await valkey.StartAsync();
var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);
var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);
// First 5 requests should pass
for (int i = 0; i < 5; i++)
{
var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
Assert.True(decision.Value.Allowed);
}
// 6th request should be denied
var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
Assert.False(deniedDecision.Value.Allowed);
Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
}
```
**Middleware integration:**
```csharp
[Fact]
public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
{
using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
var client = testServer.CreateClient();
// Configure rate limit: 5 req/sec
// Send 6 requests rapidly
for (int i = 0; i < 6; i++)
{
var response = await client.GetAsync("/api/test");
if (i < 5)
{
Assert.Equal(HttpStatusCode.OK, response.StatusCode);
}
else
{
Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
Assert.True(response.Headers.Contains("Retry-After"));
}
}
}
```
### Load Tests (k6)
**Scenario A: Instance Limits**
```javascript
import http from 'k6/http';
import { check } from 'k6';
export const options = {
scenarios: {
instance_limit: {
executor: 'constant-arrival-rate',
rate: 100, // 100 req/sec
timeUnit: '1s',
duration: '30s',
preAllocatedVUs: 50,
},
},
};
export default function () {
const res = http.get('http://router/api/test');
check(res, {
'status 200 or 429': (r) => r.status === 200 || r.status === 429,
'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
});
}
```
**Scenario B: Environment Limits (Multi-Instance)**
Run k6 from 5 different machines simultaneously simulate 5 router instances verify aggregate limit enforced.
**Scenario E: Valkey Failure**
Use Toxiproxy to inject network failures verify circuit breaker opens verify requests still allowed (fail-open).
---
## Common Pitfalls
### 1. Forgetting to Update Middleware Pipeline Order
**Problem**: Rate limit middleware added AFTER routing decision can't identify microservice.
**Solution**: Add rate limit middleware BEFORE routing decision:
```csharp
app.UsePayloadLimits();
app.UseRateLimiting(); // HERE
app.UseEndpointResolution();
app.UseRoutingDecision();
```
### 2. Circuit Breaker Never Closes
**Problem**: Circuit breaker opens, but never attempts recovery.
**Cause**: Half-open logic not implemented or timeout too long.
**Solution**: Implement half-open state with timeout:
```csharp
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
_state = CircuitState.HalfOpen; // Allow one test request
}
```
### 3. Lua Script Not Found at Runtime
**Problem**: Script file not copied to output directory.
**Solution**: Set file properties in `.csproj`:
```xml
<ItemGroup>
<Content Include="RateLimit\Scripts\*.lua">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
```
### 4. Activation Gate Never Triggers
**Problem**: Activation counter not incremented on every request.
**Cause**: Counter incremented only when instance limit is enforced.
**Solution**: Increment activation counter ALWAYS, not just when checking limits:
```csharp
public RateLimitDecision TryAcquire(string? microservice)
{
_activationCounter.Increment(); // ALWAYS increment
// ... rest of logic
}
```
### 5. Route Matching Case-Sensitivity Issues
**Problem**: `/API/Scans` doesn't match `/api/scans`.
**Solution**: Use case-insensitive comparisons:
```csharp
string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)
```
### 6. Valkey Key Explosion
**Problem**: Too many keys in Valkey, memory usage high.
**Cause**: Forgetting to set TTL on keys.
**Solution**: ALWAYS set TTL when creating keys:
```lua
if count == 1 then
redis.call("EXPIRE", key, window_sec + 2)
end
```
**+2 buffer**: Gives grace period to avoid edge cases.
---
## Debugging Guide
### Scenario 1: Requests Being Denied But Shouldn't Be
**Steps:**
1. Check metrics: Which scope is denying? (instance or environment)
```promql
rate(stella_router_rate_limit_denied_total[1m])
```
2. Check configured limits:
```bash
# View config
kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
```
3. Check activation gate:
```promql
stella_router_rate_limit_activation_gate_enabled
```
If 0, activation gate is disabledall requests hit Valkey.
4. Check Valkey keys:
```bash
redis-cli -h valkey.stellaops.local
> KEYS stella-router-rate-limit:env:*
> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
> GET stella-router-rate-limit:env:concelier:per_second:1702821600
```
5. Check circuit breaker state:
```promql
stella_router_rate_limit_circuit_breaker_state{state="open"}
```
If 1, circuit breaker is openenv limits not enforced.
### Scenario 2: Rate Limits Not Being Enforced
**Steps:**
1. Verify middleware is registered:
```csharp
// Check Startup.cs or Program.cs
app.UseRateLimiting(); // Should be present
```
2. Verify configuration loaded:
```csharp
// Add logging in RateLimitService constructor
_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
_config.ForInstance != null,
_config.ForEnvironment != null);
```
3. Check metricsare requests even hitting rate limiter?
```promql
rate(stella_router_rate_limit_allowed_total[1m])
```
If 0, middleware not in pipeline or not being called.
4. Check microservice identification:
```csharp
// Add logging in middleware
var microservice = context.Items["RoutingTarget"] as string;
_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);
```
If "unknown", routing metadata not setrate limiter can't apply service-specific limits.
### Scenario 3: Valkey Errors
**Steps:**
1. Check circuit breaker metrics:
```promql
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
```
2. Check Valkey connectivity:
```bash
redis-cli -h valkey.stellaops.local PING
```
3. Check Lua script loaded:
```bash
redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
```
4. Check Valkey logs for errors:
```bash
kubectl logs -f valkey-0 | grep ERROR
```
5. Verify Lua script syntax:
```bash
redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua
```
---
## Operational Runbook
### Deployment Checklist
- [ ] Valkey cluster healthy (check `redis-cli PING`)
- [ ] Configuration validated (run `stella-router validate-config`)
- [ ] Metrics scraping configured (Prometheus targets)
- [ ] Dashboards imported (Grafana)
- [ ] Alerts configured (Alertmanager)
- [ ] Shadow mode enabled (limits set 10x expected traffic)
- [ ] Rollback plan documented
### Monitoring Dashboards
**Dashboard 1: Rate Limiting Overview**
Panels:
- Requests allowed vs denied (pie chart)
- Denial rate by microservice (line graph)
- Denial rate by route (heatmap)
- Retry-After distribution (histogram)
**Dashboard 2: Performance**
Panels:
- Decision latency P50/P95/P99 (instance vs environment)
- Valkey call latency P95
- Activation gate effectiveness (% skipped)
**Dashboard 3: Health**
Panels:
- Circuit breaker state (gauge)
- Valkey error rate
- Most denied routes (top 10 table)
### Alert Definitions
**Critical:**
```yaml
- alert: RateLimitValkeyCriticalFailure
expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
for: 5m
annotations:
summary: "Rate limit circuit breaker open for >5min"
description: "Valkey unavailable, environment limits not enforced"
- alert: RateLimitAllRequestsDenied
expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
for: 1m
annotations:
summary: "100% denial rate"
description: "Possible configuration error"
```
**Warning:**
```yaml
- alert: RateLimitHighDenialRate
expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
for: 5m
annotations:
summary: ">20% requests denied"
description: "High denial rate, check if expected"
- alert: RateLimitValkeyHighLatency
expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
for: 5m
annotations:
summary: "Valkey latency >100ms P95"
description: "Valkey performance degraded"
```
### Tuning Guidelines
**Scenario: Too many requests denied**
1. Check if denial rate is expected (traffic spike?)
2. If not, increase limits:
- Start with 2x current limits
- Monitor for 24 hours
- Adjust as needed
**Scenario: Valkey overloaded**
1. Check ops/sec: `redis-cli INFO stats | grep instantaneous_ops_per_sec`
2. If >50k ops/sec, consider:
- Increase activation threshold (reduce Valkey calls)
- Add Valkey replicas (read scaling)
- Shard by microservice (write scaling)
**Scenario: Circuit breaker flapping**
1. Check failure rate:
```promql
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
```
2. If transient errors, increase failure_threshold
3. If persistent errors, fix Valkey issue
### Rollback Procedure
1. Disable rate limiting:
```yaml
rate_limiting:
for_instance: null
for_environment: null
```
2. Deploy config update
3. Verify traffic flows normally
4. Investigate issue offline
---
## References
- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + RetryAfter Backpressure Control.md`
- **Master Sprint Tracker:** `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md`
- **Sprint Files:** `docs/implplan/SPRINT_1200_001_00X_*.md`
- **HTTP 429 Semantics:** RFC 6585
- **HTTP Retry-After:** RFC 7231 Section 7.1.3
- **Valkey Documentation:** https://valkey.io/docs/