feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration
- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
This commit is contained in:
707
docs/implplan/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
Normal file
707
docs/implplan/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
Normal file
@@ -0,0 +1,707 @@
|
||||
# Router Rate Limiting - Implementation Guide
|
||||
|
||||
**For:** Implementation agents executing Sprint 1200_001_001 through 1200_001_006
|
||||
**Last Updated:** 2025-12-17
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
This guide provides comprehensive technical context for implementing centralized rate limiting in Stella Router. It covers architecture decisions, patterns, gotchas, and operational considerations.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Architecture Overview](#architecture-overview)
|
||||
2. [Configuration Philosophy](#configuration-philosophy)
|
||||
3. [Performance Considerations](#performance-considerations)
|
||||
4. [Valkey Integration](#valkey-integration)
|
||||
5. [Testing Strategy](#testing-strategy)
|
||||
6. [Common Pitfalls](#common-pitfalls)
|
||||
7. [Debugging Guide](#debugging-guide)
|
||||
8. [Operational Runbook](#operational-runbook)
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Design Principles
|
||||
|
||||
1. **Router-Centralized**: Rate limiting is a router responsibility, not a microservice responsibility
|
||||
2. **Fail-Open**: Never block all traffic due to infrastructure failures
|
||||
3. **Observable**: Every decision must be metrified
|
||||
4. **Deterministic**: Same request at same time should get same decision (within window)
|
||||
5. **Fair**: Use sliding windows where possible to avoid thundering herd
|
||||
|
||||
### Two-Tier Architecture
|
||||
|
||||
```
|
||||
Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
|
||||
↓ DENY ↓ DENY
|
||||
429 + Retry-After 429 + Retry-After
|
||||
```
|
||||
|
||||
**Why two tiers?**
|
||||
|
||||
- **Instance tier** protects individual router process (CPU, memory, sockets)
|
||||
- **Environment tier** protects shared backend (aggregate across all routers)
|
||||
|
||||
Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.
|
||||
|
||||
### Decision Flow
|
||||
|
||||
```
|
||||
1. Extract microservice + route from request
|
||||
2. Check instance limits (always, fast path)
|
||||
└─> DENY? Return 429
|
||||
3. Check activation gate (local 5-min counter)
|
||||
└─> Below threshold? Skip env check (optimization)
|
||||
4. Check environment limits (Valkey call)
|
||||
└─> Circuit breaker open? Skip (fail-open)
|
||||
└─> Valkey error? Skip (fail-open)
|
||||
└─> DENY? Return 429
|
||||
5. Forward to upstream
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration Philosophy
|
||||
|
||||
### Inheritance Model
|
||||
|
||||
```
|
||||
Global Defaults
|
||||
└─> Environment Defaults
|
||||
└─> Microservice Overrides
|
||||
└─> Route Overrides (most specific)
|
||||
```
|
||||
|
||||
**Replacement, not merge**: When a child level specifies limits, it REPLACES parent limits entirely.
|
||||
|
||||
**Example:**
|
||||
|
||||
```yaml
|
||||
for_environment:
|
||||
per_seconds: 300
|
||||
max_requests: 30000 # Global default
|
||||
|
||||
microservices:
|
||||
scanner:
|
||||
per_seconds: 60
|
||||
max_requests: 600 # REPLACES global (not merged)
|
||||
routes:
|
||||
scan_submit:
|
||||
per_seconds: 10
|
||||
max_requests: 50 # REPLACES microservice (not merged)
|
||||
```
|
||||
|
||||
Result:
|
||||
- `POST /scanner/api/scans` → 50 req/10sec (route level)
|
||||
- `GET /scanner/api/other` → 600 req/60sec (microservice level)
|
||||
- `GET /policy/api/evaluate` → 30000 req/300sec (global level)
|
||||
|
||||
### Rule Stacking (AND Logic)
|
||||
|
||||
Multiple rules at same level = ALL must pass.
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
rules:
|
||||
- per_seconds: 1
|
||||
max_requests: 10 # Rule 1: 10/sec
|
||||
- per_seconds: 3600
|
||||
max_requests: 3000 # Rule 2: 3000/hour
|
||||
```
|
||||
|
||||
Both rules enforced. Request denied if EITHER limit exceeded.
|
||||
|
||||
### Sensible Defaults
|
||||
|
||||
If configuration omitted:
|
||||
- `for_instance`: No limits (effectively unlimited)
|
||||
- `for_environment`: No limits
|
||||
- `activation_threshold`: 5000 (skip Valkey if <5000 req/5min)
|
||||
- `circuit_breaker.failure_threshold`: 5
|
||||
- `circuit_breaker.timeout_seconds`: 30
|
||||
|
||||
**Recommendation**: Always configure at least global defaults.
|
||||
|
||||
---
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Instance Limiter Performance
|
||||
|
||||
**Target:** <1ms P99 latency
|
||||
|
||||
**Implementation:** Sliding window with ring buffer.
|
||||
|
||||
```csharp
|
||||
// Efficient: O(1) increment, O(k) advance where k = buckets cleared
|
||||
long[] _buckets; // Ring buffer, size = window_seconds / granularity
|
||||
long _total; // Running sum
|
||||
```
|
||||
|
||||
**Lock contention**: Single lock per counter. Acceptable for <10k req/sec per router.
|
||||
|
||||
**Memory**: ~24 bytes per window (array overhead + fields).
|
||||
|
||||
**Optimization**: For very high traffic (>50k req/sec), consider lock-free implementation with `Interlocked` operations.
|
||||
|
||||
### Environment Limiter Performance
|
||||
|
||||
**Target:** <10ms P99 latency (including Valkey RTT)
|
||||
|
||||
**Critical path**: Every request to environment limiter makes a Valkey call.
|
||||
|
||||
**Optimization: Activation Gate**
|
||||
|
||||
Skip Valkey if local instance traffic < threshold:
|
||||
|
||||
```csharp
|
||||
if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
|
||||
{
|
||||
// Skip expensive Valkey check
|
||||
return instanceDecision;
|
||||
}
|
||||
```
|
||||
|
||||
**Effect**: Reduces Valkey load by 80%+ in low-traffic scenarios.
|
||||
|
||||
**Trade-off**: Under threshold, environment limits not enforced. Acceptable if:
|
||||
- Each router instance threshold is set appropriately
|
||||
- Primary concern is high-traffic scenarios
|
||||
|
||||
**Lua Script Performance**
|
||||
|
||||
- Single round-trip to Valkey (atomic)
|
||||
- Multiple `INCR` operations in single script (fast, no network)
|
||||
- TTL set only on first increment (optimization)
|
||||
|
||||
**Valkey Sizing**: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).
|
||||
|
||||
---
|
||||
|
||||
## Valkey Integration
|
||||
|
||||
### Connection Management
|
||||
|
||||
Use `ConnectionMultiplexer` from StackExchange.Redis:
|
||||
|
||||
```csharp
|
||||
var _connection = ConnectionMultiplexer.Connect(connectionString);
|
||||
var _db = _connection.GetDatabase();
|
||||
```
|
||||
|
||||
**Important**: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.
|
||||
|
||||
### Lua Script Loading
|
||||
|
||||
Scripts loaded at startup and cached by SHA:
|
||||
|
||||
```csharp
|
||||
var script = File.ReadAllText("rate_limit_check.lua");
|
||||
var server = _connection.GetServer(_connection.GetEndPoints().First());
|
||||
var sha = server.ScriptLoad(script);
|
||||
```
|
||||
|
||||
**Persistence**: Valkey caches scripts in memory. They survive across requests but NOT across restarts.
|
||||
|
||||
**Recommendation**: Load script at startup, store SHA, use `ScriptEvaluateAsync(sha, ...)` for all calls.
|
||||
|
||||
### Key Naming Strategy
|
||||
|
||||
Format: `{bucket}:env:{service}:{rule_name}:{window_start}`
|
||||
|
||||
Example: `stella-router-rate-limit:env:concelier:per_second:1702821600`
|
||||
|
||||
**Why include window_start in key?**
|
||||
|
||||
Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted.
|
||||
|
||||
**Benefit**: No manual cleanup, memory efficient.
|
||||
|
||||
### Clock Skew Handling
|
||||
|
||||
**Problem**: Different routers may have slightly different clocks, causing them to disagree on window boundaries.
|
||||
|
||||
**Solution**: Use Valkey server time (`redis.call("TIME")`) in Lua script, not client time.
|
||||
|
||||
```lua
|
||||
local now = tonumber(redis.call("TIME")[1]) -- Valkey server time
|
||||
local window_start = now - (now % window_sec)
|
||||
```
|
||||
|
||||
**Result**: All routers agree on window boundaries (Valkey is source of truth).
|
||||
|
||||
### Circuit Breaker Thresholds
|
||||
|
||||
**failure_threshold**: 5 consecutive failures before opening
|
||||
**timeout_seconds**: 30 seconds before attempting half-open
|
||||
**half_open_timeout**: 10 seconds to test one request
|
||||
|
||||
**Tuning**:
|
||||
- Lower failure_threshold = faster fail-open (more availability, less strict limiting)
|
||||
- Higher failure_threshold = tolerate more transient errors (stricter limiting)
|
||||
|
||||
**Recommendation**: Start with defaults, adjust based on Valkey stability.
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests (xUnit)
|
||||
|
||||
**Coverage targets:**
|
||||
- Configuration loading: 100%
|
||||
- Validation logic: 100%
|
||||
- Sliding window counter: 100%
|
||||
- Route matching: 100%
|
||||
- Inheritance resolution: 100%
|
||||
|
||||
**Test patterns:**
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
|
||||
{
|
||||
var counter = new SlidingWindowCounter(windowSeconds: 10);
|
||||
counter.Increment(); // count = 1
|
||||
|
||||
// Simulate time passing (mock or Thread.Sleep in tests)
|
||||
AdvanceTime(11); // seconds
|
||||
|
||||
Assert.Equal(0, counter.GetCount()); // Window expired, count reset
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Tests (TestServer + Testcontainers)
|
||||
|
||||
**Valkey integration:**
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
|
||||
{
|
||||
using var valkey = new ValkeyContainer();
|
||||
await valkey.StartAsync();
|
||||
|
||||
var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
|
||||
var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);
|
||||
|
||||
var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);
|
||||
|
||||
// First 5 requests should pass
|
||||
for (int i = 0; i < 5; i++)
|
||||
{
|
||||
var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
|
||||
Assert.True(decision.Value.Allowed);
|
||||
}
|
||||
|
||||
// 6th request should be denied
|
||||
var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
|
||||
Assert.False(deniedDecision.Value.Allowed);
|
||||
Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
|
||||
}
|
||||
```
|
||||
|
||||
**Middleware integration:**
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
|
||||
{
|
||||
using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
|
||||
var client = testServer.CreateClient();
|
||||
|
||||
// Configure rate limit: 5 req/sec
|
||||
// Send 6 requests rapidly
|
||||
for (int i = 0; i < 6; i++)
|
||||
{
|
||||
var response = await client.GetAsync("/api/test");
|
||||
if (i < 5)
|
||||
{
|
||||
Assert.Equal(HttpStatusCode.OK, response.StatusCode);
|
||||
}
|
||||
else
|
||||
{
|
||||
Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
|
||||
Assert.True(response.Headers.Contains("Retry-After"));
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Load Tests (k6)
|
||||
|
||||
**Scenario A: Instance Limits**
|
||||
|
||||
```javascript
|
||||
import http from 'k6/http';
|
||||
import { check } from 'k6';
|
||||
|
||||
export const options = {
|
||||
scenarios: {
|
||||
instance_limit: {
|
||||
executor: 'constant-arrival-rate',
|
||||
rate: 100, // 100 req/sec
|
||||
timeUnit: '1s',
|
||||
duration: '30s',
|
||||
preAllocatedVUs: 50,
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
export default function () {
|
||||
const res = http.get('http://router/api/test');
|
||||
check(res, {
|
||||
'status 200 or 429': (r) => r.status === 200 || r.status === 429,
|
||||
'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Scenario B: Environment Limits (Multi-Instance)**
|
||||
|
||||
Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced.
|
||||
|
||||
**Scenario E: Valkey Failure**
|
||||
|
||||
Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open).
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Forgetting to Update Middleware Pipeline Order
|
||||
|
||||
**Problem**: Rate limit middleware added AFTER routing decision → can't identify microservice.
|
||||
|
||||
**Solution**: Add rate limit middleware BEFORE routing decision:
|
||||
|
||||
```csharp
|
||||
app.UsePayloadLimits();
|
||||
app.UseRateLimiting(); // HERE
|
||||
app.UseEndpointResolution();
|
||||
app.UseRoutingDecision();
|
||||
```
|
||||
|
||||
### 2. Circuit Breaker Never Closes
|
||||
|
||||
**Problem**: Circuit breaker opens, but never attempts recovery.
|
||||
|
||||
**Cause**: Half-open logic not implemented or timeout too long.
|
||||
|
||||
**Solution**: Implement half-open state with timeout:
|
||||
|
||||
```csharp
|
||||
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
|
||||
{
|
||||
_state = CircuitState.HalfOpen; // Allow one test request
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Lua Script Not Found at Runtime
|
||||
|
||||
**Problem**: Script file not copied to output directory.
|
||||
|
||||
**Solution**: Set file properties in `.csproj`:
|
||||
|
||||
```xml
|
||||
<ItemGroup>
|
||||
<Content Include="RateLimit\Scripts\*.lua">
|
||||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
|
||||
</Content>
|
||||
</ItemGroup>
|
||||
```
|
||||
|
||||
### 4. Activation Gate Never Triggers
|
||||
|
||||
**Problem**: Activation counter not incremented on every request.
|
||||
|
||||
**Cause**: Counter incremented only when instance limit is enforced.
|
||||
|
||||
**Solution**: Increment activation counter ALWAYS, not just when checking limits:
|
||||
|
||||
```csharp
|
||||
public RateLimitDecision TryAcquire(string? microservice)
|
||||
{
|
||||
_activationCounter.Increment(); // ALWAYS increment
|
||||
// ... rest of logic
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Route Matching Case-Sensitivity Issues
|
||||
|
||||
**Problem**: `/API/Scans` doesn't match `/api/scans`.
|
||||
|
||||
**Solution**: Use case-insensitive comparisons:
|
||||
|
||||
```csharp
|
||||
string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)
|
||||
```
|
||||
|
||||
### 6. Valkey Key Explosion
|
||||
|
||||
**Problem**: Too many keys in Valkey, memory usage high.
|
||||
|
||||
**Cause**: Forgetting to set TTL on keys.
|
||||
|
||||
**Solution**: ALWAYS set TTL when creating keys:
|
||||
|
||||
```lua
|
||||
if count == 1 then
|
||||
redis.call("EXPIRE", key, window_sec + 2)
|
||||
end
|
||||
```
|
||||
|
||||
**+2 buffer**: Gives grace period to avoid edge cases.
|
||||
|
||||
---
|
||||
|
||||
## Debugging Guide
|
||||
|
||||
### Scenario 1: Requests Being Denied But Shouldn't Be
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Check metrics: Which scope is denying? (instance or environment)
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_denied_total[1m])
|
||||
```
|
||||
|
||||
2. Check configured limits:
|
||||
|
||||
```bash
|
||||
# View config
|
||||
kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
|
||||
```
|
||||
|
||||
3. Check activation gate:
|
||||
|
||||
```promql
|
||||
stella_router_rate_limit_activation_gate_enabled
|
||||
```
|
||||
|
||||
If 0, activation gate is disabled—all requests hit Valkey.
|
||||
|
||||
4. Check Valkey keys:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local
|
||||
> KEYS stella-router-rate-limit:env:*
|
||||
> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
|
||||
> GET stella-router-rate-limit:env:concelier:per_second:1702821600
|
||||
```
|
||||
|
||||
5. Check circuit breaker state:
|
||||
|
||||
```promql
|
||||
stella_router_rate_limit_circuit_breaker_state{state="open"}
|
||||
```
|
||||
|
||||
If 1, circuit breaker is open—env limits not enforced.
|
||||
|
||||
### Scenario 2: Rate Limits Not Being Enforced
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Verify middleware is registered:
|
||||
|
||||
```csharp
|
||||
// Check Startup.cs or Program.cs
|
||||
app.UseRateLimiting(); // Should be present
|
||||
```
|
||||
|
||||
2. Verify configuration loaded:
|
||||
|
||||
```csharp
|
||||
// Add logging in RateLimitService constructor
|
||||
_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
|
||||
_config.ForInstance != null,
|
||||
_config.ForEnvironment != null);
|
||||
```
|
||||
|
||||
3. Check metrics—are requests even hitting rate limiter?
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_allowed_total[1m])
|
||||
```
|
||||
|
||||
If 0, middleware not in pipeline or not being called.
|
||||
|
||||
4. Check microservice identification:
|
||||
|
||||
```csharp
|
||||
// Add logging in middleware
|
||||
var microservice = context.Items["RoutingTarget"] as string;
|
||||
_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);
|
||||
```
|
||||
|
||||
If "unknown", routing metadata not set—rate limiter can't apply service-specific limits.
|
||||
|
||||
### Scenario 3: Valkey Errors
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Check circuit breaker metrics:
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
|
||||
```
|
||||
|
||||
2. Check Valkey connectivity:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local PING
|
||||
```
|
||||
|
||||
3. Check Lua script loaded:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
|
||||
```
|
||||
|
||||
4. Check Valkey logs for errors:
|
||||
|
||||
```bash
|
||||
kubectl logs -f valkey-0 | grep ERROR
|
||||
```
|
||||
|
||||
5. Verify Lua script syntax:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Operational Runbook
|
||||
|
||||
### Deployment Checklist
|
||||
|
||||
- [ ] Valkey cluster healthy (check `redis-cli PING`)
|
||||
- [ ] Configuration validated (run `stella-router validate-config`)
|
||||
- [ ] Metrics scraping configured (Prometheus targets)
|
||||
- [ ] Dashboards imported (Grafana)
|
||||
- [ ] Alerts configured (Alertmanager)
|
||||
- [ ] Shadow mode enabled (limits set 10x expected traffic)
|
||||
- [ ] Rollback plan documented
|
||||
|
||||
### Monitoring Dashboards
|
||||
|
||||
**Dashboard 1: Rate Limiting Overview**
|
||||
|
||||
Panels:
|
||||
- Requests allowed vs denied (pie chart)
|
||||
- Denial rate by microservice (line graph)
|
||||
- Denial rate by route (heatmap)
|
||||
- Retry-After distribution (histogram)
|
||||
|
||||
**Dashboard 2: Performance**
|
||||
|
||||
Panels:
|
||||
- Decision latency P50/P95/P99 (instance vs environment)
|
||||
- Valkey call latency P95
|
||||
- Activation gate effectiveness (% skipped)
|
||||
|
||||
**Dashboard 3: Health**
|
||||
|
||||
Panels:
|
||||
- Circuit breaker state (gauge)
|
||||
- Valkey error rate
|
||||
- Most denied routes (top 10 table)
|
||||
|
||||
### Alert Definitions
|
||||
|
||||
**Critical:**
|
||||
|
||||
```yaml
|
||||
- alert: RateLimitValkeyCriticalFailure
|
||||
expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Rate limit circuit breaker open for >5min"
|
||||
description: "Valkey unavailable, environment limits not enforced"
|
||||
|
||||
- alert: RateLimitAllRequestsDenied
|
||||
expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
|
||||
for: 1m
|
||||
annotations:
|
||||
summary: "100% denial rate"
|
||||
description: "Possible configuration error"
|
||||
```
|
||||
|
||||
**Warning:**
|
||||
|
||||
```yaml
|
||||
- alert: RateLimitHighDenialRate
|
||||
expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: ">20% requests denied"
|
||||
description: "High denial rate, check if expected"
|
||||
|
||||
- alert: RateLimitValkeyHighLatency
|
||||
expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Valkey latency >100ms P95"
|
||||
description: "Valkey performance degraded"
|
||||
```
|
||||
|
||||
### Tuning Guidelines
|
||||
|
||||
**Scenario: Too many requests denied**
|
||||
|
||||
1. Check if denial rate is expected (traffic spike?)
|
||||
2. If not, increase limits:
|
||||
- Start with 2x current limits
|
||||
- Monitor for 24 hours
|
||||
- Adjust as needed
|
||||
|
||||
**Scenario: Valkey overloaded**
|
||||
|
||||
1. Check ops/sec: `redis-cli INFO stats | grep instantaneous_ops_per_sec`
|
||||
2. If >50k ops/sec, consider:
|
||||
- Increase activation threshold (reduce Valkey calls)
|
||||
- Add Valkey replicas (read scaling)
|
||||
- Shard by microservice (write scaling)
|
||||
|
||||
**Scenario: Circuit breaker flapping**
|
||||
|
||||
1. Check failure rate:
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
|
||||
```
|
||||
|
||||
2. If transient errors, increase failure_threshold
|
||||
3. If persistent errors, fix Valkey issue
|
||||
|
||||
### Rollback Procedure
|
||||
|
||||
1. Disable rate limiting:
|
||||
|
||||
```yaml
|
||||
rate_limiting:
|
||||
for_instance: null
|
||||
for_environment: null
|
||||
```
|
||||
|
||||
2. Deploy config update
|
||||
3. Verify traffic flows normally
|
||||
4. Investigate issue offline
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
- **Master Sprint Tracker:** `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md`
|
||||
- **Sprint Files:** `docs/implplan/SPRINT_1200_001_00X_*.md`
|
||||
- **HTTP 429 Semantics:** RFC 6585
|
||||
- **HTTP Retry-After:** RFC 7231 Section 7.1.3
|
||||
- **Valkey Documentation:** https://valkey.io/docs/
|
||||
Reference in New Issue
Block a user