# Router Rate Limiting - Implementation Guide

**For:** Implementation agents executing Sprint 1200_001_001 through 1200_001_006
**Last Updated:** 2025-12-17

---

## Purpose

This guide provides comprehensive technical context for implementing centralized rate limiting in Stella Router. It covers architecture decisions, patterns, gotchas, and operational considerations.

---

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Configuration Philosophy](#configuration-philosophy)
3. [Performance Considerations](#performance-considerations)
4. [Valkey Integration](#valkey-integration)
5. [Testing Strategy](#testing-strategy)
6. [Common Pitfalls](#common-pitfalls)
7. [Debugging Guide](#debugging-guide)
8. [Operational Runbook](#operational-runbook)

---

## Architecture Overview

### Design Principles

1. **Router-Centralized**: Rate limiting is a router responsibility, not a microservice responsibility
2. **Fail-Open**: Never block all traffic due to infrastructure failures
3. **Observable**: Every decision must be metrified
4. **Deterministic**: Same request at same time should get same decision (within window)
5. **Fair**: Use sliding windows where possible to avoid thundering herd

### Two-Tier Architecture

```
Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
              ↓ DENY                                  ↓ DENY
            429 + Retry-After                      429 + Retry-After
```

**Why two tiers?**

- **Instance tier** protects individual router process (CPU, memory, sockets)
- **Environment tier** protects shared backend (aggregate across all routers)

Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.

### Decision Flow

```
1. Extract microservice + route from request
2. Check instance limits (always, fast path)
   └─> DENY? Return 429
3. Check activation gate (local 5-min counter)
   └─> Below threshold? Skip env check (optimization)
4. Check environment limits (Valkey call)
   └─> Circuit breaker open? Skip (fail-open)
   └─> Valkey error? Skip (fail-open)
   └─> DENY? Return 429
5. Forward to upstream
```

---

## Configuration Philosophy

### Inheritance Model

```
Global Defaults
  └─> Environment Defaults
       └─> Microservice Overrides
            └─> Route Overrides (most specific)
```

**Replacement, not merge**: When a child level specifies limits, it REPLACES parent limits entirely.

**Example:**

```yaml
for_environment:
  per_seconds: 300
  max_requests: 30000        # Global default

  microservices:
    scanner:
      per_seconds: 60
      max_requests: 600       # REPLACES global (not merged)
      routes:
        scan_submit:
          per_seconds: 10
          max_requests: 50    # REPLACES microservice (not merged)
```

Result:
- `POST /scanner/api/scans` → 50 req/10sec (route level)
- `GET /scanner/api/other` → 600 req/60sec (microservice level)
- `GET /policy/api/evaluate` → 30000 req/300sec (global level)

### Rule Stacking (AND Logic)

Multiple rules at same level = ALL must pass.

```yaml
concelier:
  rules:
    - per_seconds: 1
      max_requests: 10        # Rule 1: 10/sec
    - per_seconds: 3600
      max_requests: 3000      # Rule 2: 3000/hour
```

Both rules enforced. Request denied if EITHER limit exceeded.

### Sensible Defaults

If configuration omitted:
- `for_instance`: No limits (effectively unlimited)
- `for_environment`: No limits
- `activation_threshold`: 5000 (skip Valkey if <5000 req/5min)
- `circuit_breaker.failure_threshold`: 5
- `circuit_breaker.timeout_seconds`: 30

**Recommendation**: Always configure at least global defaults.

---

## Performance Considerations

### Instance Limiter Performance

**Target:** <1ms P99 latency

**Implementation:** Sliding window with ring buffer.

```csharp
// Efficient: O(1) increment, O(k) advance where k = buckets cleared
long[] _buckets;  // Ring buffer, size = window_seconds / granularity
long _total;      // Running sum
```

**Lock contention**: Single lock per counter. Acceptable for <10k req/sec per router.

**Memory**: ~24 bytes per window (array overhead + fields).

**Optimization**: For very high traffic (>50k req/sec), consider lock-free implementation with `Interlocked` operations.

### Environment Limiter Performance

**Target:** <10ms P99 latency (including Valkey RTT)

**Critical path**: Every request to environment limiter makes a Valkey call.

**Optimization: Activation Gate**

Skip Valkey if local instance traffic < threshold:

```csharp
if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
{
    // Skip expensive Valkey check
    return instanceDecision;
}
```

**Effect**: Reduces Valkey load by 80%+ in low-traffic scenarios.

**Trade-off**: Under threshold, environment limits not enforced. Acceptable if:
- Each router instance threshold is set appropriately
- Primary concern is high-traffic scenarios

**Lua Script Performance**

- Single round-trip to Valkey (atomic)
- Multiple `INCR` operations in single script (fast, no network)
- TTL set only on first increment (optimization)

**Valkey Sizing**: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).

---

## Valkey Integration

### Connection Management

Use `ConnectionMultiplexer` from StackExchange.Redis:

```csharp
var _connection = ConnectionMultiplexer.Connect(connectionString);
var _db = _connection.GetDatabase();
```

**Important**: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.

### Lua Script Loading

Scripts loaded at startup and cached by SHA:

```csharp
var script = File.ReadAllText("rate_limit_check.lua");
var server = _connection.GetServer(_connection.GetEndPoints().First());
var sha = server.ScriptLoad(script);
```

**Persistence**: Valkey caches scripts in memory. They survive across requests but NOT across restarts.

**Recommendation**: Load script at startup, store SHA, use `ScriptEvaluateAsync(sha, ...)` for all calls.

### Key Naming Strategy

Format: `{bucket}:env:{service}:{rule_name}:{window_start}`

Example: `stella-router-rate-limit:env:concelier:per_second:1702821600`

**Why include window_start in key?**

Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted.

**Benefit**: No manual cleanup, memory efficient.

### Clock Skew Handling

**Problem**: Different routers may have slightly different clocks, causing them to disagree on window boundaries.

**Solution**: Use Valkey server time (`redis.call("TIME")`) in Lua script, not client time.

```lua
local now = tonumber(redis.call("TIME")[1])  -- Valkey server time
local window_start = now - (now % window_sec)
```

**Result**: All routers agree on window boundaries (Valkey is source of truth).

### Circuit Breaker Thresholds

**failure_threshold**: 5 consecutive failures before opening
**timeout_seconds**: 30 seconds before attempting half-open
**half_open_timeout**: 10 seconds to test one request

**Tuning**:
- Lower failure_threshold = faster fail-open (more availability, less strict limiting)
- Higher failure_threshold = tolerate more transient errors (stricter limiting)

**Recommendation**: Start with defaults, adjust based on Valkey stability.

---

## Testing Strategy

### Unit Tests (xUnit)

**Coverage targets:**
- Configuration loading: 100%
- Validation logic: 100%
- Sliding window counter: 100%
- Route matching: 100%
- Inheritance resolution: 100%

**Test patterns:**

```csharp
[Fact]
public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
{
    var counter = new SlidingWindowCounter(windowSeconds: 10);
    counter.Increment(); // count = 1

    // Simulate time passing (mock or Thread.Sleep in tests)
    AdvanceTime(11); // seconds

    Assert.Equal(0, counter.GetCount()); // Window expired, count reset
}
```

### Integration Tests (TestServer + Testcontainers)

**Valkey integration:**

```csharp
[Fact]
public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
{
    using var valkey = new ValkeyContainer();
    await valkey.StartAsync();

    var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
    var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);

    var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);

    // First 5 requests should pass
    for (int i = 0; i < 5; i++)
    {
        var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
        Assert.True(decision.Value.Allowed);
    }

    // 6th request should be denied
    var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
    Assert.False(deniedDecision.Value.Allowed);
    Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
}
```

**Middleware integration:**

```csharp
[Fact]
public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
{
    using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
    var client = testServer.CreateClient();

    // Configure rate limit: 5 req/sec
    // Send 6 requests rapidly
    for (int i = 0; i < 6; i++)
    {
        var response = await client.GetAsync("/api/test");
        if (i < 5)
        {
            Assert.Equal(HttpStatusCode.OK, response.StatusCode);
        }
        else
        {
            Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
            Assert.True(response.Headers.Contains("Retry-After"));
        }
    }
}
```

### Load Tests (k6)

**Scenario A: Instance Limits**

```javascript
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  scenarios: {
    instance_limit: {
      executor: 'constant-arrival-rate',
      rate: 100, // 100 req/sec
      timeUnit: '1s',
      duration: '30s',
      preAllocatedVUs: 50,
    },
  },
};

export default function () {
  const res = http.get('http://router/api/test');
  check(res, {
    'status 200 or 429': (r) => r.status === 200 || r.status === 429,
    'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
  });
}
```

**Scenario B: Environment Limits (Multi-Instance)**

Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced.

**Scenario E: Valkey Failure**

Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open).

---

## Common Pitfalls

### 1. Forgetting to Update Middleware Pipeline Order

**Problem**: Rate limit middleware added AFTER routing decision → can't identify microservice.

**Solution**: Add rate limit middleware BEFORE routing decision:

```csharp
app.UsePayloadLimits();
app.UseRateLimiting();        // HERE
app.UseEndpointResolution();
app.UseRoutingDecision();
```

### 2. Circuit Breaker Never Closes

**Problem**: Circuit breaker opens, but never attempts recovery.

**Cause**: Half-open logic not implemented or timeout too long.

**Solution**: Implement half-open state with timeout:

```csharp
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
    _state = CircuitState.HalfOpen; // Allow one test request
}
```

### 3. Lua Script Not Found at Runtime

**Problem**: Script file not copied to output directory.

**Solution**: Set file properties in `.csproj`:

```xml
<ItemGroup>
  <Content Include="RateLimit\Scripts\*.lua">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>
```

### 4. Activation Gate Never Triggers

**Problem**: Activation counter not incremented on every request.

**Cause**: Counter incremented only when instance limit is enforced.

**Solution**: Increment activation counter ALWAYS, not just when checking limits:

```csharp
public RateLimitDecision TryAcquire(string? microservice)
{
    _activationCounter.Increment(); // ALWAYS increment
    // ... rest of logic
}
```

### 5. Route Matching Case-Sensitivity Issues

**Problem**: `/API/Scans` doesn't match `/api/scans`.

**Solution**: Use case-insensitive comparisons:

```csharp
string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)
```

### 6. Valkey Key Explosion

**Problem**: Too many keys in Valkey, memory usage high.

**Cause**: Forgetting to set TTL on keys.

**Solution**: ALWAYS set TTL when creating keys:

```lua
if count == 1 then
    redis.call("EXPIRE", key, window_sec + 2)
end
```

**+2 buffer**: Gives grace period to avoid edge cases.

---

## Debugging Guide

### Scenario 1: Requests Being Denied But Shouldn't Be

**Steps:**

1. Check metrics: Which scope is denying? (instance or environment)

```promql
rate(stella_router_rate_limit_denied_total[1m])
```

2. Check configured limits:

```bash
# View config
kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
```

3. Check activation gate:

```promql
stella_router_rate_limit_activation_gate_enabled
```

If 0, activation gate is disabled—all requests hit Valkey.

4. Check Valkey keys:

```bash
redis-cli -h valkey.stellaops.local
> KEYS stella-router-rate-limit:env:*
> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
> GET stella-router-rate-limit:env:concelier:per_second:1702821600
```

5. Check circuit breaker state:

```promql
stella_router_rate_limit_circuit_breaker_state{state="open"}
```

If 1, circuit breaker is open—env limits not enforced.

### Scenario 2: Rate Limits Not Being Enforced

**Steps:**

1. Verify middleware is registered:

```csharp
// Check Startup.cs or Program.cs
app.UseRateLimiting(); // Should be present
```

2. Verify configuration loaded:

```csharp
// Add logging in RateLimitService constructor
_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
    _config.ForInstance != null,
    _config.ForEnvironment != null);
```

3. Check metrics—are requests even hitting rate limiter?

```promql
rate(stella_router_rate_limit_allowed_total[1m])
```

If 0, middleware not in pipeline or not being called.

4. Check microservice identification:

```csharp
// Add logging in middleware
var microservice = context.Items["RoutingTarget"] as string;
_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);
```

If "unknown", routing metadata not set—rate limiter can't apply service-specific limits.

### Scenario 3: Valkey Errors

**Steps:**

1. Check circuit breaker metrics:

```promql
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
```

2. Check Valkey connectivity:

```bash
redis-cli -h valkey.stellaops.local PING
```

3. Check Lua script loaded:

```bash
redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
```

4. Check Valkey logs for errors:

```bash
kubectl logs -f valkey-0 | grep ERROR
```

5. Verify Lua script syntax:

```bash
redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua
```

---

## Operational Runbook

### Deployment Checklist

- [ ] Valkey cluster healthy (check `redis-cli PING`)
- [ ] Configuration validated (run `stella-router validate-config`)
- [ ] Metrics scraping configured (Prometheus targets)
- [ ] Dashboards imported (Grafana)
- [ ] Alerts configured (Alertmanager)
- [ ] Shadow mode enabled (limits set 10x expected traffic)
- [ ] Rollback plan documented

### Monitoring Dashboards

**Dashboard 1: Rate Limiting Overview**

Panels:
- Requests allowed vs denied (pie chart)
- Denial rate by microservice (line graph)
- Denial rate by route (heatmap)
- Retry-After distribution (histogram)

**Dashboard 2: Performance**

Panels:
- Decision latency P50/P95/P99 (instance vs environment)
- Valkey call latency P95
- Activation gate effectiveness (% skipped)

**Dashboard 3: Health**

Panels:
- Circuit breaker state (gauge)
- Valkey error rate
- Most denied routes (top 10 table)

### Alert Definitions

**Critical:**

```yaml
- alert: RateLimitValkeyCriticalFailure
  expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
  for: 5m
  annotations:
    summary: "Rate limit circuit breaker open for >5min"
    description: "Valkey unavailable, environment limits not enforced"

- alert: RateLimitAllRequestsDenied
  expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
  for: 1m
  annotations:
    summary: "100% denial rate"
    description: "Possible configuration error"
```

**Warning:**

```yaml
- alert: RateLimitHighDenialRate
  expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
  for: 5m
  annotations:
    summary: ">20% requests denied"
    description: "High denial rate, check if expected"

- alert: RateLimitValkeyHighLatency
  expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
  for: 5m
  annotations:
    summary: "Valkey latency >100ms P95"
    description: "Valkey performance degraded"
```

### Tuning Guidelines

**Scenario: Too many requests denied**

1. Check if denial rate is expected (traffic spike?)
2. If not, increase limits:
   - Start with 2x current limits
   - Monitor for 24 hours
   - Adjust as needed

**Scenario: Valkey overloaded**

1. Check ops/sec: `redis-cli INFO stats | grep instantaneous_ops_per_sec`
2. If >50k ops/sec, consider:
   - Increase activation threshold (reduce Valkey calls)
   - Add Valkey replicas (read scaling)
   - Shard by microservice (write scaling)

**Scenario: Circuit breaker flapping**

1. Check failure rate:

```promql
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
```

2. If transient errors, increase failure_threshold
3. If persistent errors, fix Valkey issue

### Rollback Procedure

1. Disable rate limiting:

```yaml
rate_limiting:
  for_instance: null
  for_environment: null
```

2. Deploy config update
3. Verify traffic flows normally
4. Investigate issue offline

---

## References

- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
- **Master Sprint Tracker:** `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md`
- **Sprint Files:** `docs/implplan/SPRINT_1200_001_00X_*.md`
- **HTTP 429 Semantics:** RFC 6585
- **HTTP Retry-After:** RFC 7231 Section 7.1.3
- **Valkey Documentation:** https://valkey.io/docs/