# Router Rate Limiting Runbook Last updated: 2025-12-17 ## Purpose - Enforce centralized admission control at the Router (429 + Retry-After). - Reduce duplicate per-service HTTP throttling and standardize response semantics. - Keep the platform available under dependency failures (Valkey fail-open + circuit breaker). ## Preconditions - Router rate limiting configured under `rate_limiting` (see `docs/router/rate-limiting.md`). - If `for_environment` is enabled: - Valkey reachable from Router instances. - Circuit breaker parameters reviewed for the environment. ## Rollout plan (recommended) 1. **Dry-run wiring**: enable rate limiting with limits set far above peak traffic to validate middleware order, headers, and metrics. 2. **Soft limits**: set limits to ~2× peak traffic and monitor rejected rate and latency. 3. **Production limits**: set limits to target SLO and operational constraints. 4. **Migration cleanup**: remove any remaining service-level HTTP rate limiters to avoid double-limiting. ## Monitoring ### Key metrics (OpenTelemetry) - `stellaops.router.ratelimit.allowed{scope,microservice,route?}` - `stellaops.router.ratelimit.rejected{scope,microservice,route?}` - `stellaops.router.ratelimit.check_latency{scope}` - `stellaops.router.ratelimit.valkey.errors{error_type}` - `stellaops.router.ratelimit.circuit_breaker.trips{reason}` - `stellaops.router.ratelimit.instance.current` - `stellaops.router.ratelimit.environment.current` ### PromQL examples - Deny ratio (by microservice): - `sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice) / (sum(rate(stellaops_router_ratelimit_allowed_total[5m])) by (microservice) + sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice))` - P95 check latency (environment): - `histogram_quantile(0.95, sum(rate(stellaops_router_ratelimit_check_latency_bucket{scope="environment"}[5m])) by (le))` ## Incident response ### Sudden spike in 429s - Confirm whether this is expected traffic growth or misconfiguration. - Identify the top offenders: `rejected` by `microservice` and (optionally) `route`. - If misconfigured: raise limits conservatively (2×), redeploy config, then tighten gradually. ### Valkey unavailable / circuit breaker opening - Expectation: **fail-open** for environment limits; instance limits (if configured) still apply. - Check: - `stellaops.router.ratelimit.valkey.errors` - `stellaops.router.ratelimit.circuit_breaker.trips` - Actions: - Restore Valkey connectivity/performance. - Consider temporarily increasing `process_back_pressure_when_more_than_per_5min` to reduce Valkey load. ## Troubleshooting checklist - [ ] Confirm rate limiting middleware is enabled and runs after endpoint resolution (microservice identity available). - [ ] Validate YAML binding: incorrect keys should fail fast at startup. - [ ] Confirm Valkey connectivity from Router nodes (if `for_environment` enabled). - [ ] Ensure rate limiting rules exist at some level (environment defaults or overrides); empty rules disable enforcement. - [ ] Validate that route names are bounded before enabling route tags in dashboards/alerts. ## Load testing - Run `tests/load/router-rate-limiting-load-test.js` against a staging Router configured with known limits. - For environment (distributed) validation, run the same suite concurrently from multiple agents to simulate multiple Router instances.