3.3 KiB
3.3 KiB
Router Rate Limiting Runbook
Last updated: 2025-12-17
Purpose
- Enforce centralized admission control at the Router (429 + Retry-After).
- Reduce duplicate per-service HTTP throttling and standardize response semantics.
- Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).
Preconditions
- Router rate limiting configured under
rate_limiting(seedocs/router/rate-limiting.md). - If
for_environmentis enabled:- Valkey reachable from Router instances.
- Circuit breaker parameters reviewed for the environment.
Rollout plan (recommended)
- Dry-run wiring: enable rate limiting with limits set far above peak traffic to validate middleware order, headers, and metrics.
- Soft limits: set limits to ~2× peak traffic and monitor rejected rate and latency.
- Production limits: set limits to target SLO and operational constraints.
- Migration cleanup: remove any remaining service-level HTTP rate limiters to avoid double-limiting.
Monitoring
Key metrics (OpenTelemetry)
stellaops.router.ratelimit.allowed{scope,microservice,route?}stellaops.router.ratelimit.rejected{scope,microservice,route?}stellaops.router.ratelimit.check_latency{scope}stellaops.router.ratelimit.valkey.errors{error_type}stellaops.router.ratelimit.circuit_breaker.trips{reason}stellaops.router.ratelimit.instance.currentstellaops.router.ratelimit.environment.current
PromQL examples
- Deny ratio (by microservice):
sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice) / (sum(rate(stellaops_router_ratelimit_allowed_total[5m])) by (microservice) + sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice))
- P95 check latency (environment):
histogram_quantile(0.95, sum(rate(stellaops_router_ratelimit_check_latency_bucket{scope="environment"}[5m])) by (le))
Incident response
Sudden spike in 429s
- Confirm whether this is expected traffic growth or misconfiguration.
- Identify the top offenders:
rejectedbymicroserviceand (optionally)route. - If misconfigured: raise limits conservatively (2×), redeploy config, then tighten gradually.
Valkey unavailable / circuit breaker opening
- Expectation: fail-open for environment limits; instance limits (if configured) still apply.
- Check:
stellaops.router.ratelimit.valkey.errorsstellaops.router.ratelimit.circuit_breaker.trips
- Actions:
- Restore Valkey connectivity/performance.
- Consider temporarily increasing
process_back_pressure_when_more_than_per_5minto reduce Valkey load.
Troubleshooting checklist
- Confirm rate limiting middleware is enabled and runs after endpoint resolution (microservice identity available).
- Validate YAML binding: incorrect keys should fail fast at startup.
- Confirm Valkey connectivity from Router nodes (if
for_environmentenabled). - Ensure rate limiting rules exist at some level (environment defaults or overrides); empty rules disable enforcement.
- Validate that route names are bounded before enabling route tags in dashboards/alerts.
Load testing
- Run
tests/load/router-rate-limiting-load-test.jsagainst a staging Router configured with known limits. - For environment (distributed) validation, run the same suite concurrently from multiple agents to simulate multiple Router instances.