66 lines
3.3 KiB
Markdown
66 lines
3.3 KiB
Markdown
# Router Rate Limiting Runbook
|
||
|
||
Last updated: 2025-12-17
|
||
|
||
## Purpose
|
||
- Enforce centralized admission control at the Router (429 + Retry-After).
|
||
- Reduce duplicate per-service HTTP throttling and standardize response semantics.
|
||
- Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).
|
||
|
||
## Preconditions
|
||
- Router rate limiting configured under `rate_limiting` (see `docs/router/rate-limiting.md`).
|
||
- If `for_environment` is enabled:
|
||
- Valkey reachable from Router instances.
|
||
- Circuit breaker parameters reviewed for the environment.
|
||
|
||
## Rollout plan (recommended)
|
||
1. **Dry-run wiring**: enable rate limiting with limits set far above peak traffic to validate middleware order, headers, and metrics.
|
||
2. **Soft limits**: set limits to ~2× peak traffic and monitor rejected rate and latency.
|
||
3. **Production limits**: set limits to target SLO and operational constraints.
|
||
4. **Migration cleanup**: remove any remaining service-level HTTP rate limiters to avoid double-limiting.
|
||
|
||
## Monitoring
|
||
|
||
### Key metrics (OpenTelemetry)
|
||
- `stellaops.router.ratelimit.allowed{scope,microservice,route?}`
|
||
- `stellaops.router.ratelimit.rejected{scope,microservice,route?}`
|
||
- `stellaops.router.ratelimit.check_latency{scope}`
|
||
- `stellaops.router.ratelimit.valkey.errors{error_type}`
|
||
- `stellaops.router.ratelimit.circuit_breaker.trips{reason}`
|
||
- `stellaops.router.ratelimit.instance.current`
|
||
- `stellaops.router.ratelimit.environment.current`
|
||
|
||
### PromQL examples
|
||
- Deny ratio (by microservice):
|
||
- `sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice) / (sum(rate(stellaops_router_ratelimit_allowed_total[5m])) by (microservice) + sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice))`
|
||
- P95 check latency (environment):
|
||
- `histogram_quantile(0.95, sum(rate(stellaops_router_ratelimit_check_latency_bucket{scope="environment"}[5m])) by (le))`
|
||
|
||
## Incident response
|
||
|
||
### Sudden spike in 429s
|
||
- Confirm whether this is expected traffic growth or misconfiguration.
|
||
- Identify the top offenders: `rejected` by `microservice` and (optionally) `route`.
|
||
- If misconfigured: raise limits conservatively (2×), redeploy config, then tighten gradually.
|
||
|
||
### Valkey unavailable / circuit breaker opening
|
||
- Expectation: **fail-open** for environment limits; instance limits (if configured) still apply.
|
||
- Check:
|
||
- `stellaops.router.ratelimit.valkey.errors`
|
||
- `stellaops.router.ratelimit.circuit_breaker.trips`
|
||
- Actions:
|
||
- Restore Valkey connectivity/performance.
|
||
- Consider temporarily increasing `process_back_pressure_when_more_than_per_5min` to reduce Valkey load.
|
||
|
||
## Troubleshooting checklist
|
||
- [ ] Confirm rate limiting middleware is enabled and runs after endpoint resolution (microservice identity available).
|
||
- [ ] Validate YAML binding: incorrect keys should fail fast at startup.
|
||
- [ ] Confirm Valkey connectivity from Router nodes (if `for_environment` enabled).
|
||
- [ ] Ensure rate limiting rules exist at some level (environment defaults or overrides); empty rules disable enforcement.
|
||
- [ ] Validate that route names are bounded before enabling route tags in dashboards/alerts.
|
||
|
||
## Load testing
|
||
- Run `tests/load/router-rate-limiting-load-test.js` against a staging Router configured with known limits.
|
||
- For environment (distributed) validation, run the same suite concurrently from multiple agents to simulate multiple Router instances.
|
||
|