Files
git.stella-ops.org/docs/operations/router-rate-limiting.md
2025-12-18 00:47:24 +02:00

66 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Router Rate Limiting Runbook
Last updated: 2025-12-17
## Purpose
- Enforce centralized admission control at the Router (429 + Retry-After).
- Reduce duplicate per-service HTTP throttling and standardize response semantics.
- Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).
## Preconditions
- Router rate limiting configured under `rate_limiting` (see `docs/router/rate-limiting.md`).
- If `for_environment` is enabled:
- Valkey reachable from Router instances.
- Circuit breaker parameters reviewed for the environment.
## Rollout plan (recommended)
1. **Dry-run wiring**: enable rate limiting with limits set far above peak traffic to validate middleware order, headers, and metrics.
2. **Soft limits**: set limits to ~2× peak traffic and monitor rejected rate and latency.
3. **Production limits**: set limits to target SLO and operational constraints.
4. **Migration cleanup**: remove any remaining service-level HTTP rate limiters to avoid double-limiting.
## Monitoring
### Key metrics (OpenTelemetry)
- `stellaops.router.ratelimit.allowed{scope,microservice,route?}`
- `stellaops.router.ratelimit.rejected{scope,microservice,route?}`
- `stellaops.router.ratelimit.check_latency{scope}`
- `stellaops.router.ratelimit.valkey.errors{error_type}`
- `stellaops.router.ratelimit.circuit_breaker.trips{reason}`
- `stellaops.router.ratelimit.instance.current`
- `stellaops.router.ratelimit.environment.current`
### PromQL examples
- Deny ratio (by microservice):
- `sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice) / (sum(rate(stellaops_router_ratelimit_allowed_total[5m])) by (microservice) + sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice))`
- P95 check latency (environment):
- `histogram_quantile(0.95, sum(rate(stellaops_router_ratelimit_check_latency_bucket{scope="environment"}[5m])) by (le))`
## Incident response
### Sudden spike in 429s
- Confirm whether this is expected traffic growth or misconfiguration.
- Identify the top offenders: `rejected` by `microservice` and (optionally) `route`.
- If misconfigured: raise limits conservatively (2×), redeploy config, then tighten gradually.
### Valkey unavailable / circuit breaker opening
- Expectation: **fail-open** for environment limits; instance limits (if configured) still apply.
- Check:
- `stellaops.router.ratelimit.valkey.errors`
- `stellaops.router.ratelimit.circuit_breaker.trips`
- Actions:
- Restore Valkey connectivity/performance.
- Consider temporarily increasing `process_back_pressure_when_more_than_per_5min` to reduce Valkey load.
## Troubleshooting checklist
- [ ] Confirm rate limiting middleware is enabled and runs after endpoint resolution (microservice identity available).
- [ ] Validate YAML binding: incorrect keys should fail fast at startup.
- [ ] Confirm Valkey connectivity from Router nodes (if `for_environment` enabled).
- [ ] Ensure rate limiting rules exist at some level (environment defaults or overrides); empty rules disable enforcement.
- [ ] Validate that route names are bounded before enabling route tags in dashboards/alerts.
## Load testing
- Run `tests/load/router-rate-limiting-load-test.js` against a staging Router configured with known limits.
- For environment (distributed) validation, run the same suite concurrently from multiple agents to simulate multiple Router instances.