work work hard work

This commit is contained in:
StellaOps Bot
2025-12-18 00:47:24 +02:00
parent dee252940b
commit b4235c134c
189 changed files with 9627 additions and 3258 deletions

View File

@@ -0,0 +1,65 @@
# Router Rate Limiting Runbook
Last updated: 2025-12-17
## Purpose
- Enforce centralized admission control at the Router (429 + Retry-After).
- Reduce duplicate per-service HTTP throttling and standardize response semantics.
- Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).
## Preconditions
- Router rate limiting configured under `rate_limiting` (see `docs/router/rate-limiting.md`).
- If `for_environment` is enabled:
- Valkey reachable from Router instances.
- Circuit breaker parameters reviewed for the environment.
## Rollout plan (recommended)
1. **Dry-run wiring**: enable rate limiting with limits set far above peak traffic to validate middleware order, headers, and metrics.
2. **Soft limits**: set limits to ~2× peak traffic and monitor rejected rate and latency.
3. **Production limits**: set limits to target SLO and operational constraints.
4. **Migration cleanup**: remove any remaining service-level HTTP rate limiters to avoid double-limiting.
## Monitoring
### Key metrics (OpenTelemetry)
- `stellaops.router.ratelimit.allowed{scope,microservice,route?}`
- `stellaops.router.ratelimit.rejected{scope,microservice,route?}`
- `stellaops.router.ratelimit.check_latency{scope}`
- `stellaops.router.ratelimit.valkey.errors{error_type}`
- `stellaops.router.ratelimit.circuit_breaker.trips{reason}`
- `stellaops.router.ratelimit.instance.current`
- `stellaops.router.ratelimit.environment.current`
### PromQL examples
- Deny ratio (by microservice):
- `sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice) / (sum(rate(stellaops_router_ratelimit_allowed_total[5m])) by (microservice) + sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice))`
- P95 check latency (environment):
- `histogram_quantile(0.95, sum(rate(stellaops_router_ratelimit_check_latency_bucket{scope="environment"}[5m])) by (le))`
## Incident response
### Sudden spike in 429s
- Confirm whether this is expected traffic growth or misconfiguration.
- Identify the top offenders: `rejected` by `microservice` and (optionally) `route`.
- If misconfigured: raise limits conservatively (2×), redeploy config, then tighten gradually.
### Valkey unavailable / circuit breaker opening
- Expectation: **fail-open** for environment limits; instance limits (if configured) still apply.
- Check:
- `stellaops.router.ratelimit.valkey.errors`
- `stellaops.router.ratelimit.circuit_breaker.trips`
- Actions:
- Restore Valkey connectivity/performance.
- Consider temporarily increasing `process_back_pressure_when_more_than_per_5min` to reduce Valkey load.
## Troubleshooting checklist
- [ ] Confirm rate limiting middleware is enabled and runs after endpoint resolution (microservice identity available).
- [ ] Validate YAML binding: incorrect keys should fail fast at startup.
- [ ] Confirm Valkey connectivity from Router nodes (if `for_environment` enabled).
- [ ] Ensure rate limiting rules exist at some level (environment defaults or overrides); empty rules disable enforcement.
- [ ] Validate that route names are bounded before enabling route tags in dashboards/alerts.
## Load testing
- Run `tests/load/router-rate-limiting-load-test.js` against a staging Router configured with known limits.
- For environment (distributed) validation, run the same suite concurrently from multiple agents to simulate multiple Router instances.