Files
git.stella-ops.org/docs/operations/router-rate-limiting.md
2025-12-18 00:47:24 +02:00

3.3 KiB
Raw Blame History

Router Rate Limiting Runbook

Last updated: 2025-12-17

Purpose

  • Enforce centralized admission control at the Router (429 + Retry-After).
  • Reduce duplicate per-service HTTP throttling and standardize response semantics.
  • Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).

Preconditions

  • Router rate limiting configured under rate_limiting (see docs/router/rate-limiting.md).
  • If for_environment is enabled:
    • Valkey reachable from Router instances.
    • Circuit breaker parameters reviewed for the environment.
  1. Dry-run wiring: enable rate limiting with limits set far above peak traffic to validate middleware order, headers, and metrics.
  2. Soft limits: set limits to ~2× peak traffic and monitor rejected rate and latency.
  3. Production limits: set limits to target SLO and operational constraints.
  4. Migration cleanup: remove any remaining service-level HTTP rate limiters to avoid double-limiting.

Monitoring

Key metrics (OpenTelemetry)

  • stellaops.router.ratelimit.allowed{scope,microservice,route?}
  • stellaops.router.ratelimit.rejected{scope,microservice,route?}
  • stellaops.router.ratelimit.check_latency{scope}
  • stellaops.router.ratelimit.valkey.errors{error_type}
  • stellaops.router.ratelimit.circuit_breaker.trips{reason}
  • stellaops.router.ratelimit.instance.current
  • stellaops.router.ratelimit.environment.current

PromQL examples

  • Deny ratio (by microservice):
    • sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice) / (sum(rate(stellaops_router_ratelimit_allowed_total[5m])) by (microservice) + sum(rate(stellaops_router_ratelimit_rejected_total[5m])) by (microservice))
  • P95 check latency (environment):
    • histogram_quantile(0.95, sum(rate(stellaops_router_ratelimit_check_latency_bucket{scope="environment"}[5m])) by (le))

Incident response

Sudden spike in 429s

  • Confirm whether this is expected traffic growth or misconfiguration.
  • Identify the top offenders: rejected by microservice and (optionally) route.
  • If misconfigured: raise limits conservatively (2×), redeploy config, then tighten gradually.

Valkey unavailable / circuit breaker opening

  • Expectation: fail-open for environment limits; instance limits (if configured) still apply.
  • Check:
    • stellaops.router.ratelimit.valkey.errors
    • stellaops.router.ratelimit.circuit_breaker.trips
  • Actions:
    • Restore Valkey connectivity/performance.
    • Consider temporarily increasing process_back_pressure_when_more_than_per_5min to reduce Valkey load.

Troubleshooting checklist

  • Confirm rate limiting middleware is enabled and runs after endpoint resolution (microservice identity available).
  • Validate YAML binding: incorrect keys should fail fast at startup.
  • Confirm Valkey connectivity from Router nodes (if for_environment enabled).
  • Ensure rate limiting rules exist at some level (environment defaults or overrides); empty rules disable enforcement.
  • Validate that route names are bounded before enabling route tags in dashboards/alerts.

Load testing

  • Run tests/load/router-rate-limiting-load-test.js against a staging Router configured with known limits.
  • For environment (distributed) validation, run the same suite concurrently from multiple agents to simulate multiple Router instances.