# Router Rate Limiting - Sprint Package README **Package Created:** 2025-12-17 **For:** Implementation agents **Advisory Source:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md` --- ## Package Contents This sprint package contains everything needed to implement centralized rate limiting in Stella Router. ### Core Sprint Files | File | Purpose | Agent Role | |------|---------|------------| | `SPRINT_1200_001_000_router_rate_limiting_master.md` | Master tracker | **START HERE** - Overview & progress tracking | | `SPRINT_1200_001_001_router_rate_limiting_core.md` | Sprint 1: Core implementation | Implementer - 5-7 days | | `SPRINT_1200_001_002_router_rate_limiting_per_route.md` | Sprint 2: Per-route granularity | Implementer - 2-3 days | | `SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md` | Sprint 3: Rule stacking | Implementer - 2-3 days | | `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` | Technical reference | **READ FIRST** before coding | ### Documentation Files (To Be Created in Sprint 6) | File | Purpose | Created In | |------|---------|------------| | `docs/router/rate-limiting.md` | User-facing configuration guide | Sprint 6 | | `docs/operations/router-rate-limiting.md` | Operational runbook | Sprint 6 | | `docs/modules/router/architecture.md` | Architecture documentation | Sprint 6 | --- ## Implementation Sequence ### Phase 1: Core Implementation (Sprints 1-3) ``` Sprint 1 (5-7 days) ├── Task 1.1: Configuration Models ├── Task 1.2: Instance Rate Limiter ├── Task 1.3: Valkey Backend ├── Task 1.4: Middleware Integration ├── Task 1.5: Metrics └── Task 1.6: Wire into Pipeline Sprint 2 (2-3 days) ├── Task 2.1: Extend Config for Routes ├── Task 2.2: Route Matching ├── Task 2.3: Inheritance Resolution ├── Task 2.4: Integrate into Service └── Task 2.5: Documentation Sprint 3 (2-3 days) ├── Task 3.1: Config for Rule Arrays ├── Task 3.2: Update Instance Limiter ├── Task 3.3: Enhance Valkey Lua Script └── Task 3.4: Update Inheritance Resolver ``` ### Phase 2: Migration & Testing (Sprints 4-5) ``` Sprint 4 (3-4 days) - Service Migration ├── Extract AdaptiveRateLimiter configs ├── Add to Router configuration ├── Refactor AdaptiveRateLimiter └── Integration validation Sprint 5 (3-5 days) - Comprehensive Testing ├── Unit test suite ├── Integration tests (Testcontainers) ├── Load tests (k6 scenarios A-F) └── Configuration matrix tests ``` ### Phase 3: Documentation & Rollout (Sprint 6) ``` Sprint 6 (2 days) ├── Architecture docs ├── Configuration guide ├── Operational runbook └── Migration guide ``` ### Phase 4: Rollout (3 weeks, post-implementation) ``` Week 1: Shadow Mode └── Metrics only, no enforcement Week 2: Soft Limits └── 2x traffic peaks Week 3: Production Limits └── Full enforcement Week 4+: Service Migration └── Remove redundant limiters ``` --- ## Quick Start for Agents ### 1. Context Gathering (30 minutes) **Read in this order:** 1. `SPRINT_1200_001_000_router_rate_limiting_master.md` - Overview 2. `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` - Technical details 3. Original advisory: `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md` 4. Analysis plan: `C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md` ### 2. Environment Setup ```bash # Working directory cd src/__Libraries/StellaOps.Router.Gateway/ # Verify dependencies dotnet restore # Install Valkey for local testing docker run -d -p 6379:6379 valkey/valkey:latest # Run existing tests to ensure baseline dotnet test ``` ### 3. Start Sprint 1 Open `SPRINT_1200_001_001_router_rate_limiting_core.md` and follow task breakdown. **Task execution pattern:** ``` For each task: 1. Read task description 2. Review implementation code samples 3. Create files as specified 4. Write unit tests 5. Mark task complete in master tracker 6. Commit with message: "feat(router): [Sprint 1.X] Task name" ``` --- ## Key Design Decisions (Reference) ### 1. Status Codes - ✅ **429 Too Many Requests** for rate limiting - ❌ NOT 503 (that's for service health) - ❌ NOT 202 (that's for async job acceptance) ### 2. Two-Scope Architecture - **for_instance**: In-memory, protects single router - **for_environment**: Valkey-backed, protects aggregate Both are necessary—can't replace one with the other. ### 3. Fail-Open Philosophy - Circuit breaker on Valkey failures - Activation gate optimization - Instance limits enforced even if Valkey down ### 4. Configuration Inheritance - Replacement semantics (not merge) - Most specific wins: route > microservice > environment > global ### 5. Rule Stacking - Multiple rules per target = AND logic - All rules must pass - Most restrictive Retry-After returned --- ## Performance Targets | Metric | Target | Measurement | |--------|--------|-------------| | Instance check latency | <1ms P99 | BenchmarkDotNet | | Environment check latency | <10ms P99 | k6 load test | | Router throughput | 100k req/sec | k6 constant-arrival-rate | | Valkey load per instance | <1000 ops/sec | redis-cli INFO | --- ## Testing Requirements ### Unit Tests - **Coverage:** >90% for all RateLimit/* files - **Framework:** xUnit - **Patterns:** Arrange-Act-Assert ### Integration Tests - **Tool:** TestServer + Testcontainers (Valkey) - **Scope:** End-to-end middleware pipeline - **Scenarios:** All config combinations ### Load Tests - **Tool:** k6 - **Scenarios:** A (instance), B (environment), C (activation gate), D (microservice), E (Valkey failure), F (max throughput) - **Duration:** 30s per scenario minimum --- ## Common Implementation Gotchas ⚠️ **Middleware Pipeline Order** ```csharp // CORRECT: app.UsePayloadLimits(); app.UseRateLimiting(); // BEFORE routing app.UseEndpointResolution(); // WRONG: app.UseEndpointResolution(); app.UseRateLimiting(); // Too late, can't identify microservice ``` ⚠️ **Lua Script Deployment** ```xml PreserveNewest ``` ⚠️ **Clock Skew** ```lua -- CORRECT: Use Valkey server time local now = tonumber(redis.call("TIME")[1]) -- WRONG: Use client time (clock skew issues) local now = os.time() ``` ⚠️ **Circuit Breaker Half-Open** ```csharp // REQUIRED: Implement half-open state if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt) { _state = CircuitState.HalfOpen; // Allow ONE test request } ``` --- ## Success Criteria Checklist Copy this to master tracker and update as you progress: ### Functional - [ ] Router enforces per-instance limits (in-memory) - [ ] Router enforces per-environment limits (Valkey-backed) - [ ] Per-microservice configuration works - [ ] Per-route configuration works - [ ] Multiple rules per target work (rule stacking) - [ ] 429 + Retry-After response format correct - [ ] Circuit breaker handles Valkey failures - [ ] Activation gate reduces Valkey load ### Performance - [ ] Instance check <1ms P99 - [ ] Environment check <10ms P99 - [ ] 100k req/sec throughput maintained - [ ] Valkey load <1000 ops/sec per instance ### Operational - [ ] Metrics exported to OpenTelemetry - [ ] Dashboards created (Grafana) - [ ] Alerts configured (Alertmanager) - [ ] Documentation complete - [ ] Migration from service-level rate limiters complete ### Quality - [ ] Unit test coverage >90% - [ ] Integration tests pass (all scenarios) - [ ] Load tests pass (k6 scenarios A-F) - [ ] Failure injection tests pass --- ## Escalation & Support ### Blocked on Technical Decision **Escalate to:** Architecture Guild (#stella-architecture) **Response SLA:** 24 hours ### Blocked on Resource (Valkey, config, etc.) **Escalate to:** Platform Engineering (#stella-platform) **Response SLA:** 4 hours ### Blocked on Clarification **Escalate to:** Router Team Lead (#stella-router-dev) **Response SLA:** 2 hours ### Sprint Falling Behind Schedule **Escalate to:** Project Manager (update master tracker with BLOCKED status) **Action:** Add note in "Decisions & Risks" section --- ## File Structure (After Implementation) ``` src/__Libraries/StellaOps.Router.Gateway/ ├── RateLimit/ │ ├── RateLimitConfig.cs │ ├── IRateLimiter.cs │ ├── InstanceRateLimiter.cs │ ├── EnvironmentRateLimiter.cs │ ├── RateLimitService.cs │ ├── RateLimitMetrics.cs │ ├── RateLimitDecision.cs │ ├── ValkeyRateLimitStore.cs │ ├── CircuitBreaker.cs │ ├── LimitInheritanceResolver.cs │ ├── Models/ │ │ ├── InstanceLimitsConfig.cs │ │ ├── EnvironmentLimitsConfig.cs │ │ ├── MicroserviceLimitsConfig.cs │ │ ├── RouteLimitsConfig.cs │ │ ├── RateLimitRule.cs │ │ └── EffectiveLimits.cs │ ├── RouteMatching/ │ │ ├── IRouteMatcher.cs │ │ ├── RouteMatcher.cs │ │ ├── ExactRouteMatcher.cs │ │ ├── PrefixRouteMatcher.cs │ │ └── RegexRouteMatcher.cs │ ├── Internal/ │ │ └── SlidingWindowCounter.cs │ └── Scripts/ │ └── rate_limit_check.lua ├── Middleware/ │ └── RateLimitMiddleware.cs ├── ApplicationBuilderExtensions.cs (modified) └── ServiceCollectionExtensions.cs (modified) __Tests/ ├── RateLimit/ │ ├── InstanceRateLimiterTests.cs │ ├── EnvironmentRateLimiterTests.cs │ ├── ValkeyRateLimitStoreTests.cs │ ├── RateLimitMiddlewareTests.cs │ ├── ConfigurationTests.cs │ ├── RouteMatchingTests.cs │ └── InheritanceResolverTests.cs tests/load/k6/ └── rate-limit-scenarios.js ``` --- ## Next Steps After Package Review 1. **Acknowledge receipt** of sprint package 2. **Set up development environment** (Valkey, dependencies) 3. **Read Implementation Guide** in full 4. **Start Sprint 1, Task 1.1** (Configuration Models) 5. **Update master tracker** as tasks complete 6. **Commit frequently** with clear messages 7. **Run tests after each task** 8. **Ask questions early** if blocked --- ## Configuration Quick Reference ### Minimal Config (Just Defaults) ```yaml rate_limiting: for_instance: per_seconds: 300 max_requests: 30000 ``` ### Full Config (All Features) ```yaml rate_limiting: process_back_pressure_when_more_than_per_5min: 5000 for_instance: rules: - per_seconds: 300 max_requests: 30000 - per_seconds: 30 max_requests: 5000 for_environment: valkey_bucket: "stella-router-rate-limit" valkey_connection: "valkey.stellaops.local:6379" circuit_breaker: failure_threshold: 5 timeout_seconds: 30 half_open_timeout: 10 rules: - per_seconds: 300 max_requests: 30000 microservices: concelier: rules: - per_seconds: 1 max_requests: 10 - per_seconds: 3600 max_requests: 3000 scanner: rules: - per_seconds: 60 max_requests: 600 routes: scan_submit: pattern: "/api/scans" match_type: exact rules: - per_seconds: 10 max_requests: 50 ``` --- ## Related Documentation ### Source Documents - **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md` - **Analysis Plan:** `C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md` - **Architecture:** `docs/modules/platform/architecture-overview.md` ### Implementation Sprints - **Master Tracker:** `SPRINT_1200_001_000_router_rate_limiting_master.md` - **Sprint 1:** `SPRINT_1200_001_001_router_rate_limiting_core.md` - **Sprint 2:** `SPRINT_1200_001_002_router_rate_limiting_per_route.md` - **Sprint 3:** `SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md` - **Sprint 4-6:** To be created by implementer (templates in master tracker) ### Technical Guides - **Implementation Guide:** `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` (comprehensive) - **HTTP 429 Semantics:** RFC 6585 - **Valkey Documentation:** https://valkey.io/docs/ --- ## Version History | Version | Date | Changes | |---------|------|---------| | 1.0 | 2025-12-17 | Initial sprint package created | --- **Ready to implement?** Start with the Implementation Guide, then proceed to Sprint 1!