- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
12 KiB
Router Rate Limiting - Sprint Package README
Package Created: 2025-12-17
For: Implementation agents
Advisory Source: docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md
Package Contents
This sprint package contains everything needed to implement centralized rate limiting in Stella Router.
Core Sprint Files
| File | Purpose | Agent Role |
|---|---|---|
SPRINT_1200_001_000_router_rate_limiting_master.md |
Master tracker | START HERE - Overview & progress tracking |
SPRINT_1200_001_001_router_rate_limiting_core.md |
Sprint 1: Core implementation | Implementer - 5-7 days |
SPRINT_1200_001_002_router_rate_limiting_per_route.md |
Sprint 2: Per-route granularity | Implementer - 2-3 days |
SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md |
Sprint 3: Rule stacking | Implementer - 2-3 days |
SPRINT_1200_001_IMPLEMENTATION_GUIDE.md |
Technical reference | READ FIRST before coding |
Documentation Files (To Be Created in Sprint 6)
| File | Purpose | Created In |
|---|---|---|
docs/router/rate-limiting.md |
User-facing configuration guide | Sprint 6 |
docs/operations/router-rate-limiting.md |
Operational runbook | Sprint 6 |
docs/modules/router/architecture.md |
Architecture documentation | Sprint 6 |
Implementation Sequence
Phase 1: Core Implementation (Sprints 1-3)
Sprint 1 (5-7 days)
├── Task 1.1: Configuration Models
├── Task 1.2: Instance Rate Limiter
├── Task 1.3: Valkey Backend
├── Task 1.4: Middleware Integration
├── Task 1.5: Metrics
└── Task 1.6: Wire into Pipeline
Sprint 2 (2-3 days)
├── Task 2.1: Extend Config for Routes
├── Task 2.2: Route Matching
├── Task 2.3: Inheritance Resolution
├── Task 2.4: Integrate into Service
└── Task 2.5: Documentation
Sprint 3 (2-3 days)
├── Task 3.1: Config for Rule Arrays
├── Task 3.2: Update Instance Limiter
├── Task 3.3: Enhance Valkey Lua Script
└── Task 3.4: Update Inheritance Resolver
Phase 2: Migration & Testing (Sprints 4-5)
Sprint 4 (3-4 days) - Service Migration
├── Extract AdaptiveRateLimiter configs
├── Add to Router configuration
├── Refactor AdaptiveRateLimiter
└── Integration validation
Sprint 5 (3-5 days) - Comprehensive Testing
├── Unit test suite
├── Integration tests (Testcontainers)
├── Load tests (k6 scenarios A-F)
└── Configuration matrix tests
Phase 3: Documentation & Rollout (Sprint 6)
Sprint 6 (2 days)
├── Architecture docs
├── Configuration guide
├── Operational runbook
└── Migration guide
Phase 4: Rollout (3 weeks, post-implementation)
Week 1: Shadow Mode
└── Metrics only, no enforcement
Week 2: Soft Limits
└── 2x traffic peaks
Week 3: Production Limits
└── Full enforcement
Week 4+: Service Migration
└── Remove redundant limiters
Quick Start for Agents
1. Context Gathering (30 minutes)
Read in this order:
SPRINT_1200_001_000_router_rate_limiting_master.md- OverviewSPRINT_1200_001_IMPLEMENTATION_GUIDE.md- Technical details- Original advisory:
docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md - Analysis plan:
C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md
2. Environment Setup
# Working directory
cd src/__Libraries/StellaOps.Router.Gateway/
# Verify dependencies
dotnet restore
# Install Valkey for local testing
docker run -d -p 6379:6379 valkey/valkey:latest
# Run existing tests to ensure baseline
dotnet test
3. Start Sprint 1
Open SPRINT_1200_001_001_router_rate_limiting_core.md and follow task breakdown.
Task execution pattern:
For each task:
1. Read task description
2. Review implementation code samples
3. Create files as specified
4. Write unit tests
5. Mark task complete in master tracker
6. Commit with message: "feat(router): [Sprint 1.X] Task name"
Key Design Decisions (Reference)
1. Status Codes
- ✅ 429 Too Many Requests for rate limiting
- ❌ NOT 503 (that's for service health)
- ❌ NOT 202 (that's for async job acceptance)
2. Two-Scope Architecture
- for_instance: In-memory, protects single router
- for_environment: Valkey-backed, protects aggregate
Both are necessary—can't replace one with the other.
3. Fail-Open Philosophy
- Circuit breaker on Valkey failures
- Activation gate optimization
- Instance limits enforced even if Valkey down
4. Configuration Inheritance
- Replacement semantics (not merge)
- Most specific wins: route > microservice > environment > global
5. Rule Stacking
- Multiple rules per target = AND logic
- All rules must pass
- Most restrictive Retry-After returned
Performance Targets
| Metric | Target | Measurement |
|---|---|---|
| Instance check latency | <1ms P99 | BenchmarkDotNet |
| Environment check latency | <10ms P99 | k6 load test |
| Router throughput | 100k req/sec | k6 constant-arrival-rate |
| Valkey load per instance | <1000 ops/sec | redis-cli INFO |
Testing Requirements
Unit Tests
- Coverage: >90% for all RateLimit/* files
- Framework: xUnit
- Patterns: Arrange-Act-Assert
Integration Tests
- Tool: TestServer + Testcontainers (Valkey)
- Scope: End-to-end middleware pipeline
- Scenarios: All config combinations
Load Tests
- Tool: k6
- Scenarios: A (instance), B (environment), C (activation gate), D (microservice), E (Valkey failure), F (max throughput)
- Duration: 30s per scenario minimum
Common Implementation Gotchas
⚠️ Middleware Pipeline Order
// CORRECT:
app.UsePayloadLimits();
app.UseRateLimiting(); // BEFORE routing
app.UseEndpointResolution();
// WRONG:
app.UseEndpointResolution();
app.UseRateLimiting(); // Too late, can't identify microservice
⚠️ Lua Script Deployment
<!-- REQUIRED in .csproj -->
<ItemGroup>
<Content Include="RateLimit\Scripts\*.lua">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
⚠️ Clock Skew
-- CORRECT: Use Valkey server time
local now = tonumber(redis.call("TIME")[1])
-- WRONG: Use client time (clock skew issues)
local now = os.time()
⚠️ Circuit Breaker Half-Open
// REQUIRED: Implement half-open state
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
_state = CircuitState.HalfOpen; // Allow ONE test request
}
Success Criteria Checklist
Copy this to master tracker and update as you progress:
Functional
- Router enforces per-instance limits (in-memory)
- Router enforces per-environment limits (Valkey-backed)
- Per-microservice configuration works
- Per-route configuration works
- Multiple rules per target work (rule stacking)
- 429 + Retry-After response format correct
- Circuit breaker handles Valkey failures
- Activation gate reduces Valkey load
Performance
- Instance check <1ms P99
- Environment check <10ms P99
- 100k req/sec throughput maintained
- Valkey load <1000 ops/sec per instance
Operational
- Metrics exported to OpenTelemetry
- Dashboards created (Grafana)
- Alerts configured (Alertmanager)
- Documentation complete
- Migration from service-level rate limiters complete
Quality
- Unit test coverage >90%
- Integration tests pass (all scenarios)
- Load tests pass (k6 scenarios A-F)
- Failure injection tests pass
Escalation & Support
Blocked on Technical Decision
Escalate to: Architecture Guild (#stella-architecture) Response SLA: 24 hours
Blocked on Resource (Valkey, config, etc.)
Escalate to: Platform Engineering (#stella-platform) Response SLA: 4 hours
Blocked on Clarification
Escalate to: Router Team Lead (#stella-router-dev) Response SLA: 2 hours
Sprint Falling Behind Schedule
Escalate to: Project Manager (update master tracker with BLOCKED status) Action: Add note in "Decisions & Risks" section
File Structure (After Implementation)
src/__Libraries/StellaOps.Router.Gateway/
├── RateLimit/
│ ├── RateLimitConfig.cs
│ ├── IRateLimiter.cs
│ ├── InstanceRateLimiter.cs
│ ├── EnvironmentRateLimiter.cs
│ ├── RateLimitService.cs
│ ├── RateLimitMetrics.cs
│ ├── RateLimitDecision.cs
│ ├── ValkeyRateLimitStore.cs
│ ├── CircuitBreaker.cs
│ ├── LimitInheritanceResolver.cs
│ ├── Models/
│ │ ├── InstanceLimitsConfig.cs
│ │ ├── EnvironmentLimitsConfig.cs
│ │ ├── MicroserviceLimitsConfig.cs
│ │ ├── RouteLimitsConfig.cs
│ │ ├── RateLimitRule.cs
│ │ └── EffectiveLimits.cs
│ ├── RouteMatching/
│ │ ├── IRouteMatcher.cs
│ │ ├── RouteMatcher.cs
│ │ ├── ExactRouteMatcher.cs
│ │ ├── PrefixRouteMatcher.cs
│ │ └── RegexRouteMatcher.cs
│ ├── Internal/
│ │ └── SlidingWindowCounter.cs
│ └── Scripts/
│ └── rate_limit_check.lua
├── Middleware/
│ └── RateLimitMiddleware.cs
├── ApplicationBuilderExtensions.cs (modified)
└── ServiceCollectionExtensions.cs (modified)
__Tests/
├── RateLimit/
│ ├── InstanceRateLimiterTests.cs
│ ├── EnvironmentRateLimiterTests.cs
│ ├── ValkeyRateLimitStoreTests.cs
│ ├── RateLimitMiddlewareTests.cs
│ ├── ConfigurationTests.cs
│ ├── RouteMatchingTests.cs
│ └── InheritanceResolverTests.cs
tests/load/k6/
└── rate-limit-scenarios.js
Next Steps After Package Review
- Acknowledge receipt of sprint package
- Set up development environment (Valkey, dependencies)
- Read Implementation Guide in full
- Start Sprint 1, Task 1.1 (Configuration Models)
- Update master tracker as tasks complete
- Commit frequently with clear messages
- Run tests after each task
- Ask questions early if blocked
Configuration Quick Reference
Minimal Config (Just Defaults)
rate_limiting:
for_instance:
per_seconds: 300
max_requests: 30000
Full Config (All Features)
rate_limiting:
process_back_pressure_when_more_than_per_5min: 5000
for_instance:
rules:
- per_seconds: 300
max_requests: 30000
- per_seconds: 30
max_requests: 5000
for_environment:
valkey_bucket: "stella-router-rate-limit"
valkey_connection: "valkey.stellaops.local:6379"
circuit_breaker:
failure_threshold: 5
timeout_seconds: 30
half_open_timeout: 10
rules:
- per_seconds: 300
max_requests: 30000
microservices:
concelier:
rules:
- per_seconds: 1
max_requests: 10
- per_seconds: 3600
max_requests: 3000
scanner:
rules:
- per_seconds: 60
max_requests: 600
routes:
scan_submit:
pattern: "/api/scans"
match_type: exact
rules:
- per_seconds: 10
max_requests: 50
Related Documentation
Source Documents
- Advisory:
docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md - Analysis Plan:
C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md - Architecture:
docs/modules/platform/architecture-overview.md
Implementation Sprints
- Master Tracker:
SPRINT_1200_001_000_router_rate_limiting_master.md - Sprint 1:
SPRINT_1200_001_001_router_rate_limiting_core.md - Sprint 2:
SPRINT_1200_001_002_router_rate_limiting_per_route.md - Sprint 3:
SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md - Sprint 4-6: To be created by implementer (templates in master tracker)
Technical Guides
- Implementation Guide:
SPRINT_1200_001_IMPLEMENTATION_GUIDE.md(comprehensive) - HTTP 429 Semantics: RFC 6585
- Valkey Documentation: https://valkey.io/docs/
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-12-17 | Initial sprint package created |
Ready to implement? Start with the Implementation Guide, then proceed to Sprint 1!