# Router Rate Limiting - Sprint Package README

**Package Created:** 2025-12-17
**For:** Implementation agents
**Advisory Source:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`

---

## Package Contents

This sprint package contains everything needed to implement centralized rate limiting in Stella Router.

### Core Sprint Files

| File | Purpose | Agent Role |
|------|---------|------------|
| `SPRINT_1200_001_000_router_rate_limiting_master.md` | Master tracker | **START HERE** - Overview & progress tracking |
| `SPRINT_1200_001_001_router_rate_limiting_core.md` | Sprint 1: Core implementation | Implementer - 5-7 days |
| `SPRINT_1200_001_002_router_rate_limiting_per_route.md` | Sprint 2: Per-route granularity | Implementer - 2-3 days |
| `SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md` | Sprint 3: Rule stacking | Implementer - 2-3 days |
| `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` | Technical reference | **READ FIRST** before coding |

### Documentation Files (To Be Created in Sprint 6)

| File | Purpose | Created In |
|------|---------|------------|
| `docs/router/rate-limiting.md` | User-facing configuration guide | Sprint 6 |
| `docs/operations/router-rate-limiting.md` | Operational runbook | Sprint 6 |
| `docs/modules/router/architecture.md` | Architecture documentation | Sprint 6 |

---

## Implementation Sequence

### Phase 1: Core Implementation (Sprints 1-3)

```
Sprint 1 (5-7 days)
├── Task 1.1: Configuration Models
├── Task 1.2: Instance Rate Limiter
├── Task 1.3: Valkey Backend
├── Task 1.4: Middleware Integration
├── Task 1.5: Metrics
└── Task 1.6: Wire into Pipeline

Sprint 2 (2-3 days)
├── Task 2.1: Extend Config for Routes
├── Task 2.2: Route Matching
├── Task 2.3: Inheritance Resolution
├── Task 2.4: Integrate into Service
└── Task 2.5: Documentation

Sprint 3 (2-3 days)
├── Task 3.1: Config for Rule Arrays
├── Task 3.2: Update Instance Limiter
├── Task 3.3: Enhance Valkey Lua Script
└── Task 3.4: Update Inheritance Resolver
```

### Phase 2: Migration & Testing (Sprints 4-5)

```
Sprint 4 (3-4 days) - Service Migration
├── Extract AdaptiveRateLimiter configs
├── Add to Router configuration
├── Refactor AdaptiveRateLimiter
└── Integration validation

Sprint 5 (3-5 days) - Comprehensive Testing
├── Unit test suite
├── Integration tests (Testcontainers)
├── Load tests (k6 scenarios A-F)
└── Configuration matrix tests
```

### Phase 3: Documentation & Rollout (Sprint 6)

```
Sprint 6 (2 days)
├── Architecture docs
├── Configuration guide
├── Operational runbook
└── Migration guide
```

### Phase 4: Rollout (3 weeks, post-implementation)

```
Week 1: Shadow Mode
└── Metrics only, no enforcement

Week 2: Soft Limits
└── 2x traffic peaks

Week 3: Production Limits
└── Full enforcement

Week 4+: Service Migration
└── Remove redundant limiters
```

---

## Quick Start for Agents

### 1. Context Gathering (30 minutes)

**Read in this order:**

1. `SPRINT_1200_001_000_router_rate_limiting_master.md` - Overview
2. `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` - Technical details
3. Original advisory: `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
4. Analysis plan: `C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md`

### 2. Environment Setup

```bash
# Working directory
cd src/__Libraries/StellaOps.Router.Gateway/

# Verify dependencies
dotnet restore

# Install Valkey for local testing
docker run -d -p 6379:6379 valkey/valkey:latest

# Run existing tests to ensure baseline
dotnet test
```

### 3. Start Sprint 1

Open `SPRINT_1200_001_001_router_rate_limiting_core.md` and follow task breakdown.

**Task execution pattern:**

```
For each task:
1. Read task description
2. Review implementation code samples
3. Create files as specified
4. Write unit tests
5. Mark task complete in master tracker
6. Commit with message: "feat(router): [Sprint 1.X] Task name"
```

---

## Key Design Decisions (Reference)

### 1. Status Codes
- ✅ **429 Too Many Requests** for rate limiting
- ❌ NOT 503 (that's for service health)
- ❌ NOT 202 (that's for async job acceptance)

### 2. Two-Scope Architecture
- **for_instance**: In-memory, protects single router
- **for_environment**: Valkey-backed, protects aggregate

Both are necessary—can't replace one with the other.

### 3. Fail-Open Philosophy
- Circuit breaker on Valkey failures
- Activation gate optimization
- Instance limits enforced even if Valkey down

### 4. Configuration Inheritance
- Replacement semantics (not merge)
- Most specific wins: route > microservice > environment > global

### 5. Rule Stacking
- Multiple rules per target = AND logic
- All rules must pass
- Most restrictive Retry-After returned

---

## Performance Targets

| Metric | Target | Measurement |
|--------|--------|-------------|
| Instance check latency | <1ms P99 | BenchmarkDotNet |
| Environment check latency | <10ms P99 | k6 load test |
| Router throughput | 100k req/sec | k6 constant-arrival-rate |
| Valkey load per instance | <1000 ops/sec | redis-cli INFO |

---

## Testing Requirements

### Unit Tests
- **Coverage:** >90% for all RateLimit/* files
- **Framework:** xUnit
- **Patterns:** Arrange-Act-Assert

### Integration Tests
- **Tool:** TestServer + Testcontainers (Valkey)
- **Scope:** End-to-end middleware pipeline
- **Scenarios:** All config combinations

### Load Tests
- **Tool:** k6
- **Scenarios:** A (instance), B (environment), C (activation gate), D (microservice), E (Valkey failure), F (max throughput)
- **Duration:** 30s per scenario minimum

---

## Common Implementation Gotchas

⚠️ **Middleware Pipeline Order**
```csharp
// CORRECT:
app.UsePayloadLimits();
app.UseRateLimiting();        // BEFORE routing
app.UseEndpointResolution();

// WRONG:
app.UseEndpointResolution();
app.UseRateLimiting();        // Too late, can't identify microservice
```

⚠️ **Lua Script Deployment**
```xml
<!-- REQUIRED in .csproj -->
<ItemGroup>
  <Content Include="RateLimit\Scripts\*.lua">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>
```

⚠️ **Clock Skew**
```lua
-- CORRECT: Use Valkey server time
local now = tonumber(redis.call("TIME")[1])

-- WRONG: Use client time (clock skew issues)
local now = os.time()
```

⚠️ **Circuit Breaker Half-Open**
```csharp
// REQUIRED: Implement half-open state
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
    _state = CircuitState.HalfOpen; // Allow ONE test request
}
```

---

## Success Criteria Checklist

Copy this to master tracker and update as you progress:

### Functional
- [ ] Router enforces per-instance limits (in-memory)
- [ ] Router enforces per-environment limits (Valkey-backed)
- [ ] Per-microservice configuration works
- [ ] Per-route configuration works
- [ ] Multiple rules per target work (rule stacking)
- [ ] 429 + Retry-After response format correct
- [ ] Circuit breaker handles Valkey failures
- [ ] Activation gate reduces Valkey load

### Performance
- [ ] Instance check <1ms P99
- [ ] Environment check <10ms P99
- [ ] 100k req/sec throughput maintained
- [ ] Valkey load <1000 ops/sec per instance

### Operational
- [ ] Metrics exported to OpenTelemetry
- [ ] Dashboards created (Grafana)
- [ ] Alerts configured (Alertmanager)
- [ ] Documentation complete
- [ ] Migration from service-level rate limiters complete

### Quality
- [ ] Unit test coverage >90%
- [ ] Integration tests pass (all scenarios)
- [ ] Load tests pass (k6 scenarios A-F)
- [ ] Failure injection tests pass

---

## Escalation & Support

### Blocked on Technical Decision
**Escalate to:** Architecture Guild (#stella-architecture)
**Response SLA:** 24 hours

### Blocked on Resource (Valkey, config, etc.)
**Escalate to:** Platform Engineering (#stella-platform)
**Response SLA:** 4 hours

### Blocked on Clarification
**Escalate to:** Router Team Lead (#stella-router-dev)
**Response SLA:** 2 hours

### Sprint Falling Behind Schedule
**Escalate to:** Project Manager (update master tracker with BLOCKED status)
**Action:** Add note in "Decisions & Risks" section

---

## File Structure (After Implementation)

```
src/__Libraries/StellaOps.Router.Gateway/
├── RateLimit/
│   ├── RateLimitConfig.cs
│   ├── IRateLimiter.cs
│   ├── InstanceRateLimiter.cs
│   ├── EnvironmentRateLimiter.cs
│   ├── RateLimitService.cs
│   ├── RateLimitMetrics.cs
│   ├── RateLimitDecision.cs
│   ├── ValkeyRateLimitStore.cs
│   ├── CircuitBreaker.cs
│   ├── LimitInheritanceResolver.cs
│   ├── Models/
│   │   ├── InstanceLimitsConfig.cs
│   │   ├── EnvironmentLimitsConfig.cs
│   │   ├── MicroserviceLimitsConfig.cs
│   │   ├── RouteLimitsConfig.cs
│   │   ├── RateLimitRule.cs
│   │   └── EffectiveLimits.cs
│   ├── RouteMatching/
│   │   ├── IRouteMatcher.cs
│   │   ├── RouteMatcher.cs
│   │   ├── ExactRouteMatcher.cs
│   │   ├── PrefixRouteMatcher.cs
│   │   └── RegexRouteMatcher.cs
│   ├── Internal/
│   │   └── SlidingWindowCounter.cs
│   └── Scripts/
│       └── rate_limit_check.lua
├── Middleware/
│   └── RateLimitMiddleware.cs
├── ApplicationBuilderExtensions.cs (modified)
└── ServiceCollectionExtensions.cs (modified)

__Tests/
├── RateLimit/
│   ├── InstanceRateLimiterTests.cs
│   ├── EnvironmentRateLimiterTests.cs
│   ├── ValkeyRateLimitStoreTests.cs
│   ├── RateLimitMiddlewareTests.cs
│   ├── ConfigurationTests.cs
│   ├── RouteMatchingTests.cs
│   └── InheritanceResolverTests.cs

tests/load/k6/
└── rate-limit-scenarios.js
```

---

## Next Steps After Package Review

1. **Acknowledge receipt** of sprint package
2. **Set up development environment** (Valkey, dependencies)
3. **Read Implementation Guide** in full
4. **Start Sprint 1, Task 1.1** (Configuration Models)
5. **Update master tracker** as tasks complete
6. **Commit frequently** with clear messages
7. **Run tests after each task**
8. **Ask questions early** if blocked

---

## Configuration Quick Reference

### Minimal Config (Just Defaults)

```yaml
rate_limiting:
  for_instance:
    per_seconds: 300
    max_requests: 30000
```

### Full Config (All Features)

```yaml
rate_limiting:
  process_back_pressure_when_more_than_per_5min: 5000

  for_instance:
    rules:
      - per_seconds: 300
        max_requests: 30000
      - per_seconds: 30
        max_requests: 5000

  for_environment:
    valkey_bucket: "stella-router-rate-limit"
    valkey_connection: "valkey.stellaops.local:6379"

    circuit_breaker:
      failure_threshold: 5
      timeout_seconds: 30
      half_open_timeout: 10

    rules:
      - per_seconds: 300
        max_requests: 30000

    microservices:
      concelier:
        rules:
          - per_seconds: 1
            max_requests: 10
          - per_seconds: 3600
            max_requests: 3000

      scanner:
        rules:
          - per_seconds: 60
            max_requests: 600

        routes:
          scan_submit:
            pattern: "/api/scans"
            match_type: exact
            rules:
              - per_seconds: 10
                max_requests: 50
```

---

## Related Documentation

### Source Documents
- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
- **Analysis Plan:** `C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md`
- **Architecture:** `docs/modules/platform/architecture-overview.md`

### Implementation Sprints
- **Master Tracker:** `SPRINT_1200_001_000_router_rate_limiting_master.md`
- **Sprint 1:** `SPRINT_1200_001_001_router_rate_limiting_core.md`
- **Sprint 2:** `SPRINT_1200_001_002_router_rate_limiting_per_route.md`
- **Sprint 3:** `SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md`
- **Sprint 4-6:** To be created by implementer (templates in master tracker)

### Technical Guides
- **Implementation Guide:** `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` (comprehensive)
- **HTTP 429 Semantics:** RFC 6585
- **Valkey Documentation:** https://valkey.io/docs/

---

## Version History

| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2025-12-17 | Initial sprint package created |

---

**Ready to implement?** Start with the Implementation Guide, then proceed to Sprint 1!