Files
git.stella-ops.org/docs/implplan/SPRINT_1200_001_README.md
master 8bbfe4d2d2 feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration
- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.
2025-12-17 18:02:37 +02:00

12 KiB
Raw Blame History

Router Rate Limiting - Sprint Package README

Package Created: 2025-12-17 For: Implementation agents Advisory Source: docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + RetryAfter Backpressure Control.md


Package Contents

This sprint package contains everything needed to implement centralized rate limiting in Stella Router.

Core Sprint Files

File Purpose Agent Role
SPRINT_1200_001_000_router_rate_limiting_master.md Master tracker START HERE - Overview & progress tracking
SPRINT_1200_001_001_router_rate_limiting_core.md Sprint 1: Core implementation Implementer - 5-7 days
SPRINT_1200_001_002_router_rate_limiting_per_route.md Sprint 2: Per-route granularity Implementer - 2-3 days
SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md Sprint 3: Rule stacking Implementer - 2-3 days
SPRINT_1200_001_IMPLEMENTATION_GUIDE.md Technical reference READ FIRST before coding

Documentation Files (To Be Created in Sprint 6)

File Purpose Created In
docs/router/rate-limiting.md User-facing configuration guide Sprint 6
docs/operations/router-rate-limiting.md Operational runbook Sprint 6
docs/modules/router/architecture.md Architecture documentation Sprint 6

Implementation Sequence

Phase 1: Core Implementation (Sprints 1-3)

Sprint 1 (5-7 days)
├── Task 1.1: Configuration Models
├── Task 1.2: Instance Rate Limiter
├── Task 1.3: Valkey Backend
├── Task 1.4: Middleware Integration
├── Task 1.5: Metrics
└── Task 1.6: Wire into Pipeline

Sprint 2 (2-3 days)
├── Task 2.1: Extend Config for Routes
├── Task 2.2: Route Matching
├── Task 2.3: Inheritance Resolution
├── Task 2.4: Integrate into Service
└── Task 2.5: Documentation

Sprint 3 (2-3 days)
├── Task 3.1: Config for Rule Arrays
├── Task 3.2: Update Instance Limiter
├── Task 3.3: Enhance Valkey Lua Script
└── Task 3.4: Update Inheritance Resolver

Phase 2: Migration & Testing (Sprints 4-5)

Sprint 4 (3-4 days) - Service Migration
├── Extract AdaptiveRateLimiter configs
├── Add to Router configuration
├── Refactor AdaptiveRateLimiter
└── Integration validation

Sprint 5 (3-5 days) - Comprehensive Testing
├── Unit test suite
├── Integration tests (Testcontainers)
├── Load tests (k6 scenarios A-F)
└── Configuration matrix tests

Phase 3: Documentation & Rollout (Sprint 6)

Sprint 6 (2 days)
├── Architecture docs
├── Configuration guide
├── Operational runbook
└── Migration guide

Phase 4: Rollout (3 weeks, post-implementation)

Week 1: Shadow Mode
└── Metrics only, no enforcement

Week 2: Soft Limits
└── 2x traffic peaks

Week 3: Production Limits
└── Full enforcement

Week 4+: Service Migration
└── Remove redundant limiters

Quick Start for Agents

1. Context Gathering (30 minutes)

Read in this order:

  1. SPRINT_1200_001_000_router_rate_limiting_master.md - Overview
  2. SPRINT_1200_001_IMPLEMENTATION_GUIDE.md - Technical details
  3. Original advisory: docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + RetryAfter Backpressure Control.md
  4. Analysis plan: C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md

2. Environment Setup

# Working directory
cd src/__Libraries/StellaOps.Router.Gateway/

# Verify dependencies
dotnet restore

# Install Valkey for local testing
docker run -d -p 6379:6379 valkey/valkey:latest

# Run existing tests to ensure baseline
dotnet test

3. Start Sprint 1

Open SPRINT_1200_001_001_router_rate_limiting_core.md and follow task breakdown.

Task execution pattern:

For each task:
1. Read task description
2. Review implementation code samples
3. Create files as specified
4. Write unit tests
5. Mark task complete in master tracker
6. Commit with message: "feat(router): [Sprint 1.X] Task name"

Key Design Decisions (Reference)

1. Status Codes

  • 429 Too Many Requests for rate limiting
  • NOT 503 (that's for service health)
  • NOT 202 (that's for async job acceptance)

2. Two-Scope Architecture

  • for_instance: In-memory, protects single router
  • for_environment: Valkey-backed, protects aggregate

Both are necessary—can't replace one with the other.

3. Fail-Open Philosophy

  • Circuit breaker on Valkey failures
  • Activation gate optimization
  • Instance limits enforced even if Valkey down

4. Configuration Inheritance

  • Replacement semantics (not merge)
  • Most specific wins: route > microservice > environment > global

5. Rule Stacking

  • Multiple rules per target = AND logic
  • All rules must pass
  • Most restrictive Retry-After returned

Performance Targets

Metric Target Measurement
Instance check latency <1ms P99 BenchmarkDotNet
Environment check latency <10ms P99 k6 load test
Router throughput 100k req/sec k6 constant-arrival-rate
Valkey load per instance <1000 ops/sec redis-cli INFO

Testing Requirements

Unit Tests

  • Coverage: >90% for all RateLimit/* files
  • Framework: xUnit
  • Patterns: Arrange-Act-Assert

Integration Tests

  • Tool: TestServer + Testcontainers (Valkey)
  • Scope: End-to-end middleware pipeline
  • Scenarios: All config combinations

Load Tests

  • Tool: k6
  • Scenarios: A (instance), B (environment), C (activation gate), D (microservice), E (Valkey failure), F (max throughput)
  • Duration: 30s per scenario minimum

Common Implementation Gotchas

⚠️ Middleware Pipeline Order

// CORRECT:
app.UsePayloadLimits();
app.UseRateLimiting();        // BEFORE routing
app.UseEndpointResolution();

// WRONG:
app.UseEndpointResolution();
app.UseRateLimiting();        // Too late, can't identify microservice

⚠️ Lua Script Deployment

<!-- REQUIRED in .csproj -->
<ItemGroup>
  <Content Include="RateLimit\Scripts\*.lua">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>

⚠️ Clock Skew

-- CORRECT: Use Valkey server time
local now = tonumber(redis.call("TIME")[1])

-- WRONG: Use client time (clock skew issues)
local now = os.time()

⚠️ Circuit Breaker Half-Open

// REQUIRED: Implement half-open state
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
{
    _state = CircuitState.HalfOpen; // Allow ONE test request
}

Success Criteria Checklist

Copy this to master tracker and update as you progress:

Functional

  • Router enforces per-instance limits (in-memory)
  • Router enforces per-environment limits (Valkey-backed)
  • Per-microservice configuration works
  • Per-route configuration works
  • Multiple rules per target work (rule stacking)
  • 429 + Retry-After response format correct
  • Circuit breaker handles Valkey failures
  • Activation gate reduces Valkey load

Performance

  • Instance check <1ms P99
  • Environment check <10ms P99
  • 100k req/sec throughput maintained
  • Valkey load <1000 ops/sec per instance

Operational

  • Metrics exported to OpenTelemetry
  • Dashboards created (Grafana)
  • Alerts configured (Alertmanager)
  • Documentation complete
  • Migration from service-level rate limiters complete

Quality

  • Unit test coverage >90%
  • Integration tests pass (all scenarios)
  • Load tests pass (k6 scenarios A-F)
  • Failure injection tests pass

Escalation & Support

Blocked on Technical Decision

Escalate to: Architecture Guild (#stella-architecture) Response SLA: 24 hours

Blocked on Resource (Valkey, config, etc.)

Escalate to: Platform Engineering (#stella-platform) Response SLA: 4 hours

Blocked on Clarification

Escalate to: Router Team Lead (#stella-router-dev) Response SLA: 2 hours

Sprint Falling Behind Schedule

Escalate to: Project Manager (update master tracker with BLOCKED status) Action: Add note in "Decisions & Risks" section


File Structure (After Implementation)

src/__Libraries/StellaOps.Router.Gateway/
├── RateLimit/
│   ├── RateLimitConfig.cs
│   ├── IRateLimiter.cs
│   ├── InstanceRateLimiter.cs
│   ├── EnvironmentRateLimiter.cs
│   ├── RateLimitService.cs
│   ├── RateLimitMetrics.cs
│   ├── RateLimitDecision.cs
│   ├── ValkeyRateLimitStore.cs
│   ├── CircuitBreaker.cs
│   ├── LimitInheritanceResolver.cs
│   ├── Models/
│   │   ├── InstanceLimitsConfig.cs
│   │   ├── EnvironmentLimitsConfig.cs
│   │   ├── MicroserviceLimitsConfig.cs
│   │   ├── RouteLimitsConfig.cs
│   │   ├── RateLimitRule.cs
│   │   └── EffectiveLimits.cs
│   ├── RouteMatching/
│   │   ├── IRouteMatcher.cs
│   │   ├── RouteMatcher.cs
│   │   ├── ExactRouteMatcher.cs
│   │   ├── PrefixRouteMatcher.cs
│   │   └── RegexRouteMatcher.cs
│   ├── Internal/
│   │   └── SlidingWindowCounter.cs
│   └── Scripts/
│       └── rate_limit_check.lua
├── Middleware/
│   └── RateLimitMiddleware.cs
├── ApplicationBuilderExtensions.cs (modified)
└── ServiceCollectionExtensions.cs (modified)

__Tests/
├── RateLimit/
│   ├── InstanceRateLimiterTests.cs
│   ├── EnvironmentRateLimiterTests.cs
│   ├── ValkeyRateLimitStoreTests.cs
│   ├── RateLimitMiddlewareTests.cs
│   ├── ConfigurationTests.cs
│   ├── RouteMatchingTests.cs
│   └── InheritanceResolverTests.cs

tests/load/k6/
└── rate-limit-scenarios.js

Next Steps After Package Review

  1. Acknowledge receipt of sprint package
  2. Set up development environment (Valkey, dependencies)
  3. Read Implementation Guide in full
  4. Start Sprint 1, Task 1.1 (Configuration Models)
  5. Update master tracker as tasks complete
  6. Commit frequently with clear messages
  7. Run tests after each task
  8. Ask questions early if blocked

Configuration Quick Reference

Minimal Config (Just Defaults)

rate_limiting:
  for_instance:
    per_seconds: 300
    max_requests: 30000

Full Config (All Features)

rate_limiting:
  process_back_pressure_when_more_than_per_5min: 5000

  for_instance:
    rules:
      - per_seconds: 300
        max_requests: 30000
      - per_seconds: 30
        max_requests: 5000

  for_environment:
    valkey_bucket: "stella-router-rate-limit"
    valkey_connection: "valkey.stellaops.local:6379"

    circuit_breaker:
      failure_threshold: 5
      timeout_seconds: 30
      half_open_timeout: 10

    rules:
      - per_seconds: 300
        max_requests: 30000

    microservices:
      concelier:
        rules:
          - per_seconds: 1
            max_requests: 10
          - per_seconds: 3600
            max_requests: 3000

      scanner:
        rules:
          - per_seconds: 60
            max_requests: 600

        routes:
          scan_submit:
            pattern: "/api/scans"
            match_type: exact
            rules:
              - per_seconds: 10
                max_requests: 50

Source Documents

  • Advisory: docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + RetryAfter Backpressure Control.md
  • Analysis Plan: C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md
  • Architecture: docs/modules/platform/architecture-overview.md

Implementation Sprints

  • Master Tracker: SPRINT_1200_001_000_router_rate_limiting_master.md
  • Sprint 1: SPRINT_1200_001_001_router_rate_limiting_core.md
  • Sprint 2: SPRINT_1200_001_002_router_rate_limiting_per_route.md
  • Sprint 3: SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md
  • Sprint 4-6: To be created by implementer (templates in master tracker)

Technical Guides

  • Implementation Guide: SPRINT_1200_001_IMPLEMENTATION_GUIDE.md (comprehensive)
  • HTTP 429 Semantics: RFC 6585
  • Valkey Documentation: https://valkey.io/docs/

Version History

Version Date Changes
1.0 2025-12-17 Initial sprint package created

Ready to implement? Start with the Implementation Guide, then proceed to Sprint 1!