# Router Chaos Testing Runbook **Sprint:** SPRINT_5100_0005_0001 **Last Updated:** 2025-12-22 ## Overview This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior. ## Test Categories ### 1. Load Testing (k6) **Location:** `tests/load/router/` #### Spike Test Scenarios | Scenario | Rate | Duration | Purpose | |----------|------|----------|---------| | Baseline | 100 req/s | 1 min | Establish normal operation | | 10x Spike | 1000 req/s | 30s | Moderate overload | | 50x Spike | 5000 req/s | 30s | Severe overload | | Recovery | 100 req/s | 2 min | Measure recovery time | #### Running Load Tests ```bash # Install k6 brew install k6 # macOS # or choco install k6 # Windows # Run spike test against local router k6 run tests/load/router/spike-test.js \ -e ROUTER_URL=http://localhost:8080 # Run against staging k6 run tests/load/router/spike-test.js \ -e ROUTER_URL=https://router.staging.stellaops.io # Output results to JSON k6 run tests/load/router/spike-test.js \ --out json=results.json ``` ### 2. Backpressure Verification **Location:** `tests/chaos/BackpressureVerificationTests.cs` Tests verify: - HTTP 429 responses include `Retry-After` header - HTTP 503 responses include `Retry-After` header - Retry-After values are reasonable (1-60 seconds) - No data loss during throttling #### Expected Behavior | Load Level | Expected Response | Retry-After | |------------|-------------------|-------------| | Normal | 200 OK | N/A | | High (>80% capacity) | 429 Too Many Requests | 1-10s | | Critical (>95% capacity) | 503 Service Unavailable | 10-60s | ### 3. Recovery Testing **Location:** `tests/chaos/RecoveryTests.cs` Tests verify: - Router recovers within 30 seconds after load drops - No request queue corruption - Metrics return to baseline #### Recovery Thresholds | Metric | Target | Critical | |--------|--------|----------| | P95 Recovery Time | <15s | <30s | | P99 Recovery Time | <25s | <45s | | Data Loss | 0% | 0% | ### 4. Valkey Failure Injection **Location:** `tests/chaos/ValkeyFailureTests.cs` Tests verify router behavior when Valkey (cache/session store) fails: - Graceful degradation to stateless mode - No crashes or hangs - Proper error logging - Recovery when Valkey returns #### Failure Scenarios | Scenario | Expected Behavior | |----------|-------------------| | Valkey unreachable | Fallback to direct processing | | Valkey slow (>500ms) | Timeout and continue | | Valkey returns | Resume normal caching | ## CI Integration **Workflow:** `.gitea/workflows/router-chaos.yml` The chaos tests run: - On every PR to `main` that touches router code - Nightly against staging environment - Before production deployments ### Workflow Stages 1. **Build** - Compile router and test projects 2. **Unit Tests** - Run BackpressureVerificationTests 3. **Integration Tests** - Run RecoveryTests, ValkeyFailureTests 4. **Load Tests** - Run k6 spike scenarios (staging only) 5. **Report** - Upload results as artifacts ## Interpreting Results ### Success Criteria | Metric | Pass | Fail | |--------|------|------| | Request success rate during normal load | >=99% | <95% | | Throttle response rate during spike | >0% (expected) | 0% (no backpressure) | | Recovery time P95 | <30s | >=45s | | Data loss | 0% | >0% | ### Common Failure Patterns #### No Throttling Under Load **Symptom:** 0% throttled requests during 50x spike **Cause:** Backpressure not configured or circuit breaker disabled **Fix:** Check router configuration `backpressure.enabled=true` #### Slow Recovery **Symptom:** Recovery time >45s **Cause:** Request queue not draining properly **Fix:** Check `maxQueueSize` and `drainTimeoutSeconds` settings #### Missing Retry-After Header **Symptom:** 429/503 without Retry-After **Cause:** Header middleware not applied **Fix:** Ensure `UseRetryAfterMiddleware()` is in pipeline ## Metrics & Dashboards ### Key Metrics to Monitor ```promql # Throttle rate rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m]) # Recovery time histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m])) # Queue depth router_request_queue_depth ``` ### Alert Thresholds | Alert | Condition | Severity | |-------|-----------|----------| | High Throttle Rate | throttle_rate > 10% for 5m | Warning | | Extended Throttle | throttle_rate > 50% for 2m | Critical | | Slow Recovery | p95_recovery > 30s | Warning | | No Recovery | p99_recovery > 60s | Critical | ## Troubleshooting ### Test Environment Setup ```bash # Start router locally docker-compose up router valkey # Verify router health curl http://localhost:8080/health # Verify Valkey connection docker exec -it valkey redis-cli ping ``` ### Debug Mode ```bash # Run tests with verbose logging dotnet test tests/chaos/ --logger "console;verbosity=detailed" # k6 with debug output k6 run tests/load/router/spike-test.js --verbose ``` ## References - [Router Architecture](../modules/router/architecture.md) - [Backpressure Design](../product-advisories/15-Dec-2025%20-%20Designing%20202%20+%20Retry-After%20Backpressure%20Control.md) - [Testing Strategy](../product-advisories/20-Dec-2025%20-%20Testing%20strategy.md)