- Introduced `sink-detect.js` with various security sink detection patterns categorized by type (e.g., command injection, SQL injection, file operations). - Implemented functions to build a lookup map for fast sink detection and to match sink calls against known patterns. - Added `package-lock.json` for dependency management.
5.2 KiB
Router Chaos Testing Runbook
Sprint: SPRINT_5100_0005_0001 Last Updated: 2025-12-22
Overview
This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.
Test Categories
1. Load Testing (k6)
Location: tests/load/router/
Spike Test Scenarios
| Scenario | Rate | Duration | Purpose |
|---|---|---|---|
| Baseline | 100 req/s | 1 min | Establish normal operation |
| 10x Spike | 1000 req/s | 30s | Moderate overload |
| 50x Spike | 5000 req/s | 30s | Severe overload |
| Recovery | 100 req/s | 2 min | Measure recovery time |
Running Load Tests
# Install k6
brew install k6 # macOS
# or
choco install k6 # Windows
# Run spike test against local router
k6 run tests/load/router/spike-test.js \
-e ROUTER_URL=http://localhost:8080
# Run against staging
k6 run tests/load/router/spike-test.js \
-e ROUTER_URL=https://router.staging.stellaops.io
# Output results to JSON
k6 run tests/load/router/spike-test.js \
--out json=results.json
2. Backpressure Verification
Location: tests/chaos/BackpressureVerificationTests.cs
Tests verify:
- HTTP 429 responses include
Retry-Afterheader - HTTP 503 responses include
Retry-Afterheader - Retry-After values are reasonable (1-60 seconds)
- No data loss during throttling
Expected Behavior
| Load Level | Expected Response | Retry-After |
|---|---|---|
| Normal | 200 OK | N/A |
| High (>80% capacity) | 429 Too Many Requests | 1-10s |
| Critical (>95% capacity) | 503 Service Unavailable | 10-60s |
3. Recovery Testing
Location: tests/chaos/RecoveryTests.cs
Tests verify:
- Router recovers within 30 seconds after load drops
- No request queue corruption
- Metrics return to baseline
Recovery Thresholds
| Metric | Target | Critical |
|---|---|---|
| P95 Recovery Time | <15s | <30s |
| P99 Recovery Time | <25s | <45s |
| Data Loss | 0% | 0% |
4. Valkey Failure Injection
Location: tests/chaos/ValkeyFailureTests.cs
Tests verify router behavior when Valkey (cache/session store) fails:
- Graceful degradation to stateless mode
- No crashes or hangs
- Proper error logging
- Recovery when Valkey returns
Failure Scenarios
| Scenario | Expected Behavior |
|---|---|
| Valkey unreachable | Fallback to direct processing |
| Valkey slow (>500ms) | Timeout and continue |
| Valkey returns | Resume normal caching |
CI Integration
Workflow: .gitea/workflows/router-chaos.yml
The chaos tests run:
- On every PR to
mainthat touches router code - Nightly against staging environment
- Before production deployments
Workflow Stages
- Build - Compile router and test projects
- Unit Tests - Run BackpressureVerificationTests
- Integration Tests - Run RecoveryTests, ValkeyFailureTests
- Load Tests - Run k6 spike scenarios (staging only)
- Report - Upload results as artifacts
Interpreting Results
Success Criteria
| Metric | Pass | Fail |
|---|---|---|
| Request success rate during normal load | >=99% | <95% |
| Throttle response rate during spike | >0% (expected) | 0% (no backpressure) |
| Recovery time P95 | <30s | >=45s |
| Data loss | 0% | >0% |
Common Failure Patterns
No Throttling Under Load
Symptom: 0% throttled requests during 50x spike
Cause: Backpressure not configured or circuit breaker disabled
Fix: Check router configuration backpressure.enabled=true
Slow Recovery
Symptom: Recovery time >45s
Cause: Request queue not draining properly
Fix: Check maxQueueSize and drainTimeoutSeconds settings
Missing Retry-After Header
Symptom: 429/503 without Retry-After
Cause: Header middleware not applied
Fix: Ensure UseRetryAfterMiddleware() is in pipeline
Metrics & Dashboards
Key Metrics to Monitor
# Throttle rate
rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])
# Recovery time
histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))
# Queue depth
router_request_queue_depth
Alert Thresholds
| Alert | Condition | Severity |
|---|---|---|
| High Throttle Rate | throttle_rate > 10% for 5m | Warning |
| Extended Throttle | throttle_rate > 50% for 2m | Critical |
| Slow Recovery | p95_recovery > 30s | Warning |
| No Recovery | p99_recovery > 60s | Critical |
Troubleshooting
Test Environment Setup
# Start router locally
docker-compose up router valkey
# Verify router health
curl http://localhost:8080/health
# Verify Valkey connection
docker exec -it valkey redis-cli ping
Debug Mode
# Run tests with verbose logging
dotnet test tests/chaos/ --logger "console;verbosity=detailed"
# k6 with debug output
k6 run tests/load/router/spike-test.js --verbose