feat: add security sink detection patterns for JavaScript/TypeScript
- Introduced `sink-detect.js` with various security sink detection patterns categorized by type (e.g., command injection, SQL injection, file operations). - Implemented functions to build a lookup map for fast sink detection and to match sink calls against known patterns. - Added `package-lock.json` for dependency management.
This commit is contained in:
197
docs/operations/router-chaos-testing-runbook.md
Normal file
197
docs/operations/router-chaos-testing-runbook.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# Router Chaos Testing Runbook
|
||||
|
||||
**Sprint:** SPRINT_5100_0005_0001
|
||||
**Last Updated:** 2025-12-22
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.
|
||||
|
||||
## Test Categories
|
||||
|
||||
### 1. Load Testing (k6)
|
||||
|
||||
**Location:** `tests/load/router/`
|
||||
|
||||
#### Spike Test Scenarios
|
||||
|
||||
| Scenario | Rate | Duration | Purpose |
|
||||
|----------|------|----------|---------|
|
||||
| Baseline | 100 req/s | 1 min | Establish normal operation |
|
||||
| 10x Spike | 1000 req/s | 30s | Moderate overload |
|
||||
| 50x Spike | 5000 req/s | 30s | Severe overload |
|
||||
| Recovery | 100 req/s | 2 min | Measure recovery time |
|
||||
|
||||
#### Running Load Tests
|
||||
|
||||
```bash
|
||||
# Install k6
|
||||
brew install k6 # macOS
|
||||
# or
|
||||
choco install k6 # Windows
|
||||
|
||||
# Run spike test against local router
|
||||
k6 run tests/load/router/spike-test.js \
|
||||
-e ROUTER_URL=http://localhost:8080
|
||||
|
||||
# Run against staging
|
||||
k6 run tests/load/router/spike-test.js \
|
||||
-e ROUTER_URL=https://router.staging.stellaops.io
|
||||
|
||||
# Output results to JSON
|
||||
k6 run tests/load/router/spike-test.js \
|
||||
--out json=results.json
|
||||
```
|
||||
|
||||
### 2. Backpressure Verification
|
||||
|
||||
**Location:** `tests/chaos/BackpressureVerificationTests.cs`
|
||||
|
||||
Tests verify:
|
||||
- HTTP 429 responses include `Retry-After` header
|
||||
- HTTP 503 responses include `Retry-After` header
|
||||
- Retry-After values are reasonable (1-60 seconds)
|
||||
- No data loss during throttling
|
||||
|
||||
#### Expected Behavior
|
||||
|
||||
| Load Level | Expected Response | Retry-After |
|
||||
|------------|-------------------|-------------|
|
||||
| Normal | 200 OK | N/A |
|
||||
| High (>80% capacity) | 429 Too Many Requests | 1-10s |
|
||||
| Critical (>95% capacity) | 503 Service Unavailable | 10-60s |
|
||||
|
||||
### 3. Recovery Testing
|
||||
|
||||
**Location:** `tests/chaos/RecoveryTests.cs`
|
||||
|
||||
Tests verify:
|
||||
- Router recovers within 30 seconds after load drops
|
||||
- No request queue corruption
|
||||
- Metrics return to baseline
|
||||
|
||||
#### Recovery Thresholds
|
||||
|
||||
| Metric | Target | Critical |
|
||||
|--------|--------|----------|
|
||||
| P95 Recovery Time | <15s | <30s |
|
||||
| P99 Recovery Time | <25s | <45s |
|
||||
| Data Loss | 0% | 0% |
|
||||
|
||||
### 4. Valkey Failure Injection
|
||||
|
||||
**Location:** `tests/chaos/ValkeyFailureTests.cs`
|
||||
|
||||
Tests verify router behavior when Valkey (cache/session store) fails:
|
||||
- Graceful degradation to stateless mode
|
||||
- No crashes or hangs
|
||||
- Proper error logging
|
||||
- Recovery when Valkey returns
|
||||
|
||||
#### Failure Scenarios
|
||||
|
||||
| Scenario | Expected Behavior |
|
||||
|----------|-------------------|
|
||||
| Valkey unreachable | Fallback to direct processing |
|
||||
| Valkey slow (>500ms) | Timeout and continue |
|
||||
| Valkey returns | Resume normal caching |
|
||||
|
||||
## CI Integration
|
||||
|
||||
**Workflow:** `.gitea/workflows/router-chaos.yml`
|
||||
|
||||
The chaos tests run:
|
||||
- On every PR to `main` that touches router code
|
||||
- Nightly against staging environment
|
||||
- Before production deployments
|
||||
|
||||
### Workflow Stages
|
||||
|
||||
1. **Build** - Compile router and test projects
|
||||
2. **Unit Tests** - Run BackpressureVerificationTests
|
||||
3. **Integration Tests** - Run RecoveryTests, ValkeyFailureTests
|
||||
4. **Load Tests** - Run k6 spike scenarios (staging only)
|
||||
5. **Report** - Upload results as artifacts
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Success Criteria
|
||||
|
||||
| Metric | Pass | Fail |
|
||||
|--------|------|------|
|
||||
| Request success rate during normal load | >=99% | <95% |
|
||||
| Throttle response rate during spike | >0% (expected) | 0% (no backpressure) |
|
||||
| Recovery time P95 | <30s | >=45s |
|
||||
| Data loss | 0% | >0% |
|
||||
|
||||
### Common Failure Patterns
|
||||
|
||||
#### No Throttling Under Load
|
||||
**Symptom:** 0% throttled requests during 50x spike
|
||||
**Cause:** Backpressure not configured or circuit breaker disabled
|
||||
**Fix:** Check router configuration `backpressure.enabled=true`
|
||||
|
||||
#### Slow Recovery
|
||||
**Symptom:** Recovery time >45s
|
||||
**Cause:** Request queue not draining properly
|
||||
**Fix:** Check `maxQueueSize` and `drainTimeoutSeconds` settings
|
||||
|
||||
#### Missing Retry-After Header
|
||||
**Symptom:** 429/503 without Retry-After
|
||||
**Cause:** Header middleware not applied
|
||||
**Fix:** Ensure `UseRetryAfterMiddleware()` is in pipeline
|
||||
|
||||
## Metrics & Dashboards
|
||||
|
||||
### Key Metrics to Monitor
|
||||
|
||||
```promql
|
||||
# Throttle rate
|
||||
rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])
|
||||
|
||||
# Recovery time
|
||||
histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))
|
||||
|
||||
# Queue depth
|
||||
router_request_queue_depth
|
||||
```
|
||||
|
||||
### Alert Thresholds
|
||||
|
||||
| Alert | Condition | Severity |
|
||||
|-------|-----------|----------|
|
||||
| High Throttle Rate | throttle_rate > 10% for 5m | Warning |
|
||||
| Extended Throttle | throttle_rate > 50% for 2m | Critical |
|
||||
| Slow Recovery | p95_recovery > 30s | Warning |
|
||||
| No Recovery | p99_recovery > 60s | Critical |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Test Environment Setup
|
||||
|
||||
```bash
|
||||
# Start router locally
|
||||
docker-compose up router valkey
|
||||
|
||||
# Verify router health
|
||||
curl http://localhost:8080/health
|
||||
|
||||
# Verify Valkey connection
|
||||
docker exec -it valkey redis-cli ping
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# Run tests with verbose logging
|
||||
dotnet test tests/chaos/ --logger "console;verbosity=detailed"
|
||||
|
||||
# k6 with debug output
|
||||
k6 run tests/load/router/spike-test.js --verbose
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [Router Architecture](../modules/router/architecture.md)
|
||||
- [Backpressure Design](../product-advisories/15-Dec-2025%20-%20Designing%20202%20+%20Retry-After%20Backpressure%20Control.md)
|
||||
- [Testing Strategy](../product-advisories/20-Dec-2025%20-%20Testing%20strategy.md)
|
||||
Reference in New Issue
Block a user