feat: add security sink detection patterns for JavaScript/TypeScript

- Introduced `sink-detect.js` with various security sink detection patterns categorized by type (e.g., command injection, SQL injection, file operations). - Implemented functions to build a lookup map for fast sink detection and to match sink calls against known patterns. - Added `package-lock.json` for dependency management.
2025-12-22 23:21:21 +02:00
parent 3ba7157b00
commit 5146204f1b
529 changed files with 73579 additions and 5985 deletions
--- a/docs/operations/router-chaos-testing-runbook.md
+++ b/docs/operations/router-chaos-testing-runbook.md
@@ -0,0 +1,197 @@
+# Router Chaos Testing Runbook
+
+**Sprint:** SPRINT_5100_0005_0001
+**Last Updated:** 2025-12-22
+
+## Overview
+
+This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.
+
+## Test Categories
+
+### 1. Load Testing (k6)
+
+**Location:** `tests/load/router/`
+
+#### Spike Test Scenarios
+
+| Scenario | Rate | Duration | Purpose |
+|----------|------|----------|---------|
+| Baseline | 100 req/s | 1 min | Establish normal operation |
+| 10x Spike | 1000 req/s | 30s | Moderate overload |
+| 50x Spike | 5000 req/s | 30s | Severe overload |
+| Recovery | 100 req/s | 2 min | Measure recovery time |
+
+#### Running Load Tests
+
+```bash
+# Install k6
+brew install k6  # macOS
+# or
+choco install k6  # Windows
+
+# Run spike test against local router
+k6 run tests/load/router/spike-test.js \
+  -e ROUTER_URL=http://localhost:8080
+
+# Run against staging
+k6 run tests/load/router/spike-test.js \
+  -e ROUTER_URL=https://router.staging.stellaops.io
+
+# Output results to JSON
+k6 run tests/load/router/spike-test.js \
+  --out json=results.json
+```
+
+### 2. Backpressure Verification
+
+**Location:** `tests/chaos/BackpressureVerificationTests.cs`
+
+Tests verify:
+- HTTP 429 responses include `Retry-After` header
+- HTTP 503 responses include `Retry-After` header
+- Retry-After values are reasonable (1-60 seconds)
+- No data loss during throttling
+
+#### Expected Behavior
+
+| Load Level | Expected Response | Retry-After |
+|------------|-------------------|-------------|
+| Normal | 200 OK | N/A |
+| High (>80% capacity) | 429 Too Many Requests | 1-10s |
+| Critical (>95% capacity) | 503 Service Unavailable | 10-60s |
+
+### 3. Recovery Testing
+
+**Location:** `tests/chaos/RecoveryTests.cs`
+
+Tests verify:
+- Router recovers within 30 seconds after load drops
+- No request queue corruption
+- Metrics return to baseline
+
+#### Recovery Thresholds
+
+| Metric | Target | Critical |
+|--------|--------|----------|
+| P95 Recovery Time | <15s | <30s |
+| P99 Recovery Time | <25s | <45s |
+| Data Loss | 0% | 0% |
+
+### 4. Valkey Failure Injection
+
+**Location:** `tests/chaos/ValkeyFailureTests.cs`
+
+Tests verify router behavior when Valkey (cache/session store) fails:
+- Graceful degradation to stateless mode
+- No crashes or hangs
+- Proper error logging
+- Recovery when Valkey returns
+
+#### Failure Scenarios
+
+| Scenario | Expected Behavior |
+|----------|-------------------|
+| Valkey unreachable | Fallback to direct processing |
+| Valkey slow (>500ms) | Timeout and continue |
+| Valkey returns | Resume normal caching |
+
+## CI Integration
+
+**Workflow:** `.gitea/workflows/router-chaos.yml`
+
+The chaos tests run:
+- On every PR to `main` that touches router code
+- Nightly against staging environment
+- Before production deployments
+
+### Workflow Stages
+
+1. **Build** - Compile router and test projects
+2. **Unit Tests** - Run BackpressureVerificationTests
+3. **Integration Tests** - Run RecoveryTests, ValkeyFailureTests
+4. **Load Tests** - Run k6 spike scenarios (staging only)
+5. **Report** - Upload results as artifacts
+
+## Interpreting Results
+
+### Success Criteria
+
+| Metric | Pass | Fail |
+|--------|------|------|
+| Request success rate during normal load | >=99% | <95% |
+| Throttle response rate during spike | >0% (expected) | 0% (no backpressure) |
+| Recovery time P95 | <30s | >=45s |
+| Data loss | 0% | >0% |
+
+### Common Failure Patterns
+
+#### No Throttling Under Load
+**Symptom:** 0% throttled requests during 50x spike
+**Cause:** Backpressure not configured or circuit breaker disabled
+**Fix:** Check router configuration `backpressure.enabled=true`
+
+#### Slow Recovery
+**Symptom:** Recovery time >45s
+**Cause:** Request queue not draining properly
+**Fix:** Check `maxQueueSize` and `drainTimeoutSeconds` settings
+
+#### Missing Retry-After Header
+**Symptom:** 429/503 without Retry-After
+**Cause:** Header middleware not applied
+**Fix:** Ensure `UseRetryAfterMiddleware()` is in pipeline
+
+## Metrics & Dashboards
+
+### Key Metrics to Monitor
+
+```promql
+# Throttle rate
+rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])
+
+# Recovery time
+histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))
+
+# Queue depth
+router_request_queue_depth
+```
+
+### Alert Thresholds
+
+| Alert | Condition | Severity |
+|-------|-----------|----------|
+| High Throttle Rate | throttle_rate > 10% for 5m | Warning |
+| Extended Throttle | throttle_rate > 50% for 2m | Critical |
+| Slow Recovery | p95_recovery > 30s | Warning |
+| No Recovery | p99_recovery > 60s | Critical |
+
+## Troubleshooting
+
+### Test Environment Setup
+
+```bash
+# Start router locally
+docker-compose up router valkey
+
+# Verify router health
+curl http://localhost:8080/health
+
+# Verify Valkey connection
+docker exec -it valkey redis-cli ping
+```
+
+### Debug Mode
+
+```bash
+# Run tests with verbose logging
+dotnet test tests/chaos/ --logger "console;verbosity=detailed"
+
+# k6 with debug output
+k6 run tests/load/router/spike-test.js --verbose
+```
+
+## References
+
+- [Router Architecture](../modules/router/architecture.md)
+- [Backpressure Design](../product-advisories/15-Dec-2025%20-%20Designing%20202%20+%20Retry-After%20Backpressure%20Control.md)
+- [Testing Strategy](../product-advisories/20-Dec-2025%20-%20Testing%20strategy.md)