Files
git.stella-ops.org/docs/operations/router-chaos-testing-runbook.md
StellaOps Bot 5146204f1b feat: add security sink detection patterns for JavaScript/TypeScript
- Introduced `sink-detect.js` with various security sink detection patterns categorized by type (e.g., command injection, SQL injection, file operations).
- Implemented functions to build a lookup map for fast sink detection and to match sink calls against known patterns.
- Added `package-lock.json` for dependency management.
2025-12-22 23:21:21 +02:00

5.2 KiB

Router Chaos Testing Runbook

Sprint: SPRINT_5100_0005_0001 Last Updated: 2025-12-22

Overview

This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.

Test Categories

1. Load Testing (k6)

Location: tests/load/router/

Spike Test Scenarios

Scenario Rate Duration Purpose
Baseline 100 req/s 1 min Establish normal operation
10x Spike 1000 req/s 30s Moderate overload
50x Spike 5000 req/s 30s Severe overload
Recovery 100 req/s 2 min Measure recovery time

Running Load Tests

# Install k6
brew install k6  # macOS
# or
choco install k6  # Windows

# Run spike test against local router
k6 run tests/load/router/spike-test.js \
  -e ROUTER_URL=http://localhost:8080

# Run against staging
k6 run tests/load/router/spike-test.js \
  -e ROUTER_URL=https://router.staging.stellaops.io

# Output results to JSON
k6 run tests/load/router/spike-test.js \
  --out json=results.json

2. Backpressure Verification

Location: tests/chaos/BackpressureVerificationTests.cs

Tests verify:

  • HTTP 429 responses include Retry-After header
  • HTTP 503 responses include Retry-After header
  • Retry-After values are reasonable (1-60 seconds)
  • No data loss during throttling

Expected Behavior

Load Level Expected Response Retry-After
Normal 200 OK N/A
High (>80% capacity) 429 Too Many Requests 1-10s
Critical (>95% capacity) 503 Service Unavailable 10-60s

3. Recovery Testing

Location: tests/chaos/RecoveryTests.cs

Tests verify:

  • Router recovers within 30 seconds after load drops
  • No request queue corruption
  • Metrics return to baseline

Recovery Thresholds

Metric Target Critical
P95 Recovery Time <15s <30s
P99 Recovery Time <25s <45s
Data Loss 0% 0%

4. Valkey Failure Injection

Location: tests/chaos/ValkeyFailureTests.cs

Tests verify router behavior when Valkey (cache/session store) fails:

  • Graceful degradation to stateless mode
  • No crashes or hangs
  • Proper error logging
  • Recovery when Valkey returns

Failure Scenarios

Scenario Expected Behavior
Valkey unreachable Fallback to direct processing
Valkey slow (>500ms) Timeout and continue
Valkey returns Resume normal caching

CI Integration

Workflow: .gitea/workflows/router-chaos.yml

The chaos tests run:

  • On every PR to main that touches router code
  • Nightly against staging environment
  • Before production deployments

Workflow Stages

  1. Build - Compile router and test projects
  2. Unit Tests - Run BackpressureVerificationTests
  3. Integration Tests - Run RecoveryTests, ValkeyFailureTests
  4. Load Tests - Run k6 spike scenarios (staging only)
  5. Report - Upload results as artifacts

Interpreting Results

Success Criteria

Metric Pass Fail
Request success rate during normal load >=99% <95%
Throttle response rate during spike >0% (expected) 0% (no backpressure)
Recovery time P95 <30s >=45s
Data loss 0% >0%

Common Failure Patterns

No Throttling Under Load

Symptom: 0% throttled requests during 50x spike Cause: Backpressure not configured or circuit breaker disabled Fix: Check router configuration backpressure.enabled=true

Slow Recovery

Symptom: Recovery time >45s Cause: Request queue not draining properly Fix: Check maxQueueSize and drainTimeoutSeconds settings

Missing Retry-After Header

Symptom: 429/503 without Retry-After Cause: Header middleware not applied Fix: Ensure UseRetryAfterMiddleware() is in pipeline

Metrics & Dashboards

Key Metrics to Monitor

# Throttle rate
rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])

# Recovery time
histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))

# Queue depth
router_request_queue_depth

Alert Thresholds

Alert Condition Severity
High Throttle Rate throttle_rate > 10% for 5m Warning
Extended Throttle throttle_rate > 50% for 2m Critical
Slow Recovery p95_recovery > 30s Warning
No Recovery p99_recovery > 60s Critical

Troubleshooting

Test Environment Setup

# Start router locally
docker-compose up router valkey

# Verify router health
curl http://localhost:8080/health

# Verify Valkey connection
docker exec -it valkey redis-cli ping

Debug Mode

# Run tests with verbose logging
dotnet test tests/chaos/ --logger "console;verbosity=detailed"

# k6 with debug output
k6 run tests/load/router/spike-test.js --verbose

References