Files

StellaOps Bot 5146204f1b feat: add security sink detection patterns for JavaScript/TypeScript

- Introduced `sink-detect.js` with various security sink detection patterns categorized by type (e.g., command injection, SQL injection, file operations).
- Implemented functions to build a lookup map for fast sink detection and to match sink calls against known patterns.
- Added `package-lock.json` for dependency management.

2025-12-22 23:21:21 +02:00

5.2 KiB

Raw Blame History

Router Chaos Testing Runbook

Sprint: SPRINT_5100_0005_0001 Last Updated: 2025-12-22

Overview

This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.

Test Categories

1. Load Testing (k6)

Location: tests/load/router/

Spike Test Scenarios

Scenario	Rate	Duration	Purpose
Baseline	100 req/s	1 min	Establish normal operation
10x Spike	1000 req/s	30s	Moderate overload
50x Spike	5000 req/s	30s	Severe overload
Recovery	100 req/s	2 min	Measure recovery time

Running Load Tests

# Install k6
brew install k6  # macOS
# or
choco install k6  # Windows

# Run spike test against local router
k6 run tests/load/router/spike-test.js \
  -e ROUTER_URL=http://localhost:8080

# Run against staging
k6 run tests/load/router/spike-test.js \
  -e ROUTER_URL=https://router.staging.stellaops.io

# Output results to JSON
k6 run tests/load/router/spike-test.js \
  --out json=results.json

2. Backpressure Verification

Location: tests/chaos/BackpressureVerificationTests.cs

Tests verify:

HTTP 429 responses include Retry-After header
HTTP 503 responses include Retry-After header
Retry-After values are reasonable (1-60 seconds)
No data loss during throttling

Expected Behavior

Load Level	Expected Response	Retry-After
Normal	200 OK	N/A
High (>80% capacity)	429 Too Many Requests	1-10s
Critical (>95% capacity)	503 Service Unavailable	10-60s

3. Recovery Testing

Location: tests/chaos/RecoveryTests.cs

Tests verify:

Router recovers within 30 seconds after load drops
No request queue corruption
Metrics return to baseline

Recovery Thresholds

Metric	Target	Critical
P95 Recovery Time	<15s	<30s
P99 Recovery Time	<25s	<45s
Data Loss	0%	0%

4. Valkey Failure Injection

Location: tests/chaos/ValkeyFailureTests.cs

Tests verify router behavior when Valkey (cache/session store) fails:

Graceful degradation to stateless mode
No crashes or hangs
Proper error logging
Recovery when Valkey returns

Failure Scenarios

Scenario	Expected Behavior
Valkey unreachable	Fallback to direct processing
Valkey slow (>500ms)	Timeout and continue
Valkey returns	Resume normal caching

CI Integration

Workflow: .gitea/workflows/router-chaos.yml

The chaos tests run:

On every PR to main that touches router code
Nightly against staging environment
Before production deployments

Workflow Stages

Build - Compile router and test projects
Unit Tests - Run BackpressureVerificationTests
Integration Tests - Run RecoveryTests, ValkeyFailureTests
Load Tests - Run k6 spike scenarios (staging only)
Report - Upload results as artifacts

Interpreting Results

Success Criteria

Metric	Pass	Fail
Request success rate during normal load	>=99%	<95%
Throttle response rate during spike	>0% (expected)	0% (no backpressure)
Recovery time P95	<30s	>=45s
Data loss	0%	>0%

Common Failure Patterns

No Throttling Under Load

Symptom: 0% throttled requests during 50x spike Cause: Backpressure not configured or circuit breaker disabled Fix: Check router configuration backpressure.enabled=true

Slow Recovery

Symptom: Recovery time >45s Cause: Request queue not draining properly Fix: Check maxQueueSize and drainTimeoutSeconds settings

Missing Retry-After Header

Symptom: 429/503 without Retry-After Cause: Header middleware not applied Fix: Ensure UseRetryAfterMiddleware() is in pipeline

Metrics & Dashboards

Key Metrics to Monitor

# Throttle rate
rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])

# Recovery time
histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))

# Queue depth
router_request_queue_depth

Alert Thresholds

Alert	Condition	Severity
High Throttle Rate	throttle_rate > 10% for 5m	Warning
Extended Throttle	throttle_rate > 50% for 2m	Critical
Slow Recovery	p95_recovery > 30s	Warning
No Recovery	p99_recovery > 60s	Critical

Troubleshooting

Test Environment Setup

# Start router locally
docker-compose up router valkey

# Verify router health
curl http://localhost:8080/health

# Verify Valkey connection
docker exec -it valkey redis-cli ping

Debug Mode

# Run tests with verbose logging
dotnet test tests/chaos/ --logger "console;verbosity=detailed"

# k6 with debug output
k6 run tests/load/router/spike-test.js --verbose

5.2 KiB Raw Blame History