git.stella-ops.org/docs/operations/router-chaos-testing-runbook.md

# Router Chaos Testing Runbook

**Sprint:** SPRINT_5100_0005_0001
**Last Updated:** 2025-12-22

## Overview

This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.

## Test Categories

### 1. Load Testing (k6)

**Location:** `tests/load/router/`

#### Spike Test Scenarios

| Scenario | Rate | Duration | Purpose |
|----------|------|----------|---------|
| Baseline | 100 req/s | 1 min | Establish normal operation |
| 10x Spike | 1000 req/s | 30s | Moderate overload |
| 50x Spike | 5000 req/s | 30s | Severe overload |
| Recovery | 100 req/s | 2 min | Measure recovery time |

#### Running Load Tests

```bash
# Install k6
brew install k6  # macOS
# or
choco install k6  # Windows

# Run spike test against local router
k6 run tests/load/router/spike-test.js \
  -e ROUTER_URL=http://localhost:8080

# Run against staging
k6 run tests/load/router/spike-test.js \
  -e ROUTER_URL=https://router.staging.stellaops.io

# Output results to JSON
k6 run tests/load/router/spike-test.js \
  --out json=results.json
```

### 2. Backpressure Verification

**Location:** `tests/chaos/BackpressureVerificationTests.cs`

Tests verify:
- HTTP 429 responses include `Retry-After` header
- HTTP 503 responses include `Retry-After` header
- Retry-After values are reasonable (1-60 seconds)
- No data loss during throttling

#### Expected Behavior

| Load Level | Expected Response | Retry-After |
|------------|-------------------|-------------|
| Normal | 200 OK | N/A |
| High (>80% capacity) | 429 Too Many Requests | 1-10s |
| Critical (>95% capacity) | 503 Service Unavailable | 10-60s |

### 3. Recovery Testing

**Location:** `tests/chaos/RecoveryTests.cs`

Tests verify:
- Router recovers within 30 seconds after load drops
- No request queue corruption
- Metrics return to baseline

#### Recovery Thresholds

| Metric | Target | Critical |
|--------|--------|----------|
| P95 Recovery Time | <15s | <30s |
| P99 Recovery Time | <25s | <45s |
| Data Loss | 0% | 0% |

### 4. Valkey Failure Injection

**Location:** `tests/chaos/ValkeyFailureTests.cs`

Tests verify router behavior when Valkey (cache/session store) fails:
- Graceful degradation to stateless mode
- No crashes or hangs
- Proper error logging
- Recovery when Valkey returns

#### Failure Scenarios

| Scenario | Expected Behavior |
|----------|-------------------|
| Valkey unreachable | Fallback to direct processing |
| Valkey slow (>500ms) | Timeout and continue |
| Valkey returns | Resume normal caching |

## CI Integration

**Workflow:** `.gitea/workflows/router-chaos.yml`

The chaos tests run:
- On every PR to `main` that touches router code
- Nightly against staging environment
- Before production deployments

### Workflow Stages

1. **Build** - Compile router and test projects
2. **Unit Tests** - Run BackpressureVerificationTests
3. **Integration Tests** - Run RecoveryTests, ValkeyFailureTests
4. **Load Tests** - Run k6 spike scenarios (staging only)
5. **Report** - Upload results as artifacts

## Interpreting Results

### Success Criteria

| Metric | Pass | Fail |
|--------|------|------|
| Request success rate during normal load | >=99% | <95% |
| Throttle response rate during spike | >0% (expected) | 0% (no backpressure) |
| Recovery time P95 | <30s | >=45s |
| Data loss | 0% | >0% |

### Common Failure Patterns

#### No Throttling Under Load
**Symptom:** 0% throttled requests during 50x spike
**Cause:** Backpressure not configured or circuit breaker disabled
**Fix:** Check router configuration `backpressure.enabled=true`

#### Slow Recovery
**Symptom:** Recovery time >45s
**Cause:** Request queue not draining properly
**Fix:** Check `maxQueueSize` and `drainTimeoutSeconds` settings

#### Missing Retry-After Header
**Symptom:** 429/503 without Retry-After
**Cause:** Header middleware not applied
**Fix:** Ensure `UseRetryAfterMiddleware()` is in pipeline

## Metrics & Dashboards

### Key Metrics to Monitor

```promql
# Throttle rate
rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])

# Recovery time
histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))

# Queue depth
router_request_queue_depth
```

### Alert Thresholds

| Alert | Condition | Severity |
|-------|-----------|----------|
| High Throttle Rate | throttle_rate > 10% for 5m | Warning |
| Extended Throttle | throttle_rate > 50% for 2m | Critical |
| Slow Recovery | p95_recovery > 30s | Warning |
| No Recovery | p99_recovery > 60s | Critical |

## Troubleshooting

### Test Environment Setup

```bash
# Start router locally
docker-compose up router valkey

# Verify router health
curl http://localhost:8080/health

# Verify Valkey connection
docker exec -it valkey redis-cli ping
```

### Debug Mode

```bash
# Run tests with verbose logging
dotnet test tests/chaos/ --logger "console;verbosity=detailed"

# k6 with debug output
k6 run tests/load/router/spike-test.js --verbose
```

## References

- [Router Architecture](../modules/router/architecture.md)
- [Backpressure Design](../product-advisories/15-Dec-2025%20-%20Designing%20202%20+%20Retry-After%20Backpressure%20Control.md)
- [Testing Strategy](../product-advisories/20-Dec-2025%20-%20Testing%20strategy.md)