Files
git.stella-ops.org/devops/services/tenant/README.md
2025-12-26 18:11:06 +02:00

35 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Tenant audit & chaos kit (DEVOPS-TEN-49-001)
Artifacts live in this folder to cover tenant audit logging, usage metrics, JWKS outage chaos, and load/perf benchmarks.
## Whats here
- `recording-rules.yaml` Prometheus recordings for per-tenant rate/error/latency and JWKS cache ratio.
- `alerts.yaml` Alert rules for error rate, JWKS cache miss spike, p95 latency, auth failures, and rate limit hits.
- `dashboards/tenant-audit.json` Grafana dashboard with tenant/service variables.
- `k6-tenant-load.js` Multi-tenant load/perf scenario (read/write 90/10, tenant header, configurable paths).
- `jwks-chaos.sh` iptables-based JWKS dropper for chaos drills.
## Import & wiring
1. Load `recording-rules.yaml` and `alerts.yaml` into the Prometheus rule groups for the tenancy stack.
2. Import `dashboards/tenant-audit.json` into Grafana (folder `StellaOps / Tenancy`).
3. Ensure services emit `tenant` labels on request metrics and structured logs (`tenant`, `subject`, `action`, `resource`, `result`, `traceId`).
## Load/perf (k6)
```bash
BASE_URL=https://api.stella.local \
TENANTS=tenant-a,tenant-b,tenant-c \
TENANT_HEADER=X-StellaOps-Tenant \
VUS=5000 DURATION=15m \
k6 run ops/devops/tenant/k6-tenant-load.js
```
Adjust `TENANT_READ_PATHS` / `TENANT_WRITE_PATHS` to point at Policy/Vuln/Notify endpoints. Default thresholds: p95 <300ms (read), <600ms (write), error rate <0.5%.
## JWKS chaos drill
```bash
JWKS_HOST=authority.local JWKS_PORT=8440 DURATION=300 \
./ops/devops/tenant/jwks-chaos.sh &
BASE_URL=https://api.stella.local TENANTS=tenant-a,tenant-b \
k6 run ops/devops/tenant/k6-tenant-load.js
```
Run on an isolated agent with sudo/iptables available. Watch `jwks_cache_hit_ratio:5m`, `tenant_error_rate:5m`, and alerts `jwks_cache_miss_spike` / `tenant_auth_failures_spike`.