# Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001) Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf. ## Pipeline components - **Audit collector**: scrape structured logs from services emitting `tenant`, `subject`, `action`, `resource`, `result`, `traceId`. Ship via OTLP->collector->Loki/ClickHouse. - **Usage metrics**: Prometheus counters/gauges - `tenant_requests_total{tenant,service,route,status}` - `tenant_rate_limit_hits_total{tenant,service}` - `tenant_data_volume_bytes_total{tenant,service}` - `tenant_queue_depth{tenant,service}` (NATS/Redis) - **Data retention**: 30d logs; 90d metrics (downsampled after 30d). ## JWKS outage chaos - Scenario: Authority/JWKS becomes unreachable for 5m. - Steps: 1. Run synthetic tenant traffic via k6 (reuse `ops/devops/vuln/k6-vuln-explorer.js` or service-specific scripts) with `X-StellaOps-Tenant` set. 2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes. 3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records `auth.degraded` events, alerts fire if cache expired. - Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits. ## Load/perf benchmarks - Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10. - SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%. - Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain `tenant` label cardinality cap (<= 1000 active tenants). ## Implementation steps - Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate, auth failures. - Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`, `tenant_latency_p95_high`, `tenant_auth_failures_spike` with supporting recording rules in `recording-rules.yaml`. - Load/perf: k6 scenario `k6-tenant-load.js` (read/write 90/10, random tenants, headers configurable) targeting 5k RPS. - Chaos: reusable script `jwks-chaos.sh` + CI stub in `README.md` describing manual-gated run to drop JWKS egress while k6 runs. - Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live. Status: added Tenancy Observability section with import steps. ## Artefacts - Dashboard JSON: `ops/devops/tenant/dashboards/tenant-audit.json` - Alert rules: `ops/devops/tenant/alerts.yaml` - Recording rules: `ops/devops/tenant/recording-rules.yaml` - Load/perf harness: `ops/devops/tenant/k6-tenant-load.js` - Chaos script: `ops/devops/tenant/jwks-chaos.sh`