CD/CD consolidation

This commit is contained in:
StellaOps Bot
2025-12-26 17:32:23 +02:00
parent a866eb6277
commit c786faae84
638 changed files with 3821 additions and 181 deletions

View File

@@ -0,0 +1,34 @@
# Tenant audit & chaos kit (DEVOPS-TEN-49-001)
Artifacts live in this folder to cover tenant audit logging, usage metrics, JWKS outage chaos, and load/perf benchmarks.
## Whats here
- `recording-rules.yaml` Prometheus recordings for per-tenant rate/error/latency and JWKS cache ratio.
- `alerts.yaml` Alert rules for error rate, JWKS cache miss spike, p95 latency, auth failures, and rate limit hits.
- `dashboards/tenant-audit.json` Grafana dashboard with tenant/service variables.
- `k6-tenant-load.js` Multi-tenant load/perf scenario (read/write 90/10, tenant header, configurable paths).
- `jwks-chaos.sh` iptables-based JWKS dropper for chaos drills.
## Import & wiring
1. Load `recording-rules.yaml` and `alerts.yaml` into the Prometheus rule groups for the tenancy stack.
2. Import `dashboards/tenant-audit.json` into Grafana (folder `StellaOps / Tenancy`).
3. Ensure services emit `tenant` labels on request metrics and structured logs (`tenant`, `subject`, `action`, `resource`, `result`, `traceId`).
## Load/perf (k6)
```bash
BASE_URL=https://api.stella.local \
TENANTS=tenant-a,tenant-b,tenant-c \
TENANT_HEADER=X-StellaOps-Tenant \
VUS=5000 DURATION=15m \
k6 run ops/devops/tenant/k6-tenant-load.js
```
Adjust `TENANT_READ_PATHS` / `TENANT_WRITE_PATHS` to point at Policy/Vuln/Notify endpoints. Default thresholds: p95 <300ms (read), <600ms (write), error rate <0.5%.
## JWKS chaos drill
```bash
JWKS_HOST=authority.local JWKS_PORT=8440 DURATION=300 \
./ops/devops/tenant/jwks-chaos.sh &
BASE_URL=https://api.stella.local TENANTS=tenant-a,tenant-b \
k6 run ops/devops/tenant/k6-tenant-load.js
```
Run on an isolated agent with sudo/iptables available. Watch `jwks_cache_hit_ratio:5m`, `tenant_error_rate:5m`, and alerts `jwks_cache_miss_spike` / `tenant_auth_failures_spike`.

View File

@@ -0,0 +1,45 @@
# Alert rules for tenant audit & auth (DEVOPS-TEN-49-001)
apiVersion: 1
groups:
- name: tenant-audit
rules:
- alert: tenant_error_rate_gt_0_5pct
expr: sum(rate(tenant_requests_total{status=~"5.."}[5m])) / sum(rate(tenant_requests_total[5m])) > 0.005
for: 5m
labels:
severity: page
annotations:
summary: Tenant error rate high
description: Error rate across tenant-labelled requests exceeds 0.5%.
- alert: jwks_cache_miss_spike
expr: rate(auth_jwks_cache_misses_total[5m]) / (rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m])) > 0.2
for: 5m
labels:
severity: warn
annotations:
summary: JWKS cache miss rate spike
description: JWKS miss ratio above 20% may indicate outage or cache expiry.
- alert: tenant_latency_p95_high
expr: tenant_latency_p95:5m > 0.6
for: 10m
labels:
severity: warn
annotations:
summary: Tenant p95 latency high
description: Per-tenant p95 latency over 600ms for 10m.
- alert: tenant_rate_limit_exceeded
expr: rate(tenant_rate_limit_hits_total[5m]) > 10
for: 5m
labels:
severity: warn
annotations:
summary: Frequent rate limit hits
description: Tenant rate limit exceeded more than 10 times per 5m window.
- alert: tenant_auth_failures_spike
expr: rate(auth_token_validation_failures_total{tenant!=""}[5m]) > 5
for: 5m
labels:
severity: page
annotations:
summary: Tenant auth failures elevated
description: Token validation failures exceed 5 per 5m for tenant-scoped traffic.

View File

@@ -0,0 +1,39 @@
# Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)
Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.
## Pipeline components
- **Audit collector**: scrape structured logs from services emitting `tenant`, `subject`, `action`, `resource`, `result`, `traceId`. Ship via OTLP->collector->Loki/ClickHouse.
- **Usage metrics**: Prometheus counters/gauges
- `tenant_requests_total{tenant,service,route,status}`
- `tenant_rate_limit_hits_total{tenant,service}`
- `tenant_data_volume_bytes_total{tenant,service}`
- `tenant_queue_depth{tenant,service}` (NATS/Redis)
- **Data retention**: 30d logs; 90d metrics (downsampled after 30d).
## JWKS outage chaos
- Scenario: Authority/JWKS becomes unreachable for 5m.
- Steps:
1. Run synthetic tenant traffic via k6 (reuse `ops/devops/vuln/k6-vuln-explorer.js` or service-specific scripts) with `X-StellaOps-Tenant` set.
2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records `auth.degraded` events, alerts fire if cache expired.
- Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.
## Load/perf benchmarks
- Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
- SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
- Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain `tenant` label cardinality cap (<= 1000 active tenants).
## Implementation steps
- Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate, auth failures.
- Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`, `tenant_latency_p95_high`, `tenant_auth_failures_spike` with supporting recording rules in `recording-rules.yaml`.
- Load/perf: k6 scenario `k6-tenant-load.js` (read/write 90/10, random tenants, headers configurable) targeting 5k RPS.
- Chaos: reusable script `jwks-chaos.sh` + CI stub in `README.md` describing manual-gated run to drop JWKS egress while k6 runs.
- Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live. Status: added Tenancy Observability section with import steps.
## Artefacts
- Dashboard JSON: `ops/devops/tenant/dashboards/tenant-audit.json`
- Alert rules: `ops/devops/tenant/alerts.yaml`
- Recording rules: `ops/devops/tenant/recording-rules.yaml`
- Load/perf harness: `ops/devops/tenant/k6-tenant-load.js`
- Chaos script: `ops/devops/tenant/jwks-chaos.sh`

View File

@@ -0,0 +1,18 @@
{
"title": "Tenant Audit & Auth",
"timezone": "utc",
"templating": {
"list": [
{ "name": "tenant", "type": "query", "datasource": "Prometheus", "query": "label_values(tenant_requests_total, tenant)", "refresh": 2, "multi": true, "includeAll": true },
{ "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(tenant_requests_total, service)", "refresh": 2, "multi": true, "includeAll": true }
]
},
"panels": [
{ "type": "timeseries", "title": "p95 latency (by service)", "targets": [ { "expr": "tenant_latency_p95:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] },
{ "type": "timeseries", "title": "Error rate", "targets": [ { "expr": "tenant_error_rate:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] },
{ "type": "timeseries", "title": "Requests per second", "targets": [ { "expr": "tenant_requests_rate:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] },
{ "type": "timeseries", "title": "JWKS cache hit ratio", "targets": [ { "expr": "jwks_cache_hit_ratio:5m" } ] },
{ "type": "timeseries", "title": "Auth validation failures", "targets": [ { "expr": "rate(auth_token_validation_failures_total{tenant!=\"\",tenant=~\"$tenant\"}[5m])" } ] },
{ "type": "timeseries", "title": "Rate limit hits", "targets": [ { "expr": "tenant_rate_limit_hits:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] }
]
}

View File

@@ -0,0 +1,19 @@
#!/usr/bin/env bash
# Simulate JWKS outage for chaos testing (DEVOPS-TEN-49-001)
# Usage: JWKS_HOST=authority.local JWKS_PORT=8440 DURATION=300 ./jwks-chaos.sh
set -euo pipefail
HOST=${JWKS_HOST:-authority}
PORT=${JWKS_PORT:-8440}
DURATION=${DURATION:-300}
rule_name=stellaops-jwks-chaos
cleanup() {
sudo iptables -D OUTPUT -p tcp --dport "$PORT" -d "$HOST" -j DROP 2>/dev/null || true
}
trap cleanup EXIT
sudo iptables -I OUTPUT -p tcp --dport "$PORT" -d "$HOST" -j DROP
echo "JWKS traffic to ${HOST}:${PORT} dropped for ${DURATION}s" >&2
sleep "$DURATION"
cleanup

View File

@@ -0,0 +1,84 @@
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';
const BASE_URL = __ENV.BASE_URL || 'http://localhost:8080';
const TENANT_HEADER = __ENV.TENANT_HEADER || 'X-StellaOps-Tenant';
const TENANTS = (__ENV.TENANTS || 'tenant-a,tenant-b,tenant-c,tenant-d,tenant-e,tenant-f,tenant-g,tenant-h,tenant-i,tenant-j')
.split(',')
.map((t) => t.trim())
.filter(Boolean);
const READ_PATHS = (__ENV.TENANT_READ_PATHS || '/api/v1/policy/effective,/api/v1/vuln/search?limit=50,/notify/api/v1/events?limit=20,/health/readiness')
.split(',')
.map((p) => p.trim())
.filter(Boolean);
const WRITE_PATHS = (__ENV.TENANT_WRITE_PATHS || '/api/v1/policy/evaluate,/notify/api/v1/test,/api/v1/tasks/submit')
.split(',')
.map((p) => p.trim())
.filter(Boolean);
const READ_FRACTION = Number(__ENV.READ_FRACTION || '0.9');
const SLEEP_MS = Number(__ENV.SLEEP_MS || '250');
let seed = Number(__ENV.SEED || '1');
function rnd() {
seed = (seed * 1664525 + 1013904223) >>> 0;
return seed / 4294967296;
}
export const options = {
vus: Number(__ENV.VUS || '250'),
duration: __ENV.DURATION || '10m',
thresholds: {
http_req_failed: ['rate<0.005'],
http_req_duration: ['p(95)<300'],
'tenant_write_duration': ['p(95)<600'],
'tenant_auth_failures': ['rate<0.01'],
},
};
const readDuration = new Trend('tenant_read_duration', true);
const writeDuration = new Trend('tenant_write_duration', true);
const authFailures = new Rate('tenant_auth_failures');
function pick(list) {
return list[Math.floor(rnd() * list.length)];
}
export default function () {
const tenant = pick(TENANTS);
const doWrite = rnd() > READ_FRACTION;
const path = doWrite ? pick(WRITE_PATHS) : pick(READ_PATHS);
const headers = {
[TENANT_HEADER]: tenant,
'Content-Type': 'application/json',
};
const url = `${BASE_URL}${path}`;
const payload = JSON.stringify({
tenant,
traceId: __VU + '-' + Date.now(),
now: new Date().toISOString(),
sample: 'tenant-chaos',
});
const params = { headers, tags: { tenant, path, kind: doWrite ? 'write' : 'read' } };
const res = doWrite ? http.post(url, payload, params) : http.get(url, params);
if (!check(res, {
'status ok': (r) => r.status >= 200 && r.status < 300,
})) {
if (res.status === 401 || res.status === 403) {
authFailures.add(1);
}
}
if (doWrite) {
writeDuration.add(res.timings.duration);
} else {
readDuration.add(res.timings.duration);
}
sleep(SLEEP_MS / 1000);
}

View File

@@ -0,0 +1,18 @@
# Recording rules supporting tenant audit dashboards/alerts (DEVOPS-TEN-49-001)
apiVersion: 1
groups:
- name: tenant-sli
interval: 30s
rules:
- record: tenant_requests_rate:5m
expr: sum by (tenant, service) (rate(tenant_requests_total[5m]))
- record: tenant_error_rate:5m
expr: sum by (tenant, service) (rate(tenant_requests_total{status=~"5.."}[5m])) /
clamp_min(sum by (tenant, service) (rate(tenant_requests_total[5m])), 1)
- record: tenant_latency_p95:5m
expr: histogram_quantile(0.95, sum by (le, tenant, service) (rate(tenant_requests_duration_seconds_bucket[5m])))
- record: jwks_cache_hit_ratio:5m
expr: rate(auth_jwks_cache_hits_total[5m]) /
clamp_min(rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m]), 1)
- record: tenant_rate_limit_hits:5m
expr: sum by (tenant, service) (rate(tenant_rate_limit_hits_total[5m]))