Add tests and implement timeline ingestion options with NATS and Redis subscribers

- Introduced `BinaryReachabilityLifterTests` to validate binary lifting functionality. - Created `PackRunWorkerOptions` for configuring worker paths and execution persistence. - Added `TimelineIngestionOptions` for configuring NATS and Redis ingestion transports. - Implemented `NatsTimelineEventSubscriber` for subscribing to NATS events. - Developed `RedisTimelineEventSubscriber` for reading from Redis Streams. - Added `TimelineEnvelopeParser` to normalize incoming event envelopes. - Created unit tests for `TimelineEnvelopeParser` to ensure correct field mapping. - Implemented `TimelineAuthorizationAuditSink` for logging authorization outcomes.
2025-12-03 09:46:48 +02:00
parent e923880694
commit 35c8f9216f
520 changed files with 4416 additions and 31492 deletions
--- a/ops/devops/tenant/README.md
+++ b/ops/devops/tenant/README.md
@@ -0,0 +1,34 @@
+# Tenant audit & chaos kit (DEVOPS-TEN-49-001)
+
+Artifacts live in this folder to cover tenant audit logging, usage metrics, JWKS outage chaos, and load/perf benchmarks.
+
+## What’s here
+- `recording-rules.yaml` – Prometheus recordings for per-tenant rate/error/latency and JWKS cache ratio.
+- `alerts.yaml` – Alert rules for error rate, JWKS cache miss spike, p95 latency, auth failures, and rate limit hits.
+- `dashboards/tenant-audit.json` – Grafana dashboard with tenant/service variables.
+- `k6-tenant-load.js` – Multi-tenant load/perf scenario (read/write 90/10, tenant header, configurable paths).
+- `jwks-chaos.sh` – iptables-based JWKS dropper for chaos drills.
+
+## Import & wiring
+1. Load `recording-rules.yaml` and `alerts.yaml` into the Prometheus rule groups for the tenancy stack.
+2. Import `dashboards/tenant-audit.json` into Grafana (folder `StellaOps / Tenancy`).
+3. Ensure services emit `tenant` labels on request metrics and structured logs (`tenant`, `subject`, `action`, `resource`, `result`, `traceId`).
+
+## Load/perf (k6)
+```bash
+BASE_URL=https://api.stella.local \
+TENANTS=tenant-a,tenant-b,tenant-c \
+TENANT_HEADER=X-StellaOps-Tenant \
+VUS=5000 DURATION=15m \
+k6 run ops/devops/tenant/k6-tenant-load.js
+```
+Adjust `TENANT_READ_PATHS` / `TENANT_WRITE_PATHS` to point at Policy/Vuln/Notify endpoints. Default thresholds: p95 <300ms (read), <600ms (write), error rate <0.5%.
+
+## JWKS chaos drill
+```bash
+JWKS_HOST=authority.local JWKS_PORT=8440 DURATION=300 \
+./ops/devops/tenant/jwks-chaos.sh &
+BASE_URL=https://api.stella.local TENANTS=tenant-a,tenant-b \
+k6 run ops/devops/tenant/k6-tenant-load.js
+```
+Run on an isolated agent with sudo/iptables available. Watch `jwks_cache_hit_ratio:5m`, `tenant_error_rate:5m`, and alerts `jwks_cache_miss_spike` / `tenant_auth_failures_spike`.
--- a/ops/devops/tenant/alerts.yaml
+++ b/ops/devops/tenant/alerts.yaml
@@ -19,6 +19,14 @@ groups:
    annotations:
      summary: JWKS cache miss rate spike
      description: JWKS miss ratio above 20% may indicate outage or cache expiry.
+  - alert: tenant_latency_p95_high
+    expr: tenant_latency_p95:5m > 0.6
+    for: 10m
+    labels:
+      severity: warn
+    annotations:
+      summary: Tenant p95 latency high
+      description: Per-tenant p95 latency over 600ms for 10m.
  - alert: tenant_rate_limit_exceeded
    expr: rate(tenant_rate_limit_hits_total[5m]) > 10
    for: 5m
@@ -27,3 +35,11 @@ groups:
    annotations:
      summary: Frequent rate limit hits
      description: Tenant rate limit exceeded more than 10 times per 5m window.
+  - alert: tenant_auth_failures_spike
+    expr: rate(auth_token_validation_failures_total{tenant!=""}[5m]) > 5
+    for: 5m
+    labels:
+      severity: page
+    annotations:
+      summary: Tenant auth failures elevated
+      description: Token validation failures exceed 5 per 5m for tenant-scoped traffic.
--- a/ops/devops/tenant/audit-pipeline-plan.md
+++ b/ops/devops/tenant/audit-pipeline-plan.md
@@ -25,12 +25,15 @@ Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chao
 - Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain `tenant` label cardinality cap (<= 1000 active tenants).

 ## Implementation steps
- Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate.
- Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`.
- CI: add chaos test job stub (uses docker-compose + iptables fault) gated behind manual approval.
- Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live.
+- Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate, auth failures.
+- Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`, `tenant_latency_p95_high`, `tenant_auth_failures_spike` with supporting recording rules in `recording-rules.yaml`.
+- Load/perf: k6 scenario `k6-tenant-load.js` (read/write 90/10, random tenants, headers configurable) targeting 5k RPS.
+- Chaos: reusable script `jwks-chaos.sh` + CI stub in `README.md` describing manual-gated run to drop JWKS egress while k6 runs.
+- Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live. Status: added Tenancy Observability section with import steps.

 ## Artefacts
 - Dashboard JSON: `ops/devops/tenant/dashboards/tenant-audit.json`
 - Alert rules: `ops/devops/tenant/alerts.yaml`
+- Recording rules: `ops/devops/tenant/recording-rules.yaml`
+- Load/perf harness: `ops/devops/tenant/k6-tenant-load.js`
 - Chaos script: `ops/devops/tenant/jwks-chaos.sh`
--- a/ops/devops/tenant/dashboards/tenant-audit.json
+++ b/ops/devops/tenant/dashboards/tenant-audit.json
@@ -1,11 +1,18 @@
 {
  "title": "Tenant Audit & Auth",
  "timezone": "utc",
+  "templating": {
+    "list": [
+      { "name": "tenant", "type": "query", "datasource": "Prometheus", "query": "label_values(tenant_requests_total, tenant)", "refresh": 2, "multi": true, "includeAll": true },
+      { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(tenant_requests_total, service)", "refresh": 2, "multi": true, "includeAll": true }
+    ]
+  },
  "panels": [
-    {"type": "timeseries", "title": "Tenant request latency p95", "targets": [{"expr": "histogram_quantile(0.95, rate(tenant_requests_duration_seconds_bucket[5m]))"}]},
-    {"type": "timeseries", "title": "Tenant error rate", "targets": [{"expr": "sum(rate(tenant_requests_total{status=~\"5..\"}[5m])) / sum(rate(tenant_requests_total[5m]))"}]},
-    {"type": "timeseries", "title": "JWKS cache hit rate", "targets": [{"expr": "rate(auth_jwks_cache_hits_total[5m]) / (rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m]))"}]},
-    {"type": "timeseries", "title": "Rate limit hits", "targets": [{"expr": "rate(tenant_rate_limit_hits_total[5m])"}]},
-    {"type": "timeseries", "title": "Tenant queue depth", "targets": [{"expr": "tenant_queue_depth"}]}
+    { "type": "timeseries", "title": "p95 latency (by service)", "targets": [ { "expr": "tenant_latency_p95:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] },
+    { "type": "timeseries", "title": "Error rate", "targets": [ { "expr": "tenant_error_rate:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] },
+    { "type": "timeseries", "title": "Requests per second", "targets": [ { "expr": "tenant_requests_rate:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] },
+    { "type": "timeseries", "title": "JWKS cache hit ratio", "targets": [ { "expr": "jwks_cache_hit_ratio:5m" } ] },
+    { "type": "timeseries", "title": "Auth validation failures", "targets": [ { "expr": "rate(auth_token_validation_failures_total{tenant!=\"\",tenant=~\"$tenant\"}[5m])" } ] },
+    { "type": "timeseries", "title": "Rate limit hits", "targets": [ { "expr": "tenant_rate_limit_hits:5m{tenant=~\"$tenant\",service=~\"$service\"}" } ] }
  ]
 }
--- a/ops/devops/tenant/k6-tenant-load.js
+++ b/ops/devops/tenant/k6-tenant-load.js
@@ -0,0 +1,84 @@
+import http from 'k6/http';
+import { check, sleep } from 'k6';
+import { Rate, Trend } from 'k6/metrics';
+
+const BASE_URL = __ENV.BASE_URL || 'http://localhost:8080';
+const TENANT_HEADER = __ENV.TENANT_HEADER || 'X-StellaOps-Tenant';
+const TENANTS = (__ENV.TENANTS || 'tenant-a,tenant-b,tenant-c,tenant-d,tenant-e,tenant-f,tenant-g,tenant-h,tenant-i,tenant-j')
+  .split(',')
+  .map((t) => t.trim())
+  .filter(Boolean);
+const READ_PATHS = (__ENV.TENANT_READ_PATHS || '/api/v1/policy/effective,/api/v1/vuln/search?limit=50,/notify/api/v1/events?limit=20,/health/readiness')
+  .split(',')
+  .map((p) => p.trim())
+  .filter(Boolean);
+const WRITE_PATHS = (__ENV.TENANT_WRITE_PATHS || '/api/v1/policy/evaluate,/notify/api/v1/test,/api/v1/tasks/submit')
+  .split(',')
+  .map((p) => p.trim())
+  .filter(Boolean);
+
+const READ_FRACTION = Number(__ENV.READ_FRACTION || '0.9');
+const SLEEP_MS = Number(__ENV.SLEEP_MS || '250');
+let seed = Number(__ENV.SEED || '1');
+
+function rnd() {
+  seed = (seed * 1664525 + 1013904223) >>> 0;
+  return seed / 4294967296;
+}
+
+export const options = {
+  vus: Number(__ENV.VUS || '250'),
+  duration: __ENV.DURATION || '10m',
+  thresholds: {
+    http_req_failed: ['rate<0.005'],
+    http_req_duration: ['p(95)<300'],
+    'tenant_write_duration': ['p(95)<600'],
+    'tenant_auth_failures': ['rate<0.01'],
+  },
+};
+
+const readDuration = new Trend('tenant_read_duration', true);
+const writeDuration = new Trend('tenant_write_duration', true);
+const authFailures = new Rate('tenant_auth_failures');
+
+function pick(list) {
+  return list[Math.floor(rnd() * list.length)];
+}
+
+export default function () {
+  const tenant = pick(TENANTS);
+  const doWrite = rnd() > READ_FRACTION;
+  const path = doWrite ? pick(WRITE_PATHS) : pick(READ_PATHS);
+
+  const headers = {
+    [TENANT_HEADER]: tenant,
+    'Content-Type': 'application/json',
+  };
+
+  const url = `${BASE_URL}${path}`;
+  const payload = JSON.stringify({
+    tenant,
+    traceId: __VU + '-' + Date.now(),
+    now: new Date().toISOString(),
+    sample: 'tenant-chaos',
+  });
+
+  const params = { headers, tags: { tenant, path, kind: doWrite ? 'write' : 'read' } };
+  const res = doWrite ? http.post(url, payload, params) : http.get(url, params);
+
+  if (!check(res, {
+    'status ok': (r) => r.status >= 200 && r.status < 300,
+  })) {
+    if (res.status === 401 || res.status === 403) {
+      authFailures.add(1);
+    }
+  }
+
+  if (doWrite) {
+    writeDuration.add(res.timings.duration);
+  } else {
+    readDuration.add(res.timings.duration);
+  }
+
+  sleep(SLEEP_MS / 1000);
+}
--- a/ops/devops/tenant/recording-rules.yaml
+++ b/ops/devops/tenant/recording-rules.yaml
@@ -0,0 +1,18 @@
+# Recording rules supporting tenant audit dashboards/alerts (DEVOPS-TEN-49-001)
+apiVersion: 1
+groups:
+- name: tenant-sli
+  interval: 30s
+  rules:
+  - record: tenant_requests_rate:5m
+    expr: sum by (tenant, service) (rate(tenant_requests_total[5m]))
+  - record: tenant_error_rate:5m
+    expr: sum by (tenant, service) (rate(tenant_requests_total{status=~"5.."}[5m])) /
+      clamp_min(sum by (tenant, service) (rate(tenant_requests_total[5m])), 1)
+  - record: tenant_latency_p95:5m
+    expr: histogram_quantile(0.95, sum by (le, tenant, service) (rate(tenant_requests_duration_seconds_bucket[5m])))
+  - record: jwks_cache_hit_ratio:5m
+    expr: rate(auth_jwks_cache_hits_total[5m]) /
+      clamp_min(rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m]), 1)
+  - record: tenant_rate_limit_hits:5m
+    expr: sum by (tenant, service) (rate(tenant_rate_limit_hits_total[5m]))