- Introduced `BinaryReachabilityLifterTests` to validate binary lifting functionality. - Created `PackRunWorkerOptions` for configuring worker paths and execution persistence. - Added `TimelineIngestionOptions` for configuring NATS and Redis ingestion transports. - Implemented `NatsTimelineEventSubscriber` for subscribing to NATS events. - Developed `RedisTimelineEventSubscriber` for reading from Redis Streams. - Added `TimelineEnvelopeParser` to normalize incoming event envelopes. - Created unit tests for `TimelineEnvelopeParser` to ensure correct field mapping. - Implemented `TimelineAuthorizationAuditSink` for logging authorization outcomes.
2.7 KiB
2.7 KiB
Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)
Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.
Pipeline components
- Audit collector: scrape structured logs from services emitting
tenant,subject,action,resource,result,traceId. Ship via OTLP->collector->Loki/ClickHouse. - Usage metrics: Prometheus counters/gauges
tenant_requests_total{tenant,service,route,status}tenant_rate_limit_hits_total{tenant,service}tenant_data_volume_bytes_total{tenant,service}tenant_queue_depth{tenant,service}(NATS/Redis)
- Data retention: 30d logs; 90d metrics (downsampled after 30d).
JWKS outage chaos
- Scenario: Authority/JWKS becomes unreachable for 5m.
- Steps:
- Run synthetic tenant traffic via k6 (reuse
ops/devops/vuln/k6-vuln-explorer.jsor service-specific scripts) withX-StellaOps-Tenantset. - Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
- Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records
auth.degradedevents, alerts fire if cache expired.
- Run synthetic tenant traffic via k6 (reuse
- Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.
Load/perf benchmarks
- Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
- SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
- Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain
tenantlabel cardinality cap (<= 1000 active tenants).
Implementation steps
- Add dashboards (Grafana folder
StellaOps / Tenancy) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate, auth failures. - Alert rules:
tenant_error_rate_gt_0_5pct,jwks_cache_miss_spike,tenant_rate_limit_exceeded,tenant_latency_p95_high,tenant_auth_failures_spikewith supporting recording rules inrecording-rules.yaml. - Load/perf: k6 scenario
k6-tenant-load.js(read/write 90/10, random tenants, headers configurable) targeting 5k RPS. - Chaos: reusable script
jwks-chaos.sh+ CI stub inREADME.mddescribing manual-gated run to drop JWKS egress while k6 runs. - Docs: update
deploy/README.mdTenancy section once dashboards/alerts live. Status: added Tenancy Observability section with import steps.
Artefacts
- Dashboard JSON:
ops/devops/tenant/dashboards/tenant-audit.json - Alert rules:
ops/devops/tenant/alerts.yaml - Recording rules:
ops/devops/tenant/recording-rules.yaml - Load/perf harness:
ops/devops/tenant/k6-tenant-load.js - Chaos script:
ops/devops/tenant/jwks-chaos.sh