Files

StellaOps Bot 35c8f9216f Add tests and implement timeline ingestion options with NATS and Redis subscribers

- Introduced `BinaryReachabilityLifterTests` to validate binary lifting functionality.
- Created `PackRunWorkerOptions` for configuring worker paths and execution persistence.
- Added `TimelineIngestionOptions` for configuring NATS and Redis ingestion transports.
- Implemented `NatsTimelineEventSubscriber` for subscribing to NATS events.
- Developed `RedisTimelineEventSubscriber` for reading from Redis Streams.
- Added `TimelineEnvelopeParser` to normalize incoming event envelopes.
- Created unit tests for `TimelineEnvelopeParser` to ensure correct field mapping.
- Implemented `TimelineAuthorizationAuditSink` for logging authorization outcomes.

2025-12-03 09:46:48 +02:00

2.7 KiB

Raw Blame History

Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)

Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.

Pipeline components

Audit collector: scrape structured logs from services emitting tenant, subject, action, resource, result, traceId. Ship via OTLP->collector->Loki/ClickHouse.
Usage metrics: Prometheus counters/gauges
- tenant_requests_total{tenant,service,route,status}
- tenant_rate_limit_hits_total{tenant,service}
- tenant_data_volume_bytes_total{tenant,service}
- tenant_queue_depth{tenant,service} (NATS/Redis)
Data retention: 30d logs; 90d metrics (downsampled after 30d).

JWKS outage chaos

Scenario: Authority/JWKS becomes unreachable for 5m.
Steps:
1. Run synthetic tenant traffic via k6 (reuse ops/devops/vuln/k6-vuln-explorer.js or service-specific scripts) with X-StellaOps-Tenant set.
2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records auth.degraded events, alerts fire if cache expired.
Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.

Load/perf benchmarks

Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain tenant label cardinality cap (<= 1000 active tenants).

Implementation steps

Add dashboards (Grafana folder StellaOps / Tenancy) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate, auth failures.
Alert rules: tenant_error_rate_gt_0_5pct, jwks_cache_miss_spike, tenant_rate_limit_exceeded, tenant_latency_p95_high, tenant_auth_failures_spike with supporting recording rules in recording-rules.yaml.
Load/perf: k6 scenario k6-tenant-load.js (read/write 90/10, random tenants, headers configurable) targeting 5k RPS.
Chaos: reusable script jwks-chaos.sh + CI stub in README.md describing manual-gated run to drop JWKS egress while k6 runs.
Docs: update deploy/README.md Tenancy section once dashboards/alerts live. Status: added Tenancy Observability section with import steps.

Artefacts

Dashboard JSON: ops/devops/tenant/dashboards/tenant-audit.json
Alert rules: ops/devops/tenant/alerts.yaml
Recording rules: ops/devops/tenant/recording-rules.yaml
Load/perf harness: ops/devops/tenant/k6-tenant-load.js
Chaos script: ops/devops/tenant/jwks-chaos.sh

2.7 KiB Raw Blame History

Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)

Pipeline components

JWKS outage chaos

Load/perf benchmarks

Implementation steps

Artefacts

2.7 KiB

Raw Blame History