Files
git.stella-ops.org/ops/devops/tenant/audit-pipeline-plan.md
StellaOps Bot 2d08f52715 feat(zastava): add evidence locker plan and schema examples
- Introduced README.md for Zastava Evidence Locker Plan detailing artifacts to sign and post-signing steps.
- Added example JSON schemas for observer events and webhook admissions.
- Updated implementor guidelines with checklist for CI linting, determinism, secrets management, and schema control.
- Created alert rules for Vuln Explorer to monitor API latency and projection errors.
- Developed analytics ingestion plan for Vuln Explorer, focusing on telemetry and PII guardrails.
- Implemented Grafana dashboard configuration for Vuln Explorer metrics visualization.
- Added expected projection SHA256 for vulnerability events.
- Created k6 load testing script for Vuln Explorer API.
- Added sample projection and replay event data for testing.
- Implemented ReplayInputsLock for deterministic replay inputs management.
- Developed tests for ReplayInputsLock to ensure stable hash computation.
- Created SurfaceManifestDeterminismVerifier to validate manifest determinism and integrity.
- Added unit tests for SurfaceManifestDeterminismVerifier to ensure correct functionality.
- Implemented Angular tests for VulnerabilityHttpClient and VulnerabilityDetailComponent to verify API interactions and UI rendering.
2025-12-02 09:27:31 +02:00

2.2 KiB

Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)

Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.

Pipeline components

  • Audit collector: scrape structured logs from services emitting tenant, subject, action, resource, result, traceId. Ship via OTLP->collector->Loki/ClickHouse.
  • Usage metrics: Prometheus counters/gauges
    • tenant_requests_total{tenant,service,route,status}
    • tenant_rate_limit_hits_total{tenant,service}
    • tenant_data_volume_bytes_total{tenant,service}
    • tenant_queue_depth{tenant,service} (NATS/Redis)
  • Data retention: 30d logs; 90d metrics (downsampled after 30d).

JWKS outage chaos

  • Scenario: Authority/JWKS becomes unreachable for 5m.
  • Steps:
    1. Run synthetic tenant traffic via k6 (reuse ops/devops/vuln/k6-vuln-explorer.js or service-specific scripts) with X-StellaOps-Tenant set.
    2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
    3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records auth.degraded events, alerts fire if cache expired.
  • Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.

Load/perf benchmarks

  • Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
  • SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
  • Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain tenant label cardinality cap (<= 1000 active tenants).

Implementation steps

  • Add dashboards (Grafana folder StellaOps / Tenancy) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate.
  • Alert rules: tenant_error_rate_gt_0_5pct, jwks_cache_miss_spike, tenant_rate_limit_exceeded.
  • CI: add chaos test job stub (uses docker-compose + iptables fault) gated behind manual approval.
  • Docs: update deploy/README.md Tenancy section once dashboards/alerts live.

Artefacts

  • Dashboard JSON: ops/devops/tenant/dashboards/tenant-audit.json
  • Alert rules: ops/devops/tenant/alerts.yaml
  • Chaos script: ops/devops/tenant/jwks-chaos.sh