feat(zastava): add evidence locker plan and schema examples
- Introduced README.md for Zastava Evidence Locker Plan detailing artifacts to sign and post-signing steps. - Added example JSON schemas for observer events and webhook admissions. - Updated implementor guidelines with checklist for CI linting, determinism, secrets management, and schema control. - Created alert rules for Vuln Explorer to monitor API latency and projection errors. - Developed analytics ingestion plan for Vuln Explorer, focusing on telemetry and PII guardrails. - Implemented Grafana dashboard configuration for Vuln Explorer metrics visualization. - Added expected projection SHA256 for vulnerability events. - Created k6 load testing script for Vuln Explorer API. - Added sample projection and replay event data for testing. - Implemented ReplayInputsLock for deterministic replay inputs management. - Developed tests for ReplayInputsLock to ensure stable hash computation. - Created SurfaceManifestDeterminismVerifier to validate manifest determinism and integrity. - Added unit tests for SurfaceManifestDeterminismVerifier to ensure correct functionality. - Implemented Angular tests for VulnerabilityHttpClient and VulnerabilityDetailComponent to verify API interactions and UI rendering.
This commit is contained in:
36
ops/devops/tenant/audit-pipeline-plan.md
Normal file
36
ops/devops/tenant/audit-pipeline-plan.md
Normal file
@@ -0,0 +1,36 @@
|
||||
# Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)
|
||||
|
||||
Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.
|
||||
|
||||
## Pipeline components
|
||||
- **Audit collector**: scrape structured logs from services emitting `tenant`, `subject`, `action`, `resource`, `result`, `traceId`. Ship via OTLP->collector->Loki/ClickHouse.
|
||||
- **Usage metrics**: Prometheus counters/gauges
|
||||
- `tenant_requests_total{tenant,service,route,status}`
|
||||
- `tenant_rate_limit_hits_total{tenant,service}`
|
||||
- `tenant_data_volume_bytes_total{tenant,service}`
|
||||
- `tenant_queue_depth{tenant,service}` (NATS/Redis)
|
||||
- **Data retention**: 30d logs; 90d metrics (downsampled after 30d).
|
||||
|
||||
## JWKS outage chaos
|
||||
- Scenario: Authority/JWKS becomes unreachable for 5m.
|
||||
- Steps:
|
||||
1. Run synthetic tenant traffic via k6 (reuse `ops/devops/vuln/k6-vuln-explorer.js` or service-specific scripts) with `X-StellaOps-Tenant` set.
|
||||
2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
|
||||
3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records `auth.degraded` events, alerts fire if cache expired.
|
||||
- Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.
|
||||
|
||||
## Load/perf benchmarks
|
||||
- Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
|
||||
- SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
|
||||
- Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain `tenant` label cardinality cap (<= 1000 active tenants).
|
||||
|
||||
## Implementation steps
|
||||
- Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate.
|
||||
- Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`.
|
||||
- CI: add chaos test job stub (uses docker-compose + iptables fault) gated behind manual approval.
|
||||
- Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live.
|
||||
|
||||
## Artefacts
|
||||
- Dashboard JSON: `ops/devops/tenant/dashboards/tenant-audit.json`
|
||||
- Alert rules: `ops/devops/tenant/alerts.yaml`
|
||||
- Chaos script: `ops/devops/tenant/jwks-chaos.sh`
|
||||
Reference in New Issue
Block a user