feat(zastava): add evidence locker plan and schema examples

- Introduced README.md for Zastava Evidence Locker Plan detailing artifacts to sign and post-signing steps. - Added example JSON schemas for observer events and webhook admissions. - Updated implementor guidelines with checklist for CI linting, determinism, secrets management, and schema control. - Created alert rules for Vuln Explorer to monitor API latency and projection errors. - Developed analytics ingestion plan for Vuln Explorer, focusing on telemetry and PII guardrails. - Implemented Grafana dashboard configuration for Vuln Explorer metrics visualization. - Added expected projection SHA256 for vulnerability events. - Created k6 load testing script for Vuln Explorer API. - Added sample projection and replay event data for testing. - Implemented ReplayInputsLock for deterministic replay inputs management. - Developed tests for ReplayInputsLock to ensure stable hash computation. - Created SurfaceManifestDeterminismVerifier to validate manifest determinism and integrity. - Added unit tests for SurfaceManifestDeterminismVerifier to ensure correct functionality. - Implemented Angular tests for VulnerabilityHttpClient and VulnerabilityDetailComponent to verify API interactions and UI rendering.
2025-12-02 09:27:31 +02:00
parent 885ce86af4
commit 2d08f52715
74 changed files with 1690 additions and 131 deletions
--- a/ops/devops/tenant/alerts.yaml
+++ b/ops/devops/tenant/alerts.yaml
@@ -0,0 +1,29 @@
+# Alert rules for tenant audit & auth (DEVOPS-TEN-49-001)
+apiVersion: 1
+groups:
+- name: tenant-audit
+  rules:
+  - alert: tenant_error_rate_gt_0_5pct
+    expr: sum(rate(tenant_requests_total{status=~"5.."}[5m])) / sum(rate(tenant_requests_total[5m])) > 0.005
+    for: 5m
+    labels:
+      severity: page
+    annotations:
+      summary: Tenant error rate high
+      description: Error rate across tenant-labelled requests exceeds 0.5%.
+  - alert: jwks_cache_miss_spike
+    expr: rate(auth_jwks_cache_misses_total[5m]) / (rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m])) > 0.2
+    for: 5m
+    labels:
+      severity: warn
+    annotations:
+      summary: JWKS cache miss rate spike
+      description: JWKS miss ratio above 20% may indicate outage or cache expiry.
+  - alert: tenant_rate_limit_exceeded
+    expr: rate(tenant_rate_limit_hits_total[5m]) > 10
+    for: 5m
+    labels:
+      severity: warn
+    annotations:
+      summary: Frequent rate limit hits
+      description: Tenant rate limit exceeded more than 10 times per 5m window.
--- a/ops/devops/tenant/audit-pipeline-plan.md
+++ b/ops/devops/tenant/audit-pipeline-plan.md
@@ -0,0 +1,36 @@
+# Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)
+
+Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.
+
+## Pipeline components
+- **Audit collector**: scrape structured logs from services emitting `tenant`, `subject`, `action`, `resource`, `result`, `traceId`. Ship via OTLP->collector->Loki/ClickHouse.
+- **Usage metrics**: Prometheus counters/gauges
+  - `tenant_requests_total{tenant,service,route,status}`
+  - `tenant_rate_limit_hits_total{tenant,service}`
+  - `tenant_data_volume_bytes_total{tenant,service}`
+  - `tenant_queue_depth{tenant,service}` (NATS/Redis)
+- **Data retention**: 30d logs; 90d metrics (downsampled after 30d).
+
+## JWKS outage chaos
+- Scenario: Authority/JWKS becomes unreachable for 5m.
+- Steps:
+  1. Run synthetic tenant traffic via k6 (reuse `ops/devops/vuln/k6-vuln-explorer.js` or service-specific scripts) with `X-StellaOps-Tenant` set.
+  2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
+  3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records `auth.degraded` events, alerts fire if cache expired.
+- Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.
+
+## Load/perf benchmarks
+- Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
+- SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
+- Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain `tenant` label cardinality cap (<= 1000 active tenants).
+
+## Implementation steps
+- Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate.
+- Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`.
+- CI: add chaos test job stub (uses docker-compose + iptables fault) gated behind manual approval.
+- Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live.
+
+## Artefacts
+- Dashboard JSON: `ops/devops/tenant/dashboards/tenant-audit.json`
+- Alert rules: `ops/devops/tenant/alerts.yaml`
+- Chaos script: `ops/devops/tenant/jwks-chaos.sh`
--- a/ops/devops/tenant/dashboards/tenant-audit.json
+++ b/ops/devops/tenant/dashboards/tenant-audit.json
@@ -0,0 +1,11 @@
+{
+  "title": "Tenant Audit & Auth",
+  "timezone": "utc",
+  "panels": [
+    {"type": "timeseries", "title": "Tenant request latency p95", "targets": [{"expr": "histogram_quantile(0.95, rate(tenant_requests_duration_seconds_bucket[5m]))"}]},
+    {"type": "timeseries", "title": "Tenant error rate", "targets": [{"expr": "sum(rate(tenant_requests_total{status=~\"5..\"}[5m])) / sum(rate(tenant_requests_total[5m]))"}]},
+    {"type": "timeseries", "title": "JWKS cache hit rate", "targets": [{"expr": "rate(auth_jwks_cache_hits_total[5m]) / (rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m]))"}]},
+    {"type": "timeseries", "title": "Rate limit hits", "targets": [{"expr": "rate(tenant_rate_limit_hits_total[5m])"}]},
+    {"type": "timeseries", "title": "Tenant queue depth", "targets": [{"expr": "tenant_queue_depth"}]}
+  ]
+}
--- a/ops/devops/tenant/jwks-chaos.sh
+++ b/ops/devops/tenant/jwks-chaos.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+# Simulate JWKS outage for chaos testing (DEVOPS-TEN-49-001)
+# Usage: JWKS_HOST=authority.local JWKS_PORT=8440 DURATION=300 ./jwks-chaos.sh
+set -euo pipefail
+HOST=${JWKS_HOST:-authority}
+PORT=${JWKS_PORT:-8440}
+DURATION=${DURATION:-300}
+
+rule_name=stellaops-jwks-chaos
+
+cleanup() {
+  sudo iptables -D OUTPUT -p tcp --dport "$PORT" -d "$HOST" -j DROP 2>/dev/null || true
+}
+trap cleanup EXIT
+
+sudo iptables -I OUTPUT -p tcp --dport "$PORT" -d "$HOST" -j DROP
+echo "JWKS traffic to ${HOST}:${PORT} dropped for ${DURATION}s" >&2
+sleep "$DURATION"
+cleanup