feat(zastava): add evidence locker plan and schema examples
- Introduced README.md for Zastava Evidence Locker Plan detailing artifacts to sign and post-signing steps. - Added example JSON schemas for observer events and webhook admissions. - Updated implementor guidelines with checklist for CI linting, determinism, secrets management, and schema control. - Created alert rules for Vuln Explorer to monitor API latency and projection errors. - Developed analytics ingestion plan for Vuln Explorer, focusing on telemetry and PII guardrails. - Implemented Grafana dashboard configuration for Vuln Explorer metrics visualization. - Added expected projection SHA256 for vulnerability events. - Created k6 load testing script for Vuln Explorer API. - Added sample projection and replay event data for testing. - Implemented ReplayInputsLock for deterministic replay inputs management. - Developed tests for ReplayInputsLock to ensure stable hash computation. - Created SurfaceManifestDeterminismVerifier to validate manifest determinism and integrity. - Added unit tests for SurfaceManifestDeterminismVerifier to ensure correct functionality. - Implemented Angular tests for VulnerabilityHttpClient and VulnerabilityDetailComponent to verify API interactions and UI rendering.
This commit is contained in:
29
ops/devops/tenant/alerts.yaml
Normal file
29
ops/devops/tenant/alerts.yaml
Normal file
@@ -0,0 +1,29 @@
|
||||
# Alert rules for tenant audit & auth (DEVOPS-TEN-49-001)
|
||||
apiVersion: 1
|
||||
groups:
|
||||
- name: tenant-audit
|
||||
rules:
|
||||
- alert: tenant_error_rate_gt_0_5pct
|
||||
expr: sum(rate(tenant_requests_total{status=~"5.."}[5m])) / sum(rate(tenant_requests_total[5m])) > 0.005
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: Tenant error rate high
|
||||
description: Error rate across tenant-labelled requests exceeds 0.5%.
|
||||
- alert: jwks_cache_miss_spike
|
||||
expr: rate(auth_jwks_cache_misses_total[5m]) / (rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m])) > 0.2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warn
|
||||
annotations:
|
||||
summary: JWKS cache miss rate spike
|
||||
description: JWKS miss ratio above 20% may indicate outage or cache expiry.
|
||||
- alert: tenant_rate_limit_exceeded
|
||||
expr: rate(tenant_rate_limit_hits_total[5m]) > 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warn
|
||||
annotations:
|
||||
summary: Frequent rate limit hits
|
||||
description: Tenant rate limit exceeded more than 10 times per 5m window.
|
||||
36
ops/devops/tenant/audit-pipeline-plan.md
Normal file
36
ops/devops/tenant/audit-pipeline-plan.md
Normal file
@@ -0,0 +1,36 @@
|
||||
# Tenant audit pipeline & chaos plan (DEVOPS-TEN-49-001)
|
||||
|
||||
Scope: deploy audit pipeline, capture tenant usage metrics, run JWKS outage chaos tests, and benchmark tenant load/perf.
|
||||
|
||||
## Pipeline components
|
||||
- **Audit collector**: scrape structured logs from services emitting `tenant`, `subject`, `action`, `resource`, `result`, `traceId`. Ship via OTLP->collector->Loki/ClickHouse.
|
||||
- **Usage metrics**: Prometheus counters/gauges
|
||||
- `tenant_requests_total{tenant,service,route,status}`
|
||||
- `tenant_rate_limit_hits_total{tenant,service}`
|
||||
- `tenant_data_volume_bytes_total{tenant,service}`
|
||||
- `tenant_queue_depth{tenant,service}` (NATS/Redis)
|
||||
- **Data retention**: 30d logs; 90d metrics (downsampled after 30d).
|
||||
|
||||
## JWKS outage chaos
|
||||
- Scenario: Authority/JWKS becomes unreachable for 5m.
|
||||
- Steps:
|
||||
1. Run synthetic tenant traffic via k6 (reuse `ops/devops/vuln/k6-vuln-explorer.js` or service-specific scripts) with `X-StellaOps-Tenant` set.
|
||||
2. Block JWKS endpoint (iptables or envoy fault) for 5 minutes.
|
||||
3. Assert: services fall back to cached keys (if within TTL), error rate < 1%, audit pipeline records `auth.degraded` events, alerts fire if cache expired.
|
||||
- Metrics/alerts to watch: auth cache hit/miss, token validation failures, request error rate, rate limit hits.
|
||||
|
||||
## Load/perf benchmarks
|
||||
- Target: 5k concurrent tenant requests across API surfaces (Policy, Vuln, Notify) using k6 scenario that mixes read/write 90/10.
|
||||
- SLOs: p95 < 300ms read, < 600ms write; error rate < 0.5%.
|
||||
- Multi-tenant spread: at least 10 tenants, randomised per VU; ensure metrics maintain `tenant` label cardinality cap (<= 1000 active tenants).
|
||||
|
||||
## Implementation steps
|
||||
- Add dashboards (Grafana folder `StellaOps / Tenancy`) with panels for per-tenant latency, error rate, rate-limit hits, JWKS cache hit rate.
|
||||
- Alert rules: `tenant_error_rate_gt_0_5pct`, `jwks_cache_miss_spike`, `tenant_rate_limit_exceeded`.
|
||||
- CI: add chaos test job stub (uses docker-compose + iptables fault) gated behind manual approval.
|
||||
- Docs: update `deploy/README.md` Tenancy section once dashboards/alerts live.
|
||||
|
||||
## Artefacts
|
||||
- Dashboard JSON: `ops/devops/tenant/dashboards/tenant-audit.json`
|
||||
- Alert rules: `ops/devops/tenant/alerts.yaml`
|
||||
- Chaos script: `ops/devops/tenant/jwks-chaos.sh`
|
||||
11
ops/devops/tenant/dashboards/tenant-audit.json
Normal file
11
ops/devops/tenant/dashboards/tenant-audit.json
Normal file
@@ -0,0 +1,11 @@
|
||||
{
|
||||
"title": "Tenant Audit & Auth",
|
||||
"timezone": "utc",
|
||||
"panels": [
|
||||
{"type": "timeseries", "title": "Tenant request latency p95", "targets": [{"expr": "histogram_quantile(0.95, rate(tenant_requests_duration_seconds_bucket[5m]))"}]},
|
||||
{"type": "timeseries", "title": "Tenant error rate", "targets": [{"expr": "sum(rate(tenant_requests_total{status=~\"5..\"}[5m])) / sum(rate(tenant_requests_total[5m]))"}]},
|
||||
{"type": "timeseries", "title": "JWKS cache hit rate", "targets": [{"expr": "rate(auth_jwks_cache_hits_total[5m]) / (rate(auth_jwks_cache_hits_total[5m]) + rate(auth_jwks_cache_misses_total[5m]))"}]},
|
||||
{"type": "timeseries", "title": "Rate limit hits", "targets": [{"expr": "rate(tenant_rate_limit_hits_total[5m])"}]},
|
||||
{"type": "timeseries", "title": "Tenant queue depth", "targets": [{"expr": "tenant_queue_depth"}]}
|
||||
]
|
||||
}
|
||||
19
ops/devops/tenant/jwks-chaos.sh
Normal file
19
ops/devops/tenant/jwks-chaos.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/usr/bin/env bash
|
||||
# Simulate JWKS outage for chaos testing (DEVOPS-TEN-49-001)
|
||||
# Usage: JWKS_HOST=authority.local JWKS_PORT=8440 DURATION=300 ./jwks-chaos.sh
|
||||
set -euo pipefail
|
||||
HOST=${JWKS_HOST:-authority}
|
||||
PORT=${JWKS_PORT:-8440}
|
||||
DURATION=${DURATION:-300}
|
||||
|
||||
rule_name=stellaops-jwks-chaos
|
||||
|
||||
cleanup() {
|
||||
sudo iptables -D OUTPUT -p tcp --dport "$PORT" -d "$HOST" -j DROP 2>/dev/null || true
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
sudo iptables -I OUTPUT -p tcp --dport "$PORT" -d "$HOST" -j DROP
|
||||
echo "JWKS traffic to ${HOST}:${PORT} dropped for ${DURATION}s" >&2
|
||||
sleep "$DURATION"
|
||||
cleanup
|
||||
37
ops/devops/vuln/alerts.yaml
Normal file
37
ops/devops/vuln/alerts.yaml
Normal file
@@ -0,0 +1,37 @@
|
||||
# Alert rules for Vuln Explorer (DEVOPS-VULN-29-002/003)
|
||||
apiVersion: 1
|
||||
groups:
|
||||
- name: vuln-explorer
|
||||
rules:
|
||||
- alert: vuln_api_latency_p95_gt_300ms
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="vuln-explorer",path=~"/findings.*"}[5m])) > 0.3
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: Vuln Explorer API p95 latency high
|
||||
description: p95 latency for /findings exceeds 300ms for 5m.
|
||||
- alert: vuln_projection_lag_gt_60s
|
||||
expr: vuln_projection_lag_seconds > 60
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: Vuln projection lag exceeds 60s
|
||||
description: Ledger projector lag is above 60s.
|
||||
- alert: vuln_projection_error_rate_gt_1pct
|
||||
expr: rate(vuln_projection_errors_total[5m]) / rate(vuln_projection_runs_total[5m]) > 0.01
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: Vuln projector error rate >1%
|
||||
description: Projection errors exceed 1% over 5m.
|
||||
- alert: vuln_query_budget_enforced_gt_50_per_min
|
||||
expr: rate(vuln_query_budget_enforced_total[1m]) > 50
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warn
|
||||
annotations:
|
||||
summary: Query budget enforcement high
|
||||
description: Budget enforcement is firing more than 50/min.
|
||||
26
ops/devops/vuln/analytics-ingest-plan.md
Normal file
26
ops/devops/vuln/analytics-ingest-plan.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Vuln Explorer analytics pipeline plan (DEVOPS-VULN-29-003)
|
||||
|
||||
Goals: instrument analytics ingestion (query hashes, privacy/PII guardrails), update observability docs, and supply deployable configs.
|
||||
|
||||
## Instrumentation tasks
|
||||
- Expose Prometheus counters/histograms in API:
|
||||
- `vuln_query_hashes_total{tenant,query_hash}` increment on cached/served queries.
|
||||
- `vuln_api_latency_seconds` histogram (already present; ensure labels avoid PII).
|
||||
- `vuln_api_payload_bytes` histogram for request/response sizes.
|
||||
- Redact/avoid PII:
|
||||
- Hash query bodies server-side (SHA256 with salt per deployment) before logging/metrics; store only hash+shape, not raw filters.
|
||||
- Truncate any request field names/values in logs to 128 chars and drop known PII fields (email/userId).
|
||||
- Telemetry export:
|
||||
- OTLP metrics/logs via existing collector profile; add `service=\"vuln-explorer\"` resource attrs.
|
||||
|
||||
## Pipelines/configs
|
||||
- Grafana dashboard will read from Prometheus metrics already defined in `ops/devops/vuln/dashboards/vuln-explorer.json`.
|
||||
- Alert rules already in `ops/devops/vuln/alerts.yaml`; ensure additional rules for PII drops are not required (logs-only).
|
||||
|
||||
## Docs
|
||||
- Update deploy docs (`deploy/README.md`) to mention PII-safe logging in Vuln Explorer and query-hash metrics.
|
||||
- Add runbook entry under `docs/modules/vuln-explorer/observability.md` (if absent, create) summarizing metrics and how to interpret query hashes.
|
||||
|
||||
## CI checks
|
||||
- Unit test to assert logging middleware hashes queries and strips PII (to be implemented in API tests).
|
||||
- Add static check in pipeline ensuring `vuln_query_hashes_total` and payload histograms are scraped (Prometheus snapshot test).
|
||||
4
ops/devops/vuln/dashboards/README.md
Normal file
4
ops/devops/vuln/dashboards/README.md
Normal file
@@ -0,0 +1,4 @@
|
||||
# Vuln Explorer dashboards
|
||||
|
||||
- `vuln-explorer.json`: p95 latency, projection lag, error rate, query budget enforcement.
|
||||
- Import into Grafana (folder `StellaOps / Vuln Explorer`). Data source: Prometheus scrape with `service="vuln-explorer"` labels.
|
||||
30
ops/devops/vuln/dashboards/vuln-explorer.json
Normal file
30
ops/devops/vuln/dashboards/vuln-explorer.json
Normal file
@@ -0,0 +1,30 @@
|
||||
{
|
||||
"title": "Vuln Explorer",
|
||||
"timezone": "utc",
|
||||
"panels": [
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "API latency p50/p95/p99",
|
||||
"targets": [
|
||||
{ "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"vuln-explorer\",path=~\"/findings.*\"}[5m]))" },
|
||||
{ "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service=\"vuln-explorer\",path=~\"/findings.*\"}[5m]))" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Projection lag (s)",
|
||||
"targets": [ { "expr": "vuln_projection_lag_seconds" } ]
|
||||
},
|
||||
{
|
||||
"type": "stat",
|
||||
"title": "Error rate",
|
||||
"targets": [ { "expr": "sum(rate(http_requests_total{service=\"vuln-explorer\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"vuln-explorer\"}[5m]))" } ],
|
||||
"options": { "reduceOptions": { "calcs": ["lastNotNull"] } }
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Query budget enforcement hits",
|
||||
"targets": [ { "expr": "rate(vuln_query_budget_enforced_total[5m])" } ]
|
||||
}
|
||||
]
|
||||
}
|
||||
1
ops/devops/vuln/expected_projection.sha256
Normal file
1
ops/devops/vuln/expected_projection.sha256
Normal file
@@ -0,0 +1 @@
|
||||
d89271fddb12115b3610b8cd476c85318cd56c44f7e019793c947bf57c8f86ef samples/vuln/events/projection.json
|
||||
47
ops/devops/vuln/k6-vuln-explorer.js
Normal file
47
ops/devops/vuln/k6-vuln-explorer.js
Normal file
@@ -0,0 +1,47 @@
|
||||
import http from 'k6/http';
|
||||
import { check, sleep } from 'k6';
|
||||
import { Trend, Rate } from 'k6/metrics';
|
||||
|
||||
const latency = new Trend('vuln_api_latency');
|
||||
const errors = new Rate('vuln_api_errors');
|
||||
|
||||
const BASE = __ENV.VULN_BASE || 'http://localhost:8449';
|
||||
const TENANT = __ENV.VULN_TENANT || 'alpha';
|
||||
const TOKEN = __ENV.VULN_TOKEN || '';
|
||||
const HEADERS = TOKEN ? { 'Authorization': `Bearer ${TOKEN}`, 'X-StellaOps-Tenant': TENANT } : { 'X-StellaOps-Tenant': TENANT };
|
||||
|
||||
export const options = {
|
||||
scenarios: {
|
||||
ramp: {
|
||||
executor: 'ramping-vus',
|
||||
startVUs: 0,
|
||||
stages: [
|
||||
{ duration: '5m', target: 200 },
|
||||
{ duration: '10m', target: 200 },
|
||||
{ duration: '2m', target: 0 },
|
||||
],
|
||||
gracefulRampDown: '30s',
|
||||
},
|
||||
},
|
||||
thresholds: {
|
||||
vuln_api_latency: ['p(95)<250'],
|
||||
vuln_api_errors: ['rate<0.005'],
|
||||
},
|
||||
};
|
||||
|
||||
function req(path, params = {}) {
|
||||
const res = http.get(`${BASE}${path}`, { headers: HEADERS, tags: params.tags });
|
||||
latency.add(res.timings.duration, params.tags);
|
||||
errors.add(res.status >= 400, params.tags);
|
||||
check(res, {
|
||||
'status is 2xx': (r) => r.status >= 200 && r.status < 300,
|
||||
});
|
||||
return res;
|
||||
}
|
||||
|
||||
export default function () {
|
||||
req(`/findings?tenant=${TENANT}&page=1&pageSize=50`, { tags: { endpoint: 'list' } });
|
||||
req(`/findings?tenant=${TENANT}&status=open&page=1&pageSize=50`, { tags: { endpoint: 'filter_open' } });
|
||||
req(`/findings/stats?tenant=${TENANT}`, { tags: { endpoint: 'stats' } });
|
||||
sleep(1);
|
||||
}
|
||||
@@ -20,18 +20,17 @@ Assumptions: Vuln Explorer API uses MongoDB + Redis; ledger projector performs r
|
||||
- Alert when last anchored root age > 15m or mismatch detected.
|
||||
|
||||
## Verification Automation
|
||||
- Script `ops/devops/vuln/verify_projection.sh` (to be added) should:
|
||||
- Run projector against fixture events and compute hash of materialized view snapshot (`sha256sum` over canonical JSON export).
|
||||
- Compare with expected hash stored in `ops/devops/vuln/expected_projection.sha256`.
|
||||
- Exit non-zero on mismatch.
|
||||
- Script `ops/devops/vuln/verify_projection.sh` runs hash check:
|
||||
- Input projection export (`samples/vuln/events/projection.json` default) compared to `ops/devops/vuln/expected_projection.sha256`.
|
||||
- Exits non-zero on mismatch; use in CI after projector replay.
|
||||
|
||||
## Fixtures
|
||||
- Store deterministic replay fixture under `samples/vuln/events/replay.ndjson` (generated offline, includes mixed tenants, disputed findings, remediation states).
|
||||
- Export canonical projection snapshot to `samples/vuln/events/projection.json` and hash to `ops/devops/vuln/expected_projection.sha256`.
|
||||
|
||||
## Dashboards / Alerts (DEVOPS-VULN-29-002/003)
|
||||
- Dashboard panels: projection lag, replay throughput, API latency (`/findings`, `/findings/{id}`), query budget enforcement hits, and Merkle anchoring status.
|
||||
- Alerts: `vuln_projection_lag_gt_60s`, `vuln_projection_error_rate_gt_1pct`, `vuln_api_latency_p95_gt_300ms`, `merkle_anchor_stale_gt_15m`.
|
||||
- Dashboard JSON: `ops/devops/vuln/dashboards/vuln-explorer.json` (latency, projection lag, error rate, budget enforcement).
|
||||
- Alerts: `ops/devops/vuln/alerts.yaml` defining `vuln_api_latency_p95_gt_300ms`, `vuln_projection_lag_gt_60s`, `vuln_projection_error_rate_gt_1pct`, `vuln_query_budget_enforced_gt_50_per_min`.
|
||||
|
||||
## Offline posture
|
||||
- CI and verification use in-repo fixtures; no external downloads.
|
||||
|
||||
Reference in New Issue
Block a user