- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
		
			
				
	
	
	
		
			8.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with Sprint 12 – Runtime Guardrails and assumes components consume
StellaOps.Zastava.Core (AddZastavaRuntimeCore(...)).
1. Prerequisites
- Authority client credentials – service principal zastava-runtimewith scopesaud:scannerandapi:scanner.runtime.write. Provision DPoP keys and mTLS client certs before rollout.
- Scanner/WebService reachability – cluster DNS entry (e.g. scanner.internal) resolvable from every node running Observer/Webhook.
- Host mounts – read-only access to /proc, container runtime state (/var/lib/containerd,/var/run/containerd/containerd.sock) and scratch space (/var/run/zastava).
- Offline kit bundle – operators staging air-gapped installs must download
offline-kit/zastava-runtime-{version}.tar.zstcontaining container images, Grafana dashboards, and Prometheus rules referenced below.
- Secrets – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets live outside git. For air-gapped installs copy them to the sealed secrets vault.
1.1 Telemetry quick reference
| Metric | Description | Notes | 
|---|---|---|
| zastava.runtime.events.total{tenant,component,kind} | Rate of observer events sent to Scanner | Expect >0 on busy nodes. | 
| zastava.runtime.backend.latency.ms | Histogram (ms) for /runtime/eventsand/policy/runtimecalls | P95 & P99 drive alerting. | 
| zastava.admission.decisions.total{decision} | Admission verdict counts | Track deny spikes or fail-open fallbacks. | 
| zastava.admission.cache.hits.total | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. | 
2. Deployment workflows
2.1 Fresh install (Helm overlay)
- Load offline kit bundle: oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava.
- Render values:
- zastava.runtime.tenant,- environment,- deployment(cluster identifier).
- zastava.runtime.authorityblock (issuer, clientId, audience, DPoP toggle).
- zastava.runtime.metrics.commonTags.clusterfor Prometheus labels.
 
- Pre-create secrets:
- zastava-authority-dpop(JWK + private key).
- zastava-authority-mtls(client cert/key chain).
- zastava-webhook-tls(serving cert; CSR bundle if using auto-approval).
 
- Deploy Observer DaemonSet and Webhook chart:
helm upgrade --install zastava-runtime deploy/helm/zastava \ -f values/zastava-runtime.yaml \ --namespace stellaops \ --create-namespace
- Verify:
- kubectl -n stellaops get pods -l app=zastava-observerready.
- kubectl -n stellaops logs ds/zastava-observer --tail=20shows- Issued runtime OpTokaudit line with DPoP token type.
- Admission webhook registered: kubectl get validatingwebhookconfiguration zastava-webhook.
 
2.2 Upgrades
- Scale webhook deployment to --replicas=3(rolling).
- Drain one node per AZ to ensure Observer tolerates disruption.
- Apply chart upgrade; watch zastava.runtime.backend.latency.msP95 (<250 ms).
- Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect deny(policy fail).
- Apply signed Pod manifest → expect allow.
 
- Apply unsigned Pod manifest → expect 
- Record upgrade in ops log with Git SHA + Helm chart version.
2.3 Rollback
- Use Helm revision history: helm history zastava-runtime.
- Rollback: helm rollback zastava-runtime <revision>.
- Invalidate cached OpToks:
kubectl -n stellaops exec deploy/zastava-webhook -- \ zastava-webhook invalidate-op-token --audience scanner
- Confirm observers reconnect via metrics (rate(zastava_runtime_events_total[5m])).
3. Authority & security guardrails
- Tokens must be DPoPtype whenrequireDpop=true. Logs emitauthority.token.issuescope with decision data; absence indicates misconfig.
- requireMutualTls=trueenforces mTLS during token acquisition. Disable only in lab clusters; expect warning log- Mutual TLS requirement disabled.
- Static fallback tokens (allowStaticTokenFallback=true) should exist only during initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
- Audit every change in zastava.runtime.authoritythrough change management. Usekubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'to confirm key rotation.
4. Incident response
4.1 Authority offline
- Check Prometheus alert ZastavaAuthorityTokenStale.
- Inspect Observer logs for authority.token.fallbackscope.
- If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
- Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
4.2 Scanner/WebService latency spike
- Alert ZastavaRuntimeBackendLatencyHighfires at P95 > 750 ms for 5 minutes.
- Run backend health: kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready.
- If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
kubectl logs ds/zastava-observer | grep buffer.drops.
- Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
4.3 Admission deny storm
- Alert ZastavaAdmissionDenySpikeindicates >20 denies/minute.
- Pull sample: kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'.
- Cross-check policy backlog in Scanner (/policy/runtimelogs). Engage application owner; optionally set namespace tofailOpenNamespacesafter risk assessment.
5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
- docs/ops/zastava-runtime-prometheus-rules.yaml+ Grafana dashboard JSON.
- Sample zastava-runtime.values.yaml.
 
- Verification:
- Validate signature: cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert.
- Extract Prometheus rules into offline monitoring cluster (/etc/prometheus/rules.d).
- Import Grafana dashboard via grafana-cli --config ....
 
- Validate signature: 
6. Observability assets
- Prometheus alert rules: docs/ops/zastava-runtime-prometheus-rules.yaml.
- Grafana dashboard JSON: docs/ops/zastava-runtime-grafana-dashboard.json.
- Add both to the monitoring repo (ops/monitoring/zastava) and reference them in the Offline Kit manifest.
7. Build-id correlation & symbol retrieval
Runtime events emitted by Observer now include process.buildId (from the ELF
NT_GNU_BUILD_ID note) and Scanner /policy/runtime surfaces the most recent
buildIds list per digest. Operators can use these hashes to locate debug
artifacts during incident response:
- Capture the hash from CLI/webhook/Scanner API—for example:
Copy one of thestellaops-cli runtime policy test --image <digest> --namespace <ns>Build IDs(e.g.5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789).
- Derive the debug path (<aa>/<rest>under.build-id) and check it exists:ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
- If the file is missing, rehydrate it from Offline Kit bundles or the
debug-storeobject bucket (mirror of release artefacts):oras cp oci://registry.internal/debug-store:latest . --include \ "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
- Confirm the running process advertises the same GNU build-id before
symbolising:
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
- Attach the .debugfile ingdb/lldb, feed it toeu-unstrip, or cache it indebuginfodfor fleet-wide symbol resolution:debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
- For musl-based images, expect shorter build-id footprints. Missing hashes in
runtime events indicate stripped binaries without the GNU note—schedule a
rebuild with -Wl,--build-idenabled or add the binary to the debug-store allowlist so the scanner can surface a fallback symbol package.
Monitor scanner.policy.runtime responses for the buildIds field; absence of
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
Observer to trigger a fresh capture if symbol parity is required.