6.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			6.4 KiB
		
	
	
	
	
	
	
	
Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with Sprint 12 – Runtime Guardrails and assumes components consume
StellaOps.Zastava.Core (AddZastavaRuntimeCore(...)).
1. Prerequisites
- Authority client credentials – service principal 
zastava-runtimewith scopesaud:scannerandapi:scanner.runtime.write. Provision DPoP keys and mTLS client certs before rollout. - Scanner/WebService reachability – cluster DNS entry (e.g. 
scanner.internal) resolvable from every node running Observer/Webhook. - Host mounts – read-only access to 
/proc, container runtime state (/var/lib/containerd,/var/run/containerd/containerd.sock) and scratch space (/var/run/zastava). - Offline kit bundle – operators staging air-gapped installs must download
offline-kit/zastava-runtime-{version}.tar.zstcontaining container images, Grafana dashboards, and Prometheus rules referenced below. - Secrets – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets live outside git. For air-gapped installs copy them to the sealed secrets vault.
 
1.1 Telemetry quick reference
| Metric | Description | Notes | 
|---|---|---|
zastava.runtime.events.total{tenant,component,kind} | 
Rate of observer events sent to Scanner | Expect >0 on busy nodes. | 
zastava.runtime.backend.latency.ms | 
Histogram (ms) for /runtime/events and /policy/runtime calls | 
P95 & P99 drive alerting. | 
zastava.admission.decisions.total{decision} | 
Admission verdict counts | Track deny spikes or fail-open fallbacks. | 
zastava.admission.cache.hits.total | 
(future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. | 
2. Deployment workflows
2.1 Fresh install (Helm overlay)
- Load offline kit bundle: 
oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava. - Render values:
zastava.runtime.tenant,environment,deployment(cluster identifier).zastava.runtime.authorityblock (issuer, clientId, audience, DPoP toggle).zastava.runtime.metrics.commonTags.clusterfor Prometheus labels.
 - Pre-create secrets:
zastava-authority-dpop(JWK + private key).zastava-authority-mtls(client cert/key chain).zastava-webhook-tls(serving cert; CSR bundle if using auto-approval).
 - Deploy Observer DaemonSet and Webhook chart:
helm upgrade --install zastava-runtime deploy/helm/zastava \ -f values/zastava-runtime.yaml \ --namespace stellaops \ --create-namespace - Verify:
kubectl -n stellaops get pods -l app=zastava-observerready.kubectl -n stellaops logs ds/zastava-observer --tail=20showsIssued runtime OpTokaudit line with DPoP token type.- Admission webhook registered: 
kubectl get validatingwebhookconfiguration zastava-webhook. 
 
2.2 Upgrades
- Scale webhook deployment to 
--replicas=3(rolling). - Drain one node per AZ to ensure Observer tolerates disruption.
 - Apply chart upgrade; watch 
zastava.runtime.backend.latency.msP95 (<250 ms). - Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect 
deny(policy fail). - Apply signed Pod manifest → expect 
allow. 
 - Apply unsigned Pod manifest → expect 
 - Record upgrade in ops log with Git SHA + Helm chart version.
 
2.3 Rollback
- Use Helm revision history: 
helm history zastava-runtime. - Rollback: 
helm rollback zastava-runtime <revision>. - Invalidate cached OpToks:
kubectl -n stellaops exec deploy/zastava-webhook -- \ zastava-webhook invalidate-op-token --audience scanner - Confirm observers reconnect via metrics (
rate(zastava_runtime_events_total[5m])). 
3. Authority & security guardrails
- Tokens must be 
DPoPtype whenrequireDpop=true. Logs emitauthority.token.issuescope with decision data; absence indicates misconfig. requireMutualTls=trueenforces mTLS during token acquisition. Disable only in lab clusters; expect warning logMutual TLS requirement disabled.- Static fallback tokens (
allowStaticTokenFallback=true) should exist only during initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. - Audit every change in 
zastava.runtime.authoritythrough change management. Usekubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'to confirm key rotation. 
4. Incident response
4.1 Authority offline
- Check Prometheus alert 
ZastavaAuthorityTokenStale. - Inspect Observer logs for 
authority.token.fallbackscope. - If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
 - Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
 
4.2 Scanner/WebService latency spike
- Alert 
ZastavaRuntimeBackendLatencyHighfires at P95 > 750 ms for 5 minutes. - Run backend health: 
kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready. - If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
kubectl logs ds/zastava-observer | grep buffer.drops. - Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
 
4.3 Admission deny storm
- Alert 
ZastavaAdmissionDenySpikeindicates >20 denies/minute. - Pull sample: 
kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'. - Cross-check policy backlog in Scanner (
/policy/runtimelogs). Engage application owner; optionally set namespace tofailOpenNamespacesafter risk assessment. 
5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
 docs/ops/zastava-runtime-prometheus-rules.yaml+ Grafana dashboard JSON.- Sample 
zastava-runtime.values.yaml. 
 - Verification:
- Validate signature: 
cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert. - Extract Prometheus rules into offline monitoring cluster (
/etc/prometheus/rules.d). - Import Grafana dashboard via 
grafana-cli --config .... 
 - Validate signature: 
 
6. Observability assets
- Prometheus alert rules: 
docs/ops/zastava-runtime-prometheus-rules.yaml. - Grafana dashboard JSON: 
docs/ops/zastava-runtime-grafana-dashboard.json. - Add both to the monitoring repo (
ops/monitoring/zastava) and reference them in the Offline Kit manifest.