6.4 KiB
6.4 KiB
Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with Sprint 12 – Runtime Guardrails and assumes components consume
StellaOps.Zastava.Core (AddZastavaRuntimeCore(...)).
1. Prerequisites
- Authority client credentials – service principal
zastava-runtimewith scopesaud:scannerandapi:scanner.runtime.write. Provision DPoP keys and mTLS client certs before rollout. - Scanner/WebService reachability – cluster DNS entry (e.g.
scanner.internal) resolvable from every node running Observer/Webhook. - Host mounts – read-only access to
/proc, container runtime state (/var/lib/containerd,/var/run/containerd/containerd.sock) and scratch space (/var/run/zastava). - Offline kit bundle – operators staging air-gapped installs must download
offline-kit/zastava-runtime-{version}.tar.zstcontaining container images, Grafana dashboards, and Prometheus rules referenced below. - Secrets – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets live outside git. For air-gapped installs copy them to the sealed secrets vault.
1.1 Telemetry quick reference
| Metric | Description | Notes |
|---|---|---|
zastava.runtime.events.total{tenant,component,kind} |
Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
zastava.runtime.backend.latency.ms |
Histogram (ms) for /runtime/events and /policy/runtime calls |
P95 & P99 drive alerting. |
zastava.admission.decisions.total{decision} |
Admission verdict counts | Track deny spikes or fail-open fallbacks. |
zastava.admission.cache.hits.total |
(future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
2. Deployment workflows
2.1 Fresh install (Helm overlay)
- Load offline kit bundle:
oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava. - Render values:
zastava.runtime.tenant,environment,deployment(cluster identifier).zastava.runtime.authorityblock (issuer, clientId, audience, DPoP toggle).zastava.runtime.metrics.commonTags.clusterfor Prometheus labels.
- Pre-create secrets:
zastava-authority-dpop(JWK + private key).zastava-authority-mtls(client cert/key chain).zastava-webhook-tls(serving cert; CSR bundle if using auto-approval).
- Deploy Observer DaemonSet and Webhook chart:
helm upgrade --install zastava-runtime deploy/helm/zastava \ -f values/zastava-runtime.yaml \ --namespace stellaops \ --create-namespace - Verify:
kubectl -n stellaops get pods -l app=zastava-observerready.kubectl -n stellaops logs ds/zastava-observer --tail=20showsIssued runtime OpTokaudit line with DPoP token type.- Admission webhook registered:
kubectl get validatingwebhookconfiguration zastava-webhook.
2.2 Upgrades
- Scale webhook deployment to
--replicas=3(rolling). - Drain one node per AZ to ensure Observer tolerates disruption.
- Apply chart upgrade; watch
zastava.runtime.backend.latency.msP95 (<250 ms). - Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect
deny(policy fail). - Apply signed Pod manifest → expect
allow.
- Apply unsigned Pod manifest → expect
- Record upgrade in ops log with Git SHA + Helm chart version.
2.3 Rollback
- Use Helm revision history:
helm history zastava-runtime. - Rollback:
helm rollback zastava-runtime <revision>. - Invalidate cached OpToks:
kubectl -n stellaops exec deploy/zastava-webhook -- \ zastava-webhook invalidate-op-token --audience scanner - Confirm observers reconnect via metrics (
rate(zastava_runtime_events_total[5m])).
3. Authority & security guardrails
- Tokens must be
DPoPtype whenrequireDpop=true. Logs emitauthority.token.issuescope with decision data; absence indicates misconfig. requireMutualTls=trueenforces mTLS during token acquisition. Disable only in lab clusters; expect warning logMutual TLS requirement disabled.- Static fallback tokens (
allowStaticTokenFallback=true) should exist only during initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. - Audit every change in
zastava.runtime.authoritythrough change management. Usekubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'to confirm key rotation.
4. Incident response
4.1 Authority offline
- Check Prometheus alert
ZastavaAuthorityTokenStale. - Inspect Observer logs for
authority.token.fallbackscope. - If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
- Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
4.2 Scanner/WebService latency spike
- Alert
ZastavaRuntimeBackendLatencyHighfires at P95 > 750 ms for 5 minutes. - Run backend health:
kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready. - If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
kubectl logs ds/zastava-observer | grep buffer.drops. - Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
4.3 Admission deny storm
- Alert
ZastavaAdmissionDenySpikeindicates >20 denies/minute. - Pull sample:
kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'. - Cross-check policy backlog in Scanner (
/policy/runtimelogs). Engage application owner; optionally set namespace tofailOpenNamespacesafter risk assessment.
5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
docs/ops/zastava-runtime-prometheus-rules.yaml+ Grafana dashboard JSON.- Sample
zastava-runtime.values.yaml.
- Verification:
- Validate signature:
cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert. - Extract Prometheus rules into offline monitoring cluster (
/etc/prometheus/rules.d). - Import Grafana dashboard via
grafana-cli --config ....
- Validate signature:
6. Observability assets
- Prometheus alert rules:
docs/ops/zastava-runtime-prometheus-rules.yaml. - Grafana dashboard JSON:
docs/ops/zastava-runtime-grafana-dashboard.json. - Add both to the monitoring repo (
ops/monitoring/zastava) and reference them in the Offline Kit manifest.