8.5 KiB
Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with Sprint 12 – Runtime Guardrails and assumes components consume
StellaOps.Zastava.Core (AddZastavaRuntimeCore(...)).
1. Prerequisites
- Authority client credentials – service principal
zastava-runtimewith scopesaud:scannerandapi:scanner.runtime.write. Provision DPoP keys and mTLS client certs before rollout. - Scanner/WebService reachability – cluster DNS entry (e.g.
scanner.internal) resolvable from every node running Observer/Webhook. - Host mounts – read-only access to
/proc, container runtime state (/var/lib/containerd,/var/run/containerd/containerd.sock) and scratch space (/var/run/zastava). - Offline kit bundle – operators staging air-gapped installs must download
offline-kit/zastava-runtime-{version}.tar.zstcontaining container images, Grafana dashboards, and Prometheus rules referenced below. - Secrets – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets live outside git. For air-gapped installs copy them to the sealed secrets vault.
1.1 Telemetry quick reference
| Metric | Description | Notes |
|---|---|---|
zastava.runtime.events.total{tenant,component,kind} |
Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
zastava.runtime.backend.latency.ms |
Histogram (ms) for /runtime/events and /policy/runtime calls |
P95 & P99 drive alerting. |
zastava.admission.decisions.total{decision} |
Admission verdict counts | Track deny spikes or fail-open fallbacks. |
zastava.admission.cache.hits.total |
(future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
2. Deployment workflows
2.1 Fresh install (Helm overlay)
- Load offline kit bundle:
oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava. - Render values:
zastava.runtime.tenant,environment,deployment(cluster identifier).zastava.runtime.authorityblock (issuer, clientId, audience, DPoP toggle).zastava.runtime.metrics.commonTags.clusterfor Prometheus labels.
- Pre-create secrets:
zastava-authority-dpop(JWK + private key).zastava-authority-mtls(client cert/key chain).zastava-webhook-tls(serving cert; CSR bundle if using auto-approval).
- Deploy Observer DaemonSet and Webhook chart:
helm upgrade --install zastava-runtime deploy/helm/zastava \ -f values/zastava-runtime.yaml \ --namespace stellaops \ --create-namespace - Verify:
kubectl -n stellaops get pods -l app=zastava-observerready.kubectl -n stellaops logs ds/zastava-observer --tail=20showsIssued runtime OpTokaudit line with DPoP token type.- Admission webhook registered:
kubectl get validatingwebhookconfiguration zastava-webhook.
2.2 Upgrades
- Scale webhook deployment to
--replicas=3(rolling). - Drain one node per AZ to ensure Observer tolerates disruption.
- Apply chart upgrade; watch
zastava.runtime.backend.latency.msP95 (<250 ms). - Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect
deny(policy fail). - Apply signed Pod manifest → expect
allow.
- Apply unsigned Pod manifest → expect
- Record upgrade in ops log with Git SHA + Helm chart version.
2.3 Rollback
- Use Helm revision history:
helm history zastava-runtime. - Rollback:
helm rollback zastava-runtime <revision>. - Invalidate cached OpToks:
kubectl -n stellaops exec deploy/zastava-webhook -- \ zastava-webhook invalidate-op-token --audience scanner - Confirm observers reconnect via metrics (
rate(zastava_runtime_events_total[5m])).
3. Authority & security guardrails
- Tokens must be
DPoPtype whenrequireDpop=true. Logs emitauthority.token.issuescope with decision data; absence indicates misconfig. requireMutualTls=trueenforces mTLS during token acquisition. Disable only in lab clusters; expect warning logMutual TLS requirement disabled.- Static fallback tokens (
allowStaticTokenFallback=true) should exist only during initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. - Audit every change in
zastava.runtime.authoritythrough change management. Usekubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'to confirm key rotation.
4. Incident response
4.1 Authority offline
- Check Prometheus alert
ZastavaAuthorityTokenStale. - Inspect Observer logs for
authority.token.fallbackscope. - If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
- Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
4.2 Scanner/WebService latency spike
- Alert
ZastavaRuntimeBackendLatencyHighfires at P95 > 750 ms for 5 minutes. - Run backend health:
kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready. - If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
kubectl logs ds/zastava-observer | grep buffer.drops. - Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
4.3 Admission deny storm
- Alert
ZastavaAdmissionDenySpikeindicates >20 denies/minute. - Pull sample:
kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'. - Cross-check policy backlog in Scanner (
/policy/runtimelogs). Engage application owner; optionally set namespace tofailOpenNamespacesafter risk assessment.
5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
docs/ops/zastava-runtime-prometheus-rules.yaml+ Grafana dashboard JSON.- Sample
zastava-runtime.values.yaml.
- Verification:
- Validate signature:
cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert. - Extract Prometheus rules into offline monitoring cluster (
/etc/prometheus/rules.d). - Import Grafana dashboard via
grafana-cli --config ....
- Validate signature:
6. Observability assets
- Prometheus alert rules:
docs/ops/zastava-runtime-prometheus-rules.yaml. - Grafana dashboard JSON:
docs/ops/zastava-runtime-grafana-dashboard.json. - Add both to the monitoring repo (
ops/monitoring/zastava) and reference them in the Offline Kit manifest.
7. Build-id correlation & symbol retrieval
Runtime events emitted by Observer now include process.buildId (from the ELF
NT_GNU_BUILD_ID note) and Scanner /policy/runtime surfaces the most recent
buildIds list per digest. Operators can use these hashes to locate debug
artifacts during incident response:
- Capture the hash from CLI/webhook/Scanner API—for example:
Copy one of the
stellaops-cli runtime policy test --image <digest> --namespace <ns>Build IDs(e.g.5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789). - Derive the debug path (
<aa>/<rest>under.build-id) and check it exists:ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug - If the file is missing, rehydrate it from Offline Kit bundles or the
debug-storeobject bucket (mirror of release artefacts):oras cp oci://registry.internal/debug-store:latest . --include \ "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug" - Confirm the running process advertises the same GNU build-id before
symbolising:
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID' - Attach the
.debugfile ingdb/lldb, feed it toeu-unstrip, or cache it indebuginfodfor fleet-wide symbol resolution:debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug - For musl-based images, expect shorter build-id footprints. Missing hashes in
runtime events indicate stripped binaries without the GNU note—schedule a
rebuild with
-Wl,--build-idenabled or add the binary to the debug-store allowlist so the scanner can surface a fallback symbol package.
Monitor scanner.policy.runtime responses for the buildIds field; absence of
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
Observer to trigger a fresh capture if symbol parity is required.