# Zastava Runtime Operations Runbook This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook). It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume `StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`). ## 1. Prerequisites - **Authority client credentials** – service principal `zastava-runtime` with scopes `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client certs before rollout. - **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`) resolvable from every node running Observer/Webhook. - **Host mounts** – read-only access to `/proc`, container runtime state (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space (`/var/run/zastava`). - **Offline kit bundle** – operators staging air-gapped installs must download `offline-kit/zastava-runtime-{version}.tar.zst` containing container images, Grafana dashboards, and Prometheus rules referenced below. - **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets live outside git. For air-gapped installs copy them to the sealed secrets vault. ### 1.1 Telemetry quick reference | Metric | Description | Notes | |--------|-------------|-------| | `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. | | `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. | | `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. | | `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. | ## 2. Deployment workflows ### 2.1 Fresh install (Helm overlay) 1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`. 2. Render values: - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier). - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle). - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels. 3. Pre-create secrets: - `zastava-authority-dpop` (JWK + private key). - `zastava-authority-mtls` (client cert/key chain). - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval). 4. Deploy Observer DaemonSet and Webhook chart: ```sh helm upgrade --install zastava-runtime deploy/helm/zastava \ -f values/zastava-runtime.yaml \ --namespace stellaops \ --create-namespace ``` 5. Verify: - `kubectl -n stellaops get pods -l app=zastava-observer` ready. - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows `Issued runtime OpTok` audit line with DPoP token type. - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`. ### 2.2 Upgrades 1. Scale webhook deployment to `--replicas=3` (rolling). 2. Drain one node per AZ to ensure Observer tolerates disruption. 3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms). 4. Post-upgrade, run smoke tests: - Apply unsigned Pod manifest → expect `deny` (policy fail). - Apply signed Pod manifest → expect `allow`. 5. Record upgrade in ops log with Git SHA + Helm chart version. ### 2.3 Rollback 1. Use Helm revision history: `helm history zastava-runtime`. 2. Rollback: `helm rollback zastava-runtime `. 3. Invalidate cached OpToks: ```sh kubectl -n stellaops exec deploy/zastava-webhook -- \ zastava-webhook invalidate-op-token --audience scanner ``` 4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`). ## 3. Authority & security guardrails - Tokens must be `DPoP` type when `requireDpop=true`. Logs emit `authority.token.issue` scope with decision data; absence indicates misconfig. - `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in lab clusters; expect warning log `Mutual TLS requirement disabled`. - Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. - Audit every change in `zastava.runtime.authority` through change management. Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'` to confirm key rotation. ## 4. Incident response ### 4.1 Authority offline 1. Check Prometheus alert `ZastavaAuthorityTokenStale`. 2. Inspect Observer logs for `authority.token.fallback` scope. 3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h. 4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys. ### 4.2 Scanner/WebService latency spike 1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes. 2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`. 3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via `kubectl logs ds/zastava-observer | grep buffer.drops`. 4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary). ### 4.3 Admission deny storm 1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute. 2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`. 3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application owner; optionally set namespace to `failOpenNamespaces` after risk assessment. ## 5. Offline kit & air-gapped notes - Bundle contents: - Observer/Webhook container images (multi-arch). - `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON. - Sample `zastava-runtime.values.yaml`. - Verification: - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`. - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`). - Import Grafana dashboard via `grafana-cli --config ...`. ## 6. Observability assets - Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`. - Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`. - Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in the Offline Kit manifest. ## 7. Build-id correlation & symbol retrieval Runtime events emitted by Observer now include `process.buildId` (from the ELF `NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent `buildIds` list per digest. Operators can use these hashes to locate debug artifacts during incident response: 1. Capture the hash from CLI/webhook/Scanner API—for example: ```bash stellaops-cli runtime policy test --image --namespace ``` Copy one of the `Build IDs` (e.g. `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`). 2. Derive the debug path (`/` under `.build-id`) and check it exists: ```bash ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug ``` 3. If the file is missing, rehydrate it from Offline Kit bundles or the `debug-store` object bucket (mirror of release artefacts): ```bash oras cp oci://registry.internal/debug-store:latest . --include \ "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug" ``` 4. Confirm the running process advertises the same GNU build-id before symbolising: ```bash readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID' ``` 5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it in `debuginfod` for fleet-wide symbol resolution: ```bash debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug ``` 6. For musl-based images, expect shorter build-id footprints. Missing hashes in runtime events indicate stripped binaries without the GNU note—schedule a rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store allowlist so the scanner can surface a fallback symbol package. Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of data after ZASTAVA-OBS-17-005 implies containers launched before the Observer upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart Observer to trigger a fresh capture if symbol parity is required.