Files
git.stella-ops.org/docs/ops/zastava-runtime-operations.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

175 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with `Sprint 12 Runtime Guardrails` and assumes components consume
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
## 1. Prerequisites
- **Authority client credentials** service principal `zastava-runtime` with scopes
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
certs before rollout.
- **Scanner/WebService reachability** cluster DNS entry (e.g. `scanner.internal`)
resolvable from every node running Observer/Webhook.
- **Host mounts** read-only access to `/proc`, container runtime state
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
(`/var/run/zastava`).
- **Offline kit bundle** operators staging air-gapped installs must download
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
Grafana dashboards, and Prometheus rules referenced below.
- **Secrets** Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
live outside git. For air-gapped installs copy them to the sealed secrets vault.
### 1.1 Telemetry quick reference
| Metric | Description | Notes |
|--------|-------------|-------|
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
## 2. Deployment workflows
### 2.1 Fresh install (Helm overlay)
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
2. Render values:
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
3. Pre-create secrets:
- `zastava-authority-dpop` (JWK + private key).
- `zastava-authority-mtls` (client cert/key chain).
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
4. Deploy Observer DaemonSet and Webhook chart:
```sh
helm upgrade --install zastava-runtime deploy/helm/zastava \
-f values/zastava-runtime.yaml \
--namespace stellaops \
--create-namespace
```
5. Verify:
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
`Issued runtime OpTok` audit line with DPoP token type.
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
### 2.2 Upgrades
1. Scale webhook deployment to `--replicas=3` (rolling).
2. Drain one node per AZ to ensure Observer tolerates disruption.
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
4. Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect `deny` (policy fail).
- Apply signed Pod manifest → expect `allow`.
5. Record upgrade in ops log with Git SHA + Helm chart version.
### 2.3 Rollback
1. Use Helm revision history: `helm history zastava-runtime`.
2. Rollback: `helm rollback zastava-runtime <revision>`.
3. Invalidate cached OpToks:
```sh
kubectl -n stellaops exec deploy/zastava-webhook -- \
zastava-webhook invalidate-op-token --audience scanner
```
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
## 3. Authority & security guardrails
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
`authority.token.issue` scope with decision data; absence indicates misconfig.
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
lab clusters; expect warning log `Mutual TLS requirement disabled`.
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
- Audit every change in `zastava.runtime.authority` through change management.
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
to confirm key rotation.
## 4. Incident response
### 4.1 Authority offline
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
2. Inspect Observer logs for `authority.token.fallback` scope.
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
### 4.2 Scanner/WebService latency spike
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
`kubectl logs ds/zastava-observer | grep buffer.drops`.
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
### 4.3 Admission deny storm
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
## 5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
- `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
- Sample `zastava-runtime.values.yaml`.
- Verification:
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
- Import Grafana dashboard via `grafana-cli --config ...`.
## 6. Observability assets
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
the Offline Kit manifest.
## 7. Build-id correlation & symbol retrieval
Runtime events emitted by Observer now include `process.buildId` (from the ELF
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
`buildIds` list per digest. Operators can use these hashes to locate debug
artifacts during incident response:
1. Capture the hash from CLI/webhook/Scanner API—for example:
```bash
stellaops-cli runtime policy test --image <digest> --namespace <ns>
```
Copy one of the `Build IDs` (e.g.
`5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
```bash
ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
```
3. If the file is missing, rehydrate it from Offline Kit bundles or the
`debug-store` object bucket (mirror of release artefacts):
```bash
oras cp oci://registry.internal/debug-store:latest . --include \
"5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
```
4. Confirm the running process advertises the same GNU build-id before
symbolising:
```bash
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
```
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
in `debuginfod` for fleet-wide symbol resolution:
```bash
debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
```
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
runtime events indicate stripped binaries without the GNU note—schedule a
rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
allowlist so the scanner can surface a fallback symbol package.
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
Observer to trigger a fresh capture if symbol parity is required.