Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
175 lines
8.4 KiB
Markdown
175 lines
8.4 KiB
Markdown
# Zastava Runtime Operations Runbook
|
||
|
||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
|
||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
|
||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
|
||
|
||
## 1. Prerequisites
|
||
|
||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
|
||
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
|
||
certs before rollout.
|
||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
|
||
resolvable from every node running Observer/Webhook.
|
||
- **Host mounts** – read-only access to `/proc`, container runtime state
|
||
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
|
||
(`/var/run/zastava`).
|
||
- **Offline kit bundle** – operators staging air-gapped installs must download
|
||
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
|
||
Grafana dashboards, and Prometheus rules referenced below.
|
||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
|
||
live outside git. For air-gapped installs copy them to the sealed secrets vault.
|
||
|
||
### 1.1 Telemetry quick reference
|
||
|
||
| Metric | Description | Notes |
|
||
|--------|-------------|-------|
|
||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
|
||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
|
||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
|
||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
|
||
|
||
## 2. Deployment workflows
|
||
|
||
### 2.1 Fresh install (Helm overlay)
|
||
|
||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
|
||
2. Render values:
|
||
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
|
||
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
|
||
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
|
||
3. Pre-create secrets:
|
||
- `zastava-authority-dpop` (JWK + private key).
|
||
- `zastava-authority-mtls` (client cert/key chain).
|
||
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
|
||
4. Deploy Observer DaemonSet and Webhook chart:
|
||
```sh
|
||
helm upgrade --install zastava-runtime deploy/helm/zastava \
|
||
-f values/zastava-runtime.yaml \
|
||
--namespace stellaops \
|
||
--create-namespace
|
||
```
|
||
5. Verify:
|
||
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
|
||
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
|
||
`Issued runtime OpTok` audit line with DPoP token type.
|
||
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
|
||
|
||
### 2.2 Upgrades
|
||
|
||
1. Scale webhook deployment to `--replicas=3` (rolling).
|
||
2. Drain one node per AZ to ensure Observer tolerates disruption.
|
||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
|
||
4. Post-upgrade, run smoke tests:
|
||
- Apply unsigned Pod manifest → expect `deny` (policy fail).
|
||
- Apply signed Pod manifest → expect `allow`.
|
||
5. Record upgrade in ops log with Git SHA + Helm chart version.
|
||
|
||
### 2.3 Rollback
|
||
|
||
1. Use Helm revision history: `helm history zastava-runtime`.
|
||
2. Rollback: `helm rollback zastava-runtime <revision>`.
|
||
3. Invalidate cached OpToks:
|
||
```sh
|
||
kubectl -n stellaops exec deploy/zastava-webhook -- \
|
||
zastava-webhook invalidate-op-token --audience scanner
|
||
```
|
||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
|
||
|
||
## 3. Authority & security guardrails
|
||
|
||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
|
||
`authority.token.issue` scope with decision data; absence indicates misconfig.
|
||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
|
||
lab clusters; expect warning log `Mutual TLS requirement disabled`.
|
||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
|
||
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
|
||
- Audit every change in `zastava.runtime.authority` through change management.
|
||
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
|
||
to confirm key rotation.
|
||
|
||
## 4. Incident response
|
||
|
||
### 4.1 Authority offline
|
||
|
||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
|
||
2. Inspect Observer logs for `authority.token.fallback` scope.
|
||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
|
||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
|
||
|
||
### 4.2 Scanner/WebService latency spike
|
||
|
||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
|
||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
|
||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
|
||
`kubectl logs ds/zastava-observer | grep buffer.drops`.
|
||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
|
||
|
||
### 4.3 Admission deny storm
|
||
|
||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
|
||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
|
||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
|
||
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
|
||
|
||
## 5. Offline kit & air-gapped notes
|
||
|
||
- Bundle contents:
|
||
- Observer/Webhook container images (multi-arch).
|
||
- `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
|
||
- Sample `zastava-runtime.values.yaml`.
|
||
- Verification:
|
||
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
|
||
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
|
||
- Import Grafana dashboard via `grafana-cli --config ...`.
|
||
|
||
## 6. Observability assets
|
||
|
||
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
|
||
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
|
||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
|
||
the Offline Kit manifest.
|
||
|
||
## 7. Build-id correlation & symbol retrieval
|
||
|
||
Runtime events emitted by Observer now include `process.buildId` (from the ELF
|
||
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
|
||
`buildIds` list per digest. Operators can use these hashes to locate debug
|
||
artifacts during incident response:
|
||
|
||
1. Capture the hash from CLI/webhook/Scanner API—for example:
|
||
```bash
|
||
stellaops-cli runtime policy test --image <digest> --namespace <ns>
|
||
```
|
||
Copy one of the `Build IDs` (e.g.
|
||
`5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
|
||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
|
||
```bash
|
||
ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
|
||
```
|
||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
|
||
`debug-store` object bucket (mirror of release artefacts):
|
||
```bash
|
||
oras cp oci://registry.internal/debug-store:latest . --include \
|
||
"5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
|
||
```
|
||
4. Confirm the running process advertises the same GNU build-id before
|
||
symbolising:
|
||
```bash
|
||
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
|
||
```
|
||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
|
||
in `debuginfod` for fleet-wide symbol resolution:
|
||
```bash
|
||
debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
|
||
```
|
||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
|
||
runtime events indicate stripped binaries without the GNU note—schedule a
|
||
rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
|
||
allowlist so the scanner can surface a fallback symbol package.
|
||
|
||
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
|
||
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
|
||
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
|
||
Observer to trigger a fresh capture if symbol parity is required.
|