- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
		
			
				
	
	
		
			175 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			175 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Zastava Runtime Operations Runbook
 | 
						||
 | 
						||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
 | 
						||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
 | 
						||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
 | 
						||
 | 
						||
## 1. Prerequisites
 | 
						||
 | 
						||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
 | 
						||
  `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
 | 
						||
  certs before rollout.
 | 
						||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
 | 
						||
  resolvable from every node running Observer/Webhook.
 | 
						||
- **Host mounts** – read-only access to `/proc`, container runtime state
 | 
						||
  (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
 | 
						||
  (`/var/run/zastava`).
 | 
						||
- **Offline kit bundle** – operators staging air-gapped installs must download
 | 
						||
  `offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
 | 
						||
  Grafana dashboards, and Prometheus rules referenced below.
 | 
						||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
 | 
						||
  live outside git. For air-gapped installs copy them to the sealed secrets vault.
 | 
						||
 | 
						||
### 1.1 Telemetry quick reference
 | 
						||
 | 
						||
| Metric | Description | Notes |
 | 
						||
|--------|-------------|-------|
 | 
						||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
 | 
						||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
 | 
						||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
 | 
						||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
 | 
						||
 | 
						||
## 2. Deployment workflows
 | 
						||
 | 
						||
### 2.1 Fresh install (Helm overlay)
 | 
						||
 | 
						||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
 | 
						||
2. Render values:
 | 
						||
   - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
 | 
						||
   - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
 | 
						||
   - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
 | 
						||
3. Pre-create secrets:
 | 
						||
   - `zastava-authority-dpop` (JWK + private key).
 | 
						||
   - `zastava-authority-mtls` (client cert/key chain).
 | 
						||
   - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
 | 
						||
4. Deploy Observer DaemonSet and Webhook chart:
 | 
						||
   ```sh
 | 
						||
   helm upgrade --install zastava-runtime deploy/helm/zastava \
 | 
						||
     -f values/zastava-runtime.yaml \
 | 
						||
     --namespace stellaops \
 | 
						||
     --create-namespace
 | 
						||
   ```
 | 
						||
5. Verify:
 | 
						||
   - `kubectl -n stellaops get pods -l app=zastava-observer` ready.
 | 
						||
   - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
 | 
						||
     `Issued runtime OpTok` audit line with DPoP token type.
 | 
						||
   - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
 | 
						||
 | 
						||
### 2.2 Upgrades
 | 
						||
 | 
						||
1. Scale webhook deployment to `--replicas=3` (rolling).
 | 
						||
2. Drain one node per AZ to ensure Observer tolerates disruption.
 | 
						||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
 | 
						||
4. Post-upgrade, run smoke tests:
 | 
						||
   - Apply unsigned Pod manifest → expect `deny` (policy fail).
 | 
						||
   - Apply signed Pod manifest → expect `allow`.
 | 
						||
5. Record upgrade in ops log with Git SHA + Helm chart version.
 | 
						||
 | 
						||
### 2.3 Rollback
 | 
						||
 | 
						||
1. Use Helm revision history: `helm history zastava-runtime`.
 | 
						||
2. Rollback: `helm rollback zastava-runtime <revision>`.
 | 
						||
3. Invalidate cached OpToks:
 | 
						||
   ```sh
 | 
						||
   kubectl -n stellaops exec deploy/zastava-webhook -- \
 | 
						||
     zastava-webhook invalidate-op-token --audience scanner
 | 
						||
   ```
 | 
						||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
 | 
						||
 | 
						||
## 3. Authority & security guardrails
 | 
						||
 | 
						||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
 | 
						||
  `authority.token.issue` scope with decision data; absence indicates misconfig.
 | 
						||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
 | 
						||
  lab clusters; expect warning log `Mutual TLS requirement disabled`.
 | 
						||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
 | 
						||
  initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
 | 
						||
- Audit every change in `zastava.runtime.authority` through change management.
 | 
						||
  Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
 | 
						||
  to confirm key rotation.
 | 
						||
 | 
						||
## 4. Incident response
 | 
						||
 | 
						||
### 4.1 Authority offline
 | 
						||
 | 
						||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
 | 
						||
2. Inspect Observer logs for `authority.token.fallback` scope.
 | 
						||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
 | 
						||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
 | 
						||
 | 
						||
### 4.2 Scanner/WebService latency spike
 | 
						||
 | 
						||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
 | 
						||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
 | 
						||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
 | 
						||
   `kubectl logs ds/zastava-observer | grep buffer.drops`.
 | 
						||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
 | 
						||
 | 
						||
### 4.3 Admission deny storm
 | 
						||
 | 
						||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
 | 
						||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
 | 
						||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
 | 
						||
   owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
 | 
						||
 | 
						||
## 5. Offline kit & air-gapped notes
 | 
						||
 | 
						||
- Bundle contents:
 | 
						||
  - Observer/Webhook container images (multi-arch).
 | 
						||
  - `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
 | 
						||
  - Sample `zastava-runtime.values.yaml`.
 | 
						||
- Verification:
 | 
						||
  - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
 | 
						||
  - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
 | 
						||
  - Import Grafana dashboard via `grafana-cli --config ...`.
 | 
						||
 | 
						||
## 6. Observability assets
 | 
						||
 | 
						||
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
 | 
						||
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
 | 
						||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
 | 
						||
  the Offline Kit manifest.
 | 
						||
 | 
						||
## 7. Build-id correlation & symbol retrieval
 | 
						||
 | 
						||
Runtime events emitted by Observer now include `process.buildId` (from the ELF
 | 
						||
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
 | 
						||
`buildIds` list per digest. Operators can use these hashes to locate debug
 | 
						||
artifacts during incident response:
 | 
						||
 | 
						||
1. Capture the hash from CLI/webhook/Scanner API—for example:
 | 
						||
   ```bash
 | 
						||
   stellaops-cli runtime policy test --image <digest> --namespace <ns>
 | 
						||
   ```
 | 
						||
   Copy one of the `Build IDs` (e.g.
 | 
						||
   `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
 | 
						||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
 | 
						||
   ```bash
 | 
						||
   ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
 | 
						||
   ```
 | 
						||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
 | 
						||
   `debug-store` object bucket (mirror of release artefacts):
 | 
						||
   ```bash
 | 
						||
   oras cp oci://registry.internal/debug-store:latest . --include \
 | 
						||
     "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
 | 
						||
   ```
 | 
						||
4. Confirm the running process advertises the same GNU build-id before
 | 
						||
   symbolising:
 | 
						||
   ```bash
 | 
						||
   readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
 | 
						||
   ```
 | 
						||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
 | 
						||
   in `debuginfod` for fleet-wide symbol resolution:
 | 
						||
   ```bash
 | 
						||
   debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
 | 
						||
   ```
 | 
						||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
 | 
						||
   runtime events indicate stripped binaries without the GNU note—schedule a
 | 
						||
   rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
 | 
						||
   allowlist so the scanner can surface a fallback symbol package.
 | 
						||
 | 
						||
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
 | 
						||
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
 | 
						||
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
 | 
						||
Observer to trigger a fresh capture if symbol parity is required.
 |