Restructure solution layout by module
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			This commit is contained in:
		| @@ -1,174 +1,174 @@ | ||||
| # Zastava Runtime Operations Runbook | ||||
|  | ||||
| This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook). | ||||
| It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume | ||||
| `StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`). | ||||
|  | ||||
| ## 1. Prerequisites | ||||
|  | ||||
| - **Authority client credentials** – service principal `zastava-runtime` with scopes | ||||
|   `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client | ||||
|   certs before rollout. | ||||
| - **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`) | ||||
|   resolvable from every node running Observer/Webhook. | ||||
| - **Host mounts** – read-only access to `/proc`, container runtime state | ||||
|   (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space | ||||
|   (`/var/run/zastava`). | ||||
| - **Offline kit bundle** – operators staging air-gapped installs must download | ||||
|   `offline-kit/zastava-runtime-{version}.tar.zst` containing container images, | ||||
|   Grafana dashboards, and Prometheus rules referenced below. | ||||
| - **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets | ||||
|   live outside git. For air-gapped installs copy them to the sealed secrets vault. | ||||
|  | ||||
| ### 1.1 Telemetry quick reference | ||||
|  | ||||
| | Metric | Description | Notes | | ||||
| |--------|-------------|-------| | ||||
| | `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. | | ||||
| | `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. | | ||||
| | `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. | | ||||
| | `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. | | ||||
|  | ||||
| ## 2. Deployment workflows | ||||
|  | ||||
| ### 2.1 Fresh install (Helm overlay) | ||||
|  | ||||
| 1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`. | ||||
| 2. Render values: | ||||
|    - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier). | ||||
|    - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle). | ||||
|    - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels. | ||||
| 3. Pre-create secrets: | ||||
|    - `zastava-authority-dpop` (JWK + private key). | ||||
|    - `zastava-authority-mtls` (client cert/key chain). | ||||
|    - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval). | ||||
| 4. Deploy Observer DaemonSet and Webhook chart: | ||||
|    ```sh | ||||
|    helm upgrade --install zastava-runtime deploy/helm/zastava \ | ||||
|      -f values/zastava-runtime.yaml \ | ||||
|      --namespace stellaops \ | ||||
|      --create-namespace | ||||
|    ``` | ||||
| 5. Verify: | ||||
|    - `kubectl -n stellaops get pods -l app=zastava-observer` ready. | ||||
|    - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows | ||||
|      `Issued runtime OpTok` audit line with DPoP token type. | ||||
|    - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`. | ||||
|  | ||||
| ### 2.2 Upgrades | ||||
|  | ||||
| 1. Scale webhook deployment to `--replicas=3` (rolling). | ||||
| 2. Drain one node per AZ to ensure Observer tolerates disruption. | ||||
| 3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms). | ||||
| 4. Post-upgrade, run smoke tests: | ||||
|    - Apply unsigned Pod manifest → expect `deny` (policy fail). | ||||
|    - Apply signed Pod manifest → expect `allow`. | ||||
| 5. Record upgrade in ops log with Git SHA + Helm chart version. | ||||
|  | ||||
| ### 2.3 Rollback | ||||
|  | ||||
| 1. Use Helm revision history: `helm history zastava-runtime`. | ||||
| 2. Rollback: `helm rollback zastava-runtime <revision>`. | ||||
| 3. Invalidate cached OpToks: | ||||
|    ```sh | ||||
|    kubectl -n stellaops exec deploy/zastava-webhook -- \ | ||||
|      zastava-webhook invalidate-op-token --audience scanner | ||||
|    ``` | ||||
| 4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`). | ||||
|  | ||||
| ## 3. Authority & security guardrails | ||||
|  | ||||
| - Tokens must be `DPoP` type when `requireDpop=true`. Logs emit | ||||
|   `authority.token.issue` scope with decision data; absence indicates misconfig. | ||||
| - `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in | ||||
|   lab clusters; expect warning log `Mutual TLS requirement disabled`. | ||||
| - Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during | ||||
|   initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. | ||||
| - Audit every change in `zastava.runtime.authority` through change management. | ||||
|   Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'` | ||||
|   to confirm key rotation. | ||||
|  | ||||
| ## 4. Incident response | ||||
|  | ||||
| ### 4.1 Authority offline | ||||
|  | ||||
| 1. Check Prometheus alert `ZastavaAuthorityTokenStale`. | ||||
| 2. Inspect Observer logs for `authority.token.fallback` scope. | ||||
| 3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h. | ||||
| 4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys. | ||||
|  | ||||
| ### 4.2 Scanner/WebService latency spike | ||||
|  | ||||
| 1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes. | ||||
| 2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`. | ||||
| 3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via | ||||
|    `kubectl logs ds/zastava-observer | grep buffer.drops`. | ||||
| 4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary). | ||||
|  | ||||
| ### 4.3 Admission deny storm | ||||
|  | ||||
| 1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute. | ||||
| 2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`. | ||||
| 3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application | ||||
|    owner; optionally set namespace to `failOpenNamespaces` after risk assessment. | ||||
|  | ||||
| ## 5. Offline kit & air-gapped notes | ||||
|  | ||||
| - Bundle contents: | ||||
|   - Observer/Webhook container images (multi-arch). | ||||
|   - `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON. | ||||
|   - Sample `zastava-runtime.values.yaml`. | ||||
| - Verification: | ||||
|   - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`. | ||||
|   - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`). | ||||
|   - Import Grafana dashboard via `grafana-cli --config ...`. | ||||
|  | ||||
| ## 6. Observability assets | ||||
|  | ||||
| - Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`. | ||||
| - Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`. | ||||
| - Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in | ||||
|   the Offline Kit manifest. | ||||
|  | ||||
| ## 7. Build-id correlation & symbol retrieval | ||||
|  | ||||
| Runtime events emitted by Observer now include `process.buildId` (from the ELF | ||||
| `NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent | ||||
| `buildIds` list per digest. Operators can use these hashes to locate debug | ||||
| artifacts during incident response: | ||||
|  | ||||
| 1. Capture the hash from CLI/webhook/Scanner API—for example: | ||||
|    ```bash | ||||
|    stellaops-cli runtime policy test --image <digest> --namespace <ns> | ||||
|    ``` | ||||
|    Copy one of the `Build IDs` (e.g. | ||||
|    `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`). | ||||
| 2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists: | ||||
|    ```bash | ||||
|    ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug | ||||
|    ``` | ||||
| 3. If the file is missing, rehydrate it from Offline Kit bundles or the | ||||
|    `debug-store` object bucket (mirror of release artefacts): | ||||
|    ```bash | ||||
|    oras cp oci://registry.internal/debug-store:latest . --include \ | ||||
|      "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug" | ||||
|    ``` | ||||
| 4. Confirm the running process advertises the same GNU build-id before | ||||
|    symbolising: | ||||
|    ```bash | ||||
|    readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID' | ||||
|    ``` | ||||
| 5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it | ||||
|    in `debuginfod` for fleet-wide symbol resolution: | ||||
|    ```bash | ||||
|    debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug | ||||
|    ``` | ||||
| 6. For musl-based images, expect shorter build-id footprints. Missing hashes in | ||||
|    runtime events indicate stripped binaries without the GNU note—schedule a | ||||
|    rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store | ||||
|    allowlist so the scanner can surface a fallback symbol package. | ||||
|  | ||||
| Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of | ||||
| data after ZASTAVA-OBS-17-005 implies containers launched before the Observer | ||||
| upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart | ||||
| Observer to trigger a fresh capture if symbol parity is required. | ||||
| # Zastava Runtime Operations Runbook | ||||
|  | ||||
| This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook). | ||||
| It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume | ||||
| `StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`). | ||||
|  | ||||
| ## 1. Prerequisites | ||||
|  | ||||
| - **Authority client credentials** – service principal `zastava-runtime` with scopes | ||||
|   `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client | ||||
|   certs before rollout. | ||||
| - **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`) | ||||
|   resolvable from every node running Observer/Webhook. | ||||
| - **Host mounts** – read-only access to `/proc`, container runtime state | ||||
|   (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space | ||||
|   (`/var/run/zastava`). | ||||
| - **Offline kit bundle** – operators staging air-gapped installs must download | ||||
|   `offline-kit/zastava-runtime-{version}.tar.zst` containing container images, | ||||
|   Grafana dashboards, and Prometheus rules referenced below. | ||||
| - **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets | ||||
|   live outside git. For air-gapped installs copy them to the sealed secrets vault. | ||||
|  | ||||
| ### 1.1 Telemetry quick reference | ||||
|  | ||||
| | Metric | Description | Notes | | ||||
| |--------|-------------|-------| | ||||
| | `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. | | ||||
| | `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. | | ||||
| | `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. | | ||||
| | `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. | | ||||
|  | ||||
| ## 2. Deployment workflows | ||||
|  | ||||
| ### 2.1 Fresh install (Helm overlay) | ||||
|  | ||||
| 1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`. | ||||
| 2. Render values: | ||||
|    - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier). | ||||
|    - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle). | ||||
|    - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels. | ||||
| 3. Pre-create secrets: | ||||
|    - `zastava-authority-dpop` (JWK + private key). | ||||
|    - `zastava-authority-mtls` (client cert/key chain). | ||||
|    - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval). | ||||
| 4. Deploy Observer DaemonSet and Webhook chart: | ||||
|    ```sh | ||||
|    helm upgrade --install zastava-runtime deploy/helm/zastava \ | ||||
|      -f values/zastava-runtime.yaml \ | ||||
|      --namespace stellaops \ | ||||
|      --create-namespace | ||||
|    ``` | ||||
| 5. Verify: | ||||
|    - `kubectl -n stellaops get pods -l app=zastava-observer` ready. | ||||
|    - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows | ||||
|      `Issued runtime OpTok` audit line with DPoP token type. | ||||
|    - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`. | ||||
|  | ||||
| ### 2.2 Upgrades | ||||
|  | ||||
| 1. Scale webhook deployment to `--replicas=3` (rolling). | ||||
| 2. Drain one node per AZ to ensure Observer tolerates disruption. | ||||
| 3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms). | ||||
| 4. Post-upgrade, run smoke tests: | ||||
|    - Apply unsigned Pod manifest → expect `deny` (policy fail). | ||||
|    - Apply signed Pod manifest → expect `allow`. | ||||
| 5. Record upgrade in ops log with Git SHA + Helm chart version. | ||||
|  | ||||
| ### 2.3 Rollback | ||||
|  | ||||
| 1. Use Helm revision history: `helm history zastava-runtime`. | ||||
| 2. Rollback: `helm rollback zastava-runtime <revision>`. | ||||
| 3. Invalidate cached OpToks: | ||||
|    ```sh | ||||
|    kubectl -n stellaops exec deploy/zastava-webhook -- \ | ||||
|      zastava-webhook invalidate-op-token --audience scanner | ||||
|    ``` | ||||
| 4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`). | ||||
|  | ||||
| ## 3. Authority & security guardrails | ||||
|  | ||||
| - Tokens must be `DPoP` type when `requireDpop=true`. Logs emit | ||||
|   `authority.token.issue` scope with decision data; absence indicates misconfig. | ||||
| - `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in | ||||
|   lab clusters; expect warning log `Mutual TLS requirement disabled`. | ||||
| - Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during | ||||
|   initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. | ||||
| - Audit every change in `zastava.runtime.authority` through change management. | ||||
|   Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'` | ||||
|   to confirm key rotation. | ||||
|  | ||||
| ## 4. Incident response | ||||
|  | ||||
| ### 4.1 Authority offline | ||||
|  | ||||
| 1. Check Prometheus alert `ZastavaAuthorityTokenStale`. | ||||
| 2. Inspect Observer logs for `authority.token.fallback` scope. | ||||
| 3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h. | ||||
| 4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys. | ||||
|  | ||||
| ### 4.2 Scanner/WebService latency spike | ||||
|  | ||||
| 1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes. | ||||
| 2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`. | ||||
| 3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via | ||||
|    `kubectl logs ds/zastava-observer | grep buffer.drops`. | ||||
| 4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary). | ||||
|  | ||||
| ### 4.3 Admission deny storm | ||||
|  | ||||
| 1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute. | ||||
| 2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`. | ||||
| 3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application | ||||
|    owner; optionally set namespace to `failOpenNamespaces` after risk assessment. | ||||
|  | ||||
| ## 5. Offline kit & air-gapped notes | ||||
|  | ||||
| - Bundle contents: | ||||
|   - Observer/Webhook container images (multi-arch). | ||||
|   - `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON. | ||||
|   - Sample `zastava-runtime.values.yaml`. | ||||
| - Verification: | ||||
|   - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`. | ||||
|   - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`). | ||||
|   - Import Grafana dashboard via `grafana-cli --config ...`. | ||||
|  | ||||
| ## 6. Observability assets | ||||
|  | ||||
| - Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`. | ||||
| - Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`. | ||||
| - Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in | ||||
|   the Offline Kit manifest. | ||||
|  | ||||
| ## 7. Build-id correlation & symbol retrieval | ||||
|  | ||||
| Runtime events emitted by Observer now include `process.buildId` (from the ELF | ||||
| `NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent | ||||
| `buildIds` list per digest. Operators can use these hashes to locate debug | ||||
| artifacts during incident response: | ||||
|  | ||||
| 1. Capture the hash from CLI/webhook/Scanner API—for example: | ||||
|    ```bash | ||||
|    stellaops-cli runtime policy test --image <digest> --namespace <ns> | ||||
|    ``` | ||||
|    Copy one of the `Build IDs` (e.g. | ||||
|    `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`). | ||||
| 2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists: | ||||
|    ```bash | ||||
|    ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug | ||||
|    ``` | ||||
| 3. If the file is missing, rehydrate it from Offline Kit bundles or the | ||||
|    `debug-store` object bucket (mirror of release artefacts): | ||||
|    ```bash | ||||
|    oras cp oci://registry.internal/debug-store:latest . --include \ | ||||
|      "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug" | ||||
|    ``` | ||||
| 4. Confirm the running process advertises the same GNU build-id before | ||||
|    symbolising: | ||||
|    ```bash | ||||
|    readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID' | ||||
|    ``` | ||||
| 5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it | ||||
|    in `debuginfod` for fleet-wide symbol resolution: | ||||
|    ```bash | ||||
|    debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug | ||||
|    ``` | ||||
| 6. For musl-based images, expect shorter build-id footprints. Missing hashes in | ||||
|    runtime events indicate stripped binaries without the GNU note—schedule a | ||||
|    rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store | ||||
|    allowlist so the scanner can surface a fallback symbol package. | ||||
|  | ||||
| Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of | ||||
| data after ZASTAVA-OBS-17-005 implies containers launched before the Observer | ||||
| upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart | ||||
| Observer to trigger a fresh capture if symbol parity is required. | ||||
|   | ||||
		Reference in New Issue
	
	Block a user