Files
git.stella-ops.org/docs/ops/zastava-runtime-operations.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

8.5 KiB
Raw Blame History

Zastava Runtime Operations Runbook

This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook). It aligns with Sprint 12 Runtime Guardrails and assumes components consume StellaOps.Zastava.Core (AddZastavaRuntimeCore(...)).

1. Prerequisites

  • Authority client credentials service principal zastava-runtime with scopes aud:scanner and api:scanner.runtime.write. Provision DPoP keys and mTLS client certs before rollout.
  • Scanner/WebService reachability cluster DNS entry (e.g. scanner.internal) resolvable from every node running Observer/Webhook.
  • Host mounts read-only access to /proc, container runtime state (/var/lib/containerd, /var/run/containerd/containerd.sock) and scratch space (/var/run/zastava).
  • Offline kit bundle operators staging air-gapped installs must download offline-kit/zastava-runtime-{version}.tar.zst containing container images, Grafana dashboards, and Prometheus rules referenced below.
  • Secrets Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets live outside git. For air-gapped installs copy them to the sealed secrets vault.

1.1 Telemetry quick reference

Metric Description Notes
zastava.runtime.events.total{tenant,component,kind} Rate of observer events sent to Scanner Expect >0 on busy nodes.
zastava.runtime.backend.latency.ms Histogram (ms) for /runtime/events and /policy/runtime calls P95 & P99 drive alerting.
zastava.admission.decisions.total{decision} Admission verdict counts Track deny spikes or fail-open fallbacks.
zastava.admission.cache.hits.total (future) Cache utilisation once Observer batches land Placeholder until Observer tasks 12-004 complete.

2. Deployment workflows

2.1 Fresh install (Helm overlay)

  1. Load offline kit bundle: oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava.
  2. Render values:
    • zastava.runtime.tenant, environment, deployment (cluster identifier).
    • zastava.runtime.authority block (issuer, clientId, audience, DPoP toggle).
    • zastava.runtime.metrics.commonTags.cluster for Prometheus labels.
  3. Pre-create secrets:
    • zastava-authority-dpop (JWK + private key).
    • zastava-authority-mtls (client cert/key chain).
    • zastava-webhook-tls (serving cert; CSR bundle if using auto-approval).
  4. Deploy Observer DaemonSet and Webhook chart:
    helm upgrade --install zastava-runtime deploy/helm/zastava \
      -f values/zastava-runtime.yaml \
      --namespace stellaops \
      --create-namespace
    
  5. Verify:
    • kubectl -n stellaops get pods -l app=zastava-observer ready.
    • kubectl -n stellaops logs ds/zastava-observer --tail=20 shows Issued runtime OpTok audit line with DPoP token type.
    • Admission webhook registered: kubectl get validatingwebhookconfiguration zastava-webhook.

2.2 Upgrades

  1. Scale webhook deployment to --replicas=3 (rolling).
  2. Drain one node per AZ to ensure Observer tolerates disruption.
  3. Apply chart upgrade; watch zastava.runtime.backend.latency.ms P95 (<250 ms).
  4. Post-upgrade, run smoke tests:
    • Apply unsigned Pod manifest → expect deny (policy fail).
    • Apply signed Pod manifest → expect allow.
  5. Record upgrade in ops log with Git SHA + Helm chart version.

2.3 Rollback

  1. Use Helm revision history: helm history zastava-runtime.
  2. Rollback: helm rollback zastava-runtime <revision>.
  3. Invalidate cached OpToks:
    kubectl -n stellaops exec deploy/zastava-webhook -- \
      zastava-webhook invalidate-op-token --audience scanner
    
  4. Confirm observers reconnect via metrics (rate(zastava_runtime_events_total[5m])).

3. Authority & security guardrails

  • Tokens must be DPoP type when requireDpop=true. Logs emit authority.token.issue scope with decision data; absence indicates misconfig.
  • requireMutualTls=true enforces mTLS during token acquisition. Disable only in lab clusters; expect warning log Mutual TLS requirement disabled.
  • Static fallback tokens (allowStaticTokenFallback=true) should exist only during initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
  • Audit every change in zastava.runtime.authority through change management. Use kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}' to confirm key rotation.

4. Incident response

4.1 Authority offline

  1. Check Prometheus alert ZastavaAuthorityTokenStale.
  2. Inspect Observer logs for authority.token.fallback scope.
  3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
  4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.

4.2 Scanner/WebService latency spike

  1. Alert ZastavaRuntimeBackendLatencyHigh fires at P95 > 750 ms for 5 minutes.
  2. Run backend health: kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready.
  3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via kubectl logs ds/zastava-observer | grep buffer.drops.
  4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).

4.3 Admission deny storm

  1. Alert ZastavaAdmissionDenySpike indicates >20 denies/minute.
  2. Pull sample: kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'.
  3. Cross-check policy backlog in Scanner (/policy/runtime logs). Engage application owner; optionally set namespace to failOpenNamespaces after risk assessment.

5. Offline kit & air-gapped notes

  • Bundle contents:
    • Observer/Webhook container images (multi-arch).
    • docs/ops/zastava-runtime-prometheus-rules.yaml + Grafana dashboard JSON.
    • Sample zastava-runtime.values.yaml.
  • Verification:
    • Validate signature: cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert.
    • Extract Prometheus rules into offline monitoring cluster (/etc/prometheus/rules.d).
    • Import Grafana dashboard via grafana-cli --config ....

6. Observability assets

  • Prometheus alert rules: docs/ops/zastava-runtime-prometheus-rules.yaml.
  • Grafana dashboard JSON: docs/ops/zastava-runtime-grafana-dashboard.json.
  • Add both to the monitoring repo (ops/monitoring/zastava) and reference them in the Offline Kit manifest.

7. Build-id correlation & symbol retrieval

Runtime events emitted by Observer now include process.buildId (from the ELF NT_GNU_BUILD_ID note) and Scanner /policy/runtime surfaces the most recent buildIds list per digest. Operators can use these hashes to locate debug artifacts during incident response:

  1. Capture the hash from CLI/webhook/Scanner API—for example:
    stellaops-cli runtime policy test --image <digest> --namespace <ns>
    
    Copy one of the Build IDs (e.g. 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789).
  2. Derive the debug path (<aa>/<rest> under .build-id) and check it exists:
    ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
    
  3. If the file is missing, rehydrate it from Offline Kit bundles or the debug-store object bucket (mirror of release artefacts):
    oras cp oci://registry.internal/debug-store:latest . --include \
      "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
    
  4. Confirm the running process advertises the same GNU build-id before symbolising:
    readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
    
  5. Attach the .debug file in gdb/lldb, feed it to eu-unstrip, or cache it in debuginfod for fleet-wide symbol resolution:
    debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
    
  6. For musl-based images, expect shorter build-id footprints. Missing hashes in runtime events indicate stripped binaries without the GNU note—schedule a rebuild with -Wl,--build-id enabled or add the binary to the debug-store allowlist so the scanner can surface a fallback symbol package.

Monitor scanner.policy.runtime responses for the buildIds field; absence of data after ZASTAVA-OBS-17-005 implies containers launched before the Observer upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart Observer to trigger a fresh capture if symbol parity is required.