feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										22
									
								
								docs/modules/zastava/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								docs/modules/zastava/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,22 @@ | ||||
| # Zastava agent guide | ||||
|  | ||||
| ## Mission | ||||
| Zastava monitors running workloads, verifies supply chain posture, and enforces runtime policy via Kubernetes admission webhooks. | ||||
|  | ||||
| ## Key docs | ||||
| - [Module README](./README.md) | ||||
| - [Architecture](./architecture.md) | ||||
| - [Implementation plan](./implementation_plan.md) | ||||
| - [Task board](./TASKS.md) | ||||
|  | ||||
| ## How to get started | ||||
| 1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module. | ||||
| 2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED). | ||||
| 3. Read the architecture and README for domain context before editing code or docs. | ||||
| 4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan. | ||||
|  | ||||
| ## Guardrails | ||||
| - Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md). | ||||
| - Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts. | ||||
| - Keep Offline Kit parity in mind—document air-gapped workflows for any new feature. | ||||
| - Update runbooks/observability assets when operational characteristics change. | ||||
							
								
								
									
										33
									
								
								docs/modules/zastava/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										33
									
								
								docs/modules/zastava/README.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,33 @@ | ||||
| # StellaOps Zastava | ||||
|  | ||||
| Zastava monitors running workloads, verifies supply chain posture, and enforces runtime policy via Kubernetes admission webhooks. | ||||
|  | ||||
| ## Responsibilities | ||||
| - Observe node/container activity and emit runtime events. | ||||
| - Validate signatures, SBOM presence, and backend verdicts before allowing containers. | ||||
| - Buffer and replay events during disconnections. | ||||
| - Trigger delta scans when runtime posture drifts. | ||||
|  | ||||
| ## Key components | ||||
| - `StellaOps.Zastava.Observer` daemonset. | ||||
| - `StellaOps.Zastava.Webhook` admission controller. | ||||
| - Shared contracts in `StellaOps.Zastava.Core`. | ||||
|  | ||||
| ## Integrations & dependencies | ||||
| - Authority for OpToks and mTLS. | ||||
| - Scanner/Scheduler for remediation triggers. | ||||
| - Notify/UI for runtime alerts and dashboards. | ||||
|  | ||||
| ## Operational notes | ||||
| - Runbook ./operations/runtime.md with Grafana/Prometheus assets. | ||||
| - Offline kit assets bundling webhook charts. | ||||
| - DPoP/mTLS rotation guidance shared with Authority. | ||||
|  | ||||
| ## Related resources | ||||
| - ./operations/runtime.md | ||||
| - ./operations/runtime-grafana-dashboard.json | ||||
| - ./operations/runtime-prometheus-rules.yaml | ||||
|  | ||||
| ## Backlog references | ||||
| - ZASTAVA runtime tasks in ../../TASKS.md. | ||||
| - Webhook smoke tests tracked in src/Zastava/**/TASKS.md. | ||||
							
								
								
									
										9
									
								
								docs/modules/zastava/TASKS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										9
									
								
								docs/modules/zastava/TASKS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,9 @@ | ||||
| # Task board — Zastava | ||||
|  | ||||
| > Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable. | ||||
|  | ||||
| | ID | Status | Owner(s) | Description | Notes | | ||||
| |----|--------|----------|-------------|-------| | ||||
| | ZASTAVA-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md | | ||||
| | ZASTAVA-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md | | ||||
| | ZASTAVA-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow | | ||||
							
								
								
									
										496
									
								
								docs/modules/zastava/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										496
									
								
								docs/modules/zastava/architecture.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,496 @@ | ||||
| # component_architecture_zastava.md — **Stella Ops Zastava** (2025Q4) | ||||
|  | ||||
| > **Scope.** Implementation‑ready architecture for **Zastava**: the **runtime inspector/enforcer** that watches real workloads, detects drift from the scanned baseline, verifies image/SBOM/attestation posture, and (optionally) **admits/blocks** deployments. Includes Kubernetes & plain‑Docker topologies, data contracts, APIs, security posture, performance targets, test matrices, and failure modes. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 0) Mission & boundaries | ||||
|  | ||||
| **Mission.** Give operators **ground‑truth** from running environments and a **fast guardrail** before workloads land: | ||||
|  | ||||
| * **Observer:** inventory containers, entrypoints actually executed, and DSOs actually loaded; verify **image signature**, **SBOM referrers**, and **attestation** presence; detect **drift** (unexpected processes/paths) and **policy violations**; publish **runtime events** to Scanner.WebService. | ||||
| * **Admission (optional):** Kubernetes ValidatingAdmissionWebhook that enforces minimal posture (signed images, SBOM availability, known base images, policy PASS) **pre‑flight**. | ||||
|  | ||||
| **Boundaries.** | ||||
|  | ||||
| * Zastava **does not** compute SBOMs and does not sign; it **consumes** Scanner/WebService outputs and **enforces** backend policy verdicts. | ||||
| * Zastava can **request** a delta scan when the baseline is missing/stale, but scanning is done by **Scanner.Worker**. | ||||
| * On non‑K8s Docker hosts, Zastava runs as a host service with **observer‑only** features. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1) Topology & processes | ||||
|  | ||||
| ### 1.1 Components (Kubernetes) | ||||
|  | ||||
| ``` | ||||
| stellaops/zastava-observer    # DaemonSet on every node (read-only host mounts) | ||||
| stellaops/zastava-webhook     # ValidatingAdmissionWebhook (Deployment, 2+ replicas) | ||||
| ``` | ||||
|  | ||||
| ### 1.2 Components (Docker/VM) | ||||
|  | ||||
| ``` | ||||
| stellaops/zastava-agent       # System service; watch Docker events; observer only | ||||
| ``` | ||||
|  | ||||
| ### 1.3 Dependencies | ||||
|  | ||||
| * **Authority** (OIDC): short OpToks (DPoP/mTLS) for API calls to Scanner.WebService. | ||||
| * **Scanner.WebService**: `/runtime/events` ingestion; `/policy/runtime` fetch. | ||||
| * **OCI Registry** (optional): for direct referrers/sig checks if not delegated to backend. | ||||
| * **Container runtime**: containerd/CRI‑O/Docker (read interfaces only). | ||||
| * **Kubernetes API** (watch Pods in cluster; validating webhook). | ||||
| * **Host mounts** (K8s DaemonSet): `/proc`, `/var/lib/containerd` (or CRI‑O), `/run/containerd/containerd.sock` (optional, read‑only). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2) Data contracts | ||||
|  | ||||
| ### 2.1 Runtime event (observer → Scanner.WebService) | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "eventId": "9f6a…", | ||||
|   "when": "2025-10-17T12:34:56Z", | ||||
|   "kind": "CONTAINER_START|CONTAINER_STOP|DRIFT|POLICY_VIOLATION|ATTESTATION_STATUS", | ||||
|   "tenant": "tenant-01", | ||||
|   "node": "ip-10-0-1-23", | ||||
|   "runtime": { "engine": "containerd", "version": "1.7.19" }, | ||||
|   "workload": { | ||||
|     "platform": "kubernetes", | ||||
|     "namespace": "payments", | ||||
|     "pod": "api-7c9fbbd8b7-ktd84", | ||||
|     "container": "api", | ||||
|     "containerId": "containerd://...", | ||||
|     "imageRef": "ghcr.io/acme/api@sha256:abcd…", | ||||
|     "owner": { "kind": "Deployment", "name": "api" } | ||||
|   }, | ||||
|   "process": { | ||||
|     "pid": 12345, | ||||
|     "entrypoint": ["/entrypoint.sh", "--serve"], | ||||
|     "entryTrace": [ | ||||
|       {"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"}, | ||||
|       {"file":"<argv>","op":"python","target":"/opt/app/server.py"} | ||||
|     ], | ||||
|     "buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1" | ||||
|   }, | ||||
|   "loadedLibs": [ | ||||
|     { "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"}, | ||||
|     { "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"} | ||||
|   ], | ||||
|   "posture": { | ||||
|     "imageSigned": true, | ||||
|     "sbomReferrer": "present|missing", | ||||
|     "attestation": { "uuid": "rekor-uuid", "verified": true } | ||||
|   }, | ||||
|   "delta": { | ||||
|     "baselineImageDigest": "sha256:abcd…", | ||||
|     "changedFiles": ["/opt/app/server.py"],           // optional quick signal | ||||
|     "newBinaries": [{ "path":"/usr/local/bin/helper","sha256":"…" }] | ||||
|   }, | ||||
|   "evidence": [ | ||||
|     {"signal":"procfs.maps","value":"/lib/.../libssl.so.3@0x7f..."}, | ||||
|     {"signal":"cri.task.inspect","value":"pid=12345"}, | ||||
|     {"signal":"registry.referrers","value":"sbom: application/vnd.cyclonedx+json"} | ||||
|   ] | ||||
| } | ||||
| ``` | ||||
|  | ||||
| ### 2.2 Admission decision (webhook → API server) | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "admissionId": "…", | ||||
|   "namespace": "payments", | ||||
|   "podSpecDigest": "sha256:…", | ||||
|   "images": [ | ||||
|     { | ||||
|       "name": "ghcr.io/acme/api:1.2.3", | ||||
|       "resolved": "ghcr.io/acme/api@sha256:abcd…", | ||||
|       "signed": true, | ||||
|       "hasSbomReferrers": true, | ||||
|       "policyVerdict": "pass|warn|fail", | ||||
|       "reasons": ["unsigned base image", "missing SBOM"] | ||||
|     } | ||||
|   ], | ||||
|   "decision": "Allow|Deny", | ||||
|   "ttlSeconds": 300 | ||||
| } | ||||
| ``` | ||||
|  | ||||
| ### 2.3 Schema negotiation & hashing guarantees | ||||
|  | ||||
| * Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used. | ||||
| * Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs. | ||||
| * Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3) Observer — node agent (DaemonSet) | ||||
|  | ||||
| ### 3.1 Responsibilities | ||||
|  | ||||
| * **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC read‑only) or `/var/log/containers/*.log` tail fallback. | ||||
| * **Resolve** container → image digest, mount point rootfs. | ||||
| * **Trace entrypoint**: attach **short‑lived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**. | ||||
| * **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size). | ||||
| * **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation. | ||||
| * **Posture check** (cheap): | ||||
|  | ||||
|   * Image signature presence (if cosign policies are local; else ask backend). | ||||
|   * SBOM **referrers** presence (HEAD to registry, optional). | ||||
|   * Rekor UUID known (query Scanner.WebService by image digest). | ||||
| * **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress). | ||||
| * **Request delta scan** if: no SBOM in catalog OR base differs from known baseline. | ||||
|  | ||||
| ### 3.2 Privileges & mounts (K8s) | ||||
|  | ||||
| * **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`. | ||||
| * **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`. | ||||
| * **Host mounts (read‑only):** | ||||
|  | ||||
|   * `/proc` (host) → `/host/proc` | ||||
|   * `/run/containerd/containerd.sock` (or CRI‑O socket) | ||||
|   * `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids) | ||||
| * **Networking:** cluster‑internal egress to Scanner.WebService only. | ||||
| * **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants. | ||||
|  | ||||
| ### 3.3 Event batching | ||||
|  | ||||
| * Buffer ND‑JSON; flush by **N events** or **2 s**. | ||||
| * Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event. | ||||
|  | ||||
| ### 3.4 Build-id capture & validation workflow | ||||
|  | ||||
| 1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope. | ||||
| 2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI. | ||||
| 3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`). | ||||
| 4. Operators resolve symbols by either: | ||||
|    * calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or | ||||
|    * piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree. | ||||
| 5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4) Admission Webhook (Kubernetes) | ||||
|  | ||||
| ### 4.1 Gate criteria | ||||
|  | ||||
| Configurable policy (fetched from backend and cached): | ||||
|  | ||||
| * **Image signature**: must be cosign‑verifiable to configured key(s) or keyless identities. | ||||
| * **SBOM availability**: at least one **CycloneDX** referrer or **Scanner.WebService** catalog entry. | ||||
| * **Scanner policy verdict**: backend `PASS` required for namespaces/labels matching rules; allow `WARN` if configured. | ||||
| * **Registry allowlists/denylists**. | ||||
| * **Tag bans** (e.g., `:latest`). | ||||
| * **Base image allowlists** (by digest). | ||||
|  | ||||
| ### 4.2 Flow | ||||
|  | ||||
| ```mermaid | ||||
| sequenceDiagram | ||||
|   autonumber | ||||
|   participant K8s as API Server | ||||
|   participant WH as Zastava Webhook | ||||
|   participant SW as Scanner.WebService | ||||
|  | ||||
|   K8s->>WH: AdmissionReview(Pod) | ||||
|   WH->>WH: Resolve images to digests (remote HEAD/pull if needed) | ||||
|   WH->>SW: POST /policy/runtime { digests, namespace, labels } | ||||
|   SW-->>WH: { per-image: {signed, hasSbom, verdict, reasons}, ttl } | ||||
|   alt All pass | ||||
|     WH-->>K8s: AdmissionResponse(Allow, ttl) | ||||
|   else Any fail (enforce=true) | ||||
|     WH-->>K8s: AdmissionResponse(Deny, message) | ||||
|   end | ||||
| ``` | ||||
|  | ||||
| **Caching:** Per‑digest result cached `ttlSeconds` (default 300 s). **Fail‑open** or **fail‑closed** is configurable per namespace. | ||||
|  | ||||
| ### 4.3 TLS & HA | ||||
|  | ||||
| * Webhook has its own **serving cert** signed by cluster CA (or custom cert + CA bundle on configuration). | ||||
| * Deployment ≥ 2 replicas; **leaderless**; stateless. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5) Backend integration (Scanner.WebService) | ||||
|  | ||||
| ### 5.1 Ingestion endpoint | ||||
|  | ||||
| `POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)* | ||||
|  | ||||
| * Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL). | ||||
| * Performs **correlation**: | ||||
|  | ||||
|   * Attach nearest **image SBOM** (inventory/usage) and **BOM‑Index** if known. | ||||
|   * If unknown/missing, schedule **delta scan** and return `202 Accepted`. | ||||
| * Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`). | ||||
|  | ||||
| ### 5.2 Policy decision API (for webhook) | ||||
|  | ||||
| `POST /api/v1/scanner/policy/runtime` | ||||
|  | ||||
| The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane. | ||||
|  | ||||
| Request: | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "namespace": "payments", | ||||
|   "labels": { "app": "api", "env": "prod" }, | ||||
|   "images": ["ghcr.io/acme/api@sha256:...", "ghcr.io/acme/nginx@sha256:..."] | ||||
| } | ||||
| ``` | ||||
|  | ||||
| Response: | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "ttlSeconds": 300, | ||||
|   "results": { | ||||
|     "ghcr.io/acme/api@sha256:...": { | ||||
|       "signed": true, | ||||
|       "hasSbom": true, | ||||
|       "policyVerdict": "pass", | ||||
|       "reasons": [], | ||||
|       "rekor": { "uuid": "..." } | ||||
|     }, | ||||
|     "ghcr.io/acme/nginx@sha256:...": { | ||||
|       "signed": false, | ||||
|       "hasSbom": false, | ||||
|       "policyVerdict": "fail", | ||||
|       "reasons": ["unsigned", "missing SBOM"] | ||||
|     } | ||||
|   } | ||||
| } | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6) Configuration (YAML) | ||||
|  | ||||
| ```yaml | ||||
| zastava: | ||||
|   mode: | ||||
|     observer: true | ||||
|     webhook: true | ||||
|   backend: | ||||
|     baseAddress: "https://scanner-web.internal" | ||||
|     policyPath: "/api/v1/scanner/policy/runtime" | ||||
|     requestTimeoutSeconds: 5 | ||||
|     allowInsecureHttp: false | ||||
|   runtime: | ||||
|     authority: | ||||
|       issuer: "https://authority.internal" | ||||
|       clientId: "zastava-observer" | ||||
|       audience: ["scanner","zastava"] | ||||
|       scopes: | ||||
|         - "api:scanner.runtime.write" | ||||
|       refreshSkewSeconds: 120 | ||||
|       requireDpop: true | ||||
|       requireMutualTls: true | ||||
|       allowStaticTokenFallback: false | ||||
|       staticTokenPath: null      # Optional bootstrap secret | ||||
|     tenant: "tenant-01" | ||||
|     environment: "prod" | ||||
|     deployment: "cluster-a" | ||||
|     logging: | ||||
|       includeScopes: true | ||||
|       includeActivityTracking: true | ||||
|       staticScope: | ||||
|         plane: "runtime" | ||||
|     metrics: | ||||
|       meterName: "StellaOps.Zastava" | ||||
|       meterVersion: "1.0.0" | ||||
|       commonTags: | ||||
|         cluster: "prod-cluster" | ||||
|     engine: "auto"    # containerd|cri-o|docker|auto | ||||
|     procfs: "/host/proc" | ||||
|     collect: | ||||
|       entryTrace: true | ||||
|       loadedLibs: true | ||||
|       maxLibs: 256 | ||||
|       maxHashBytesPerContainer: 64_000_000 | ||||
|       maxDepth: 48 | ||||
|   admission: | ||||
|     enforce: true | ||||
|     failOpenNamespaces: ["dev", "test"] | ||||
|     verify: | ||||
|       imageSignature: true | ||||
|       sbomReferrer: true | ||||
|       scannerPolicyPass: true | ||||
|     cacheTtlSeconds: 300 | ||||
|     resolveTags: true          # do remote digest resolution for tag-only images | ||||
|   limits: | ||||
|     eventsPerSecond: 50 | ||||
|     burst: 200 | ||||
|     perNodeQueue: 10_000 | ||||
|   security: | ||||
|     mounts: | ||||
|       containerdSock: "/run/containerd/containerd.sock:ro" | ||||
|       proc: "/proc:/host/proc:ro" | ||||
|       runtimeState: "/var/lib/containerd:ro" | ||||
| ``` | ||||
|  | ||||
| > Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7) Security posture | ||||
|  | ||||
| * **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles). | ||||
| * **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts. | ||||
| * **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`. | ||||
| * **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode. | ||||
| * **Rate limiting**: per‑node caps; per‑tenant caps at backend. | ||||
| * **Hard caps**: bytes hashed, files inspected, depth of shell parsing. | ||||
| * **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8) Metrics, logs, tracing | ||||
|  | ||||
| **Observer** | ||||
|  | ||||
| * `zastava.runtime.events.total{kind}` | ||||
| * `zastava.runtime.backend.latency.ms{endpoint="events"}` | ||||
| * `zastava.proc_maps.samples.total{result}` | ||||
| * `zastava.entrytrace.depth{p99}` | ||||
| * `zastava.hash.bytes.total` | ||||
| * `zastava.buffer.drops.total` | ||||
|  | ||||
| **Webhook** | ||||
|  | ||||
| * `zastava.admission.decisions.total{decision}` | ||||
| * `zastava.runtime.backend.latency.ms{endpoint="policy"}` | ||||
| * `zastava.admission.cache.hits.total` | ||||
| * `zastava.backend.failures.total` | ||||
|  | ||||
| **Logs** (structured): node, pod, image digest, decision, reasons. | ||||
| **Tracing**: spans for observe→batch→post; webhook request→resolve→respond. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9) Performance & scale targets | ||||
|  | ||||
| * **Observer**: ≤ **30 ms** to sample `/proc/<pid>/maps` and compute quick hashes for ≤ 64 files; ≤ **200 ms** for full library set (256 libs). | ||||
| * **Webhook**: P95 ≤ **8 ms** with warm cache; ≤ **50 ms** with one backend round‑trip. | ||||
| * **Throughput**: 1k admission requests/min/replica; 5k runtime events/min/node with batching. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10) Drift detection model | ||||
|  | ||||
| **Signals** | ||||
|  | ||||
| * **Process drift**: terminal program differs from **EntryTrace** baseline. | ||||
| * **Library drift**: loaded DSOs not present in **Usage** SBOM view. | ||||
| * **Filesystem drift**: new executable files under `/usr/local/bin`, `/opt`, `/app` with **mtime** after image creation. | ||||
| * **Network drift** (optional): listening sockets on unexpected ports (from policy). | ||||
|  | ||||
| **Action** | ||||
|  | ||||
| * Emit `DRIFT` event with evidence; backend can **auto‑queue** a delta scan; policy may **escalate** to alert/block (Admission cannot block already‑running pods; rely on K8s policies/PodSecurity or operator action). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 11) Test matrix | ||||
|  | ||||
| * **Engines**: containerd, CRI‑O, Docker; ensure PID resolution and rootfs mapping. | ||||
| * **EntryTrace**: bash features (case, if, run‑parts, `.`/`source`), language launchers (python/node/java). | ||||
| * **Procfs**: multiple arches, musl/glibc images; static binaries (maps minimal). | ||||
| * **Admission**: unsigned images, missing SBOM referrers, tag‑only images, digest resolution, backend latency, cache TTL. | ||||
| * **Perf/soak**: 500 Pods/node churn; webhook under HPA growth. | ||||
| * **Security**: attempt privilege escalation disabled, read‑only mounts enforced, rate‑limit abuse. | ||||
| * **Failure injection**: backend down (observer buffers, webhook fail‑open/closed), registry throttling, containerd socket unavailable. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 12) Failure modes & responses | ||||
|  | ||||
| | Condition                       | Observer behavior                              | Webhook behavior                                       | | ||||
| | ------------------------------- | ---------------------------------------------- | ------------------------------------------------------ | | ||||
| | Backend unreachable             | Buffer to disk; drop after cap; emit metric    | **Fail‑open/closed** per namespace config              | | ||||
| | PID vanished mid‑sample         | Retry once; emit partial evidence              | N/A                                                    | | ||||
| | CRI socket missing              | Fallback to K8s events only (reduced fidelity) | N/A                                                    | | ||||
| | Registry digest resolve blocked | Defer to backend; mark `resolve=unknown`       | Deny or allow per `resolveTags` & `failOpenNamespaces` | | ||||
| | Excessive events                | Apply local rate limit, coalesce               | N/A                                                    | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 13) Deployment notes (K8s) | ||||
|  | ||||
| **DaemonSet (snippet):** | ||||
|  | ||||
| ```yaml | ||||
| apiVersion: apps/v1 | ||||
| kind: DaemonSet | ||||
| metadata: { name: zastava-observer, namespace: stellaops } | ||||
| spec: | ||||
|   template: | ||||
|     spec: | ||||
|       serviceAccountName: zastava | ||||
|       hostPID: true | ||||
|       containers: | ||||
|       - name: observer | ||||
|         image: stellaops/zastava-observer:2.3 | ||||
|         securityContext: | ||||
|           runAsUser: 0 | ||||
|           readOnlyRootFilesystem: true | ||||
|           allowPrivilegeEscalation: false | ||||
|           capabilities: { add: ["SYS_PTRACE","DAC_READ_SEARCH"] } | ||||
|         volumeMounts: | ||||
|         - { name: proc, mountPath: /host/proc, readOnly: true } | ||||
|         - { name: containerd-sock, mountPath: /run/containerd/containerd.sock, readOnly: true } | ||||
|         - { name: containerd-state, mountPath: /var/lib/containerd, readOnly: true } | ||||
|       volumes: | ||||
|       - { name: proc, hostPath: { path: /proc } } | ||||
|       - { name: containerd-sock, hostPath: { path: /run/containerd/containerd.sock } } | ||||
|       - { name: containerd-state, hostPath: { path: /var/lib/containerd } } | ||||
| ``` | ||||
|  | ||||
| **Webhook (snippet):** | ||||
|  | ||||
| ```yaml | ||||
| apiVersion: admissionregistration.k8s.io/v1 | ||||
| kind: ValidatingWebhookConfiguration | ||||
| webhooks: | ||||
| - name: gate.zastava.stella-ops.org | ||||
|   admissionReviewVersions: ["v1"] | ||||
|   sideEffects: None | ||||
|   failurePolicy: Ignore   # or Fail | ||||
|   rules: | ||||
|   - operations: ["CREATE","UPDATE"] | ||||
|     apiGroups: [""] | ||||
|     apiVersions: ["v1"] | ||||
|     resources: ["pods"] | ||||
|   clientConfig: | ||||
|     service: | ||||
|       namespace: stellaops | ||||
|       name: zastava-webhook | ||||
|       path: /admit | ||||
|     caBundle: <base64 CA> | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 14) Implementation notes | ||||
|  | ||||
| * **Language**: Rust (observer) for low‑latency `/proc` parsing; Go/.NET viable too. Webhook can be .NET 10 for parity with backend. | ||||
| * **CRI drivers**: pluggable (`containerd`, `cri-o`, `docker`). Prefer CRI over parsing logs. | ||||
| * **Shell parser**: re‑use Scanner.EntryTrace grammar for consistent results (compile to WASM if observer is Rust/Go). | ||||
| * **Hashing**: `BLAKE3` for speed locally, then convert to `sha256` (or compute `sha256` directly when budget allows). | ||||
| * **Resilience**: never block container start; observer is **passive**; only webhook decides allow/deny. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 15) Roadmap | ||||
|  | ||||
| * **eBPF** option for syscall/library load tracing (kernel‑level, opt‑in). | ||||
| * **Windows containers** support (ETW providers, loaded modules). | ||||
| * **Network posture** checks: listening ports vs policy. | ||||
| * **Live **used‑by‑entrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view. | ||||
| * **Admission dry‑run** dashboards (simulate block lists before enforcing). | ||||
|  | ||||
							
								
								
									
										19
									
								
								docs/modules/zastava/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										19
									
								
								docs/modules/zastava/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,19 @@ | ||||
| # Implementation plan — Zastava | ||||
|  | ||||
| ## Current objectives | ||||
| - Maintain deterministic behaviour and offline parity across releases. | ||||
| - Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes. | ||||
|  | ||||
| ## Workstreams | ||||
| - Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap. | ||||
| - Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs. | ||||
| - Validation: extend tests/fixtures to preserve determinism and provenance requirements. | ||||
|  | ||||
| ## Backlog references | ||||
| - ZASTAVA runtime tasks in ../../TASKS.md. | ||||
| - Webhook smoke tests tracked in src/Zastava/**/TASKS.md. | ||||
|  | ||||
| ## Coordination | ||||
| - Review ./AGENTS.md before picking up new work. | ||||
| - Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md. | ||||
| - Update this plan whenever scope, dependencies, or guardrails change. | ||||
							
								
								
									
										205
									
								
								docs/modules/zastava/operations/runtime-grafana-dashboard.json
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										205
									
								
								docs/modules/zastava/operations/runtime-grafana-dashboard.json
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,205 @@ | ||||
| { | ||||
|   "title": "Zastava Runtime Plane", | ||||
|   "uid": "zastava-runtime", | ||||
|   "timezone": "utc", | ||||
|   "schemaVersion": 38, | ||||
|   "version": 1, | ||||
|   "refresh": "30s", | ||||
|   "time": { | ||||
|     "from": "now-6h", | ||||
|     "to": "now" | ||||
|   }, | ||||
|   "panels": [ | ||||
|     { | ||||
|       "id": 1, | ||||
|       "type": "timeseries", | ||||
|       "title": "Observer Event Rate", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))", | ||||
|           "legendFormat": "{{tenant}}/{{component}}/{{kind}}" | ||||
|         } | ||||
|       ], | ||||
|       "gridPos": { | ||||
|         "h": 8, | ||||
|         "w": 12, | ||||
|         "x": 0, | ||||
|         "y": 0 | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "1/s", | ||||
|           "thresholds": { | ||||
|             "mode": "absolute", | ||||
|             "steps": [ | ||||
|               { | ||||
|                 "color": "green" | ||||
|               } | ||||
|             ] | ||||
|           } | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "options": { | ||||
|         "legend": { | ||||
|           "showLegend": true, | ||||
|           "placement": "bottom" | ||||
|         }, | ||||
|         "tooltip": { | ||||
|           "mode": "multi" | ||||
|         } | ||||
|       } | ||||
|     }, | ||||
|     { | ||||
|       "id": 2, | ||||
|       "type": "timeseries", | ||||
|       "title": "Admission Decisions", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))", | ||||
|           "legendFormat": "{{decision}}" | ||||
|         } | ||||
|       ], | ||||
|       "gridPos": { | ||||
|         "h": 8, | ||||
|         "w": 12, | ||||
|         "x": 12, | ||||
|         "y": 0 | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "1/s", | ||||
|           "thresholds": { | ||||
|             "mode": "absolute", | ||||
|             "steps": [ | ||||
|               { | ||||
|                 "color": "green" | ||||
|               }, | ||||
|               { | ||||
|                 "color": "red", | ||||
|                 "value": 20 | ||||
|               } | ||||
|             ] | ||||
|           } | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "options": { | ||||
|         "legend": { | ||||
|           "showLegend": true, | ||||
|           "placement": "bottom" | ||||
|         }, | ||||
|         "tooltip": { | ||||
|           "mode": "multi" | ||||
|         } | ||||
|       } | ||||
|     }, | ||||
|     { | ||||
|       "id": 3, | ||||
|       "type": "timeseries", | ||||
|       "title": "Backend Latency P95", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))", | ||||
|           "legendFormat": "p95 latency" | ||||
|         } | ||||
|       ], | ||||
|       "gridPos": { | ||||
|         "h": 8, | ||||
|         "w": 12, | ||||
|         "x": 0, | ||||
|         "y": 8 | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "ms", | ||||
|           "thresholds": { | ||||
|             "mode": "absolute", | ||||
|             "steps": [ | ||||
|               { | ||||
|                 "color": "green" | ||||
|               }, | ||||
|               { | ||||
|                 "color": "orange", | ||||
|                 "value": 500 | ||||
|               }, | ||||
|               { | ||||
|                 "color": "red", | ||||
|                 "value": 750 | ||||
|               } | ||||
|             ] | ||||
|           } | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "options": { | ||||
|         "legend": { | ||||
|           "showLegend": true, | ||||
|           "placement": "bottom" | ||||
|         }, | ||||
|         "tooltip": { | ||||
|           "mode": "multi" | ||||
|         } | ||||
|       } | ||||
|     } | ||||
|   ], | ||||
|   "templating": { | ||||
|     "list": [ | ||||
|       { | ||||
|         "name": "datasource", | ||||
|         "type": "datasource", | ||||
|         "query": "prometheus", | ||||
|         "label": "Prometheus", | ||||
|         "current": { | ||||
|           "text": "Prometheus", | ||||
|           "value": "Prometheus" | ||||
|         } | ||||
|       }, | ||||
|       { | ||||
|         "name": "tenant", | ||||
|         "type": "query", | ||||
|         "datasource": { | ||||
|           "type": "prometheus", | ||||
|           "uid": "${datasource}" | ||||
|         }, | ||||
|         "definition": "label_values(zastava_runtime_events_total, tenant)", | ||||
|         "refresh": 1, | ||||
|         "hide": 0, | ||||
|         "current": { | ||||
|           "text": ".*", | ||||
|           "value": ".*" | ||||
|         }, | ||||
|         "regex": "", | ||||
|         "includeAll": true, | ||||
|         "multi": true, | ||||
|         "sort": 1 | ||||
|       } | ||||
|     ] | ||||
|   }, | ||||
|   "annotations": { | ||||
|     "list": [ | ||||
|       { | ||||
|         "name": "Deployments", | ||||
|         "type": "tags", | ||||
|         "datasource": { | ||||
|           "type": "prometheus", | ||||
|           "uid": "${datasource}" | ||||
|         }, | ||||
|         "enable": true, | ||||
|         "iconColor": "rgba(255, 96, 96, 1)" | ||||
|       } | ||||
|     ] | ||||
|   } | ||||
| } | ||||
| @@ -0,0 +1,31 @@ | ||||
| groups: | ||||
|   - name: zastava-runtime | ||||
|     interval: 30s | ||||
|     rules: | ||||
|       - alert: ZastavaRuntimeEventsSilent | ||||
|         expr: sum(rate(zastava_runtime_events_total[10m])) == 0 | ||||
|         for: 15m | ||||
|         labels: | ||||
|           severity: warning | ||||
|           service: zastava-runtime | ||||
|         annotations: | ||||
|           summary: "Observer events stalled" | ||||
|           description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts." | ||||
|       - alert: ZastavaRuntimeBackendLatencyHigh | ||||
|         expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75 | ||||
|         for: 10m | ||||
|         labels: | ||||
|           severity: critical | ||||
|           service: zastava-runtime | ||||
|         annotations: | ||||
|           summary: "Runtime backend latency p95 above 750 ms" | ||||
|           description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network." | ||||
|       - alert: ZastavaAdmissionDenySpike | ||||
|         expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20 | ||||
|         for: 5m | ||||
|         labels: | ||||
|           severity: warning | ||||
|           service: zastava-runtime | ||||
|         annotations: | ||||
|           summary: "Admission webhook denies exceeding threshold" | ||||
|           description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces." | ||||
							
								
								
									
										174
									
								
								docs/modules/zastava/operations/runtime.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										174
									
								
								docs/modules/zastava/operations/runtime.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,174 @@ | ||||
| # Zastava Runtime Operations Runbook | ||||
|  | ||||
| This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook). | ||||
| It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume | ||||
| `StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`). | ||||
|  | ||||
| ## 1. Prerequisites | ||||
|  | ||||
| - **Authority client credentials** – service principal `zastava-runtime` with scopes | ||||
|   `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client | ||||
|   certs before rollout. | ||||
| - **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`) | ||||
|   resolvable from every node running Observer/Webhook. | ||||
| - **Host mounts** – read-only access to `/proc`, container runtime state | ||||
|   (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space | ||||
|   (`/var/run/zastava`). | ||||
| - **Offline kit bundle** – operators staging air-gapped installs must download | ||||
|   `offline-kit/zastava-runtime-{version}.tar.zst` containing container images, | ||||
|   Grafana dashboards, and Prometheus rules referenced below. | ||||
| - **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets | ||||
|   live outside git. For air-gapped installs copy them to the sealed secrets vault. | ||||
|  | ||||
| ### 1.1 Telemetry quick reference | ||||
|  | ||||
| | Metric | Description | Notes | | ||||
| |--------|-------------|-------| | ||||
| | `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. | | ||||
| | `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. | | ||||
| | `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. | | ||||
| | `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. | | ||||
|  | ||||
| ## 2. Deployment workflows | ||||
|  | ||||
| ### 2.1 Fresh install (Helm overlay) | ||||
|  | ||||
| 1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`. | ||||
| 2. Render values: | ||||
|    - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier). | ||||
|    - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle). | ||||
|    - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels. | ||||
| 3. Pre-create secrets: | ||||
|    - `zastava-authority-dpop` (JWK + private key). | ||||
|    - `zastava-authority-mtls` (client cert/key chain). | ||||
|    - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval). | ||||
| 4. Deploy Observer DaemonSet and Webhook chart: | ||||
|    ```sh | ||||
|    helm upgrade --install zastava-runtime deploy/helm/zastava \ | ||||
|      -f values/zastava-runtime.yaml \ | ||||
|      --namespace stellaops \ | ||||
|      --create-namespace | ||||
|    ``` | ||||
| 5. Verify: | ||||
|    - `kubectl -n stellaops get pods -l app=zastava-observer` ready. | ||||
|    - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows | ||||
|      `Issued runtime OpTok` audit line with DPoP token type. | ||||
|    - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`. | ||||
|  | ||||
| ### 2.2 Upgrades | ||||
|  | ||||
| 1. Scale webhook deployment to `--replicas=3` (rolling). | ||||
| 2. Drain one node per AZ to ensure Observer tolerates disruption. | ||||
| 3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms). | ||||
| 4. Post-upgrade, run smoke tests: | ||||
|    - Apply unsigned Pod manifest → expect `deny` (policy fail). | ||||
|    - Apply signed Pod manifest → expect `allow`. | ||||
| 5. Record upgrade in ops log with Git SHA + Helm chart version. | ||||
|  | ||||
| ### 2.3 Rollback | ||||
|  | ||||
| 1. Use Helm revision history: `helm history zastava-runtime`. | ||||
| 2. Rollback: `helm rollback zastava-runtime <revision>`. | ||||
| 3. Invalidate cached OpToks: | ||||
|    ```sh | ||||
|    kubectl -n stellaops exec deploy/zastava-webhook -- \ | ||||
|      zastava-webhook invalidate-op-token --audience scanner | ||||
|    ``` | ||||
| 4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`). | ||||
|  | ||||
| ## 3. Authority & security guardrails | ||||
|  | ||||
| - Tokens must be `DPoP` type when `requireDpop=true`. Logs emit | ||||
|   `authority.token.issue` scope with decision data; absence indicates misconfig. | ||||
| - `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in | ||||
|   lab clusters; expect warning log `Mutual TLS requirement disabled`. | ||||
| - Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during | ||||
|   initial bootstrap. Rotate nightly; preference is to disable once Authority reachable. | ||||
| - Audit every change in `zastava.runtime.authority` through change management. | ||||
|   Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'` | ||||
|   to confirm key rotation. | ||||
|  | ||||
| ## 4. Incident response | ||||
|  | ||||
| ### 4.1 Authority offline | ||||
|  | ||||
| 1. Check Prometheus alert `ZastavaAuthorityTokenStale`. | ||||
| 2. Inspect Observer logs for `authority.token.fallback` scope. | ||||
| 3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h. | ||||
| 4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys. | ||||
|  | ||||
| ### 4.2 Scanner/WebService latency spike | ||||
|  | ||||
| 1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes. | ||||
| 2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`. | ||||
| 3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via | ||||
|    `kubectl logs ds/zastava-observer | grep buffer.drops`. | ||||
| 4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary). | ||||
|  | ||||
| ### 4.3 Admission deny storm | ||||
|  | ||||
| 1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute. | ||||
| 2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`. | ||||
| 3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application | ||||
|    owner; optionally set namespace to `failOpenNamespaces` after risk assessment. | ||||
|  | ||||
| ## 5. Offline kit & air-gapped notes | ||||
|  | ||||
| - Bundle contents: | ||||
|   - Observer/Webhook container images (multi-arch). | ||||
| - `docs/modules/zastava/operations/runtime-prometheus-rules.yaml` + Grafana dashboard JSON. | ||||
|   - Sample `zastava-runtime.values.yaml`. | ||||
| - Verification: | ||||
|   - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`. | ||||
|   - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`). | ||||
|   - Import Grafana dashboard via `grafana-cli --config ...`. | ||||
|  | ||||
| ## 6. Observability assets | ||||
|  | ||||
| - Prometheus alert rules: `docs/modules/zastava/operations/runtime-prometheus-rules.yaml`. | ||||
| - Grafana dashboard JSON: `docs/modules/zastava/operations/runtime-grafana-dashboard.json`. | ||||
| - Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in | ||||
|   the Offline Kit manifest. | ||||
|  | ||||
| ## 7. Build-id correlation & symbol retrieval | ||||
|  | ||||
| Runtime events emitted by Observer now include `process.buildId` (from the ELF | ||||
| `NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent | ||||
| `buildIds` list per digest. Operators can use these hashes to locate debug | ||||
| artifacts during incident response: | ||||
|  | ||||
| 1. Capture the hash from CLI/webhook/Scanner API—for example: | ||||
|    ```bash | ||||
|    stellaops-cli runtime policy test --image <digest> --namespace <ns> | ||||
|    ``` | ||||
|    Copy one of the `Build IDs` (e.g. | ||||
|    `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`). | ||||
| 2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists: | ||||
|    ```bash | ||||
|    ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug | ||||
|    ``` | ||||
| 3. If the file is missing, rehydrate it from Offline Kit bundles or the | ||||
|    `debug-store` object bucket (mirror of release artefacts): | ||||
|    ```bash | ||||
|    oras cp oci://registry.internal/debug-store:latest . --include \ | ||||
|      "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug" | ||||
|    ``` | ||||
| 4. Confirm the running process advertises the same GNU build-id before | ||||
|    symbolising: | ||||
|    ```bash | ||||
|    readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID' | ||||
|    ``` | ||||
| 5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it | ||||
|    in `debuginfod` for fleet-wide symbol resolution: | ||||
|    ```bash | ||||
|    debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug | ||||
|    ``` | ||||
| 6. For musl-based images, expect shorter build-id footprints. Missing hashes in | ||||
|    runtime events indicate stripped binaries without the GNU note—schedule a | ||||
|    rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store | ||||
|    allowlist so the scanner can surface a fallback symbol package. | ||||
|  | ||||
| Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of | ||||
| data after ZASTAVA-OBS-17-005 implies containers launched before the Observer | ||||
| upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart | ||||
| Observer to trigger a fresh capture if symbol parity is required. | ||||
		Reference in New Issue
	
	Block a user