# component_architecture_zastava.md — **Stella Ops Zastava** (2025Q4) > **Scope.** Implementation‑ready architecture for **Zastava**: the **runtime inspector/enforcer** that watches real workloads, detects drift from the scanned baseline, verifies image/SBOM/attestation posture, and (optionally) **admits/blocks** deployments. Includes Kubernetes & plain‑Docker topologies, data contracts, APIs, security posture, performance targets, test matrices, and failure modes. --- ## 0) Mission & boundaries **Mission.** Give operators **ground‑truth** from running environments and a **fast guardrail** before workloads land: * **Observer:** inventory containers, entrypoints actually executed, and DSOs actually loaded; verify **image signature**, **SBOM referrers**, and **attestation** presence; detect **drift** (unexpected processes/paths) and **policy violations**; publish **runtime events** to Scanner.WebService. * **Admission (optional):** Kubernetes ValidatingAdmissionWebhook that enforces minimal posture (signed images, SBOM availability, known base images, policy PASS) **pre‑flight**. **Boundaries.** * Zastava **does not** compute SBOMs and does not sign; it **consumes** Scanner/WebService outputs and **enforces** backend policy verdicts. * Zastava can **request** a delta scan when the baseline is missing/stale, but scanning is done by **Scanner.Worker**. * On non‑K8s Docker hosts, Zastava runs as a host service with **observer‑only** features. --- ## 1) Topology & processes ### 1.1 Components (Kubernetes) ``` stellaops/zastava-observer # DaemonSet on every node (read-only host mounts) stellaops/zastava-webhook # ValidatingAdmissionWebhook (Deployment, 2+ replicas) ``` ### 1.2 Components (Docker/VM) ``` stellaops/zastava-agent # System service; watch Docker events; observer only ``` ### 1.3 Dependencies * **Authority** (OIDC): short OpToks (DPoP/mTLS) for API calls to Scanner.WebService. * **Scanner.WebService**: `/runtime/events` ingestion; `/policy/runtime` fetch. * **OCI Registry** (optional): for direct referrers/sig checks if not delegated to backend. * **Container runtime**: containerd/CRI‑O/Docker (read interfaces only). * **Kubernetes API** (watch Pods in cluster; validating webhook). * **Host mounts** (K8s DaemonSet): `/proc`, `/var/lib/containerd` (or CRI‑O), `/run/containerd/containerd.sock` (optional, read‑only). --- ## 2) Data contracts ### 2.1 Runtime event (observer → Scanner.WebService) ```json { "eventId": "9f6a…", "when": "2025-10-17T12:34:56Z", "kind": "CONTAINER_START|CONTAINER_STOP|DRIFT|POLICY_VIOLATION|ATTESTATION_STATUS", "tenant": "tenant-01", "node": "ip-10-0-1-23", "runtime": { "engine": "containerd", "version": "1.7.19" }, "workload": { "platform": "kubernetes", "namespace": "payments", "pod": "api-7c9fbbd8b7-ktd84", "container": "api", "containerId": "containerd://...", "imageRef": "ghcr.io/acme/api@sha256:abcd…", "owner": { "kind": "Deployment", "name": "api" } }, "process": { "pid": 12345, "entrypoint": ["/entrypoint.sh", "--serve"], "entryTrace": [ {"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"}, {"file":"","op":"python","target":"/opt/app/server.py"} ], "buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1" }, "loadedLibs": [ { "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"}, { "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"} ], "posture": { "imageSigned": true, "sbomReferrer": "present|missing", "attestation": { "uuid": "rekor-uuid", "verified": true } }, "delta": { "baselineImageDigest": "sha256:abcd…", "changedFiles": ["/opt/app/server.py"], // optional quick signal "newBinaries": [{ "path":"/usr/local/bin/helper","sha256":"…" }] }, "evidence": [ {"signal":"procfs.maps","value":"/lib/.../libssl.so.3@0x7f..."}, {"signal":"cri.task.inspect","value":"pid=12345"}, {"signal":"registry.referrers","value":"sbom: application/vnd.cyclonedx+json"} ] } ``` ### 2.2 Admission decision (webhook → API server) ```json { "admissionId": "…", "namespace": "payments", "podSpecDigest": "sha256:…", "images": [ { "name": "ghcr.io/acme/api:1.2.3", "resolved": "ghcr.io/acme/api@sha256:abcd…", "signed": true, "hasSbomReferrers": true, "policyVerdict": "pass|warn|fail", "reasons": ["unsigned base image", "missing SBOM"] } ], "decision": "Allow|Deny", "ttlSeconds": 300 } ``` ### 2.3 Schema negotiation & hashing guarantees * Every payload is wrapped in an envelope with `schemaVersion` set to `"@v."`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used. * Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-` are reproducible across observers, webhooks, and backend jobs. * Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook. --- ## 3) Observer — node agent (DaemonSet) ### 3.1 Responsibilities * **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC read‑only) or `/var/log/containers/*.log` tail fallback. * **Resolve** container → image digest, mount point rootfs. * **Trace entrypoint**: attach **short‑lived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**. * **Sample loaded libs**: read `/proc//maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size). * **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc//exe` and attach the normalized hex to runtime events for symbol/debug-store correlation. * **Posture check** (cheap): * Image signature presence (if cosign policies are local; else ask backend). * SBOM **referrers** presence (HEAD to registry, optional). * Rekor UUID known (query Scanner.WebService by image digest). * **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress). * **Request delta scan** if: no SBOM in catalog OR base differs from known baseline. ### 3.2 Privileges & mounts (K8s) * **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`. * **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`. * **Host mounts (read‑only):** * `/proc` (host) → `/host/proc` * `/run/containerd/containerd.sock` (or CRI‑O socket) * `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids) * **Networking:** cluster‑internal egress to Scanner.WebService only. * **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants. ### 3.3 Event batching * Buffer ND‑JSON; flush by **N events** or **2 s**. * Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event. --- ## 4) Admission Webhook (Kubernetes) ### 4.1 Gate criteria Configurable policy (fetched from backend and cached): * **Image signature**: must be cosign‑verifiable to configured key(s) or keyless identities. * **SBOM availability**: at least one **CycloneDX** referrer or **Scanner.WebService** catalog entry. * **Scanner policy verdict**: backend `PASS` required for namespaces/labels matching rules; allow `WARN` if configured. * **Registry allowlists/denylists**. * **Tag bans** (e.g., `:latest`). * **Base image allowlists** (by digest). ### 4.2 Flow ```mermaid sequenceDiagram autonumber participant K8s as API Server participant WH as Zastava Webhook participant SW as Scanner.WebService K8s->>WH: AdmissionReview(Pod) WH->>WH: Resolve images to digests (remote HEAD/pull if needed) WH->>SW: POST /policy/runtime { digests, namespace, labels } SW-->>WH: { per-image: {signed, hasSbom, verdict, reasons}, ttl } alt All pass WH-->>K8s: AdmissionResponse(Allow, ttl) else Any fail (enforce=true) WH-->>K8s: AdmissionResponse(Deny, message) end ``` **Caching:** Per‑digest result cached `ttlSeconds` (default 300 s). **Fail‑open** or **fail‑closed** is configurable per namespace. ### 4.3 TLS & HA * Webhook has its own **serving cert** signed by cluster CA (or custom cert + CA bundle on configuration). * Deployment ≥ 2 replicas; **leaderless**; stateless. --- ## 5) Backend integration (Scanner.WebService) ### 5.1 Ingestion endpoint `POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)* * Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL). * Performs **correlation**: * Attach nearest **image SBOM** (inventory/usage) and **BOM‑Index** if known. * If unknown/missing, schedule **delta scan** and return `202 Accepted`. * Emits **derived signals** (usedByEntrypoint per component based on `/proc//maps`). ### 5.2 Policy decision API (for webhook) `POST /api/v1/scanner/policy/runtime` The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane. Request: ```json { "namespace": "payments", "labels": { "app": "api", "env": "prod" }, "images": ["ghcr.io/acme/api@sha256:...", "ghcr.io/acme/nginx@sha256:..."] } ``` Response: ```json { "ttlSeconds": 300, "results": { "ghcr.io/acme/api@sha256:...": { "signed": true, "hasSbom": true, "policyVerdict": "pass", "reasons": [], "rekor": { "uuid": "..." } }, "ghcr.io/acme/nginx@sha256:...": { "signed": false, "hasSbom": false, "policyVerdict": "fail", "reasons": ["unsigned", "missing SBOM"] } } } ``` --- ## 6) Configuration (YAML) ```yaml zastava: mode: observer: true webhook: true backend: baseAddress: "https://scanner-web.internal" policyPath: "/api/v1/scanner/policy/runtime" requestTimeoutSeconds: 5 allowInsecureHttp: false runtime: authority: issuer: "https://authority.internal" clientId: "zastava-observer" audience: ["scanner","zastava"] scopes: - "api:scanner.runtime.write" refreshSkewSeconds: 120 requireDpop: true requireMutualTls: true allowStaticTokenFallback: false staticTokenPath: null # Optional bootstrap secret tenant: "tenant-01" environment: "prod" deployment: "cluster-a" logging: includeScopes: true includeActivityTracking: true staticScope: plane: "runtime" metrics: meterName: "StellaOps.Zastava" meterVersion: "1.0.0" commonTags: cluster: "prod-cluster" engine: "auto" # containerd|cri-o|docker|auto procfs: "/host/proc" collect: entryTrace: true loadedLibs: true maxLibs: 256 maxHashBytesPerContainer: 64_000_000 maxDepth: 48 admission: enforce: true failOpenNamespaces: ["dev", "test"] verify: imageSignature: true sbomReferrer: true scannerPolicyPass: true cacheTtlSeconds: 300 resolveTags: true # do remote digest resolution for tag-only images limits: eventsPerSecond: 50 burst: 200 perNodeQueue: 10_000 security: mounts: containerdSock: "/run/containerd/containerd.sock:ro" proc: "/proc:/host/proc:ro" runtimeState: "/var/lib/containerd:ro" ``` > Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters. --- ## 7) Security posture * **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles). * **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts. * **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/`. * **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode. * **Rate limiting**: per‑node caps; per‑tenant caps at backend. * **Hard caps**: bytes hashed, files inspected, depth of shell parsing. * **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed. --- ## 8) Metrics, logs, tracing **Observer** * `zastava.runtime.events.total{kind}` * `zastava.runtime.backend.latency.ms{endpoint="events"}` * `zastava.proc_maps.samples.total{result}` * `zastava.entrytrace.depth{p99}` * `zastava.hash.bytes.total` * `zastava.buffer.drops.total` **Webhook** * `zastava.admission.decisions.total{decision}` * `zastava.runtime.backend.latency.ms{endpoint="policy"}` * `zastava.admission.cache.hits.total` * `zastava.backend.failures.total` **Logs** (structured): node, pod, image digest, decision, reasons. **Tracing**: spans for observe→batch→post; webhook request→resolve→respond. --- ## 9) Performance & scale targets * **Observer**: ≤ **30 ms** to sample `/proc//maps` and compute quick hashes for ≤ 64 files; ≤ **200 ms** for full library set (256 libs). * **Webhook**: P95 ≤ **8 ms** with warm cache; ≤ **50 ms** with one backend round‑trip. * **Throughput**: 1k admission requests/min/replica; 5k runtime events/min/node with batching. --- ## 10) Drift detection model **Signals** * **Process drift**: terminal program differs from **EntryTrace** baseline. * **Library drift**: loaded DSOs not present in **Usage** SBOM view. * **Filesystem drift**: new executable files under `/usr/local/bin`, `/opt`, `/app` with **mtime** after image creation. * **Network drift** (optional): listening sockets on unexpected ports (from policy). **Action** * Emit `DRIFT` event with evidence; backend can **auto‑queue** a delta scan; policy may **escalate** to alert/block (Admission cannot block already‑running pods; rely on K8s policies/PodSecurity or operator action). --- ## 11) Test matrix * **Engines**: containerd, CRI‑O, Docker; ensure PID resolution and rootfs mapping. * **EntryTrace**: bash features (case, if, run‑parts, `.`/`source`), language launchers (python/node/java). * **Procfs**: multiple arches, musl/glibc images; static binaries (maps minimal). * **Admission**: unsigned images, missing SBOM referrers, tag‑only images, digest resolution, backend latency, cache TTL. * **Perf/soak**: 500 Pods/node churn; webhook under HPA growth. * **Security**: attempt privilege escalation disabled, read‑only mounts enforced, rate‑limit abuse. * **Failure injection**: backend down (observer buffers, webhook fail‑open/closed), registry throttling, containerd socket unavailable. --- ## 12) Failure modes & responses | Condition | Observer behavior | Webhook behavior | | ------------------------------- | ---------------------------------------------- | ------------------------------------------------------ | | Backend unreachable | Buffer to disk; drop after cap; emit metric | **Fail‑open/closed** per namespace config | | PID vanished mid‑sample | Retry once; emit partial evidence | N/A | | CRI socket missing | Fallback to K8s events only (reduced fidelity) | N/A | | Registry digest resolve blocked | Defer to backend; mark `resolve=unknown` | Deny or allow per `resolveTags` & `failOpenNamespaces` | | Excessive events | Apply local rate limit, coalesce | N/A | --- ## 13) Deployment notes (K8s) **DaemonSet (snippet):** ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: { name: zastava-observer, namespace: stellaops } spec: template: spec: serviceAccountName: zastava hostPID: true containers: - name: observer image: stellaops/zastava-observer:2.3 securityContext: runAsUser: 0 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: { add: ["SYS_PTRACE","DAC_READ_SEARCH"] } volumeMounts: - { name: proc, mountPath: /host/proc, readOnly: true } - { name: containerd-sock, mountPath: /run/containerd/containerd.sock, readOnly: true } - { name: containerd-state, mountPath: /var/lib/containerd, readOnly: true } volumes: - { name: proc, hostPath: { path: /proc } } - { name: containerd-sock, hostPath: { path: /run/containerd/containerd.sock } } - { name: containerd-state, hostPath: { path: /var/lib/containerd } } ``` **Webhook (snippet):** ```yaml apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration webhooks: - name: gate.zastava.stella-ops.org admissionReviewVersions: ["v1"] sideEffects: None failurePolicy: Ignore # or Fail rules: - operations: ["CREATE","UPDATE"] apiGroups: [""] apiVersions: ["v1"] resources: ["pods"] clientConfig: service: namespace: stellaops name: zastava-webhook path: /admit caBundle: ``` --- ## 14) Implementation notes * **Language**: Rust (observer) for low‑latency `/proc` parsing; Go/.NET viable too. Webhook can be .NET 10 for parity with backend. * **CRI drivers**: pluggable (`containerd`, `cri-o`, `docker`). Prefer CRI over parsing logs. * **Shell parser**: re‑use Scanner.EntryTrace grammar for consistent results (compile to WASM if observer is Rust/Go). * **Hashing**: `BLAKE3` for speed locally, then convert to `sha256` (or compute `sha256` directly when budget allows). * **Resilience**: never block container start; observer is **passive**; only webhook decides allow/deny. --- ## 15) Roadmap * **eBPF** option for syscall/library load tracing (kernel‑level, opt‑in). * **Windows containers** support (ETW providers, loaded modules). * **Network posture** checks: listening ports vs policy. * **Live **used‑by‑entrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view. * **Admission dry‑run** dashboards (simulate block lists before enforcing).