19 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	component_architecture_zastava.md — Stella Ops Zastava (2025Q4)
Scope. Implementation‑ready architecture for Zastava: the runtime inspector/enforcer that watches real workloads, detects drift from the scanned baseline, verifies image/SBOM/attestation posture, and (optionally) admits/blocks deployments. Includes Kubernetes & plain‑Docker topologies, data contracts, APIs, security posture, performance targets, test matrices, and failure modes.
0) Mission & boundaries
Mission. Give operators ground‑truth from running environments and a fast guardrail before workloads land:
- Observer: inventory containers, entrypoints actually executed, and DSOs actually loaded; verify image signature, SBOM referrers, and attestation presence; detect drift (unexpected processes/paths) and policy violations; publish runtime events to Scanner.WebService.
- Admission (optional): Kubernetes ValidatingAdmissionWebhook that enforces minimal posture (signed images, SBOM availability, known base images, policy PASS) pre‑flight.
Boundaries.
- Zastava does not compute SBOMs and does not sign; it consumes Scanner/WebService outputs and enforces backend policy verdicts.
- Zastava can request a delta scan when the baseline is missing/stale, but scanning is done by Scanner.Worker.
- On non‑K8s Docker hosts, Zastava runs as a host service with observer‑only features.
1) Topology & processes
1.1 Components (Kubernetes)
stellaops/zastava-observer    # DaemonSet on every node (read-only host mounts)
stellaops/zastava-webhook     # ValidatingAdmissionWebhook (Deployment, 2+ replicas)
1.2 Components (Docker/VM)
stellaops/zastava-agent       # System service; watch Docker events; observer only
1.3 Dependencies
- Authority (OIDC): short OpToks (DPoP/mTLS) for API calls to Scanner.WebService.
- Scanner.WebService: /runtime/eventsingestion;/policy/runtimefetch.
- OCI Registry (optional): for direct referrers/sig checks if not delegated to backend.
- Container runtime: containerd/CRI‑O/Docker (read interfaces only).
- Kubernetes API (watch Pods in cluster; validating webhook).
- Host mounts (K8s DaemonSet): /proc,/var/lib/containerd(or CRI‑O),/run/containerd/containerd.sock(optional, read‑only).
2) Data contracts
2.1 Runtime event (observer → Scanner.WebService)
{
  "eventId": "9f6a…",
  "when": "2025-10-17T12:34:56Z",
  "kind": "CONTAINER_START|CONTAINER_STOP|DRIFT|POLICY_VIOLATION|ATTESTATION_STATUS",
  "tenant": "tenant-01",
  "node": "ip-10-0-1-23",
  "runtime": { "engine": "containerd", "version": "1.7.19" },
  "workload": {
    "platform": "kubernetes",
    "namespace": "payments",
    "pod": "api-7c9fbbd8b7-ktd84",
    "container": "api",
    "containerId": "containerd://...",
    "imageRef": "ghcr.io/acme/api@sha256:abcd…",
    "owner": { "kind": "Deployment", "name": "api" }
  },
  "process": {
    "pid": 12345,
    "entrypoint": ["/entrypoint.sh", "--serve"],
    "entryTrace": [
      {"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
      {"file":"<argv>","op":"python","target":"/opt/app/server.py"}
    ],
    "buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
  },
  "loadedLibs": [
    { "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
    { "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
  ],
  "posture": {
    "imageSigned": true,
    "sbomReferrer": "present|missing",
    "attestation": { "uuid": "rekor-uuid", "verified": true }
  },
  "delta": {
    "baselineImageDigest": "sha256:abcd…",
    "changedFiles": ["/opt/app/server.py"],           // optional quick signal
    "newBinaries": [{ "path":"/usr/local/bin/helper","sha256":"…" }]
  },
  "evidence": [
    {"signal":"procfs.maps","value":"/lib/.../libssl.so.3@0x7f..."},
    {"signal":"cri.task.inspect","value":"pid=12345"},
    {"signal":"registry.referrers","value":"sbom: application/vnd.cyclonedx+json"}
  ]
}
2.2 Admission decision (webhook → API server)
{
  "admissionId": "…",
  "namespace": "payments",
  "podSpecDigest": "sha256:…",
  "images": [
    {
      "name": "ghcr.io/acme/api:1.2.3",
      "resolved": "ghcr.io/acme/api@sha256:abcd…",
      "signed": true,
      "hasSbomReferrers": true,
      "policyVerdict": "pass|warn|fail",
      "reasons": ["unsigned base image", "missing SBOM"]
    }
  ],
  "decision": "Allow|Deny",
  "ttlSeconds": 300
}
2.3 Schema negotiation & hashing guarantees
- Every payload is wrapped in an envelope with schemaVersionset to"<schema>@v<major>.<minor>". Version negotiation keeps the major line in lockstep (zastava.runtime.event@v1.x,zastava.admission.decision@v1.x) and selects the highest mutually supported minor. If no overlap exists, the local default (@v1.0) is used.
- Components use the shared ZastavaContractVersionshelper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such assha256-<base64url>are reproducible across observers, webhooks, and backend jobs.
- Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the major and require dual-writer/reader rollout per deployment playbook.
3) Observer — node agent (DaemonSet)
3.1 Responsibilities
- 
Watch container lifecycle (start/stop) via CRI ( /run/containerd/containerd.sockgRPC read‑only) or/var/log/containers/*.logtail fallback.
- 
Resolve container → image digest, mount point rootfs. 
- 
Trace entrypoint: attach short‑lived nsenter/exec to PID 1 in container, parse shell for execchain (bounded depth), record terminal program.
- 
Sample loaded libs: read /proc/<pid>/mapsandexesymlink to collect actually loaded DSOs; compute sha256 for each mapped file (bounded count/size).
- 
Record GNU build-id: parse NT_GNU_BUILD_IDfrom/proc/<pid>/exeand attach the normalized hex to runtime events for symbol/debug-store correlation.
- 
Posture check (cheap): - Image signature presence (if cosign policies are local; else ask backend).
- SBOM referrers presence (HEAD to registry, optional).
- Rekor UUID known (query Scanner.WebService by image digest).
 
- 
Publish runtime events to Scanner.WebService /runtime/events(batch & compress).
- 
Request delta scan if: no SBOM in catalog OR base differs from known baseline. 
3.2 Privileges & mounts (K8s)
- 
SecurityContext: runAsUser: 0,readOnlyRootFilesystem: true,allowPrivilegeEscalation: false.
- 
Capabilities: CAP_SYS_PTRACE(optional if using nsenter trace),CAP_DAC_READ_SEARCH.
- 
Host mounts (read‑only): - /proc(host) →- /host/proc
- /run/containerd/containerd.sock(or CRI‑O socket)
- /var/lib/containerd/io.containerd.runtime.v2.task(rootfs paths & pids)
 
- 
Networking: cluster‑internal egress to Scanner.WebService only. 
- 
Rate limits: hard caps for bytes hashed and file count per container to avoid noisy tenants. 
3.3 Event batching
- Buffer ND‑JSON; flush by N events or 2 s.
- Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with metrics and warning event.
4) Admission Webhook (Kubernetes)
4.1 Gate criteria
Configurable policy (fetched from backend and cached):
- Image signature: must be cosign‑verifiable to configured key(s) or keyless identities.
- SBOM availability: at least one CycloneDX referrer or Scanner.WebService catalog entry.
- Scanner policy verdict: backend PASSrequired for namespaces/labels matching rules; allowWARNif configured.
- Registry allowlists/denylists.
- Tag bans (e.g., :latest).
- Base image allowlists (by digest).
4.2 Flow
sequenceDiagram
  autonumber
  participant K8s as API Server
  participant WH as Zastava Webhook
  participant SW as Scanner.WebService
  K8s->>WH: AdmissionReview(Pod)
  WH->>WH: Resolve images to digests (remote HEAD/pull if needed)
  WH->>SW: POST /policy/runtime { digests, namespace, labels }
  SW-->>WH: { per-image: {signed, hasSbom, verdict, reasons}, ttl }
  alt All pass
    WH-->>K8s: AdmissionResponse(Allow, ttl)
  else Any fail (enforce=true)
    WH-->>K8s: AdmissionResponse(Deny, message)
  end
Caching: Per‑digest result cached ttlSeconds (default 300 s). Fail‑open or fail‑closed is configurable per namespace.
4.3 TLS & HA
- Webhook has its own serving cert signed by cluster CA (or custom cert + CA bundle on configuration).
- Deployment ≥ 2 replicas; leaderless; stateless.
5) Backend integration (Scanner.WebService)
5.1 Ingestion endpoint
POST /api/v1/scanner/runtime/events (OpTok + DPoP/mTLS)
- 
Validates event schema; enforces rate caps by tenant/node; persists to Mongo ( runtime.eventscapped collection or regular with TTL).
- 
Performs correlation: - Attach nearest image SBOM (inventory/usage) and BOM‑Index if known.
- If unknown/missing, schedule delta scan and return 202 Accepted.
 
- 
Emits derived signals (usedByEntrypoint per component based on /proc/<pid>/maps).
5.2 Policy decision API (for webhook)
POST /api/v1/scanner/policy/runtime
The webhook reuses the shared runtime stack (AddZastavaRuntimeCore + IZastavaAuthorityTokenProvider) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
{
  "namespace": "payments",
  "labels": { "app": "api", "env": "prod" },
  "images": ["ghcr.io/acme/api@sha256:...", "ghcr.io/acme/nginx@sha256:..."]
}
Response:
{
  "ttlSeconds": 300,
  "results": {
    "ghcr.io/acme/api@sha256:...": {
      "signed": true,
      "hasSbom": true,
      "policyVerdict": "pass",
      "reasons": [],
      "rekor": { "uuid": "..." }
    },
    "ghcr.io/acme/nginx@sha256:...": {
      "signed": false,
      "hasSbom": false,
      "policyVerdict": "fail",
      "reasons": ["unsigned", "missing SBOM"]
    }
  }
}
6) Configuration (YAML)
zastava:
  mode:
    observer: true
    webhook: true
  backend:
    baseAddress: "https://scanner-web.internal"
    policyPath: "/api/v1/scanner/policy/runtime"
    requestTimeoutSeconds: 5
    allowInsecureHttp: false
  runtime:
    authority:
      issuer: "https://authority.internal"
      clientId: "zastava-observer"
      audience: ["scanner","zastava"]
      scopes:
        - "api:scanner.runtime.write"
      refreshSkewSeconds: 120
      requireDpop: true
      requireMutualTls: true
      allowStaticTokenFallback: false
      staticTokenPath: null      # Optional bootstrap secret
    tenant: "tenant-01"
    environment: "prod"
    deployment: "cluster-a"
    logging:
      includeScopes: true
      includeActivityTracking: true
      staticScope:
        plane: "runtime"
    metrics:
      meterName: "StellaOps.Zastava"
      meterVersion: "1.0.0"
      commonTags:
        cluster: "prod-cluster"
    engine: "auto"    # containerd|cri-o|docker|auto
    procfs: "/host/proc"
    collect:
      entryTrace: true
      loadedLibs: true
      maxLibs: 256
      maxHashBytesPerContainer: 64_000_000
      maxDepth: 48
  admission:
    enforce: true
    failOpenNamespaces: ["dev", "test"]
    verify:
      imageSignature: true
      sbomReferrer: true
      scannerPolicyPass: true
    cacheTtlSeconds: 300
    resolveTags: true          # do remote digest resolution for tag-only images
  limits:
    eventsPerSecond: 50
    burst: 200
    perNodeQueue: 10_000
  security:
    mounts:
      containerdSock: "/run/containerd/containerd.sock:ro"
      proc: "/proc:/host/proc:ro"
      runtimeState: "/var/lib/containerd:ro"
Implementation note: both
zastava-observerandzastava-webhookcallservices.AddZastavaRuntimeCore(configuration, "<component>")during start-up to bind thezastava:runtimesection, enforce validation, and register canonical log scopes + meters.
7) Security posture
- AuthN/Z: Authority OpToks (DPoP preferred) to backend; webhook does not require client auth from API server (K8s handles).
- Least privileges: read‑only host mounts; optional CAP_SYS_PTRACE; no host networking; no write mounts.
- Isolation: never exec untrusted code; nsenter only to read /proc/<pid>.
- Data minimization: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
- Rate limiting: per‑node caps; per‑tenant caps at backend.
- Hard caps: bytes hashed, files inspected, depth of shell parsing.
- Authority guardrails: AddZastavaRuntimeCorebindszastava.runtime.authorityand refuses tokens withoutaud:<tenant>scope; optional knobs (requireDpop,requireMutualTls,allowStaticTokenFallback) emit structured warnings when relaxed.
8) Metrics, logs, tracing
Observer
- zastava.runtime.events.total{kind}
- zastava.runtime.backend.latency.ms{endpoint="events"}
- zastava.proc_maps.samples.total{result}
- zastava.entrytrace.depth{p99}
- zastava.hash.bytes.total
- zastava.buffer.drops.total
Webhook
- zastava.admission.decisions.total{decision}
- zastava.runtime.backend.latency.ms{endpoint="policy"}
- zastava.admission.cache.hits.total
- zastava.backend.failures.total
Logs (structured): node, pod, image digest, decision, reasons. Tracing: spans for observe→batch→post; webhook request→resolve→respond.
9) Performance & scale targets
- Observer: ≤ 30 ms to sample /proc/<pid>/mapsand compute quick hashes for ≤ 64 files; ≤ 200 ms for full library set (256 libs).
- Webhook: P95 ≤ 8 ms with warm cache; ≤ 50 ms with one backend round‑trip.
- Throughput: 1k admission requests/min/replica; 5k runtime events/min/node with batching.
10) Drift detection model
Signals
- Process drift: terminal program differs from EntryTrace baseline.
- Library drift: loaded DSOs not present in Usage SBOM view.
- Filesystem drift: new executable files under /usr/local/bin,/opt,/appwith mtime after image creation.
- Network drift (optional): listening sockets on unexpected ports (from policy).
Action
- Emit DRIFTevent with evidence; backend can auto‑queue a delta scan; policy may escalate to alert/block (Admission cannot block already‑running pods; rely on K8s policies/PodSecurity or operator action).
11) Test matrix
- Engines: containerd, CRI‑O, Docker; ensure PID resolution and rootfs mapping.
- EntryTrace: bash features (case, if, run‑parts, ./source), language launchers (python/node/java).
- Procfs: multiple arches, musl/glibc images; static binaries (maps minimal).
- Admission: unsigned images, missing SBOM referrers, tag‑only images, digest resolution, backend latency, cache TTL.
- Perf/soak: 500 Pods/node churn; webhook under HPA growth.
- Security: attempt privilege escalation disabled, read‑only mounts enforced, rate‑limit abuse.
- Failure injection: backend down (observer buffers, webhook fail‑open/closed), registry throttling, containerd socket unavailable.
12) Failure modes & responses
| Condition | Observer behavior | Webhook behavior | 
|---|---|---|
| Backend unreachable | Buffer to disk; drop after cap; emit metric | Fail‑open/closed per namespace config | 
| PID vanished mid‑sample | Retry once; emit partial evidence | N/A | 
| CRI socket missing | Fallback to K8s events only (reduced fidelity) | N/A | 
| Registry digest resolve blocked | Defer to backend; mark resolve=unknown | Deny or allow per resolveTags&failOpenNamespaces | 
| Excessive events | Apply local rate limit, coalesce | N/A | 
13) Deployment notes (K8s)
DaemonSet (snippet):
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: zastava-observer, namespace: stellaops }
spec:
  template:
    spec:
      serviceAccountName: zastava
      hostPID: true
      containers:
      - name: observer
        image: stellaops/zastava-observer:2.3
        securityContext:
          runAsUser: 0
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          capabilities: { add: ["SYS_PTRACE","DAC_READ_SEARCH"] }
        volumeMounts:
        - { name: proc, mountPath: /host/proc, readOnly: true }
        - { name: containerd-sock, mountPath: /run/containerd/containerd.sock, readOnly: true }
        - { name: containerd-state, mountPath: /var/lib/containerd, readOnly: true }
      volumes:
      - { name: proc, hostPath: { path: /proc } }
      - { name: containerd-sock, hostPath: { path: /run/containerd/containerd.sock } }
      - { name: containerd-state, hostPath: { path: /var/lib/containerd } }
Webhook (snippet):
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
- name: gate.zastava.stella-ops.org
  admissionReviewVersions: ["v1"]
  sideEffects: None
  failurePolicy: Ignore   # or Fail
  rules:
  - operations: ["CREATE","UPDATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  clientConfig:
    service:
      namespace: stellaops
      name: zastava-webhook
      path: /admit
    caBundle: <base64 CA>
14) Implementation notes
- Language: Rust (observer) for low‑latency /procparsing; Go/.NET viable too. Webhook can be .NET 10 for parity with backend.
- CRI drivers: pluggable (containerd,cri-o,docker). Prefer CRI over parsing logs.
- Shell parser: re‑use Scanner.EntryTrace grammar for consistent results (compile to WASM if observer is Rust/Go).
- Hashing: BLAKE3for speed locally, then convert tosha256(or computesha256directly when budget allows).
- Resilience: never block container start; observer is passive; only webhook decides allow/deny.
15) Roadmap
- eBPF option for syscall/library load tracing (kernel‑level, opt‑in).
- Windows containers support (ETW providers, loaded modules).
- Network posture checks: listening ports vs policy.
- Live used‑by‑entrypoint synthesis: send compact bitset diff to backend to tighten Usage view.
- Admission dry‑run dashboards (simulate block lists before enforcing).