feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										496
									
								
								docs/modules/zastava/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										496
									
								
								docs/modules/zastava/architecture.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,496 @@
 | 
			
		||||
# component_architecture_zastava.md — **Stella Ops Zastava** (2025Q4)
 | 
			
		||||
 | 
			
		||||
> **Scope.** Implementation‑ready architecture for **Zastava**: the **runtime inspector/enforcer** that watches real workloads, detects drift from the scanned baseline, verifies image/SBOM/attestation posture, and (optionally) **admits/blocks** deployments. Includes Kubernetes & plain‑Docker topologies, data contracts, APIs, security posture, performance targets, test matrices, and failure modes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 0) Mission & boundaries
 | 
			
		||||
 | 
			
		||||
**Mission.** Give operators **ground‑truth** from running environments and a **fast guardrail** before workloads land:
 | 
			
		||||
 | 
			
		||||
* **Observer:** inventory containers, entrypoints actually executed, and DSOs actually loaded; verify **image signature**, **SBOM referrers**, and **attestation** presence; detect **drift** (unexpected processes/paths) and **policy violations**; publish **runtime events** to Scanner.WebService.
 | 
			
		||||
* **Admission (optional):** Kubernetes ValidatingAdmissionWebhook that enforces minimal posture (signed images, SBOM availability, known base images, policy PASS) **pre‑flight**.
 | 
			
		||||
 | 
			
		||||
**Boundaries.**
 | 
			
		||||
 | 
			
		||||
* Zastava **does not** compute SBOMs and does not sign; it **consumes** Scanner/WebService outputs and **enforces** backend policy verdicts.
 | 
			
		||||
* Zastava can **request** a delta scan when the baseline is missing/stale, but scanning is done by **Scanner.Worker**.
 | 
			
		||||
* On non‑K8s Docker hosts, Zastava runs as a host service with **observer‑only** features.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1) Topology & processes
 | 
			
		||||
 | 
			
		||||
### 1.1 Components (Kubernetes)
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
stellaops/zastava-observer    # DaemonSet on every node (read-only host mounts)
 | 
			
		||||
stellaops/zastava-webhook     # ValidatingAdmissionWebhook (Deployment, 2+ replicas)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### 1.2 Components (Docker/VM)
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
stellaops/zastava-agent       # System service; watch Docker events; observer only
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### 1.3 Dependencies
 | 
			
		||||
 | 
			
		||||
* **Authority** (OIDC): short OpToks (DPoP/mTLS) for API calls to Scanner.WebService.
 | 
			
		||||
* **Scanner.WebService**: `/runtime/events` ingestion; `/policy/runtime` fetch.
 | 
			
		||||
* **OCI Registry** (optional): for direct referrers/sig checks if not delegated to backend.
 | 
			
		||||
* **Container runtime**: containerd/CRI‑O/Docker (read interfaces only).
 | 
			
		||||
* **Kubernetes API** (watch Pods in cluster; validating webhook).
 | 
			
		||||
* **Host mounts** (K8s DaemonSet): `/proc`, `/var/lib/containerd` (or CRI‑O), `/run/containerd/containerd.sock` (optional, read‑only).
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2) Data contracts
 | 
			
		||||
 | 
			
		||||
### 2.1 Runtime event (observer → Scanner.WebService)
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
{
 | 
			
		||||
  "eventId": "9f6a…",
 | 
			
		||||
  "when": "2025-10-17T12:34:56Z",
 | 
			
		||||
  "kind": "CONTAINER_START|CONTAINER_STOP|DRIFT|POLICY_VIOLATION|ATTESTATION_STATUS",
 | 
			
		||||
  "tenant": "tenant-01",
 | 
			
		||||
  "node": "ip-10-0-1-23",
 | 
			
		||||
  "runtime": { "engine": "containerd", "version": "1.7.19" },
 | 
			
		||||
  "workload": {
 | 
			
		||||
    "platform": "kubernetes",
 | 
			
		||||
    "namespace": "payments",
 | 
			
		||||
    "pod": "api-7c9fbbd8b7-ktd84",
 | 
			
		||||
    "container": "api",
 | 
			
		||||
    "containerId": "containerd://...",
 | 
			
		||||
    "imageRef": "ghcr.io/acme/api@sha256:abcd…",
 | 
			
		||||
    "owner": { "kind": "Deployment", "name": "api" }
 | 
			
		||||
  },
 | 
			
		||||
  "process": {
 | 
			
		||||
    "pid": 12345,
 | 
			
		||||
    "entrypoint": ["/entrypoint.sh", "--serve"],
 | 
			
		||||
    "entryTrace": [
 | 
			
		||||
      {"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
 | 
			
		||||
      {"file":"<argv>","op":"python","target":"/opt/app/server.py"}
 | 
			
		||||
    ],
 | 
			
		||||
    "buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
 | 
			
		||||
  },
 | 
			
		||||
  "loadedLibs": [
 | 
			
		||||
    { "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
 | 
			
		||||
    { "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
 | 
			
		||||
  ],
 | 
			
		||||
  "posture": {
 | 
			
		||||
    "imageSigned": true,
 | 
			
		||||
    "sbomReferrer": "present|missing",
 | 
			
		||||
    "attestation": { "uuid": "rekor-uuid", "verified": true }
 | 
			
		||||
  },
 | 
			
		||||
  "delta": {
 | 
			
		||||
    "baselineImageDigest": "sha256:abcd…",
 | 
			
		||||
    "changedFiles": ["/opt/app/server.py"],           // optional quick signal
 | 
			
		||||
    "newBinaries": [{ "path":"/usr/local/bin/helper","sha256":"…" }]
 | 
			
		||||
  },
 | 
			
		||||
  "evidence": [
 | 
			
		||||
    {"signal":"procfs.maps","value":"/lib/.../libssl.so.3@0x7f..."},
 | 
			
		||||
    {"signal":"cri.task.inspect","value":"pid=12345"},
 | 
			
		||||
    {"signal":"registry.referrers","value":"sbom: application/vnd.cyclonedx+json"}
 | 
			
		||||
  ]
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### 2.2 Admission decision (webhook → API server)
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
{
 | 
			
		||||
  "admissionId": "…",
 | 
			
		||||
  "namespace": "payments",
 | 
			
		||||
  "podSpecDigest": "sha256:…",
 | 
			
		||||
  "images": [
 | 
			
		||||
    {
 | 
			
		||||
      "name": "ghcr.io/acme/api:1.2.3",
 | 
			
		||||
      "resolved": "ghcr.io/acme/api@sha256:abcd…",
 | 
			
		||||
      "signed": true,
 | 
			
		||||
      "hasSbomReferrers": true,
 | 
			
		||||
      "policyVerdict": "pass|warn|fail",
 | 
			
		||||
      "reasons": ["unsigned base image", "missing SBOM"]
 | 
			
		||||
    }
 | 
			
		||||
  ],
 | 
			
		||||
  "decision": "Allow|Deny",
 | 
			
		||||
  "ttlSeconds": 300
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### 2.3 Schema negotiation & hashing guarantees
 | 
			
		||||
 | 
			
		||||
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
 | 
			
		||||
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
 | 
			
		||||
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3) Observer — node agent (DaemonSet)
 | 
			
		||||
 | 
			
		||||
### 3.1 Responsibilities
 | 
			
		||||
 | 
			
		||||
* **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC read‑only) or `/var/log/containers/*.log` tail fallback.
 | 
			
		||||
* **Resolve** container → image digest, mount point rootfs.
 | 
			
		||||
* **Trace entrypoint**: attach **short‑lived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**.
 | 
			
		||||
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
 | 
			
		||||
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
 | 
			
		||||
* **Posture check** (cheap):
 | 
			
		||||
 | 
			
		||||
  * Image signature presence (if cosign policies are local; else ask backend).
 | 
			
		||||
  * SBOM **referrers** presence (HEAD to registry, optional).
 | 
			
		||||
  * Rekor UUID known (query Scanner.WebService by image digest).
 | 
			
		||||
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
 | 
			
		||||
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
 | 
			
		||||
 | 
			
		||||
### 3.2 Privileges & mounts (K8s)
 | 
			
		||||
 | 
			
		||||
* **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`.
 | 
			
		||||
* **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`.
 | 
			
		||||
* **Host mounts (read‑only):**
 | 
			
		||||
 | 
			
		||||
  * `/proc` (host) → `/host/proc`
 | 
			
		||||
  * `/run/containerd/containerd.sock` (or CRI‑O socket)
 | 
			
		||||
  * `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids)
 | 
			
		||||
* **Networking:** cluster‑internal egress to Scanner.WebService only.
 | 
			
		||||
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
 | 
			
		||||
 | 
			
		||||
### 3.3 Event batching
 | 
			
		||||
 | 
			
		||||
* Buffer ND‑JSON; flush by **N events** or **2 s**.
 | 
			
		||||
* Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
 | 
			
		||||
 | 
			
		||||
### 3.4 Build-id capture & validation workflow
 | 
			
		||||
 | 
			
		||||
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
 | 
			
		||||
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
 | 
			
		||||
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
 | 
			
		||||
4. Operators resolve symbols by either:
 | 
			
		||||
   * calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
 | 
			
		||||
   * piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
 | 
			
		||||
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4) Admission Webhook (Kubernetes)
 | 
			
		||||
 | 
			
		||||
### 4.1 Gate criteria
 | 
			
		||||
 | 
			
		||||
Configurable policy (fetched from backend and cached):
 | 
			
		||||
 | 
			
		||||
* **Image signature**: must be cosign‑verifiable to configured key(s) or keyless identities.
 | 
			
		||||
* **SBOM availability**: at least one **CycloneDX** referrer or **Scanner.WebService** catalog entry.
 | 
			
		||||
* **Scanner policy verdict**: backend `PASS` required for namespaces/labels matching rules; allow `WARN` if configured.
 | 
			
		||||
* **Registry allowlists/denylists**.
 | 
			
		||||
* **Tag bans** (e.g., `:latest`).
 | 
			
		||||
* **Base image allowlists** (by digest).
 | 
			
		||||
 | 
			
		||||
### 4.2 Flow
 | 
			
		||||
 | 
			
		||||
```mermaid
 | 
			
		||||
sequenceDiagram
 | 
			
		||||
  autonumber
 | 
			
		||||
  participant K8s as API Server
 | 
			
		||||
  participant WH as Zastava Webhook
 | 
			
		||||
  participant SW as Scanner.WebService
 | 
			
		||||
 | 
			
		||||
  K8s->>WH: AdmissionReview(Pod)
 | 
			
		||||
  WH->>WH: Resolve images to digests (remote HEAD/pull if needed)
 | 
			
		||||
  WH->>SW: POST /policy/runtime { digests, namespace, labels }
 | 
			
		||||
  SW-->>WH: { per-image: {signed, hasSbom, verdict, reasons}, ttl }
 | 
			
		||||
  alt All pass
 | 
			
		||||
    WH-->>K8s: AdmissionResponse(Allow, ttl)
 | 
			
		||||
  else Any fail (enforce=true)
 | 
			
		||||
    WH-->>K8s: AdmissionResponse(Deny, message)
 | 
			
		||||
  end
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**Caching:** Per‑digest result cached `ttlSeconds` (default 300 s). **Fail‑open** or **fail‑closed** is configurable per namespace.
 | 
			
		||||
 | 
			
		||||
### 4.3 TLS & HA
 | 
			
		||||
 | 
			
		||||
* Webhook has its own **serving cert** signed by cluster CA (or custom cert + CA bundle on configuration).
 | 
			
		||||
* Deployment ≥ 2 replicas; **leaderless**; stateless.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5) Backend integration (Scanner.WebService)
 | 
			
		||||
 | 
			
		||||
### 5.1 Ingestion endpoint
 | 
			
		||||
 | 
			
		||||
`POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)*
 | 
			
		||||
 | 
			
		||||
* Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL).
 | 
			
		||||
* Performs **correlation**:
 | 
			
		||||
 | 
			
		||||
  * Attach nearest **image SBOM** (inventory/usage) and **BOM‑Index** if known.
 | 
			
		||||
  * If unknown/missing, schedule **delta scan** and return `202 Accepted`.
 | 
			
		||||
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
 | 
			
		||||
 | 
			
		||||
### 5.2 Policy decision API (for webhook)
 | 
			
		||||
 | 
			
		||||
`POST /api/v1/scanner/policy/runtime`
 | 
			
		||||
 | 
			
		||||
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
 | 
			
		||||
 | 
			
		||||
Request:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
{
 | 
			
		||||
  "namespace": "payments",
 | 
			
		||||
  "labels": { "app": "api", "env": "prod" },
 | 
			
		||||
  "images": ["ghcr.io/acme/api@sha256:...", "ghcr.io/acme/nginx@sha256:..."]
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Response:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
{
 | 
			
		||||
  "ttlSeconds": 300,
 | 
			
		||||
  "results": {
 | 
			
		||||
    "ghcr.io/acme/api@sha256:...": {
 | 
			
		||||
      "signed": true,
 | 
			
		||||
      "hasSbom": true,
 | 
			
		||||
      "policyVerdict": "pass",
 | 
			
		||||
      "reasons": [],
 | 
			
		||||
      "rekor": { "uuid": "..." }
 | 
			
		||||
    },
 | 
			
		||||
    "ghcr.io/acme/nginx@sha256:...": {
 | 
			
		||||
      "signed": false,
 | 
			
		||||
      "hasSbom": false,
 | 
			
		||||
      "policyVerdict": "fail",
 | 
			
		||||
      "reasons": ["unsigned", "missing SBOM"]
 | 
			
		||||
    }
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6) Configuration (YAML)
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
zastava:
 | 
			
		||||
  mode:
 | 
			
		||||
    observer: true
 | 
			
		||||
    webhook: true
 | 
			
		||||
  backend:
 | 
			
		||||
    baseAddress: "https://scanner-web.internal"
 | 
			
		||||
    policyPath: "/api/v1/scanner/policy/runtime"
 | 
			
		||||
    requestTimeoutSeconds: 5
 | 
			
		||||
    allowInsecureHttp: false
 | 
			
		||||
  runtime:
 | 
			
		||||
    authority:
 | 
			
		||||
      issuer: "https://authority.internal"
 | 
			
		||||
      clientId: "zastava-observer"
 | 
			
		||||
      audience: ["scanner","zastava"]
 | 
			
		||||
      scopes:
 | 
			
		||||
        - "api:scanner.runtime.write"
 | 
			
		||||
      refreshSkewSeconds: 120
 | 
			
		||||
      requireDpop: true
 | 
			
		||||
      requireMutualTls: true
 | 
			
		||||
      allowStaticTokenFallback: false
 | 
			
		||||
      staticTokenPath: null      # Optional bootstrap secret
 | 
			
		||||
    tenant: "tenant-01"
 | 
			
		||||
    environment: "prod"
 | 
			
		||||
    deployment: "cluster-a"
 | 
			
		||||
    logging:
 | 
			
		||||
      includeScopes: true
 | 
			
		||||
      includeActivityTracking: true
 | 
			
		||||
      staticScope:
 | 
			
		||||
        plane: "runtime"
 | 
			
		||||
    metrics:
 | 
			
		||||
      meterName: "StellaOps.Zastava"
 | 
			
		||||
      meterVersion: "1.0.0"
 | 
			
		||||
      commonTags:
 | 
			
		||||
        cluster: "prod-cluster"
 | 
			
		||||
    engine: "auto"    # containerd|cri-o|docker|auto
 | 
			
		||||
    procfs: "/host/proc"
 | 
			
		||||
    collect:
 | 
			
		||||
      entryTrace: true
 | 
			
		||||
      loadedLibs: true
 | 
			
		||||
      maxLibs: 256
 | 
			
		||||
      maxHashBytesPerContainer: 64_000_000
 | 
			
		||||
      maxDepth: 48
 | 
			
		||||
  admission:
 | 
			
		||||
    enforce: true
 | 
			
		||||
    failOpenNamespaces: ["dev", "test"]
 | 
			
		||||
    verify:
 | 
			
		||||
      imageSignature: true
 | 
			
		||||
      sbomReferrer: true
 | 
			
		||||
      scannerPolicyPass: true
 | 
			
		||||
    cacheTtlSeconds: 300
 | 
			
		||||
    resolveTags: true          # do remote digest resolution for tag-only images
 | 
			
		||||
  limits:
 | 
			
		||||
    eventsPerSecond: 50
 | 
			
		||||
    burst: 200
 | 
			
		||||
    perNodeQueue: 10_000
 | 
			
		||||
  security:
 | 
			
		||||
    mounts:
 | 
			
		||||
      containerdSock: "/run/containerd/containerd.sock:ro"
 | 
			
		||||
      proc: "/proc:/host/proc:ro"
 | 
			
		||||
      runtimeState: "/var/lib/containerd:ro"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7) Security posture
 | 
			
		||||
 | 
			
		||||
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
 | 
			
		||||
* **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
 | 
			
		||||
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
 | 
			
		||||
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
 | 
			
		||||
* **Rate limiting**: per‑node caps; per‑tenant caps at backend.
 | 
			
		||||
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
 | 
			
		||||
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 8) Metrics, logs, tracing
 | 
			
		||||
 | 
			
		||||
**Observer**
 | 
			
		||||
 | 
			
		||||
* `zastava.runtime.events.total{kind}`
 | 
			
		||||
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
 | 
			
		||||
* `zastava.proc_maps.samples.total{result}`
 | 
			
		||||
* `zastava.entrytrace.depth{p99}`
 | 
			
		||||
* `zastava.hash.bytes.total`
 | 
			
		||||
* `zastava.buffer.drops.total`
 | 
			
		||||
 | 
			
		||||
**Webhook**
 | 
			
		||||
 | 
			
		||||
* `zastava.admission.decisions.total{decision}`
 | 
			
		||||
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
 | 
			
		||||
* `zastava.admission.cache.hits.total`
 | 
			
		||||
* `zastava.backend.failures.total`
 | 
			
		||||
 | 
			
		||||
**Logs** (structured): node, pod, image digest, decision, reasons.
 | 
			
		||||
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 9) Performance & scale targets
 | 
			
		||||
 | 
			
		||||
* **Observer**: ≤ **30 ms** to sample `/proc/<pid>/maps` and compute quick hashes for ≤ 64 files; ≤ **200 ms** for full library set (256 libs).
 | 
			
		||||
* **Webhook**: P95 ≤ **8 ms** with warm cache; ≤ **50 ms** with one backend round‑trip.
 | 
			
		||||
* **Throughput**: 1k admission requests/min/replica; 5k runtime events/min/node with batching.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 10) Drift detection model
 | 
			
		||||
 | 
			
		||||
**Signals**
 | 
			
		||||
 | 
			
		||||
* **Process drift**: terminal program differs from **EntryTrace** baseline.
 | 
			
		||||
* **Library drift**: loaded DSOs not present in **Usage** SBOM view.
 | 
			
		||||
* **Filesystem drift**: new executable files under `/usr/local/bin`, `/opt`, `/app` with **mtime** after image creation.
 | 
			
		||||
* **Network drift** (optional): listening sockets on unexpected ports (from policy).
 | 
			
		||||
 | 
			
		||||
**Action**
 | 
			
		||||
 | 
			
		||||
* Emit `DRIFT` event with evidence; backend can **auto‑queue** a delta scan; policy may **escalate** to alert/block (Admission cannot block already‑running pods; rely on K8s policies/PodSecurity or operator action).
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 11) Test matrix
 | 
			
		||||
 | 
			
		||||
* **Engines**: containerd, CRI‑O, Docker; ensure PID resolution and rootfs mapping.
 | 
			
		||||
* **EntryTrace**: bash features (case, if, run‑parts, `.`/`source`), language launchers (python/node/java).
 | 
			
		||||
* **Procfs**: multiple arches, musl/glibc images; static binaries (maps minimal).
 | 
			
		||||
* **Admission**: unsigned images, missing SBOM referrers, tag‑only images, digest resolution, backend latency, cache TTL.
 | 
			
		||||
* **Perf/soak**: 500 Pods/node churn; webhook under HPA growth.
 | 
			
		||||
* **Security**: attempt privilege escalation disabled, read‑only mounts enforced, rate‑limit abuse.
 | 
			
		||||
* **Failure injection**: backend down (observer buffers, webhook fail‑open/closed), registry throttling, containerd socket unavailable.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 12) Failure modes & responses
 | 
			
		||||
 | 
			
		||||
| Condition                       | Observer behavior                              | Webhook behavior                                       |
 | 
			
		||||
| ------------------------------- | ---------------------------------------------- | ------------------------------------------------------ |
 | 
			
		||||
| Backend unreachable             | Buffer to disk; drop after cap; emit metric    | **Fail‑open/closed** per namespace config              |
 | 
			
		||||
| PID vanished mid‑sample         | Retry once; emit partial evidence              | N/A                                                    |
 | 
			
		||||
| CRI socket missing              | Fallback to K8s events only (reduced fidelity) | N/A                                                    |
 | 
			
		||||
| Registry digest resolve blocked | Defer to backend; mark `resolve=unknown`       | Deny or allow per `resolveTags` & `failOpenNamespaces` |
 | 
			
		||||
| Excessive events                | Apply local rate limit, coalesce               | N/A                                                    |
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 13) Deployment notes (K8s)
 | 
			
		||||
 | 
			
		||||
**DaemonSet (snippet):**
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
apiVersion: apps/v1
 | 
			
		||||
kind: DaemonSet
 | 
			
		||||
metadata: { name: zastava-observer, namespace: stellaops }
 | 
			
		||||
spec:
 | 
			
		||||
  template:
 | 
			
		||||
    spec:
 | 
			
		||||
      serviceAccountName: zastava
 | 
			
		||||
      hostPID: true
 | 
			
		||||
      containers:
 | 
			
		||||
      - name: observer
 | 
			
		||||
        image: stellaops/zastava-observer:2.3
 | 
			
		||||
        securityContext:
 | 
			
		||||
          runAsUser: 0
 | 
			
		||||
          readOnlyRootFilesystem: true
 | 
			
		||||
          allowPrivilegeEscalation: false
 | 
			
		||||
          capabilities: { add: ["SYS_PTRACE","DAC_READ_SEARCH"] }
 | 
			
		||||
        volumeMounts:
 | 
			
		||||
        - { name: proc, mountPath: /host/proc, readOnly: true }
 | 
			
		||||
        - { name: containerd-sock, mountPath: /run/containerd/containerd.sock, readOnly: true }
 | 
			
		||||
        - { name: containerd-state, mountPath: /var/lib/containerd, readOnly: true }
 | 
			
		||||
      volumes:
 | 
			
		||||
      - { name: proc, hostPath: { path: /proc } }
 | 
			
		||||
      - { name: containerd-sock, hostPath: { path: /run/containerd/containerd.sock } }
 | 
			
		||||
      - { name: containerd-state, hostPath: { path: /var/lib/containerd } }
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**Webhook (snippet):**
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
apiVersion: admissionregistration.k8s.io/v1
 | 
			
		||||
kind: ValidatingWebhookConfiguration
 | 
			
		||||
webhooks:
 | 
			
		||||
- name: gate.zastava.stella-ops.org
 | 
			
		||||
  admissionReviewVersions: ["v1"]
 | 
			
		||||
  sideEffects: None
 | 
			
		||||
  failurePolicy: Ignore   # or Fail
 | 
			
		||||
  rules:
 | 
			
		||||
  - operations: ["CREATE","UPDATE"]
 | 
			
		||||
    apiGroups: [""]
 | 
			
		||||
    apiVersions: ["v1"]
 | 
			
		||||
    resources: ["pods"]
 | 
			
		||||
  clientConfig:
 | 
			
		||||
    service:
 | 
			
		||||
      namespace: stellaops
 | 
			
		||||
      name: zastava-webhook
 | 
			
		||||
      path: /admit
 | 
			
		||||
    caBundle: <base64 CA>
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 14) Implementation notes
 | 
			
		||||
 | 
			
		||||
* **Language**: Rust (observer) for low‑latency `/proc` parsing; Go/.NET viable too. Webhook can be .NET 10 for parity with backend.
 | 
			
		||||
* **CRI drivers**: pluggable (`containerd`, `cri-o`, `docker`). Prefer CRI over parsing logs.
 | 
			
		||||
* **Shell parser**: re‑use Scanner.EntryTrace grammar for consistent results (compile to WASM if observer is Rust/Go).
 | 
			
		||||
* **Hashing**: `BLAKE3` for speed locally, then convert to `sha256` (or compute `sha256` directly when budget allows).
 | 
			
		||||
* **Resilience**: never block container start; observer is **passive**; only webhook decides allow/deny.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 15) Roadmap
 | 
			
		||||
 | 
			
		||||
* **eBPF** option for syscall/library load tracing (kernel‑level, opt‑in).
 | 
			
		||||
* **Windows containers** support (ETW providers, loaded modules).
 | 
			
		||||
* **Network posture** checks: listening ports vs policy.
 | 
			
		||||
* **Live **used‑by‑entrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
 | 
			
		||||
* **Admission dry‑run** dashboards (simulate block lists before enforcing).
 | 
			
		||||
 | 
			
		||||
		Reference in New Issue
	
	Block a user