docs consolidation work
This commit is contained in:
@@ -24,7 +24,7 @@ Zastava monitors running workloads, verifies supply chain posture, and enforces
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
|
||||
@@ -66,15 +66,15 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
|
||||
"imageRef": "ghcr.io/acme/api@sha256:abcd…",
|
||||
"owner": { "kind": "Deployment", "name": "api" }
|
||||
},
|
||||
"process": {
|
||||
"pid": 12345,
|
||||
"entrypoint": ["/entrypoint.sh", "--serve"],
|
||||
"entryTrace": [
|
||||
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
|
||||
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
|
||||
],
|
||||
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
|
||||
},
|
||||
"process": {
|
||||
"pid": 12345,
|
||||
"entrypoint": ["/entrypoint.sh", "--serve"],
|
||||
"entryTrace": [
|
||||
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
|
||||
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
|
||||
],
|
||||
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
|
||||
},
|
||||
"loadedLibs": [
|
||||
{ "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
|
||||
{ "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
|
||||
@@ -116,35 +116,35 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
|
||||
],
|
||||
"decision": "Allow|Deny",
|
||||
"ttlSeconds": 300
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 Schema negotiation & hashing guarantees
|
||||
|
||||
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
|
||||
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
|
||||
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
|
||||
|
||||
---
|
||||
|
||||
## 3) Observer — node agent (DaemonSet)
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 Schema negotiation & hashing guarantees
|
||||
|
||||
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
|
||||
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
|
||||
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
|
||||
|
||||
---
|
||||
|
||||
## 3) Observer — node agent (DaemonSet)
|
||||
|
||||
### 3.1 Responsibilities
|
||||
|
||||
* **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC read‑only) or `/var/log/containers/*.log` tail fallback.
|
||||
* **Resolve** container → image digest, mount point rootfs.
|
||||
* **Trace entrypoint**: attach **short‑lived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**.
|
||||
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
|
||||
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
|
||||
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
|
||||
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
|
||||
* **Posture check** (cheap):
|
||||
|
||||
* Image signature presence (if cosign policies are local; else ask backend).
|
||||
* SBOM **referrers** presence (HEAD to registry, optional).
|
||||
* Rekor UUID known (query Scanner.WebService by image digest).
|
||||
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
|
||||
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
|
||||
|
||||
### 3.2 Privileges & mounts (K8s)
|
||||
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
|
||||
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
|
||||
|
||||
### 3.2 Privileges & mounts (K8s)
|
||||
|
||||
* **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`.
|
||||
* **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`.
|
||||
@@ -154,22 +154,22 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
|
||||
* `/run/containerd/containerd.sock` (or CRI‑O socket)
|
||||
* `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids)
|
||||
* **Networking:** cluster‑internal egress to Scanner.WebService only.
|
||||
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
|
||||
|
||||
### 3.3 Event batching
|
||||
|
||||
* Buffer ND‑JSON; flush by **N events** or **2 s**.
|
||||
* Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
|
||||
|
||||
### 3.4 Build-id capture & validation workflow
|
||||
|
||||
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
|
||||
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
|
||||
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
|
||||
4. Operators resolve symbols by either:
|
||||
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
|
||||
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
|
||||
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
|
||||
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
|
||||
|
||||
### 3.3 Event batching
|
||||
|
||||
* Buffer ND‑JSON; flush by **N events** or **2 s**.
|
||||
* Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
|
||||
|
||||
### 3.4 Build-id capture & validation workflow
|
||||
|
||||
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
|
||||
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
|
||||
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
|
||||
4. Operators resolve symbols by either:
|
||||
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
|
||||
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
|
||||
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
|
||||
|
||||
---
|
||||
|
||||
@@ -221,20 +221,20 @@ sequenceDiagram
|
||||
|
||||
`POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)*
|
||||
|
||||
* Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL).
|
||||
* Validates event schema; enforces rate caps by tenant/node; persists to **PostgreSQL** (`runtime.events` table with TTL-based retention).
|
||||
* Performs **correlation**:
|
||||
|
||||
* Attach nearest **image SBOM** (inventory/usage) and **BOM‑Index** if known.
|
||||
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
|
||||
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
|
||||
|
||||
### 5.2 Policy decision API (for webhook)
|
||||
|
||||
`POST /api/v1/scanner/policy/runtime`
|
||||
|
||||
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
|
||||
|
||||
Request:
|
||||
### 5.2 Policy decision API (for webhook)
|
||||
|
||||
`POST /api/v1/scanner/policy/runtime`
|
||||
|
||||
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
|
||||
|
||||
Request:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -273,44 +273,44 @@ Response:
|
||||
|
||||
```yaml
|
||||
zastava:
|
||||
mode:
|
||||
observer: true
|
||||
webhook: true
|
||||
backend:
|
||||
baseAddress: "https://scanner-web.internal"
|
||||
policyPath: "/api/v1/scanner/policy/runtime"
|
||||
requestTimeoutSeconds: 5
|
||||
allowInsecureHttp: false
|
||||
runtime:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
clientId: "zastava-observer"
|
||||
audience: ["scanner","zastava"]
|
||||
scopes:
|
||||
- "api:scanner.runtime.write"
|
||||
refreshSkewSeconds: 120
|
||||
requireDpop: true
|
||||
requireMutualTls: true
|
||||
allowStaticTokenFallback: false
|
||||
staticTokenPath: null # Optional bootstrap secret
|
||||
tenant: "tenant-01"
|
||||
environment: "prod"
|
||||
deployment: "cluster-a"
|
||||
logging:
|
||||
includeScopes: true
|
||||
includeActivityTracking: true
|
||||
staticScope:
|
||||
plane: "runtime"
|
||||
metrics:
|
||||
meterName: "StellaOps.Zastava"
|
||||
meterVersion: "1.0.0"
|
||||
commonTags:
|
||||
cluster: "prod-cluster"
|
||||
engine: "auto" # containerd|cri-o|docker|auto
|
||||
procfs: "/host/proc"
|
||||
collect:
|
||||
entryTrace: true
|
||||
loadedLibs: true
|
||||
mode:
|
||||
observer: true
|
||||
webhook: true
|
||||
backend:
|
||||
baseAddress: "https://scanner-web.internal"
|
||||
policyPath: "/api/v1/scanner/policy/runtime"
|
||||
requestTimeoutSeconds: 5
|
||||
allowInsecureHttp: false
|
||||
runtime:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
clientId: "zastava-observer"
|
||||
audience: ["scanner","zastava"]
|
||||
scopes:
|
||||
- "api:scanner.runtime.write"
|
||||
refreshSkewSeconds: 120
|
||||
requireDpop: true
|
||||
requireMutualTls: true
|
||||
allowStaticTokenFallback: false
|
||||
staticTokenPath: null # Optional bootstrap secret
|
||||
tenant: "tenant-01"
|
||||
environment: "prod"
|
||||
deployment: "cluster-a"
|
||||
logging:
|
||||
includeScopes: true
|
||||
includeActivityTracking: true
|
||||
staticScope:
|
||||
plane: "runtime"
|
||||
metrics:
|
||||
meterName: "StellaOps.Zastava"
|
||||
meterVersion: "1.0.0"
|
||||
commonTags:
|
||||
cluster: "prod-cluster"
|
||||
engine: "auto" # containerd|cri-o|docker|auto
|
||||
procfs: "/host/proc"
|
||||
collect:
|
||||
entryTrace: true
|
||||
loadedLibs: true
|
||||
maxLibs: 256
|
||||
maxHashBytesPerContainer: 64_000_000
|
||||
maxDepth: 48
|
||||
@@ -327,49 +327,49 @@ zastava:
|
||||
eventsPerSecond: 50
|
||||
burst: 200
|
||||
perNodeQueue: 10_000
|
||||
security:
|
||||
mounts:
|
||||
containerdSock: "/run/containerd/containerd.sock:ro"
|
||||
proc: "/proc:/host/proc:ro"
|
||||
runtimeState: "/var/lib/containerd:ro"
|
||||
```
|
||||
|
||||
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
|
||||
|
||||
---
|
||||
security:
|
||||
mounts:
|
||||
containerdSock: "/run/containerd/containerd.sock:ro"
|
||||
proc: "/proc:/host/proc:ro"
|
||||
runtimeState: "/var/lib/containerd:ro"
|
||||
```
|
||||
|
||||
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
|
||||
|
||||
---
|
||||
|
||||
## 7) Security posture
|
||||
|
||||
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
|
||||
* **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
|
||||
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
|
||||
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
|
||||
* **Rate limiting**: per‑node caps; per‑tenant caps at backend.
|
||||
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
|
||||
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
|
||||
* **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
|
||||
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
|
||||
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
|
||||
* **Rate limiting**: per‑node caps; per‑tenant caps at backend.
|
||||
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
|
||||
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
|
||||
|
||||
---
|
||||
|
||||
## 8) Metrics, logs, tracing
|
||||
|
||||
**Observer**
|
||||
|
||||
* `zastava.runtime.events.total{kind}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
|
||||
* `zastava.proc_maps.samples.total{result}`
|
||||
* `zastava.entrytrace.depth{p99}`
|
||||
* `zastava.hash.bytes.total`
|
||||
* `zastava.buffer.drops.total`
|
||||
|
||||
**Webhook**
|
||||
|
||||
* `zastava.admission.decisions.total{decision}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
|
||||
* `zastava.admission.cache.hits.total`
|
||||
* `zastava.backend.failures.total`
|
||||
|
||||
**Logs** (structured): node, pod, image digest, decision, reasons.
|
||||
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
|
||||
**Observer**
|
||||
|
||||
* `zastava.runtime.events.total{kind}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
|
||||
* `zastava.proc_maps.samples.total{result}`
|
||||
* `zastava.entrytrace.depth{p99}`
|
||||
* `zastava.hash.bytes.total`
|
||||
* `zastava.buffer.drops.total`
|
||||
|
||||
**Webhook**
|
||||
|
||||
* `zastava.admission.decisions.total{decision}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
|
||||
* `zastava.admission.cache.hits.total`
|
||||
* `zastava.backend.failures.total`
|
||||
|
||||
**Logs** (structured): node, pod, image digest, decision, reasons.
|
||||
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
|
||||
|
||||
---
|
||||
|
||||
@@ -486,20 +486,20 @@ webhooks:
|
||||
|
||||
---
|
||||
|
||||
## 15) Roadmap
|
||||
|
||||
* **eBPF** option for syscall/library load tracing (kernel‑level, opt‑in).
|
||||
* **Windows containers** support (ETW providers, loaded modules).
|
||||
* **Network posture** checks: listening ports vs policy.
|
||||
* **Live **used‑by‑entrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
|
||||
* **Admission dry‑run** dashboards (simulate block lists before enforcing).
|
||||
|
||||
---
|
||||
|
||||
## 16) Observability (stub)
|
||||
|
||||
- Runbook + dashboard placeholder for offline import: `operations/observability.md`, `operations/dashboards/zastava-observability.json`.
|
||||
- Metrics to surface: admission latency p95/p99, allow/deny counts, Surface.Env miss rate, Surface.Secrets failures, Surface.FS cache freshness, drift events.
|
||||
- Health endpoints: `/health/liveness`, `/health/readiness`, `/status`, `/surface/fs/cache/status` (see runbook).
|
||||
- Alert hints: deny spikes, latency > 800ms p99, cache freshness lag > 10m, any secrets failure.
|
||||
## 15) Roadmap
|
||||
|
||||
* **eBPF** option for syscall/library load tracing (kernel‑level, opt‑in).
|
||||
* **Windows containers** support (ETW providers, loaded modules).
|
||||
* **Network posture** checks: listening ports vs policy.
|
||||
* **Live **used‑by‑entrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
|
||||
* **Admission dry‑run** dashboards (simulate block lists before enforcing).
|
||||
|
||||
---
|
||||
|
||||
## 16) Observability (stub)
|
||||
|
||||
- Runbook + dashboard placeholder for offline import: `operations/observability.md`, `operations/dashboards/zastava-observability.json`.
|
||||
- Metrics to surface: admission latency p95/p99, allow/deny counts, Surface.Env miss rate, Surface.Secrets failures, Surface.FS cache freshness, drift events.
|
||||
- Health endpoints: `/health/liveness`, `/health/readiness`, `/status`, `/surface/fs/cache/status` (see runbook).
|
||||
- Alert hints: deny spikes, latency > 800ms p99, cache freshness lag > 10m, any secrets failure.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user