docs consolidation work

This commit is contained in:
StellaOps Bot
2025-12-25 10:53:53 +02:00
parent b9f71fc7e9
commit deb82b4f03
117 changed files with 852 additions and 847 deletions

View File

@@ -24,7 +24,7 @@ Zastava monitors running workloads, verifies supply chain posture, and enforces
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -66,15 +66,15 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
"imageRef": "ghcr.io/acme/api@sha256:abcd…",
"owner": { "kind": "Deployment", "name": "api" }
},
"process": {
"pid": 12345,
"entrypoint": ["/entrypoint.sh", "--serve"],
"entryTrace": [
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
],
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
},
"process": {
"pid": 12345,
"entrypoint": ["/entrypoint.sh", "--serve"],
"entryTrace": [
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
],
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
},
"loadedLibs": [
{ "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
{ "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
@@ -116,35 +116,35 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
],
"decision": "Allow|Deny",
"ttlSeconds": 300
}
```
### 2.3 Schema negotiation & hashing guarantees
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
---
## 3) Observer — node agent (DaemonSet)
}
```
### 2.3 Schema negotiation & hashing guarantees
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
---
## 3) Observer — node agent (DaemonSet)
### 3.1 Responsibilities
* **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC readonly) or `/var/log/containers/*.log` tail fallback.
* **Resolve** container → image digest, mount point rootfs.
* **Trace entrypoint**: attach **shortlived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**.
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
* **Posture check** (cheap):
* Image signature presence (if cosign policies are local; else ask backend).
* SBOM **referrers** presence (HEAD to registry, optional).
* Rekor UUID known (query Scanner.WebService by image digest).
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
### 3.2 Privileges & mounts (K8s)
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
### 3.2 Privileges & mounts (K8s)
* **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`.
* **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`.
@@ -154,22 +154,22 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
* `/run/containerd/containerd.sock` (or CRIO socket)
* `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids)
* **Networking:** clusterinternal egress to Scanner.WebService only.
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
### 3.3 Event batching
* Buffer NDJSON; flush by **N events** or **2s**.
* Backpressure: local disk ring buffer (50MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
### 3.4 Build-id capture & validation workflow
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
4. Operators resolve symbols by either:
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
### 3.3 Event batching
* Buffer NDJSON; flush by **N events** or **2s**.
* Backpressure: local disk ring buffer (50MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
### 3.4 Build-id capture & validation workflow
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
4. Operators resolve symbols by either:
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
---
@@ -221,20 +221,20 @@ sequenceDiagram
`POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)*
* Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL).
* Validates event schema; enforces rate caps by tenant/node; persists to **PostgreSQL** (`runtime.events` table with TTL-based retention).
* Performs **correlation**:
* Attach nearest **image SBOM** (inventory/usage) and **BOMIndex** if known.
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
```json
{
@@ -273,44 +273,44 @@ Response:
```yaml
zastava:
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
allowInsecureHttp: false
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner","zastava"]
scopes:
- "api:scanner.runtime.write"
refreshSkewSeconds: 120
requireDpop: true
requireMutualTls: true
allowStaticTokenFallback: false
staticTokenPath: null # Optional bootstrap secret
tenant: "tenant-01"
environment: "prod"
deployment: "cluster-a"
logging:
includeScopes: true
includeActivityTracking: true
staticScope:
plane: "runtime"
metrics:
meterName: "StellaOps.Zastava"
meterVersion: "1.0.0"
commonTags:
cluster: "prod-cluster"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
allowInsecureHttp: false
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner","zastava"]
scopes:
- "api:scanner.runtime.write"
refreshSkewSeconds: 120
requireDpop: true
requireMutualTls: true
allowStaticTokenFallback: false
staticTokenPath: null # Optional bootstrap secret
tenant: "tenant-01"
environment: "prod"
deployment: "cluster-a"
logging:
includeScopes: true
includeActivityTracking: true
staticScope:
plane: "runtime"
metrics:
meterName: "StellaOps.Zastava"
meterVersion: "1.0.0"
commonTags:
cluster: "prod-cluster"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
maxLibs: 256
maxHashBytesPerContainer: 64_000_000
maxDepth: 48
@@ -327,49 +327,49 @@ zastava:
eventsPerSecond: 50
burst: 200
perNodeQueue: 10_000
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
---
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
---
## 7) Security posture
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
---
## 8) Metrics, logs, tracing
**Observer**
* `zastava.runtime.events.total{kind}`
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
* `zastava.proc_maps.samples.total{result}`
* `zastava.entrytrace.depth{p99}`
* `zastava.hash.bytes.total`
* `zastava.buffer.drops.total`
**Webhook**
* `zastava.admission.decisions.total{decision}`
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
* `zastava.admission.cache.hits.total`
* `zastava.backend.failures.total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
**Observer**
* `zastava.runtime.events.total{kind}`
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
* `zastava.proc_maps.samples.total{result}`
* `zastava.entrytrace.depth{p99}`
* `zastava.hash.bytes.total`
* `zastava.buffer.drops.total`
**Webhook**
* `zastava.admission.decisions.total{decision}`
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
* `zastava.admission.cache.hits.total`
* `zastava.backend.failures.total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
---
@@ -486,20 +486,20 @@ webhooks:
---
## 15) Roadmap
* **eBPF** option for syscall/library load tracing (kernellevel, optin).
* **Windows containers** support (ETW providers, loaded modules).
* **Network posture** checks: listening ports vs policy.
* **Live **usedbyentrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
* **Admission dryrun** dashboards (simulate block lists before enforcing).
---
## 16) Observability (stub)
- Runbook + dashboard placeholder for offline import: `operations/observability.md`, `operations/dashboards/zastava-observability.json`.
- Metrics to surface: admission latency p95/p99, allow/deny counts, Surface.Env miss rate, Surface.Secrets failures, Surface.FS cache freshness, drift events.
- Health endpoints: `/health/liveness`, `/health/readiness`, `/status`, `/surface/fs/cache/status` (see runbook).
- Alert hints: deny spikes, latency > 800ms p99, cache freshness lag > 10m, any secrets failure.
## 15) Roadmap
* **eBPF** option for syscall/library load tracing (kernellevel, optin).
* **Windows containers** support (ETW providers, loaded modules).
* **Network posture** checks: listening ports vs policy.
* **Live **usedbyentrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
* **Admission dryrun** dashboards (simulate block lists before enforcing).
---
## 16) Observability (stub)
- Runbook + dashboard placeholder for offline import: `operations/observability.md`, `operations/dashboards/zastava-observability.json`.
- Metrics to surface: admission latency p95/p99, allow/deny counts, Surface.Env miss rate, Surface.Secrets failures, Surface.FS cache freshness, drift events.
- Health endpoints: `/health/liveness`, `/health/readiness`, `/status`, `/surface/fs/cache/status` (see runbook).
- Alert hints: deny spikes, latency > 800ms p99, cache freshness lag > 10m, any secrets failure.