feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions

View File

@@ -0,0 +1,22 @@
# Zastava agent guide
## Mission
Zastava monitors running workloads, verifies supply chain posture, and enforces runtime policy via Kubernetes admission webhooks.
## Key docs
- [Module README](./README.md)
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
## How to get started
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
3. Read the architecture and README for domain context before editing code or docs.
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -0,0 +1,33 @@
# StellaOps Zastava
Zastava monitors running workloads, verifies supply chain posture, and enforces runtime policy via Kubernetes admission webhooks.
## Responsibilities
- Observe node/container activity and emit runtime events.
- Validate signatures, SBOM presence, and backend verdicts before allowing containers.
- Buffer and replay events during disconnections.
- Trigger delta scans when runtime posture drifts.
## Key components
- `StellaOps.Zastava.Observer` daemonset.
- `StellaOps.Zastava.Webhook` admission controller.
- Shared contracts in `StellaOps.Zastava.Core`.
## Integrations & dependencies
- Authority for OpToks and mTLS.
- Scanner/Scheduler for remediation triggers.
- Notify/UI for runtime alerts and dashboards.
## Operational notes
- Runbook ./operations/runtime.md with Grafana/Prometheus assets.
- Offline kit assets bundling webhook charts.
- DPoP/mTLS rotation guidance shared with Authority.
## Related resources
- ./operations/runtime.md
- ./operations/runtime-grafana-dashboard.json
- ./operations/runtime-prometheus-rules.yaml
## Backlog references
- ZASTAVA runtime tasks in ../../TASKS.md.
- Webhook smoke tests tracked in src/Zastava/**/TASKS.md.

View File

@@ -0,0 +1,9 @@
# Task board — Zastava
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| ZASTAVA-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| ZASTAVA-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| ZASTAVA-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |

View File

@@ -0,0 +1,496 @@
# component_architecture_zastava.md — **StellaOps Zastava** (2025Q4)
> **Scope.** Implementationready architecture for **Zastava**: the **runtime inspector/enforcer** that watches real workloads, detects drift from the scanned baseline, verifies image/SBOM/attestation posture, and (optionally) **admits/blocks** deployments. Includes Kubernetes & plainDocker topologies, data contracts, APIs, security posture, performance targets, test matrices, and failure modes.
---
## 0) Mission & boundaries
**Mission.** Give operators **groundtruth** from running environments and a **fast guardrail** before workloads land:
* **Observer:** inventory containers, entrypoints actually executed, and DSOs actually loaded; verify **image signature**, **SBOM referrers**, and **attestation** presence; detect **drift** (unexpected processes/paths) and **policy violations**; publish **runtime events** to Scanner.WebService.
* **Admission (optional):** Kubernetes ValidatingAdmissionWebhook that enforces minimal posture (signed images, SBOM availability, known base images, policy PASS) **preflight**.
**Boundaries.**
* Zastava **does not** compute SBOMs and does not sign; it **consumes** Scanner/WebService outputs and **enforces** backend policy verdicts.
* Zastava can **request** a delta scan when the baseline is missing/stale, but scanning is done by **Scanner.Worker**.
* On nonK8s Docker hosts, Zastava runs as a host service with **observeronly** features.
---
## 1) Topology & processes
### 1.1 Components (Kubernetes)
```
stellaops/zastava-observer # DaemonSet on every node (read-only host mounts)
stellaops/zastava-webhook # ValidatingAdmissionWebhook (Deployment, 2+ replicas)
```
### 1.2 Components (Docker/VM)
```
stellaops/zastava-agent # System service; watch Docker events; observer only
```
### 1.3 Dependencies
* **Authority** (OIDC): short OpToks (DPoP/mTLS) for API calls to Scanner.WebService.
* **Scanner.WebService**: `/runtime/events` ingestion; `/policy/runtime` fetch.
* **OCI Registry** (optional): for direct referrers/sig checks if not delegated to backend.
* **Container runtime**: containerd/CRIO/Docker (read interfaces only).
* **Kubernetes API** (watch Pods in cluster; validating webhook).
* **Host mounts** (K8s DaemonSet): `/proc`, `/var/lib/containerd` (or CRIO), `/run/containerd/containerd.sock` (optional, readonly).
---
## 2) Data contracts
### 2.1 Runtime event (observer → Scanner.WebService)
```json
{
"eventId": "9f6a…",
"when": "2025-10-17T12:34:56Z",
"kind": "CONTAINER_START|CONTAINER_STOP|DRIFT|POLICY_VIOLATION|ATTESTATION_STATUS",
"tenant": "tenant-01",
"node": "ip-10-0-1-23",
"runtime": { "engine": "containerd", "version": "1.7.19" },
"workload": {
"platform": "kubernetes",
"namespace": "payments",
"pod": "api-7c9fbbd8b7-ktd84",
"container": "api",
"containerId": "containerd://...",
"imageRef": "ghcr.io/acme/api@sha256:abcd…",
"owner": { "kind": "Deployment", "name": "api" }
},
"process": {
"pid": 12345,
"entrypoint": ["/entrypoint.sh", "--serve"],
"entryTrace": [
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
],
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
},
"loadedLibs": [
{ "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
{ "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
],
"posture": {
"imageSigned": true,
"sbomReferrer": "present|missing",
"attestation": { "uuid": "rekor-uuid", "verified": true }
},
"delta": {
"baselineImageDigest": "sha256:abcd…",
"changedFiles": ["/opt/app/server.py"], // optional quick signal
"newBinaries": [{ "path":"/usr/local/bin/helper","sha256":"…" }]
},
"evidence": [
{"signal":"procfs.maps","value":"/lib/.../libssl.so.3@0x7f..."},
{"signal":"cri.task.inspect","value":"pid=12345"},
{"signal":"registry.referrers","value":"sbom: application/vnd.cyclonedx+json"}
]
}
```
### 2.2 Admission decision (webhook → API server)
```json
{
"admissionId": "…",
"namespace": "payments",
"podSpecDigest": "sha256:…",
"images": [
{
"name": "ghcr.io/acme/api:1.2.3",
"resolved": "ghcr.io/acme/api@sha256:abcd…",
"signed": true,
"hasSbomReferrers": true,
"policyVerdict": "pass|warn|fail",
"reasons": ["unsigned base image", "missing SBOM"]
}
],
"decision": "Allow|Deny",
"ttlSeconds": 300
}
```
### 2.3 Schema negotiation & hashing guarantees
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
---
## 3) Observer — node agent (DaemonSet)
### 3.1 Responsibilities
* **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC readonly) or `/var/log/containers/*.log` tail fallback.
* **Resolve** container → image digest, mount point rootfs.
* **Trace entrypoint**: attach **shortlived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**.
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
* **Posture check** (cheap):
* Image signature presence (if cosign policies are local; else ask backend).
* SBOM **referrers** presence (HEAD to registry, optional).
* Rekor UUID known (query Scanner.WebService by image digest).
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
### 3.2 Privileges & mounts (K8s)
* **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`.
* **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`.
* **Host mounts (readonly):**
* `/proc` (host) → `/host/proc`
* `/run/containerd/containerd.sock` (or CRIO socket)
* `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids)
* **Networking:** clusterinternal egress to Scanner.WebService only.
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
### 3.3 Event batching
* Buffer NDJSON; flush by **N events** or **2s**.
* Backpressure: local disk ring buffer (50MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
### 3.4 Build-id capture & validation workflow
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
4. Operators resolve symbols by either:
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
---
## 4) Admission Webhook (Kubernetes)
### 4.1 Gate criteria
Configurable policy (fetched from backend and cached):
* **Image signature**: must be cosignverifiable to configured key(s) or keyless identities.
* **SBOM availability**: at least one **CycloneDX** referrer or **Scanner.WebService** catalog entry.
* **Scanner policy verdict**: backend `PASS` required for namespaces/labels matching rules; allow `WARN` if configured.
* **Registry allowlists/denylists**.
* **Tag bans** (e.g., `:latest`).
* **Base image allowlists** (by digest).
### 4.2 Flow
```mermaid
sequenceDiagram
autonumber
participant K8s as API Server
participant WH as Zastava Webhook
participant SW as Scanner.WebService
K8s->>WH: AdmissionReview(Pod)
WH->>WH: Resolve images to digests (remote HEAD/pull if needed)
WH->>SW: POST /policy/runtime { digests, namespace, labels }
SW-->>WH: { per-image: {signed, hasSbom, verdict, reasons}, ttl }
alt All pass
WH-->>K8s: AdmissionResponse(Allow, ttl)
else Any fail (enforce=true)
WH-->>K8s: AdmissionResponse(Deny, message)
end
```
**Caching:** Perdigest result cached `ttlSeconds` (default 300 s). **Failopen** or **failclosed** is configurable per namespace.
### 4.3 TLS & HA
* Webhook has its own **serving cert** signed by cluster CA (or custom cert + CA bundle on configuration).
* Deployment ≥ 2 replicas; **leaderless**; stateless.
---
## 5) Backend integration (Scanner.WebService)
### 5.1 Ingestion endpoint
`POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)*
* Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL).
* Performs **correlation**:
* Attach nearest **image SBOM** (inventory/usage) and **BOMIndex** if known.
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
```json
{
"namespace": "payments",
"labels": { "app": "api", "env": "prod" },
"images": ["ghcr.io/acme/api@sha256:...", "ghcr.io/acme/nginx@sha256:..."]
}
```
Response:
```json
{
"ttlSeconds": 300,
"results": {
"ghcr.io/acme/api@sha256:...": {
"signed": true,
"hasSbom": true,
"policyVerdict": "pass",
"reasons": [],
"rekor": { "uuid": "..." }
},
"ghcr.io/acme/nginx@sha256:...": {
"signed": false,
"hasSbom": false,
"policyVerdict": "fail",
"reasons": ["unsigned", "missing SBOM"]
}
}
}
```
---
## 6) Configuration (YAML)
```yaml
zastava:
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
allowInsecureHttp: false
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner","zastava"]
scopes:
- "api:scanner.runtime.write"
refreshSkewSeconds: 120
requireDpop: true
requireMutualTls: true
allowStaticTokenFallback: false
staticTokenPath: null # Optional bootstrap secret
tenant: "tenant-01"
environment: "prod"
deployment: "cluster-a"
logging:
includeScopes: true
includeActivityTracking: true
staticScope:
plane: "runtime"
metrics:
meterName: "StellaOps.Zastava"
meterVersion: "1.0.0"
commonTags:
cluster: "prod-cluster"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
maxLibs: 256
maxHashBytesPerContainer: 64_000_000
maxDepth: 48
admission:
enforce: true
failOpenNamespaces: ["dev", "test"]
verify:
imageSignature: true
sbomReferrer: true
scannerPolicyPass: true
cacheTtlSeconds: 300
resolveTags: true # do remote digest resolution for tag-only images
limits:
eventsPerSecond: 50
burst: 200
perNodeQueue: 10_000
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
---
## 7) Security posture
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
---
## 8) Metrics, logs, tracing
**Observer**
* `zastava.runtime.events.total{kind}`
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
* `zastava.proc_maps.samples.total{result}`
* `zastava.entrytrace.depth{p99}`
* `zastava.hash.bytes.total`
* `zastava.buffer.drops.total`
**Webhook**
* `zastava.admission.decisions.total{decision}`
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
* `zastava.admission.cache.hits.total`
* `zastava.backend.failures.total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
---
## 9) Performance & scale targets
* **Observer**: ≤ **30ms** to sample `/proc/<pid>/maps` and compute quick hashes for ≤ 64 files; ≤ **200ms** for full library set (256 libs).
* **Webhook**: P95 ≤ **8ms** with warm cache; ≤ **50ms** with one backend roundtrip.
* **Throughput**: 1k admission requests/min/replica; 5k runtime events/min/node with batching.
---
## 10) Drift detection model
**Signals**
* **Process drift**: terminal program differs from **EntryTrace** baseline.
* **Library drift**: loaded DSOs not present in **Usage** SBOM view.
* **Filesystem drift**: new executable files under `/usr/local/bin`, `/opt`, `/app` with **mtime** after image creation.
* **Network drift** (optional): listening sockets on unexpected ports (from policy).
**Action**
* Emit `DRIFT` event with evidence; backend can **autoqueue** a delta scan; policy may **escalate** to alert/block (Admission cannot block alreadyrunning pods; rely on K8s policies/PodSecurity or operator action).
---
## 11) Test matrix
* **Engines**: containerd, CRIO, Docker; ensure PID resolution and rootfs mapping.
* **EntryTrace**: bash features (case, if, runparts, `.`/`source`), language launchers (python/node/java).
* **Procfs**: multiple arches, musl/glibc images; static binaries (maps minimal).
* **Admission**: unsigned images, missing SBOM referrers, tagonly images, digest resolution, backend latency, cache TTL.
* **Perf/soak**: 500 Pods/node churn; webhook under HPA growth.
* **Security**: attempt privilege escalation disabled, readonly mounts enforced, ratelimit abuse.
* **Failure injection**: backend down (observer buffers, webhook failopen/closed), registry throttling, containerd socket unavailable.
---
## 12) Failure modes & responses
| Condition | Observer behavior | Webhook behavior |
| ------------------------------- | ---------------------------------------------- | ------------------------------------------------------ |
| Backend unreachable | Buffer to disk; drop after cap; emit metric | **Failopen/closed** per namespace config |
| PID vanished midsample | Retry once; emit partial evidence | N/A |
| CRI socket missing | Fallback to K8s events only (reduced fidelity) | N/A |
| Registry digest resolve blocked | Defer to backend; mark `resolve=unknown` | Deny or allow per `resolveTags` & `failOpenNamespaces` |
| Excessive events | Apply local rate limit, coalesce | N/A |
---
## 13) Deployment notes (K8s)
**DaemonSet (snippet):**
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: zastava-observer, namespace: stellaops }
spec:
template:
spec:
serviceAccountName: zastava
hostPID: true
containers:
- name: observer
image: stellaops/zastava-observer:2.3
securityContext:
runAsUser: 0
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities: { add: ["SYS_PTRACE","DAC_READ_SEARCH"] }
volumeMounts:
- { name: proc, mountPath: /host/proc, readOnly: true }
- { name: containerd-sock, mountPath: /run/containerd/containerd.sock, readOnly: true }
- { name: containerd-state, mountPath: /var/lib/containerd, readOnly: true }
volumes:
- { name: proc, hostPath: { path: /proc } }
- { name: containerd-sock, hostPath: { path: /run/containerd/containerd.sock } }
- { name: containerd-state, hostPath: { path: /var/lib/containerd } }
```
**Webhook (snippet):**
```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
- name: gate.zastava.stella-ops.org
admissionReviewVersions: ["v1"]
sideEffects: None
failurePolicy: Ignore # or Fail
rules:
- operations: ["CREATE","UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
clientConfig:
service:
namespace: stellaops
name: zastava-webhook
path: /admit
caBundle: <base64 CA>
```
---
## 14) Implementation notes
* **Language**: Rust (observer) for lowlatency `/proc` parsing; Go/.NET viable too. Webhook can be .NET 10 for parity with backend.
* **CRI drivers**: pluggable (`containerd`, `cri-o`, `docker`). Prefer CRI over parsing logs.
* **Shell parser**: reuse Scanner.EntryTrace grammar for consistent results (compile to WASM if observer is Rust/Go).
* **Hashing**: `BLAKE3` for speed locally, then convert to `sha256` (or compute `sha256` directly when budget allows).
* **Resilience**: never block container start; observer is **passive**; only webhook decides allow/deny.
---
## 15) Roadmap
* **eBPF** option for syscall/library load tracing (kernellevel, optin).
* **Windows containers** support (ETW providers, loaded modules).
* **Network posture** checks: listening ports vs policy.
* **Live **usedbyentrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
* **Admission dryrun** dashboards (simulate block lists before enforcing).

View File

@@ -0,0 +1,19 @@
# Implementation plan — Zastava
## Current objectives
- Maintain deterministic behaviour and offline parity across releases.
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
## Workstreams
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
## Backlog references
- ZASTAVA runtime tasks in ../../TASKS.md.
- Webhook smoke tests tracked in src/Zastava/**/TASKS.md.
## Coordination
- Review ./AGENTS.md before picking up new work.
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
- Update this plan whenever scope, dependencies, or guardrails change.

View File

@@ -0,0 +1,205 @@
{
"title": "Zastava Runtime Plane",
"uid": "zastava-runtime",
"timezone": "utc",
"schemaVersion": 38,
"version": 1,
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"panels": [
{
"id": 1,
"type": "timeseries",
"title": "Observer Event Rate",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"targets": [
{
"expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
"legendFormat": "{{tenant}}/{{component}}/{{kind}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"fieldConfig": {
"defaults": {
"unit": "1/s",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
}
]
}
},
"overrides": []
},
"options": {
"legend": {
"showLegend": true,
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 2,
"type": "timeseries",
"title": "Admission Decisions",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"targets": [
{
"expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
"legendFormat": "{{decision}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"fieldConfig": {
"defaults": {
"unit": "1/s",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
},
{
"color": "red",
"value": 20
}
]
}
},
"overrides": []
},
"options": {
"legend": {
"showLegend": true,
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 3,
"type": "timeseries",
"title": "Backend Latency P95",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
"legendFormat": "p95 latency"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"fieldConfig": {
"defaults": {
"unit": "ms",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
},
{
"color": "orange",
"value": 500
},
{
"color": "red",
"value": 750
}
]
}
},
"overrides": []
},
"options": {
"legend": {
"showLegend": true,
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
}
],
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"label": "Prometheus",
"current": {
"text": "Prometheus",
"value": "Prometheus"
}
},
{
"name": "tenant",
"type": "query",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"definition": "label_values(zastava_runtime_events_total, tenant)",
"refresh": 1,
"hide": 0,
"current": {
"text": ".*",
"value": ".*"
},
"regex": "",
"includeAll": true,
"multi": true,
"sort": 1
}
]
},
"annotations": {
"list": [
{
"name": "Deployments",
"type": "tags",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"enable": true,
"iconColor": "rgba(255, 96, 96, 1)"
}
]
}
}

View File

@@ -0,0 +1,31 @@
groups:
- name: zastava-runtime
interval: 30s
rules:
- alert: ZastavaRuntimeEventsSilent
expr: sum(rate(zastava_runtime_events_total[10m])) == 0
for: 15m
labels:
severity: warning
service: zastava-runtime
annotations:
summary: "Observer events stalled"
description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
- alert: ZastavaRuntimeBackendLatencyHigh
expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
for: 10m
labels:
severity: critical
service: zastava-runtime
annotations:
summary: "Runtime backend latency p95 above 750 ms"
description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
- alert: ZastavaAdmissionDenySpike
expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
for: 5m
labels:
severity: warning
service: zastava-runtime
annotations:
summary: "Admission webhook denies exceeding threshold"
description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."

View File

@@ -0,0 +1,174 @@
# Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with `Sprint 12 Runtime Guardrails` and assumes components consume
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
## 1. Prerequisites
- **Authority client credentials** service principal `zastava-runtime` with scopes
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
certs before rollout.
- **Scanner/WebService reachability** cluster DNS entry (e.g. `scanner.internal`)
resolvable from every node running Observer/Webhook.
- **Host mounts** read-only access to `/proc`, container runtime state
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
(`/var/run/zastava`).
- **Offline kit bundle** operators staging air-gapped installs must download
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
Grafana dashboards, and Prometheus rules referenced below.
- **Secrets** Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
live outside git. For air-gapped installs copy them to the sealed secrets vault.
### 1.1 Telemetry quick reference
| Metric | Description | Notes |
|--------|-------------|-------|
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
## 2. Deployment workflows
### 2.1 Fresh install (Helm overlay)
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
2. Render values:
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
3. Pre-create secrets:
- `zastava-authority-dpop` (JWK + private key).
- `zastava-authority-mtls` (client cert/key chain).
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
4. Deploy Observer DaemonSet and Webhook chart:
```sh
helm upgrade --install zastava-runtime deploy/helm/zastava \
-f values/zastava-runtime.yaml \
--namespace stellaops \
--create-namespace
```
5. Verify:
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
`Issued runtime OpTok` audit line with DPoP token type.
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
### 2.2 Upgrades
1. Scale webhook deployment to `--replicas=3` (rolling).
2. Drain one node per AZ to ensure Observer tolerates disruption.
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
4. Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect `deny` (policy fail).
- Apply signed Pod manifest → expect `allow`.
5. Record upgrade in ops log with Git SHA + Helm chart version.
### 2.3 Rollback
1. Use Helm revision history: `helm history zastava-runtime`.
2. Rollback: `helm rollback zastava-runtime <revision>`.
3. Invalidate cached OpToks:
```sh
kubectl -n stellaops exec deploy/zastava-webhook -- \
zastava-webhook invalidate-op-token --audience scanner
```
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
## 3. Authority & security guardrails
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
`authority.token.issue` scope with decision data; absence indicates misconfig.
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
lab clusters; expect warning log `Mutual TLS requirement disabled`.
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
- Audit every change in `zastava.runtime.authority` through change management.
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
to confirm key rotation.
## 4. Incident response
### 4.1 Authority offline
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
2. Inspect Observer logs for `authority.token.fallback` scope.
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
### 4.2 Scanner/WebService latency spike
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
`kubectl logs ds/zastava-observer | grep buffer.drops`.
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
### 4.3 Admission deny storm
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
## 5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
- `docs/modules/zastava/operations/runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
- Sample `zastava-runtime.values.yaml`.
- Verification:
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
- Import Grafana dashboard via `grafana-cli --config ...`.
## 6. Observability assets
- Prometheus alert rules: `docs/modules/zastava/operations/runtime-prometheus-rules.yaml`.
- Grafana dashboard JSON: `docs/modules/zastava/operations/runtime-grafana-dashboard.json`.
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
the Offline Kit manifest.
## 7. Build-id correlation & symbol retrieval
Runtime events emitted by Observer now include `process.buildId` (from the ELF
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
`buildIds` list per digest. Operators can use these hashes to locate debug
artifacts during incident response:
1. Capture the hash from CLI/webhook/Scanner API—for example:
```bash
stellaops-cli runtime policy test --image <digest> --namespace <ns>
```
Copy one of the `Build IDs` (e.g.
`5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
```bash
ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
```
3. If the file is missing, rehydrate it from Offline Kit bundles or the
`debug-store` object bucket (mirror of release artefacts):
```bash
oras cp oci://registry.internal/debug-store:latest . --include \
"5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
```
4. Confirm the running process advertises the same GNU build-id before
symbolising:
```bash
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
```
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
in `debuginfod` for fleet-wide symbol resolution:
```bash
debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
```
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
runtime events indicate stripped binaries without the GNU note—schedule a
rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
allowlist so the scanner can surface a fallback symbol package.
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
Observer to trigger a fresh capture if symbol parity is required.