feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
22
docs/modules/zastava/AGENTS.md
Normal file
22
docs/modules/zastava/AGENTS.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Zastava agent guide
|
||||
|
||||
## Mission
|
||||
Zastava monitors running workloads, verifies supply chain posture, and enforces runtime policy via Kubernetes admission webhooks.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
33
docs/modules/zastava/README.md
Normal file
33
docs/modules/zastava/README.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# StellaOps Zastava
|
||||
|
||||
Zastava monitors running workloads, verifies supply chain posture, and enforces runtime policy via Kubernetes admission webhooks.
|
||||
|
||||
## Responsibilities
|
||||
- Observe node/container activity and emit runtime events.
|
||||
- Validate signatures, SBOM presence, and backend verdicts before allowing containers.
|
||||
- Buffer and replay events during disconnections.
|
||||
- Trigger delta scans when runtime posture drifts.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Zastava.Observer` daemonset.
|
||||
- `StellaOps.Zastava.Webhook` admission controller.
|
||||
- Shared contracts in `StellaOps.Zastava.Core`.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Authority for OpToks and mTLS.
|
||||
- Scanner/Scheduler for remediation triggers.
|
||||
- Notify/UI for runtime alerts and dashboards.
|
||||
|
||||
## Operational notes
|
||||
- Runbook ./operations/runtime.md with Grafana/Prometheus assets.
|
||||
- Offline kit assets bundling webhook charts.
|
||||
- DPoP/mTLS rotation guidance shared with Authority.
|
||||
|
||||
## Related resources
|
||||
- ./operations/runtime.md
|
||||
- ./operations/runtime-grafana-dashboard.json
|
||||
- ./operations/runtime-prometheus-rules.yaml
|
||||
|
||||
## Backlog references
|
||||
- ZASTAVA runtime tasks in ../../TASKS.md.
|
||||
- Webhook smoke tests tracked in src/Zastava/**/TASKS.md.
|
||||
9
docs/modules/zastava/TASKS.md
Normal file
9
docs/modules/zastava/TASKS.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Task board — Zastava
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| ZASTAVA-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| ZASTAVA-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| ZASTAVA-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
496
docs/modules/zastava/architecture.md
Normal file
496
docs/modules/zastava/architecture.md
Normal file
@@ -0,0 +1,496 @@
|
||||
# component_architecture_zastava.md — **Stella Ops Zastava** (2025Q4)
|
||||
|
||||
> **Scope.** Implementation‑ready architecture for **Zastava**: the **runtime inspector/enforcer** that watches real workloads, detects drift from the scanned baseline, verifies image/SBOM/attestation posture, and (optionally) **admits/blocks** deployments. Includes Kubernetes & plain‑Docker topologies, data contracts, APIs, security posture, performance targets, test matrices, and failure modes.
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Give operators **ground‑truth** from running environments and a **fast guardrail** before workloads land:
|
||||
|
||||
* **Observer:** inventory containers, entrypoints actually executed, and DSOs actually loaded; verify **image signature**, **SBOM referrers**, and **attestation** presence; detect **drift** (unexpected processes/paths) and **policy violations**; publish **runtime events** to Scanner.WebService.
|
||||
* **Admission (optional):** Kubernetes ValidatingAdmissionWebhook that enforces minimal posture (signed images, SBOM availability, known base images, policy PASS) **pre‑flight**.
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
* Zastava **does not** compute SBOMs and does not sign; it **consumes** Scanner/WebService outputs and **enforces** backend policy verdicts.
|
||||
* Zastava can **request** a delta scan when the baseline is missing/stale, but scanning is done by **Scanner.Worker**.
|
||||
* On non‑K8s Docker hosts, Zastava runs as a host service with **observer‑only** features.
|
||||
|
||||
---
|
||||
|
||||
## 1) Topology & processes
|
||||
|
||||
### 1.1 Components (Kubernetes)
|
||||
|
||||
```
|
||||
stellaops/zastava-observer # DaemonSet on every node (read-only host mounts)
|
||||
stellaops/zastava-webhook # ValidatingAdmissionWebhook (Deployment, 2+ replicas)
|
||||
```
|
||||
|
||||
### 1.2 Components (Docker/VM)
|
||||
|
||||
```
|
||||
stellaops/zastava-agent # System service; watch Docker events; observer only
|
||||
```
|
||||
|
||||
### 1.3 Dependencies
|
||||
|
||||
* **Authority** (OIDC): short OpToks (DPoP/mTLS) for API calls to Scanner.WebService.
|
||||
* **Scanner.WebService**: `/runtime/events` ingestion; `/policy/runtime` fetch.
|
||||
* **OCI Registry** (optional): for direct referrers/sig checks if not delegated to backend.
|
||||
* **Container runtime**: containerd/CRI‑O/Docker (read interfaces only).
|
||||
* **Kubernetes API** (watch Pods in cluster; validating webhook).
|
||||
* **Host mounts** (K8s DaemonSet): `/proc`, `/var/lib/containerd` (or CRI‑O), `/run/containerd/containerd.sock` (optional, read‑only).
|
||||
|
||||
---
|
||||
|
||||
## 2) Data contracts
|
||||
|
||||
### 2.1 Runtime event (observer → Scanner.WebService)
|
||||
|
||||
```json
|
||||
{
|
||||
"eventId": "9f6a…",
|
||||
"when": "2025-10-17T12:34:56Z",
|
||||
"kind": "CONTAINER_START|CONTAINER_STOP|DRIFT|POLICY_VIOLATION|ATTESTATION_STATUS",
|
||||
"tenant": "tenant-01",
|
||||
"node": "ip-10-0-1-23",
|
||||
"runtime": { "engine": "containerd", "version": "1.7.19" },
|
||||
"workload": {
|
||||
"platform": "kubernetes",
|
||||
"namespace": "payments",
|
||||
"pod": "api-7c9fbbd8b7-ktd84",
|
||||
"container": "api",
|
||||
"containerId": "containerd://...",
|
||||
"imageRef": "ghcr.io/acme/api@sha256:abcd…",
|
||||
"owner": { "kind": "Deployment", "name": "api" }
|
||||
},
|
||||
"process": {
|
||||
"pid": 12345,
|
||||
"entrypoint": ["/entrypoint.sh", "--serve"],
|
||||
"entryTrace": [
|
||||
{"file":"/entrypoint.sh","line":3,"op":"exec","target":"/usr/bin/python3"},
|
||||
{"file":"<argv>","op":"python","target":"/opt/app/server.py"}
|
||||
],
|
||||
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
|
||||
},
|
||||
"loadedLibs": [
|
||||
{ "path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "…"},
|
||||
{ "path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "…"}
|
||||
],
|
||||
"posture": {
|
||||
"imageSigned": true,
|
||||
"sbomReferrer": "present|missing",
|
||||
"attestation": { "uuid": "rekor-uuid", "verified": true }
|
||||
},
|
||||
"delta": {
|
||||
"baselineImageDigest": "sha256:abcd…",
|
||||
"changedFiles": ["/opt/app/server.py"], // optional quick signal
|
||||
"newBinaries": [{ "path":"/usr/local/bin/helper","sha256":"…" }]
|
||||
},
|
||||
"evidence": [
|
||||
{"signal":"procfs.maps","value":"/lib/.../libssl.so.3@0x7f..."},
|
||||
{"signal":"cri.task.inspect","value":"pid=12345"},
|
||||
{"signal":"registry.referrers","value":"sbom: application/vnd.cyclonedx+json"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 Admission decision (webhook → API server)
|
||||
|
||||
```json
|
||||
{
|
||||
"admissionId": "…",
|
||||
"namespace": "payments",
|
||||
"podSpecDigest": "sha256:…",
|
||||
"images": [
|
||||
{
|
||||
"name": "ghcr.io/acme/api:1.2.3",
|
||||
"resolved": "ghcr.io/acme/api@sha256:abcd…",
|
||||
"signed": true,
|
||||
"hasSbomReferrers": true,
|
||||
"policyVerdict": "pass|warn|fail",
|
||||
"reasons": ["unsigned base image", "missing SBOM"]
|
||||
}
|
||||
],
|
||||
"decision": "Allow|Deny",
|
||||
"ttlSeconds": 300
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 Schema negotiation & hashing guarantees
|
||||
|
||||
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
|
||||
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
|
||||
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
|
||||
|
||||
---
|
||||
|
||||
## 3) Observer — node agent (DaemonSet)
|
||||
|
||||
### 3.1 Responsibilities
|
||||
|
||||
* **Watch** container lifecycle (start/stop) via CRI (`/run/containerd/containerd.sock` gRPC read‑only) or `/var/log/containers/*.log` tail fallback.
|
||||
* **Resolve** container → image digest, mount point rootfs.
|
||||
* **Trace entrypoint**: attach **short‑lived** nsenter/exec to PID 1 in container, parse shell for `exec` chain (bounded depth), record **terminal program**.
|
||||
* **Sample loaded libs**: read `/proc/<pid>/maps` and `exe` symlink to collect **actually loaded** DSOs; compute **sha256** for each mapped file (bounded count/size).
|
||||
* **Record GNU build-id**: parse `NT_GNU_BUILD_ID` from `/proc/<pid>/exe` and attach the normalized hex to runtime events for symbol/debug-store correlation.
|
||||
* **Posture check** (cheap):
|
||||
|
||||
* Image signature presence (if cosign policies are local; else ask backend).
|
||||
* SBOM **referrers** presence (HEAD to registry, optional).
|
||||
* Rekor UUID known (query Scanner.WebService by image digest).
|
||||
* **Publish runtime events** to Scanner.WebService `/runtime/events` (batch & compress).
|
||||
* **Request delta scan** if: no SBOM in catalog OR base differs from known baseline.
|
||||
|
||||
### 3.2 Privileges & mounts (K8s)
|
||||
|
||||
* **SecurityContext:** `runAsUser: 0`, `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`.
|
||||
* **Capabilities:** `CAP_SYS_PTRACE` (optional if using nsenter trace), `CAP_DAC_READ_SEARCH`.
|
||||
* **Host mounts (read‑only):**
|
||||
|
||||
* `/proc` (host) → `/host/proc`
|
||||
* `/run/containerd/containerd.sock` (or CRI‑O socket)
|
||||
* `/var/lib/containerd/io.containerd.runtime.v2.task` (rootfs paths & pids)
|
||||
* **Networking:** cluster‑internal egress to Scanner.WebService only.
|
||||
* **Rate limits:** hard caps for bytes hashed and file count per container to avoid noisy tenants.
|
||||
|
||||
### 3.3 Event batching
|
||||
|
||||
* Buffer ND‑JSON; flush by **N events** or **2 s**.
|
||||
* Backpressure: local disk ring buffer (50 MB default) if Scanner is temporarily unavailable; drop oldest after cap with **metrics** and **warning** event.
|
||||
|
||||
### 3.4 Build-id capture & validation workflow
|
||||
|
||||
1. When Observer sees a `CONTAINER_START` it dereferences `/proc/<pid>/exe`, extracts the `NT_GNU_BUILD_ID` note, normalises it to lower-case hex, and sends it as `process.buildId` in the runtime envelope.
|
||||
2. Scanner.WebService persists the observation and propagates the most recent hashes into `/policy/runtime` responses (`buildIds` list) and policy caches consumed by the webhook/CLI.
|
||||
3. Release engineering copies the matching `.debug` files into the bundle (`debug/.build-id/<aa>/<rest>.debug`) and publishes `debug/debug-manifest.json` with per-hash digests. Offline Kit packaging reuses those artefacts verbatim (see `ops/offline-kit/mirror_debug_store.py`).
|
||||
4. Operators resolve symbols by either:
|
||||
* calling `stellaops-cli runtime policy test --image <digest>` to read the current `buildIds` and then fetching the corresponding `.debug` file from the bundle/offline mirror, or
|
||||
* piping the hash into `debuginfod-find debuginfo <buildId>` when a `debuginfod` service is wired against the mirrored tree.
|
||||
5. Missing hashes indicate stripped binaries without GNU notes; operators should trigger a rebuild with `-Wl,--build-id` or register a fallback symbol package as described in the runtime operations runbook.
|
||||
|
||||
---
|
||||
|
||||
## 4) Admission Webhook (Kubernetes)
|
||||
|
||||
### 4.1 Gate criteria
|
||||
|
||||
Configurable policy (fetched from backend and cached):
|
||||
|
||||
* **Image signature**: must be cosign‑verifiable to configured key(s) or keyless identities.
|
||||
* **SBOM availability**: at least one **CycloneDX** referrer or **Scanner.WebService** catalog entry.
|
||||
* **Scanner policy verdict**: backend `PASS` required for namespaces/labels matching rules; allow `WARN` if configured.
|
||||
* **Registry allowlists/denylists**.
|
||||
* **Tag bans** (e.g., `:latest`).
|
||||
* **Base image allowlists** (by digest).
|
||||
|
||||
### 4.2 Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant K8s as API Server
|
||||
participant WH as Zastava Webhook
|
||||
participant SW as Scanner.WebService
|
||||
|
||||
K8s->>WH: AdmissionReview(Pod)
|
||||
WH->>WH: Resolve images to digests (remote HEAD/pull if needed)
|
||||
WH->>SW: POST /policy/runtime { digests, namespace, labels }
|
||||
SW-->>WH: { per-image: {signed, hasSbom, verdict, reasons}, ttl }
|
||||
alt All pass
|
||||
WH-->>K8s: AdmissionResponse(Allow, ttl)
|
||||
else Any fail (enforce=true)
|
||||
WH-->>K8s: AdmissionResponse(Deny, message)
|
||||
end
|
||||
```
|
||||
|
||||
**Caching:** Per‑digest result cached `ttlSeconds` (default 300 s). **Fail‑open** or **fail‑closed** is configurable per namespace.
|
||||
|
||||
### 4.3 TLS & HA
|
||||
|
||||
* Webhook has its own **serving cert** signed by cluster CA (or custom cert + CA bundle on configuration).
|
||||
* Deployment ≥ 2 replicas; **leaderless**; stateless.
|
||||
|
||||
---
|
||||
|
||||
## 5) Backend integration (Scanner.WebService)
|
||||
|
||||
### 5.1 Ingestion endpoint
|
||||
|
||||
`POST /api/v1/scanner/runtime/events` *(OpTok + DPoP/mTLS)*
|
||||
|
||||
* Validates event schema; enforces rate caps by tenant/node; persists to **Mongo** (`runtime.events` capped collection or regular with TTL).
|
||||
* Performs **correlation**:
|
||||
|
||||
* Attach nearest **image SBOM** (inventory/usage) and **BOM‑Index** if known.
|
||||
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
|
||||
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
|
||||
|
||||
### 5.2 Policy decision API (for webhook)
|
||||
|
||||
`POST /api/v1/scanner/policy/runtime`
|
||||
|
||||
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
|
||||
|
||||
Request:
|
||||
|
||||
```json
|
||||
{
|
||||
"namespace": "payments",
|
||||
"labels": { "app": "api", "env": "prod" },
|
||||
"images": ["ghcr.io/acme/api@sha256:...", "ghcr.io/acme/nginx@sha256:..."]
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"ttlSeconds": 300,
|
||||
"results": {
|
||||
"ghcr.io/acme/api@sha256:...": {
|
||||
"signed": true,
|
||||
"hasSbom": true,
|
||||
"policyVerdict": "pass",
|
||||
"reasons": [],
|
||||
"rekor": { "uuid": "..." }
|
||||
},
|
||||
"ghcr.io/acme/nginx@sha256:...": {
|
||||
"signed": false,
|
||||
"hasSbom": false,
|
||||
"policyVerdict": "fail",
|
||||
"reasons": ["unsigned", "missing SBOM"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6) Configuration (YAML)
|
||||
|
||||
```yaml
|
||||
zastava:
|
||||
mode:
|
||||
observer: true
|
||||
webhook: true
|
||||
backend:
|
||||
baseAddress: "https://scanner-web.internal"
|
||||
policyPath: "/api/v1/scanner/policy/runtime"
|
||||
requestTimeoutSeconds: 5
|
||||
allowInsecureHttp: false
|
||||
runtime:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
clientId: "zastava-observer"
|
||||
audience: ["scanner","zastava"]
|
||||
scopes:
|
||||
- "api:scanner.runtime.write"
|
||||
refreshSkewSeconds: 120
|
||||
requireDpop: true
|
||||
requireMutualTls: true
|
||||
allowStaticTokenFallback: false
|
||||
staticTokenPath: null # Optional bootstrap secret
|
||||
tenant: "tenant-01"
|
||||
environment: "prod"
|
||||
deployment: "cluster-a"
|
||||
logging:
|
||||
includeScopes: true
|
||||
includeActivityTracking: true
|
||||
staticScope:
|
||||
plane: "runtime"
|
||||
metrics:
|
||||
meterName: "StellaOps.Zastava"
|
||||
meterVersion: "1.0.0"
|
||||
commonTags:
|
||||
cluster: "prod-cluster"
|
||||
engine: "auto" # containerd|cri-o|docker|auto
|
||||
procfs: "/host/proc"
|
||||
collect:
|
||||
entryTrace: true
|
||||
loadedLibs: true
|
||||
maxLibs: 256
|
||||
maxHashBytesPerContainer: 64_000_000
|
||||
maxDepth: 48
|
||||
admission:
|
||||
enforce: true
|
||||
failOpenNamespaces: ["dev", "test"]
|
||||
verify:
|
||||
imageSignature: true
|
||||
sbomReferrer: true
|
||||
scannerPolicyPass: true
|
||||
cacheTtlSeconds: 300
|
||||
resolveTags: true # do remote digest resolution for tag-only images
|
||||
limits:
|
||||
eventsPerSecond: 50
|
||||
burst: 200
|
||||
perNodeQueue: 10_000
|
||||
security:
|
||||
mounts:
|
||||
containerdSock: "/run/containerd/containerd.sock:ro"
|
||||
proc: "/proc:/host/proc:ro"
|
||||
runtimeState: "/var/lib/containerd:ro"
|
||||
```
|
||||
|
||||
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
|
||||
|
||||
---
|
||||
|
||||
## 7) Security posture
|
||||
|
||||
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
|
||||
* **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
|
||||
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
|
||||
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
|
||||
* **Rate limiting**: per‑node caps; per‑tenant caps at backend.
|
||||
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
|
||||
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
|
||||
|
||||
---
|
||||
|
||||
## 8) Metrics, logs, tracing
|
||||
|
||||
**Observer**
|
||||
|
||||
* `zastava.runtime.events.total{kind}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
|
||||
* `zastava.proc_maps.samples.total{result}`
|
||||
* `zastava.entrytrace.depth{p99}`
|
||||
* `zastava.hash.bytes.total`
|
||||
* `zastava.buffer.drops.total`
|
||||
|
||||
**Webhook**
|
||||
|
||||
* `zastava.admission.decisions.total{decision}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
|
||||
* `zastava.admission.cache.hits.total`
|
||||
* `zastava.backend.failures.total`
|
||||
|
||||
**Logs** (structured): node, pod, image digest, decision, reasons.
|
||||
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
|
||||
|
||||
---
|
||||
|
||||
## 9) Performance & scale targets
|
||||
|
||||
* **Observer**: ≤ **30 ms** to sample `/proc/<pid>/maps` and compute quick hashes for ≤ 64 files; ≤ **200 ms** for full library set (256 libs).
|
||||
* **Webhook**: P95 ≤ **8 ms** with warm cache; ≤ **50 ms** with one backend round‑trip.
|
||||
* **Throughput**: 1k admission requests/min/replica; 5k runtime events/min/node with batching.
|
||||
|
||||
---
|
||||
|
||||
## 10) Drift detection model
|
||||
|
||||
**Signals**
|
||||
|
||||
* **Process drift**: terminal program differs from **EntryTrace** baseline.
|
||||
* **Library drift**: loaded DSOs not present in **Usage** SBOM view.
|
||||
* **Filesystem drift**: new executable files under `/usr/local/bin`, `/opt`, `/app` with **mtime** after image creation.
|
||||
* **Network drift** (optional): listening sockets on unexpected ports (from policy).
|
||||
|
||||
**Action**
|
||||
|
||||
* Emit `DRIFT` event with evidence; backend can **auto‑queue** a delta scan; policy may **escalate** to alert/block (Admission cannot block already‑running pods; rely on K8s policies/PodSecurity or operator action).
|
||||
|
||||
---
|
||||
|
||||
## 11) Test matrix
|
||||
|
||||
* **Engines**: containerd, CRI‑O, Docker; ensure PID resolution and rootfs mapping.
|
||||
* **EntryTrace**: bash features (case, if, run‑parts, `.`/`source`), language launchers (python/node/java).
|
||||
* **Procfs**: multiple arches, musl/glibc images; static binaries (maps minimal).
|
||||
* **Admission**: unsigned images, missing SBOM referrers, tag‑only images, digest resolution, backend latency, cache TTL.
|
||||
* **Perf/soak**: 500 Pods/node churn; webhook under HPA growth.
|
||||
* **Security**: attempt privilege escalation disabled, read‑only mounts enforced, rate‑limit abuse.
|
||||
* **Failure injection**: backend down (observer buffers, webhook fail‑open/closed), registry throttling, containerd socket unavailable.
|
||||
|
||||
---
|
||||
|
||||
## 12) Failure modes & responses
|
||||
|
||||
| Condition | Observer behavior | Webhook behavior |
|
||||
| ------------------------------- | ---------------------------------------------- | ------------------------------------------------------ |
|
||||
| Backend unreachable | Buffer to disk; drop after cap; emit metric | **Fail‑open/closed** per namespace config |
|
||||
| PID vanished mid‑sample | Retry once; emit partial evidence | N/A |
|
||||
| CRI socket missing | Fallback to K8s events only (reduced fidelity) | N/A |
|
||||
| Registry digest resolve blocked | Defer to backend; mark `resolve=unknown` | Deny or allow per `resolveTags` & `failOpenNamespaces` |
|
||||
| Excessive events | Apply local rate limit, coalesce | N/A |
|
||||
|
||||
---
|
||||
|
||||
## 13) Deployment notes (K8s)
|
||||
|
||||
**DaemonSet (snippet):**
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata: { name: zastava-observer, namespace: stellaops }
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
serviceAccountName: zastava
|
||||
hostPID: true
|
||||
containers:
|
||||
- name: observer
|
||||
image: stellaops/zastava-observer:2.3
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
readOnlyRootFilesystem: true
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities: { add: ["SYS_PTRACE","DAC_READ_SEARCH"] }
|
||||
volumeMounts:
|
||||
- { name: proc, mountPath: /host/proc, readOnly: true }
|
||||
- { name: containerd-sock, mountPath: /run/containerd/containerd.sock, readOnly: true }
|
||||
- { name: containerd-state, mountPath: /var/lib/containerd, readOnly: true }
|
||||
volumes:
|
||||
- { name: proc, hostPath: { path: /proc } }
|
||||
- { name: containerd-sock, hostPath: { path: /run/containerd/containerd.sock } }
|
||||
- { name: containerd-state, hostPath: { path: /var/lib/containerd } }
|
||||
```
|
||||
|
||||
**Webhook (snippet):**
|
||||
|
||||
```yaml
|
||||
apiVersion: admissionregistration.k8s.io/v1
|
||||
kind: ValidatingWebhookConfiguration
|
||||
webhooks:
|
||||
- name: gate.zastava.stella-ops.org
|
||||
admissionReviewVersions: ["v1"]
|
||||
sideEffects: None
|
||||
failurePolicy: Ignore # or Fail
|
||||
rules:
|
||||
- operations: ["CREATE","UPDATE"]
|
||||
apiGroups: [""]
|
||||
apiVersions: ["v1"]
|
||||
resources: ["pods"]
|
||||
clientConfig:
|
||||
service:
|
||||
namespace: stellaops
|
||||
name: zastava-webhook
|
||||
path: /admit
|
||||
caBundle: <base64 CA>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14) Implementation notes
|
||||
|
||||
* **Language**: Rust (observer) for low‑latency `/proc` parsing; Go/.NET viable too. Webhook can be .NET 10 for parity with backend.
|
||||
* **CRI drivers**: pluggable (`containerd`, `cri-o`, `docker`). Prefer CRI over parsing logs.
|
||||
* **Shell parser**: re‑use Scanner.EntryTrace grammar for consistent results (compile to WASM if observer is Rust/Go).
|
||||
* **Hashing**: `BLAKE3` for speed locally, then convert to `sha256` (or compute `sha256` directly when budget allows).
|
||||
* **Resilience**: never block container start; observer is **passive**; only webhook decides allow/deny.
|
||||
|
||||
---
|
||||
|
||||
## 15) Roadmap
|
||||
|
||||
* **eBPF** option for syscall/library load tracing (kernel‑level, opt‑in).
|
||||
* **Windows containers** support (ETW providers, loaded modules).
|
||||
* **Network posture** checks: listening ports vs policy.
|
||||
* **Live **used‑by‑entrypoint** synthesis**: send compact bitset diff to backend to tighten Usage view.
|
||||
* **Admission dry‑run** dashboards (simulate block lists before enforcing).
|
||||
|
||||
19
docs/modules/zastava/implementation_plan.md
Normal file
19
docs/modules/zastava/implementation_plan.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# Implementation plan — Zastava
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Backlog references
|
||||
- ZASTAVA runtime tasks in ../../TASKS.md.
|
||||
- Webhook smoke tests tracked in src/Zastava/**/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
205
docs/modules/zastava/operations/runtime-grafana-dashboard.json
Normal file
205
docs/modules/zastava/operations/runtime-grafana-dashboard.json
Normal file
@@ -0,0 +1,205 @@
|
||||
{
|
||||
"title": "Zastava Runtime Plane",
|
||||
"uid": "zastava-runtime",
|
||||
"timezone": "utc",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"refresh": "30s",
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"type": "timeseries",
|
||||
"title": "Observer Event Rate",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
|
||||
"legendFormat": "{{tenant}}/{{component}}/{{kind}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "1/s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"type": "timeseries",
|
||||
"title": "Admission Decisions",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
|
||||
"legendFormat": "{{decision}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "1/s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"type": "timeseries",
|
||||
"title": "Backend Latency P95",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
|
||||
"legendFormat": "p95 latency"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
},
|
||||
{
|
||||
"color": "orange",
|
||||
"value": 500
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 750
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"label": "Prometheus",
|
||||
"current": {
|
||||
"text": "Prometheus",
|
||||
"value": "Prometheus"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "tenant",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"definition": "label_values(zastava_runtime_events_total, tenant)",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {
|
||||
"text": ".*",
|
||||
"value": ".*"
|
||||
},
|
||||
"regex": "",
|
||||
"includeAll": true,
|
||||
"multi": true,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"name": "Deployments",
|
||||
"type": "tags",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"enable": true,
|
||||
"iconColor": "rgba(255, 96, 96, 1)"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,31 @@
|
||||
groups:
|
||||
- name: zastava-runtime
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: ZastavaRuntimeEventsSilent
|
||||
expr: sum(rate(zastava_runtime_events_total[10m])) == 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Observer events stalled"
|
||||
description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
|
||||
- alert: ZastavaRuntimeBackendLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Runtime backend latency p95 above 750 ms"
|
||||
description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
|
||||
- alert: ZastavaAdmissionDenySpike
|
||||
expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Admission webhook denies exceeding threshold"
|
||||
description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."
|
||||
174
docs/modules/zastava/operations/runtime.md
Normal file
174
docs/modules/zastava/operations/runtime.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Zastava Runtime Operations Runbook
|
||||
|
||||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
|
||||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
|
||||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
|
||||
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
|
||||
certs before rollout.
|
||||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
|
||||
resolvable from every node running Observer/Webhook.
|
||||
- **Host mounts** – read-only access to `/proc`, container runtime state
|
||||
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
|
||||
(`/var/run/zastava`).
|
||||
- **Offline kit bundle** – operators staging air-gapped installs must download
|
||||
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
|
||||
Grafana dashboards, and Prometheus rules referenced below.
|
||||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
|
||||
live outside git. For air-gapped installs copy them to the sealed secrets vault.
|
||||
|
||||
### 1.1 Telemetry quick reference
|
||||
|
||||
| Metric | Description | Notes |
|
||||
|--------|-------------|-------|
|
||||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
|
||||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
|
||||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
|
||||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
|
||||
|
||||
## 2. Deployment workflows
|
||||
|
||||
### 2.1 Fresh install (Helm overlay)
|
||||
|
||||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
|
||||
2. Render values:
|
||||
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
|
||||
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
|
||||
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
|
||||
3. Pre-create secrets:
|
||||
- `zastava-authority-dpop` (JWK + private key).
|
||||
- `zastava-authority-mtls` (client cert/key chain).
|
||||
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
|
||||
4. Deploy Observer DaemonSet and Webhook chart:
|
||||
```sh
|
||||
helm upgrade --install zastava-runtime deploy/helm/zastava \
|
||||
-f values/zastava-runtime.yaml \
|
||||
--namespace stellaops \
|
||||
--create-namespace
|
||||
```
|
||||
5. Verify:
|
||||
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
|
||||
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
|
||||
`Issued runtime OpTok` audit line with DPoP token type.
|
||||
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
|
||||
|
||||
### 2.2 Upgrades
|
||||
|
||||
1. Scale webhook deployment to `--replicas=3` (rolling).
|
||||
2. Drain one node per AZ to ensure Observer tolerates disruption.
|
||||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
|
||||
4. Post-upgrade, run smoke tests:
|
||||
- Apply unsigned Pod manifest → expect `deny` (policy fail).
|
||||
- Apply signed Pod manifest → expect `allow`.
|
||||
5. Record upgrade in ops log with Git SHA + Helm chart version.
|
||||
|
||||
### 2.3 Rollback
|
||||
|
||||
1. Use Helm revision history: `helm history zastava-runtime`.
|
||||
2. Rollback: `helm rollback zastava-runtime <revision>`.
|
||||
3. Invalidate cached OpToks:
|
||||
```sh
|
||||
kubectl -n stellaops exec deploy/zastava-webhook -- \
|
||||
zastava-webhook invalidate-op-token --audience scanner
|
||||
```
|
||||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
|
||||
|
||||
## 3. Authority & security guardrails
|
||||
|
||||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
|
||||
`authority.token.issue` scope with decision data; absence indicates misconfig.
|
||||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
|
||||
lab clusters; expect warning log `Mutual TLS requirement disabled`.
|
||||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
|
||||
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
|
||||
- Audit every change in `zastava.runtime.authority` through change management.
|
||||
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
|
||||
to confirm key rotation.
|
||||
|
||||
## 4. Incident response
|
||||
|
||||
### 4.1 Authority offline
|
||||
|
||||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
|
||||
2. Inspect Observer logs for `authority.token.fallback` scope.
|
||||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
|
||||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
|
||||
|
||||
### 4.2 Scanner/WebService latency spike
|
||||
|
||||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
|
||||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
|
||||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
|
||||
`kubectl logs ds/zastava-observer | grep buffer.drops`.
|
||||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
|
||||
|
||||
### 4.3 Admission deny storm
|
||||
|
||||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
|
||||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
|
||||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
|
||||
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
|
||||
|
||||
## 5. Offline kit & air-gapped notes
|
||||
|
||||
- Bundle contents:
|
||||
- Observer/Webhook container images (multi-arch).
|
||||
- `docs/modules/zastava/operations/runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
|
||||
- Sample `zastava-runtime.values.yaml`.
|
||||
- Verification:
|
||||
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
|
||||
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
|
||||
- Import Grafana dashboard via `grafana-cli --config ...`.
|
||||
|
||||
## 6. Observability assets
|
||||
|
||||
- Prometheus alert rules: `docs/modules/zastava/operations/runtime-prometheus-rules.yaml`.
|
||||
- Grafana dashboard JSON: `docs/modules/zastava/operations/runtime-grafana-dashboard.json`.
|
||||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
|
||||
the Offline Kit manifest.
|
||||
|
||||
## 7. Build-id correlation & symbol retrieval
|
||||
|
||||
Runtime events emitted by Observer now include `process.buildId` (from the ELF
|
||||
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
|
||||
`buildIds` list per digest. Operators can use these hashes to locate debug
|
||||
artifacts during incident response:
|
||||
|
||||
1. Capture the hash from CLI/webhook/Scanner API—for example:
|
||||
```bash
|
||||
stellaops-cli runtime policy test --image <digest> --namespace <ns>
|
||||
```
|
||||
Copy one of the `Build IDs` (e.g.
|
||||
`5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
|
||||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
|
||||
```bash
|
||||
ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
|
||||
```
|
||||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
|
||||
`debug-store` object bucket (mirror of release artefacts):
|
||||
```bash
|
||||
oras cp oci://registry.internal/debug-store:latest . --include \
|
||||
"5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
|
||||
```
|
||||
4. Confirm the running process advertises the same GNU build-id before
|
||||
symbolising:
|
||||
```bash
|
||||
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
|
||||
```
|
||||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
|
||||
in `debuginfod` for fleet-wide symbol resolution:
|
||||
```bash
|
||||
debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
|
||||
```
|
||||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
|
||||
runtime events indicate stripped binaries without the GNU note—schedule a
|
||||
rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
|
||||
allowlist so the scanner can surface a fallback symbol package.
|
||||
|
||||
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
|
||||
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
|
||||
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
|
||||
Observer to trigger a fresh capture if symbol parity is required.
|
||||
Reference in New Issue
Block a user