This commit is contained in:
@@ -57,11 +57,16 @@ graph LR
|
||||
| **Publish** | Push to `registry.git.stella-ops.org`. |
|
||||
| **E2E** | Kind‑based Kubernetes test incl. Zastava DaemonSet; verify sub‑5 s scan SLA. |
|
||||
| **Notify** | Report to Mattermost & GitLab Slack app. |
|
||||
| **OfflineToken** | Call `JwtIssuer.Generate(exp=30d)` → store `client.jwt` artefact → attach to OUK build context |
|
||||
|
||||
*All stages run in parallel where possible; max wall‑time < 15 min.*
|
||||
|
||||
---
|
||||
| **OfflineToken** | Call `JwtIssuer.Generate(exp=30d)` → store `client.jwt` artefact → attach to OUK build context |
|
||||
|
||||
*All stages run in parallel where possible; max wall‑time < 15 min.*
|
||||
|
||||
**Implementation note.** `.gitea/workflows/release.yml` executes
|
||||
`ops/devops/release/build_release.py` to build multi-arch images, attach
|
||||
CycloneDX SBOMs and SLSA provenance with Cosign, and emit
|
||||
`out/release/release.yaml` for downstream packaging (Helm, Compose, Offline Kit).
|
||||
|
||||
---
|
||||
|
||||
## 3 Container Image Strategy
|
||||
|
||||
|
||||
@@ -115,12 +115,18 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
|
||||
],
|
||||
"decision": "Allow|Deny",
|
||||
"ttlSeconds": 300
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3) Observer — node agent (DaemonSet)
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 Schema negotiation & hashing guarantees
|
||||
|
||||
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
|
||||
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
|
||||
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
|
||||
|
||||
---
|
||||
|
||||
## 3) Observer — node agent (DaemonSet)
|
||||
|
||||
### 3.1 Responsibilities
|
||||
|
||||
@@ -210,11 +216,13 @@ sequenceDiagram
|
||||
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
|
||||
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
|
||||
|
||||
### 5.2 Policy decision API (for webhook)
|
||||
|
||||
`POST /api/v1/scanner/policy/runtime`
|
||||
|
||||
Request:
|
||||
### 5.2 Policy decision API (for webhook)
|
||||
|
||||
`POST /api/v1/scanner/policy/runtime`
|
||||
|
||||
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
|
||||
|
||||
Request:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -253,23 +261,44 @@ Response:
|
||||
|
||||
```yaml
|
||||
zastava:
|
||||
mode:
|
||||
observer: true
|
||||
webhook: true
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
aud: ["scanner","zastava"] # tokens for backend and self-id
|
||||
backend:
|
||||
url: "https://scanner-web.internal"
|
||||
connectTimeoutMs: 500
|
||||
requestTimeoutMs: 1500
|
||||
retry: { attempts: 3, backoffMs: 200 }
|
||||
runtime:
|
||||
engine: "auto" # containerd|cri-o|docker|auto
|
||||
procfs: "/host/proc"
|
||||
collect:
|
||||
entryTrace: true
|
||||
loadedLibs: true
|
||||
mode:
|
||||
observer: true
|
||||
webhook: true
|
||||
backend:
|
||||
baseAddress: "https://scanner-web.internal"
|
||||
policyPath: "/api/v1/scanner/policy/runtime"
|
||||
requestTimeoutSeconds: 5
|
||||
allowInsecureHttp: false
|
||||
runtime:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
clientId: "zastava-observer"
|
||||
audience: ["scanner","zastava"]
|
||||
scopes:
|
||||
- "api:scanner.runtime.write"
|
||||
refreshSkewSeconds: 120
|
||||
requireDpop: true
|
||||
requireMutualTls: true
|
||||
allowStaticTokenFallback: false
|
||||
staticTokenPath: null # Optional bootstrap secret
|
||||
tenant: "tenant-01"
|
||||
environment: "prod"
|
||||
deployment: "cluster-a"
|
||||
logging:
|
||||
includeScopes: true
|
||||
includeActivityTracking: true
|
||||
staticScope:
|
||||
plane: "runtime"
|
||||
metrics:
|
||||
meterName: "StellaOps.Zastava"
|
||||
meterVersion: "1.0.0"
|
||||
commonTags:
|
||||
cluster: "prod-cluster"
|
||||
engine: "auto" # containerd|cri-o|docker|auto
|
||||
procfs: "/host/proc"
|
||||
collect:
|
||||
entryTrace: true
|
||||
loadedLibs: true
|
||||
maxLibs: 256
|
||||
maxHashBytesPerContainer: 64_000_000
|
||||
maxDepth: 48
|
||||
@@ -286,45 +315,49 @@ zastava:
|
||||
eventsPerSecond: 50
|
||||
burst: 200
|
||||
perNodeQueue: 10_000
|
||||
security:
|
||||
mounts:
|
||||
containerdSock: "/run/containerd/containerd.sock:ro"
|
||||
proc: "/proc:/host/proc:ro"
|
||||
runtimeState: "/var/lib/containerd:ro"
|
||||
```
|
||||
|
||||
---
|
||||
security:
|
||||
mounts:
|
||||
containerdSock: "/run/containerd/containerd.sock:ro"
|
||||
proc: "/proc:/host/proc:ro"
|
||||
runtimeState: "/var/lib/containerd:ro"
|
||||
```
|
||||
|
||||
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
|
||||
|
||||
---
|
||||
|
||||
## 7) Security posture
|
||||
|
||||
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
|
||||
* **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
|
||||
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
|
||||
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
|
||||
* **Rate limiting**: per‑node caps; per‑tenant caps at backend.
|
||||
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
|
||||
* **Least privileges**: read‑only host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
|
||||
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
|
||||
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
|
||||
* **Rate limiting**: per‑node caps; per‑tenant caps at backend.
|
||||
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
|
||||
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
|
||||
|
||||
---
|
||||
|
||||
## 8) Metrics, logs, tracing
|
||||
|
||||
**Observer**
|
||||
|
||||
* `zastava.events_emitted_total{kind}`
|
||||
* `zastava.proc_maps_samples_total{result}`
|
||||
* `zastava.entrytrace_depth{p99}`
|
||||
* `zastava.hash_bytes_total`
|
||||
* `zastava.buffer_drops_total`
|
||||
|
||||
**Webhook**
|
||||
|
||||
* `zastava.admission_requests_total{decision}`
|
||||
* `zastava.admission_latency_seconds`
|
||||
* `zastava.cache_hits_total`
|
||||
* `zastava.backend_failures_total`
|
||||
|
||||
**Logs** (structured): node, pod, image digest, decision, reasons.
|
||||
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
|
||||
**Observer**
|
||||
|
||||
* `zastava.runtime.events.total{kind}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
|
||||
* `zastava.proc_maps.samples.total{result}`
|
||||
* `zastava.entrytrace.depth{p99}`
|
||||
* `zastava.hash.bytes.total`
|
||||
* `zastava.buffer.drops.total`
|
||||
|
||||
**Webhook**
|
||||
|
||||
* `zastava.admission.decisions.total{decision}`
|
||||
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
|
||||
* `zastava.admission.cache.hits.total`
|
||||
* `zastava.backend.failures.total`
|
||||
|
||||
**Logs** (structured): node, pod, image digest, decision, reasons.
|
||||
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -82,6 +82,7 @@ Everything here is open‑source and versioned — when you check out a git ta
|
||||
- **31 – [Concelier MSRC Connector – AAD Onboarding](ops/concelier-msrc-operations.md)**
|
||||
- **32 – [Scanner Analyzer Bench Operations](ops/scanner-analyzers-operations.md)**
|
||||
- **33 – [Scanner Artifact Store Migration](ops/scanner-rustfs-migration.md)**
|
||||
- **34 – [Zastava Runtime Operations Runbook](ops/zastava-runtime-operations.md)**
|
||||
|
||||
### Legal & licence
|
||||
- **32 – [Legal & Quota FAQ](29_LEGAL_FAQ_QUOTA.md)**
|
||||
|
||||
32
docs/ops/ui-auth-smoke.md
Normal file
32
docs/ops/ui-auth-smoke.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# UI Auth Smoke Job (Playwright)
|
||||
|
||||
The DevOps Guild tracks **DEVOPS-UI-13-006** to wire the new Playwright auth
|
||||
smoke checks into CI and the Offline Kit pipeline. These tests exercise the
|
||||
Angular UI login flow against a stubbed Authority instance to verify that
|
||||
`/config.json` is discovered, DPoP proofs are minted, and error handling is
|
||||
surfaced when the backend rejects a request.
|
||||
|
||||
## What the job does
|
||||
|
||||
1. Builds the UI bundle (or consumes the artifact from the release pipeline).
|
||||
2. Copies the environment stub from `src/config/config.sample.json` into the
|
||||
runtime directory as `config.json` so the UI can bootstrap without a live
|
||||
gateway.
|
||||
3. Runs `npm run test:e2e`, which launches Playwright with the auth fixtures
|
||||
under `tests/e2e/auth.spec.ts`:
|
||||
- Validates that the Sign-in button generates an Authorization Code + PKCE
|
||||
redirect to `https://authority.local/connect/authorize`.
|
||||
- Confirms the callback view shows an actionable error when the redirect is
|
||||
missing the pending login state.
|
||||
4. Publishes JUnit + Playwright traces (retain-on-failure) for troubleshooting.
|
||||
|
||||
## Pipeline integration notes
|
||||
|
||||
- Chromium must already be available (`npx playwright install --with-deps`).
|
||||
- Set `PLAYWRIGHT_BASE_URL` if the UI serves on a non-default host/port.
|
||||
- For Offline Kit packaging, bundle the Playwright browser cache under
|
||||
`.cache/ms-playwright/` so the job runs without network access.
|
||||
- Failures should block release promotion; export the traces to the artifacts
|
||||
tab for debugging.
|
||||
|
||||
Refer to `ops/devops/TASKS.md` (DEVOPS-UI-13-006) for progress and ownership.
|
||||
205
docs/ops/zastava-runtime-grafana-dashboard.json
Normal file
205
docs/ops/zastava-runtime-grafana-dashboard.json
Normal file
@@ -0,0 +1,205 @@
|
||||
{
|
||||
"title": "Zastava Runtime Plane",
|
||||
"uid": "zastava-runtime",
|
||||
"timezone": "utc",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"refresh": "30s",
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"type": "timeseries",
|
||||
"title": "Observer Event Rate",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
|
||||
"legendFormat": "{{tenant}}/{{component}}/{{kind}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "1/s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"type": "timeseries",
|
||||
"title": "Admission Decisions",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
|
||||
"legendFormat": "{{decision}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "1/s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"type": "timeseries",
|
||||
"title": "Backend Latency P95",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
|
||||
"legendFormat": "p95 latency"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
},
|
||||
{
|
||||
"color": "orange",
|
||||
"value": 500
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 750
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"label": "Prometheus",
|
||||
"current": {
|
||||
"text": "Prometheus",
|
||||
"value": "Prometheus"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "tenant",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"definition": "label_values(zastava_runtime_events_total, tenant)",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {
|
||||
"text": ".*",
|
||||
"value": ".*"
|
||||
},
|
||||
"regex": "",
|
||||
"includeAll": true,
|
||||
"multi": true,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"name": "Deployments",
|
||||
"type": "tags",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"enable": true,
|
||||
"iconColor": "rgba(255, 96, 96, 1)"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
131
docs/ops/zastava-runtime-operations.md
Normal file
131
docs/ops/zastava-runtime-operations.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Zastava Runtime Operations Runbook
|
||||
|
||||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
|
||||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
|
||||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
|
||||
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
|
||||
certs before rollout.
|
||||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
|
||||
resolvable from every node running Observer/Webhook.
|
||||
- **Host mounts** – read-only access to `/proc`, container runtime state
|
||||
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
|
||||
(`/var/run/zastava`).
|
||||
- **Offline kit bundle** – operators staging air-gapped installs must download
|
||||
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
|
||||
Grafana dashboards, and Prometheus rules referenced below.
|
||||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
|
||||
live outside git. For air-gapped installs copy them to the sealed secrets vault.
|
||||
|
||||
### 1.1 Telemetry quick reference
|
||||
|
||||
| Metric | Description | Notes |
|
||||
|--------|-------------|-------|
|
||||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
|
||||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
|
||||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
|
||||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
|
||||
|
||||
## 2. Deployment workflows
|
||||
|
||||
### 2.1 Fresh install (Helm overlay)
|
||||
|
||||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
|
||||
2. Render values:
|
||||
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
|
||||
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
|
||||
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
|
||||
3. Pre-create secrets:
|
||||
- `zastava-authority-dpop` (JWK + private key).
|
||||
- `zastava-authority-mtls` (client cert/key chain).
|
||||
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
|
||||
4. Deploy Observer DaemonSet and Webhook chart:
|
||||
```sh
|
||||
helm upgrade --install zastava-runtime deploy/helm/zastava \
|
||||
-f values/zastava-runtime.yaml \
|
||||
--namespace stellaops \
|
||||
--create-namespace
|
||||
```
|
||||
5. Verify:
|
||||
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
|
||||
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
|
||||
`Issued runtime OpTok` audit line with DPoP token type.
|
||||
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
|
||||
|
||||
### 2.2 Upgrades
|
||||
|
||||
1. Scale webhook deployment to `--replicas=3` (rolling).
|
||||
2. Drain one node per AZ to ensure Observer tolerates disruption.
|
||||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
|
||||
4. Post-upgrade, run smoke tests:
|
||||
- Apply unsigned Pod manifest → expect `deny` (policy fail).
|
||||
- Apply signed Pod manifest → expect `allow`.
|
||||
5. Record upgrade in ops log with Git SHA + Helm chart version.
|
||||
|
||||
### 2.3 Rollback
|
||||
|
||||
1. Use Helm revision history: `helm history zastava-runtime`.
|
||||
2. Rollback: `helm rollback zastava-runtime <revision>`.
|
||||
3. Invalidate cached OpToks:
|
||||
```sh
|
||||
kubectl -n stellaops exec deploy/zastava-webhook -- \
|
||||
zastava-webhook invalidate-op-token --audience scanner
|
||||
```
|
||||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
|
||||
|
||||
## 3. Authority & security guardrails
|
||||
|
||||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
|
||||
`authority.token.issue` scope with decision data; absence indicates misconfig.
|
||||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
|
||||
lab clusters; expect warning log `Mutual TLS requirement disabled`.
|
||||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
|
||||
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
|
||||
- Audit every change in `zastava.runtime.authority` through change management.
|
||||
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
|
||||
to confirm key rotation.
|
||||
|
||||
## 4. Incident response
|
||||
|
||||
### 4.1 Authority offline
|
||||
|
||||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
|
||||
2. Inspect Observer logs for `authority.token.fallback` scope.
|
||||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
|
||||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
|
||||
|
||||
### 4.2 Scanner/WebService latency spike
|
||||
|
||||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
|
||||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
|
||||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
|
||||
`kubectl logs ds/zastava-observer | grep buffer.drops`.
|
||||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
|
||||
|
||||
### 4.3 Admission deny storm
|
||||
|
||||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
|
||||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
|
||||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
|
||||
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
|
||||
|
||||
## 5. Offline kit & air-gapped notes
|
||||
|
||||
- Bundle contents:
|
||||
- Observer/Webhook container images (multi-arch).
|
||||
- `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
|
||||
- Sample `zastava-runtime.values.yaml`.
|
||||
- Verification:
|
||||
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
|
||||
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
|
||||
- Import Grafana dashboard via `grafana-cli --config ...`.
|
||||
|
||||
## 6. Observability assets
|
||||
|
||||
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
|
||||
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
|
||||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
|
||||
the Offline Kit manifest.
|
||||
31
docs/ops/zastava-runtime-prometheus-rules.yaml
Normal file
31
docs/ops/zastava-runtime-prometheus-rules.yaml
Normal file
@@ -0,0 +1,31 @@
|
||||
groups:
|
||||
- name: zastava-runtime
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: ZastavaRuntimeEventsSilent
|
||||
expr: sum(rate(zastava_runtime_events_total[10m])) == 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Observer events stalled"
|
||||
description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
|
||||
- alert: ZastavaRuntimeBackendLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Runtime backend latency p95 above 750 ms"
|
||||
description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
|
||||
- alert: ZastavaAdmissionDenySpike
|
||||
expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Admission webhook denies exceeding threshold"
|
||||
description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."
|
||||
Reference in New Issue
Block a user