up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
2025-10-24 09:15:37 +03:00
parent f4d7a15a00
commit 17d861e4ab
163 changed files with 14269 additions and 452 deletions

View File

@@ -57,11 +57,16 @@ graph LR
| **Publish** | Push to `registry.git.stella-ops.org`. |
| **E2E** | Kindbased Kubernetes test incl. Zastava DaemonSet; verify sub5s scan SLA. |
| **Notify** | Report to Mattermost & GitLab Slack app. |
| **OfflineToken** | Call `JwtIssuer.Generate(exp=30d)` → store `client.jwt` artefact → attach to OUK build context |
*All stages run in parallel where possible; max walltime <15min.*
---
| **OfflineToken** | Call `JwtIssuer.Generate(exp=30d)` → store `client.jwt` artefact → attach to OUK build context |
*All stages run in parallel where possible; max walltime <15min.*
**Implementation note.** `.gitea/workflows/release.yml` executes
`ops/devops/release/build_release.py` to build multi-arch images, attach
CycloneDX SBOMs and SLSA provenance with Cosign, and emit
`out/release/release.yaml` for downstream packaging (Helm, Compose, Offline Kit).
---
##3Container Image Strategy

View File

@@ -115,12 +115,18 @@ stellaops/zastava-agent # System service; watch Docker events; observer on
],
"decision": "Allow|Deny",
"ttlSeconds": 300
}
```
---
## 3) Observer — node agent (DaemonSet)
}
```
### 2.3 Schema negotiation & hashing guarantees
* Every payload is wrapped in an envelope with `schemaVersion` set to `"<schema>@v<major>.<minor>"`. Version negotiation keeps the **major** line in lockstep (`zastava.runtime.event@v1.x`, `zastava.admission.decision@v1.x`) and selects the highest mutually supported **minor**. If no overlap exists, the local default (`@v1.0`) is used.
* Components use the shared `ZastavaContractVersions` helper for parsing/negotiation and the canonical JSON serializer to guarantee identical byte sequences prior to hashing, ensuring multihash IDs such as `sha256-<base64url>` are reproducible across observers, webhooks, and backend jobs.
* Schema evolution rules: backwards-compatible fields append to the end of the canonical property order; breaking changes bump the **major** and require dual-writer/reader rollout per deployment playbook.
---
## 3) Observer — node agent (DaemonSet)
### 3.1 Responsibilities
@@ -210,11 +216,13 @@ sequenceDiagram
* If unknown/missing, schedule **delta scan** and return `202 Accepted`.
* Emits **derived signals** (usedByEntrypoint per component based on `/proc/<pid>/maps`).
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
Request:
### 5.2 Policy decision API (for webhook)
`POST /api/v1/scanner/policy/runtime`
The webhook reuses the shared runtime stack (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so OpTok caching, DPoP enforcement, and telemetry behave identically to the observer plane.
Request:
```json
{
@@ -253,23 +261,44 @@ Response:
```yaml
zastava:
mode:
observer: true
webhook: true
authority:
issuer: "https://authority.internal"
aud: ["scanner","zastava"] # tokens for backend and self-id
backend:
url: "https://scanner-web.internal"
connectTimeoutMs: 500
requestTimeoutMs: 1500
retry: { attempts: 3, backoffMs: 200 }
runtime:
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
allowInsecureHttp: false
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner","zastava"]
scopes:
- "api:scanner.runtime.write"
refreshSkewSeconds: 120
requireDpop: true
requireMutualTls: true
allowStaticTokenFallback: false
staticTokenPath: null # Optional bootstrap secret
tenant: "tenant-01"
environment: "prod"
deployment: "cluster-a"
logging:
includeScopes: true
includeActivityTracking: true
staticScope:
plane: "runtime"
metrics:
meterName: "StellaOps.Zastava"
meterVersion: "1.0.0"
commonTags:
cluster: "prod-cluster"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
maxLibs: 256
maxHashBytesPerContainer: 64_000_000
maxDepth: 48
@@ -286,45 +315,49 @@ zastava:
eventsPerSecond: 50
burst: 200
perNodeQueue: 10_000
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
---
security:
mounts:
containerdSock: "/run/containerd/containerd.sock:ro"
proc: "/proc:/host/proc:ro"
runtimeState: "/var/lib/containerd:ro"
```
> Implementation note: both `zastava-observer` and `zastava-webhook` call `services.AddZastavaRuntimeCore(configuration, "<component>")` during start-up to bind the `zastava:runtime` section, enforce validation, and register canonical log scopes + meters.
---
## 7) Security posture
* **AuthN/Z**: Authority OpToks (DPoP preferred) to backend; webhook does **not** require client auth from API server (K8s handles).
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Least privileges**: readonly host mounts; optional `CAP_SYS_PTRACE`; **no** host networking; **no** write mounts.
* **Isolation**: never exec untrusted code; nsenter only to **read** `/proc/<pid>`.
* **Data minimization**: do not exfiltrate env vars or command arguments unless policy explicitly enables diagnostic mode.
* **Rate limiting**: pernode caps; pertenant caps at backend.
* **Hard caps**: bytes hashed, files inspected, depth of shell parsing.
* **Authority guardrails**: `AddZastavaRuntimeCore` binds `zastava.runtime.authority` and refuses tokens without `aud:<tenant>` scope; optional knobs (`requireDpop`, `requireMutualTls`, `allowStaticTokenFallback`) emit structured warnings when relaxed.
---
## 8) Metrics, logs, tracing
**Observer**
* `zastava.events_emitted_total{kind}`
* `zastava.proc_maps_samples_total{result}`
* `zastava.entrytrace_depth{p99}`
* `zastava.hash_bytes_total`
* `zastava.buffer_drops_total`
**Webhook**
* `zastava.admission_requests_total{decision}`
* `zastava.admission_latency_seconds`
* `zastava.cache_hits_total`
* `zastava.backend_failures_total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
**Observer**
* `zastava.runtime.events.total{kind}`
* `zastava.runtime.backend.latency.ms{endpoint="events"}`
* `zastava.proc_maps.samples.total{result}`
* `zastava.entrytrace.depth{p99}`
* `zastava.hash.bytes.total`
* `zastava.buffer.drops.total`
**Webhook**
* `zastava.admission.decisions.total{decision}`
* `zastava.runtime.backend.latency.ms{endpoint="policy"}`
* `zastava.admission.cache.hits.total`
* `zastava.backend.failures.total`
**Logs** (structured): node, pod, image digest, decision, reasons.
**Tracing**: spans for observe→batch→post; webhook request→resolve→respond.
---

View File

@@ -82,6 +82,7 @@ Everything here is opensource and versioned— when you check out a git ta
- **31[Concelier MSRC Connector AAD Onboarding](ops/concelier-msrc-operations.md)**
- **32[Scanner Analyzer Bench Operations](ops/scanner-analyzers-operations.md)**
- **33[Scanner Artifact Store Migration](ops/scanner-rustfs-migration.md)**
- **34[Zastava Runtime Operations Runbook](ops/zastava-runtime-operations.md)**
### Legal & licence
- **32[Legal & Quota FAQ](29_LEGAL_FAQ_QUOTA.md)**

32
docs/ops/ui-auth-smoke.md Normal file
View File

@@ -0,0 +1,32 @@
# UI Auth Smoke Job (Playwright)
The DevOps Guild tracks **DEVOPS-UI-13-006** to wire the new Playwright auth
smoke checks into CI and the Offline Kit pipeline. These tests exercise the
Angular UI login flow against a stubbed Authority instance to verify that
`/config.json` is discovered, DPoP proofs are minted, and error handling is
surfaced when the backend rejects a request.
## What the job does
1. Builds the UI bundle (or consumes the artifact from the release pipeline).
2. Copies the environment stub from `src/config/config.sample.json` into the
runtime directory as `config.json` so the UI can bootstrap without a live
gateway.
3. Runs `npm run test:e2e`, which launches Playwright with the auth fixtures
under `tests/e2e/auth.spec.ts`:
- Validates that the Sign-in button generates an Authorization Code + PKCE
redirect to `https://authority.local/connect/authorize`.
- Confirms the callback view shows an actionable error when the redirect is
missing the pending login state.
4. Publishes JUnit + Playwright traces (retain-on-failure) for troubleshooting.
## Pipeline integration notes
- Chromium must already be available (`npx playwright install --with-deps`).
- Set `PLAYWRIGHT_BASE_URL` if the UI serves on a non-default host/port.
- For Offline Kit packaging, bundle the Playwright browser cache under
`.cache/ms-playwright/` so the job runs without network access.
- Failures should block release promotion; export the traces to the artifacts
tab for debugging.
Refer to `ops/devops/TASKS.md` (DEVOPS-UI-13-006) for progress and ownership.

View File

@@ -0,0 +1,205 @@
{
"title": "Zastava Runtime Plane",
"uid": "zastava-runtime",
"timezone": "utc",
"schemaVersion": 38,
"version": 1,
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"panels": [
{
"id": 1,
"type": "timeseries",
"title": "Observer Event Rate",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"targets": [
{
"expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
"legendFormat": "{{tenant}}/{{component}}/{{kind}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"fieldConfig": {
"defaults": {
"unit": "1/s",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
}
]
}
},
"overrides": []
},
"options": {
"legend": {
"showLegend": true,
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 2,
"type": "timeseries",
"title": "Admission Decisions",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"targets": [
{
"expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
"legendFormat": "{{decision}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"fieldConfig": {
"defaults": {
"unit": "1/s",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
},
{
"color": "red",
"value": 20
}
]
}
},
"overrides": []
},
"options": {
"legend": {
"showLegend": true,
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 3,
"type": "timeseries",
"title": "Backend Latency P95",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
"legendFormat": "p95 latency"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"fieldConfig": {
"defaults": {
"unit": "ms",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
},
{
"color": "orange",
"value": 500
},
{
"color": "red",
"value": 750
}
]
}
},
"overrides": []
},
"options": {
"legend": {
"showLegend": true,
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
}
],
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"label": "Prometheus",
"current": {
"text": "Prometheus",
"value": "Prometheus"
}
},
{
"name": "tenant",
"type": "query",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"definition": "label_values(zastava_runtime_events_total, tenant)",
"refresh": 1,
"hide": 0,
"current": {
"text": ".*",
"value": ".*"
},
"regex": "",
"includeAll": true,
"multi": true,
"sort": 1
}
]
},
"annotations": {
"list": [
{
"name": "Deployments",
"type": "tags",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"enable": true,
"iconColor": "rgba(255, 96, 96, 1)"
}
]
}
}

View File

@@ -0,0 +1,131 @@
# Zastava Runtime Operations Runbook
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
It aligns with `Sprint 12 Runtime Guardrails` and assumes components consume
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
## 1. Prerequisites
- **Authority client credentials** service principal `zastava-runtime` with scopes
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
certs before rollout.
- **Scanner/WebService reachability** cluster DNS entry (e.g. `scanner.internal`)
resolvable from every node running Observer/Webhook.
- **Host mounts** read-only access to `/proc`, container runtime state
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
(`/var/run/zastava`).
- **Offline kit bundle** operators staging air-gapped installs must download
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
Grafana dashboards, and Prometheus rules referenced below.
- **Secrets** Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
live outside git. For air-gapped installs copy them to the sealed secrets vault.
### 1.1 Telemetry quick reference
| Metric | Description | Notes |
|--------|-------------|-------|
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
## 2. Deployment workflows
### 2.1 Fresh install (Helm overlay)
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
2. Render values:
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
3. Pre-create secrets:
- `zastava-authority-dpop` (JWK + private key).
- `zastava-authority-mtls` (client cert/key chain).
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
4. Deploy Observer DaemonSet and Webhook chart:
```sh
helm upgrade --install zastava-runtime deploy/helm/zastava \
-f values/zastava-runtime.yaml \
--namespace stellaops \
--create-namespace
```
5. Verify:
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
`Issued runtime OpTok` audit line with DPoP token type.
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
### 2.2 Upgrades
1. Scale webhook deployment to `--replicas=3` (rolling).
2. Drain one node per AZ to ensure Observer tolerates disruption.
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
4. Post-upgrade, run smoke tests:
- Apply unsigned Pod manifest → expect `deny` (policy fail).
- Apply signed Pod manifest → expect `allow`.
5. Record upgrade in ops log with Git SHA + Helm chart version.
### 2.3 Rollback
1. Use Helm revision history: `helm history zastava-runtime`.
2. Rollback: `helm rollback zastava-runtime <revision>`.
3. Invalidate cached OpToks:
```sh
kubectl -n stellaops exec deploy/zastava-webhook -- \
zastava-webhook invalidate-op-token --audience scanner
```
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
## 3. Authority & security guardrails
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
`authority.token.issue` scope with decision data; absence indicates misconfig.
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
lab clusters; expect warning log `Mutual TLS requirement disabled`.
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
- Audit every change in `zastava.runtime.authority` through change management.
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
to confirm key rotation.
## 4. Incident response
### 4.1 Authority offline
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
2. Inspect Observer logs for `authority.token.fallback` scope.
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
### 4.2 Scanner/WebService latency spike
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
`kubectl logs ds/zastava-observer | grep buffer.drops`.
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
### 4.3 Admission deny storm
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
## 5. Offline kit & air-gapped notes
- Bundle contents:
- Observer/Webhook container images (multi-arch).
- `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
- Sample `zastava-runtime.values.yaml`.
- Verification:
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
- Import Grafana dashboard via `grafana-cli --config ...`.
## 6. Observability assets
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
the Offline Kit manifest.

View File

@@ -0,0 +1,31 @@
groups:
- name: zastava-runtime
interval: 30s
rules:
- alert: ZastavaRuntimeEventsSilent
expr: sum(rate(zastava_runtime_events_total[10m])) == 0
for: 15m
labels:
severity: warning
service: zastava-runtime
annotations:
summary: "Observer events stalled"
description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
- alert: ZastavaRuntimeBackendLatencyHigh
expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
for: 10m
labels:
severity: critical
service: zastava-runtime
annotations:
summary: "Runtime backend latency p95 above 750 ms"
description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
- alert: ZastavaAdmissionDenySpike
expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
for: 5m
labels:
severity: warning
service: zastava-runtime
annotations:
summary: "Admission webhook denies exceeding threshold"
description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."