- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
106 lines
7.3 KiB
Markdown
106 lines
7.3 KiB
Markdown
# Zastava Webhook · Wave 0 Implementation Notes
|
|
|
|
> Authored 2025-10-19 by Zastava Webhook Guild.
|
|
|
|
## ZASTAVA-WEBHOOK-12-101 — Admission Controller Host (TLS bootstrap + Authority auth)
|
|
|
|
**Objectives**
|
|
- Provide a deterministic, restart-safe .NET 10 host that exposes a Kubernetes ValidatingAdmissionWebhook endpoint.
|
|
- Load serving certificates at start-up only (per restart-time plug-in rule) and surface reload guidance via documentation rather than hot-reload.
|
|
- Authenticate outbound calls to Authority/Scanner using OpTok + DPoP as defined in `docs/modules/zastava/ARCHITECTURE.md`.
|
|
|
|
**Plan**
|
|
1. **Project scaffolding**
|
|
- Create `StellaOps.Zastava.Webhook` project with minimal API pipeline (`Program.cs`, `Startup` equivalent via extension methods).
|
|
- Reference shared helpers once `ZASTAVA-CORE-12-201/202` land; temporarily stub interfaces behind `IZastavaAdmissionRequest`/`IZastavaAdmissionResult`.
|
|
2. **TLS bootstrap**
|
|
- Support two certificate sources:
|
|
1. Mounted secret path (`/var/run/secrets/zastava-webhook/tls.{crt,key}`) with optional CA bundle.
|
|
2. CSR workflow: generate CSR + private key, submit to Kubernetes Certificates API when `admission.tls.autoApprove` enabled; persist signed cert/key to mounted emptyDir for reuse across replicas.
|
|
- Validate cert/key pair on boot; abort start-up if invalid to preserve deterministic behavior.
|
|
- Configure Kestrel for mutual TLS off (API Server already provides client auth) but enforce minimum TLS 1.3, strong cipher suite list, HTTP/2 disabled (K8s uses HTTP/1.1).
|
|
3. **Authority auth**
|
|
- Bootstrap Authority client via shared runtime core (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so webhook reuses multitenant OpTok caching and guardrails.
|
|
- Implement DPoP proof generator bound to webhook host keypair (prefer Ed25519) with configurable rotation period (default 24h, triggered at restart).
|
|
- Add background health check verifying token freshness and surfacing metrics (`zastava.authority_token_renew_failures_total`).
|
|
4. **Hosting concerns**
|
|
- Configure structured logging with correlation id from AdmissionReview UID.
|
|
- Expose `/healthz` (reads cert expiry, Authority token status) and `/metrics` (Prometheus).
|
|
- Add readiness gate that requires initial TLS and Authority bootstrap to succeed.
|
|
|
|
**Deliverables**
|
|
- Compilable host project with integration tests covering TLS load (mounted files + CSR mock) and Authority token acquisition.
|
|
- Documentation snippet for deploy charts describing secret/CSR wiring.
|
|
|
|
**Open Questions**
|
|
- Need confirmation from Core guild on DTO naming (`AdmissionReviewEnvelope`, `AdmissionDecision`) to avoid rework.
|
|
- Determine whether CSR auto-approval is acceptable for air-gapped clusters without Kubernetes cert-manager; may require fallback manual cert import path.
|
|
|
|
## ZASTAVA-WEBHOOK-12-102 — Backend policy query & digest resolution
|
|
|
|
**Objectives**
|
|
- Resolve all images within AdmissionReview to immutable digests before policy evaluation.
|
|
- Call Scanner WebService `/api/v1/scanner/policy/runtime` with namespace/labels/images payload, enforce verdicts with deterministic error messaging.
|
|
|
|
**Plan**
|
|
1. **Image resolution**
|
|
- Implement resolver service with pluggable strategies:
|
|
- Use existing digest if present.
|
|
- Resolve tags via registry HEAD (respecting `admission.resolveTags` flag); fallback to Observer-provided digest once core DTOs available.
|
|
- Cache per-registry auth to minimise latency; adhere to allow/deny lists from configuration.
|
|
2. **Scanner client**
|
|
- Define typed request/response models mirroring `docs/modules/zastava/ARCHITECTURE.md` structure (`ttlSeconds`, `results[digest] -> { signed, hasSbom, policyVerdict, reasons, rekor }`).
|
|
- Implement retry policy (3 attempts, exponential backoff) and map HTTP errors to webhook fail-open/closed depending on namespace configuration.
|
|
- Instrument latency (`zastava.backend_latency_seconds`) and failure counts.
|
|
3. **Verdict enforcement**
|
|
- Evaluate per-image results: if any `policyVerdict != pass` (or `warn` when `enforceWarnings=false`), deny with aggregated reasons.
|
|
- Attach `ttlSeconds` to admission response annotations for auditing.
|
|
- Record structured logs with namespace, pod, image digest, decision, reasons, backend latency.
|
|
4. **Contract coordination**
|
|
- Schedule joint review with Scanner WebService guild once SCANNER-RUNTIME-12-302 schema stabilises; track in TASKS sub-items.
|
|
- Provide sample payload fixtures for CLI team (`CLI-RUNTIME-13-005`) to validate table output; ensure field names stay aligned.
|
|
|
|
**Deliverables**
|
|
- Registry resolver unit tests (tag->digest) with deterministic fixtures.
|
|
- HTTP client integration tests using Scanner stub returning varied verdict combinations.
|
|
- Documentation update summarising contract and failure handling.
|
|
|
|
**Open Questions**
|
|
- Confirm expected policy verdict enumeration (`pass|warn|fail|error`?) and textual reason codes.
|
|
- Need TTL behaviour: should webhook reduce TTL when backend returns > configured max?
|
|
|
|
## ZASTAVA-WEBHOOK-12-103 — Caching, fail-open/closed toggles, metrics/logging
|
|
|
|
**Objectives**
|
|
- Provide deterministic caching layer respecting backend TTL while ensuring eviction on policy mutation.
|
|
- Allow namespace-scoped fail-open behaviour with explicit metrics and alerts.
|
|
- Surface actionable metrics/logging aligned with Architecture doc.
|
|
|
|
**Plan**
|
|
1. **Cache design**
|
|
- In-memory LRU keyed by image digest; value carries verdict payload + expiry timestamp.
|
|
- Support optional persistent seed (read-only) to prime hot digests for offline clusters (config: `admission.cache.seedPath`).
|
|
- On startup, load seed file and emit metric `zastava.cache_seed_entries_total`.
|
|
- Evict entries on TTL or when `policyRevision` annotation in AdmissionReview changes (requires hook from Core DTO).
|
|
2. **Fail-open/closed toggles**
|
|
- Configuration: global default + namespace overrides through `admission.failOpenNamespaces`, `admission.failClosedNamespaces`.
|
|
- Decision matrix:
|
|
- Backend success + verdict PASS → allow.
|
|
- Backend success + non-pass → deny unless namespace override says warn allowed.
|
|
- Backend failure → allow if namespace fail-open, deny otherwise; annotate response with `zastava.ops/fail-open=true`.
|
|
- Implement policy change event hook (future) to clear cache if observer signals revocation.
|
|
3. **Metrics & logging**
|
|
- Counters: `zastava.admission_requests_total{decision}`, `zastava.cache_hits_total{result=hit|miss}`, `zastava.fail_open_total`, `zastava.backend_failures_total{stage}`.
|
|
- Histograms: `zastava.admission_latency_seconds` (overall), `zastava.resolve_latency_seconds`.
|
|
- Logs: structured JSON with `decision`, `namespace`, `pod`, `imageDigest`, `reasons`, `cacheStatus`, `failMode`.
|
|
- Optionally emit OpenTelemetry span for admission path with attributes capturing backend latency + cache path.
|
|
4. **Testing & ops hooks**
|
|
- Unit tests for cache TTL, namespace override logic, fail-open metric increments.
|
|
- Integration test simulating backend outage ensuring fail-open/closed behaviour matches config.
|
|
- Document runbook snippet describing interpreting metrics and toggling namespaces.
|
|
|
|
**Open Questions**
|
|
- Confirm whether cache entries should include `policyRevision` to detect backend policy updates; requires coordination with Policy guild.
|
|
- Need guidance on maximum cache size (default suggestions: 5k entries per replica?) to avoid memory blow-up.
|
|
|