Files
git.stella-ops.org/src/Zastava/StellaOps.Zastava.Webhook/IMPLEMENTATION_PLAN.md
master 7b5bdcf4d3 feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
2025-10-30 00:09:39 +02:00

106 lines
7.3 KiB
Markdown

# Zastava Webhook · Wave 0 Implementation Notes
> Authored 2025-10-19 by Zastava Webhook Guild.
## ZASTAVA-WEBHOOK-12-101 — Admission Controller Host (TLS bootstrap + Authority auth)
**Objectives**
- Provide a deterministic, restart-safe .NET 10 host that exposes a Kubernetes ValidatingAdmissionWebhook endpoint.
- Load serving certificates at start-up only (per restart-time plug-in rule) and surface reload guidance via documentation rather than hot-reload.
- Authenticate outbound calls to Authority/Scanner using OpTok + DPoP as defined in `docs/modules/zastava/ARCHITECTURE.md`.
**Plan**
1. **Project scaffolding**
- Create `StellaOps.Zastava.Webhook` project with minimal API pipeline (`Program.cs`, `Startup` equivalent via extension methods).
- Reference shared helpers once `ZASTAVA-CORE-12-201/202` land; temporarily stub interfaces behind `IZastavaAdmissionRequest`/`IZastavaAdmissionResult`.
2. **TLS bootstrap**
- Support two certificate sources:
1. Mounted secret path (`/var/run/secrets/zastava-webhook/tls.{crt,key}`) with optional CA bundle.
2. CSR workflow: generate CSR + private key, submit to Kubernetes Certificates API when `admission.tls.autoApprove` enabled; persist signed cert/key to mounted emptyDir for reuse across replicas.
- Validate cert/key pair on boot; abort start-up if invalid to preserve deterministic behavior.
- Configure Kestrel for mutual TLS off (API Server already provides client auth) but enforce minimum TLS 1.3, strong cipher suite list, HTTP/2 disabled (K8s uses HTTP/1.1).
3. **Authority auth**
- Bootstrap Authority client via shared runtime core (`AddZastavaRuntimeCore` + `IZastavaAuthorityTokenProvider`) so webhook reuses multitenant OpTok caching and guardrails.
- Implement DPoP proof generator bound to webhook host keypair (prefer Ed25519) with configurable rotation period (default 24h, triggered at restart).
- Add background health check verifying token freshness and surfacing metrics (`zastava.authority_token_renew_failures_total`).
4. **Hosting concerns**
- Configure structured logging with correlation id from AdmissionReview UID.
- Expose `/healthz` (reads cert expiry, Authority token status) and `/metrics` (Prometheus).
- Add readiness gate that requires initial TLS and Authority bootstrap to succeed.
**Deliverables**
- Compilable host project with integration tests covering TLS load (mounted files + CSR mock) and Authority token acquisition.
- Documentation snippet for deploy charts describing secret/CSR wiring.
**Open Questions**
- Need confirmation from Core guild on DTO naming (`AdmissionReviewEnvelope`, `AdmissionDecision`) to avoid rework.
- Determine whether CSR auto-approval is acceptable for air-gapped clusters without Kubernetes cert-manager; may require fallback manual cert import path.
## ZASTAVA-WEBHOOK-12-102 — Backend policy query & digest resolution
**Objectives**
- Resolve all images within AdmissionReview to immutable digests before policy evaluation.
- Call Scanner WebService `/api/v1/scanner/policy/runtime` with namespace/labels/images payload, enforce verdicts with deterministic error messaging.
**Plan**
1. **Image resolution**
- Implement resolver service with pluggable strategies:
- Use existing digest if present.
- Resolve tags via registry HEAD (respecting `admission.resolveTags` flag); fallback to Observer-provided digest once core DTOs available.
- Cache per-registry auth to minimise latency; adhere to allow/deny lists from configuration.
2. **Scanner client**
- Define typed request/response models mirroring `docs/modules/zastava/ARCHITECTURE.md` structure (`ttlSeconds`, `results[digest] -> { signed, hasSbom, policyVerdict, reasons, rekor }`).
- Implement retry policy (3 attempts, exponential backoff) and map HTTP errors to webhook fail-open/closed depending on namespace configuration.
- Instrument latency (`zastava.backend_latency_seconds`) and failure counts.
3. **Verdict enforcement**
- Evaluate per-image results: if any `policyVerdict != pass` (or `warn` when `enforceWarnings=false`), deny with aggregated reasons.
- Attach `ttlSeconds` to admission response annotations for auditing.
- Record structured logs with namespace, pod, image digest, decision, reasons, backend latency.
4. **Contract coordination**
- Schedule joint review with Scanner WebService guild once SCANNER-RUNTIME-12-302 schema stabilises; track in TASKS sub-items.
- Provide sample payload fixtures for CLI team (`CLI-RUNTIME-13-005`) to validate table output; ensure field names stay aligned.
**Deliverables**
- Registry resolver unit tests (tag->digest) with deterministic fixtures.
- HTTP client integration tests using Scanner stub returning varied verdict combinations.
- Documentation update summarising contract and failure handling.
**Open Questions**
- Confirm expected policy verdict enumeration (`pass|warn|fail|error`?) and textual reason codes.
- Need TTL behaviour: should webhook reduce TTL when backend returns > configured max?
## ZASTAVA-WEBHOOK-12-103 — Caching, fail-open/closed toggles, metrics/logging
**Objectives**
- Provide deterministic caching layer respecting backend TTL while ensuring eviction on policy mutation.
- Allow namespace-scoped fail-open behaviour with explicit metrics and alerts.
- Surface actionable metrics/logging aligned with Architecture doc.
**Plan**
1. **Cache design**
- In-memory LRU keyed by image digest; value carries verdict payload + expiry timestamp.
- Support optional persistent seed (read-only) to prime hot digests for offline clusters (config: `admission.cache.seedPath`).
- On startup, load seed file and emit metric `zastava.cache_seed_entries_total`.
- Evict entries on TTL or when `policyRevision` annotation in AdmissionReview changes (requires hook from Core DTO).
2. **Fail-open/closed toggles**
- Configuration: global default + namespace overrides through `admission.failOpenNamespaces`, `admission.failClosedNamespaces`.
- Decision matrix:
- Backend success + verdict PASS → allow.
- Backend success + non-pass → deny unless namespace override says warn allowed.
- Backend failure → allow if namespace fail-open, deny otherwise; annotate response with `zastava.ops/fail-open=true`.
- Implement policy change event hook (future) to clear cache if observer signals revocation.
3. **Metrics & logging**
- Counters: `zastava.admission_requests_total{decision}`, `zastava.cache_hits_total{result=hit|miss}`, `zastava.fail_open_total`, `zastava.backend_failures_total{stage}`.
- Histograms: `zastava.admission_latency_seconds` (overall), `zastava.resolve_latency_seconds`.
- Logs: structured JSON with `decision`, `namespace`, `pod`, `imageDigest`, `reasons`, `cacheStatus`, `failMode`.
- Optionally emit OpenTelemetry span for admission path with attributes capturing backend latency + cache path.
4. **Testing & ops hooks**
- Unit tests for cache TTL, namespace override logic, fail-open metric increments.
- Integration test simulating backend outage ensuring fail-open/closed behaviour matches config.
- Document runbook snippet describing interpreting metrics and toggling namespaces.
**Open Questions**
- Confirm whether cache entries should include `policyRevision` to detect backend policy updates; requires coordination with Policy guild.
- Need guidance on maximum cache size (default suggestions: 5k entries per replica?) to avoid memory blow-up.