up

2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions
--- a/docs/operations/notifier-runbook.md
+++ b/docs/operations/notifier-runbook.md
@@ -0,0 +1,58 @@
+# Notifier Runbook
+
+Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
+
+## Purpose
+Operational steps to deploy, monitor, and recover the Notifications service (WebService + Worker).
+
+## Pre-flight
+- Secrets stored in Authority: SMTP creds, Slack/Teams hooks, webhook HMAC keys.
+- Outbound allowlist updated for target channels.
+- Mongo and Redis reachable; health checks pass.
+- Offline kit loaded: channel manifests, default templates, rule seeds.
+
+## Deploy
+1. Apply Kubernetes manifests/Compose stack from `ops/notify/` with image digests pinned.
+2. Set env:
+   - `Notify__Mongo__ConnectionString`
+   - `Notify__Redis__ConnectionString`
+   - `Notify__Authority__BaseUrl`
+   - `Notify__ChannelAllowlist`
+   - `ASPNETCORE_URLS=http://0.0.0.0:8080`
+3. Warm caches: `POST /api/v1/notify/admin/warm` (loads rules/templates into memory) — optional.
+4. Verify `GET /api/v1/notify/health` returns `ready=true`.
+
+## Monitor
+- Metrics (Prometheus):
+  - `notify_delivery_attempts_total` by status/channel/tenant.
+  - `notify_escalation_stage_total` by policy/stage.
+  - `notify_rule_eval_seconds_bucket` for worker latency.
+- Logs: structured JSON with `tenant`, `ruleId`, `deliveryId`, `channel`, `status`.
+- Traces: span `notify.delivery` with linkage to originating event `traceparent` when provided.
+
+## Common operations
+- **List stuck deliveries**: `GET /api/v1/notify/deliveries?status=failed&from=<utc>`.
+- **Replay delivery**: `POST /api/v1/notify/deliveries/{id}:replay` (idempotent; only re-renders if inputs unchanged).
+- **Pause a tenant**: set tenant state `paused=true` via admin API; worker stops sending but keeps evaluating for audit.
+- **Rotate secrets**: update Authority secret, then `POST /api/v1/notify/channels/{id}:refresh-secret`.
+
+## Failure recovery
+- Worker crash loop: check Redis connectivity, template compile errors; run `notify-worker --validate-only` using current config.
+- Mongo outage: worker backs off with exponential retry; after recovery, replay via `:replay` or digests as needed.
+- Channel outage (e.g., Slack 5xx): throttles + retry policy handle transient errors; for extended outages, disable channel or swap to backup policy.
+
+## Auditing
+- Delivery ledger retains attempt hashes and signatures; export via `/deliveries?from=...&to=...&format=ndjson` for offline review.
+- Ack events stored with actor, timestamp, source IP.
+
+## Determinism safeguards
+- Rule snapshots are versioned per tenant; upgrades swap snapshots atomically.
+- Template rendering uses deterministic helpers only; no live lookups.
+- Time sources are UTC; quiet hours evaluated using tenant timezone from config.
+
+## On-call checklist
+- [ ] Health endpoints green.
+- [ ] Delivery failure rate < 0.5% over last hour.
+- [ ] Escalation backlog empty or within SLO.
+- [ ] Redis memory < 75% and Mongo primary healthy.
+- [ ] Latest release notes applied and channels validated.
--- a/docs/operations/orchestrator-runbook.md
+++ b/docs/operations/orchestrator-runbook.md
@@ -0,0 +1,39 @@
+# Orchestrator Runbook (DOCS-ORCH-34-003)
+
+Last updated: 2025-11-25
+
+## Pre-flight
+- Ensure Mongo and queue backend reachable; health at `/api/v1/orchestrator/admin/health` green.
+- Verify tenant allowlist and scopes (`orchestrator:*`) configured in Authority.
+- Plugin bundles present and signatures verified.
+
+## Common operations
+- **Start a run**: `POST /api/v1/orchestrator/runs` or `stella orch run start ...`.
+- **Cancel a run**: `POST /runs/{runId}:cancel`; best-effort, idempotent.
+- **Stream status**: WebSocket `/runs/stream` or CLI `stella orch run stream`.
+- **Export ledger**: NDJSON export by time window for audits.
+
+## Incident response
+- **Queue backlog**: Check queue depth; scale workers or pause schedulers; drain oldest first. Verify no stuck plugin.
+- **Repeated failures**: Inspect run ledger for `error.code`; compare `inputsHash` and plugin version; roll back DAG version if regression.
+- **Plugin auth errors**: rotate `secretRef` in Authority; warm worker cache; re-run impacted DAGs.
+- **Scheduler runaway**: disable offending DAG version; clear scheduled triggers; confirm queue drains.
+
+## Health checks
+- `GET /admin/health` — liveness/readiness + queue depth.
+- Metrics: `orchestrator_runs_total`, `orchestrator_queue_depth`, `orchestrator_step_retries_total`, `orchestrator_run_duration_seconds`.
+- Logs: structured JSON with `tenant`, `dagId`, `runId`, `status`; check for redaction markers.
+
+## Determinism/immutability
+- Runs are append-only; do not mutate ledger entries. Use new DAG versions for fixes.
+- Idempotency via `runToken`; reruns should reuse the same token when repeating intended work.
+
+## Offline/air-gap
+- Keep plugin bundles and DAG specs in sealed storage; no remote fetch.
+- Export logs/metrics/traces as NDJSON for offline analysis; include manifest/hash.
+
+## Quick checks
+- [ ] Health green, queue depth normal.
+- [ ] Latest plugin bundle signatures valid.
+- [ ] No secrets in logs (spot-check redaction).
+- [ ] Error budget within SLO (see `docs/observability/metrics-and-slos.md`).