docs consolidation and others

2026-01-06 19:02:21 +02:00
parent d7bdca6d97
commit 4789027317
849 changed files with 16551 additions and 66770 deletions
--- a/docs/operations/runbooks/assistant-ops.md
+++ b/docs/operations/runbooks/assistant-ops.md
@@ -0,0 +1,42 @@
+# Assistant Ops Runbook (DOCS-AIAI-31-009)
+
+_Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111_
+
+This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.
+
+## 1) Warmup & cache priming
+- Ensure Offline Kit fixtures are staged:
+  - CLI guardrail bundles: `out/console/guardrails/cli-vuln-29-001/`, `out/console/guardrails/cli-vex-30-001/`.
+  - SBOM context fixtures: copy into `data/advisory-ai/fixtures/sbom/` and record hashes in `SHA256SUMS`.
+  - Profiles/prompts manifests: ensure `profiles.catalog.json` and `prompts.manifest` hashes match `AdvisoryAI:Provenance` settings.
+- Start services and prime caches using cache-only calls:
+  - `stella advise run summary --advisory-key <id> --timeout 0 --json` (should return cached/empty context, exit 0).
+  - `stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json` (verifies SBOM clamps without executing inference).
+
+## 2) Guardrail & provenance verification
+- Run guardrail self-test: `dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail` (offline-safe).
+- Validate DSSE bundles:
+  - `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifest`
+  - `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>`
+- Confirm `AdvisoryAI:Guardrails:BlockedPhrases` file matches the hash captured during pack build; diff against `prompts.manifest`.
+
+## 3) Scaling & queue health
+- Defaults: queue capacity 1024, dequeue wait 1s (see `docs/modules/policy/guides/assistant-parameters.md`). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism.
+- Metrics to watch: `advisory_ai_queue_depth`, `advisory_ai_latency_seconds`, `advisory_ai_guardrail_blocks_total`.
+- If queue depth > 75% for 5 minutes, add one worker pod or increase `Queue:Capacity` by 25% (record change in ops log).
+
+## 4) Outage handling
+- **SBOM service down**: switch to `NullSbomContextClient` by unsetting `ADVISORYAI__SBOM__BASEADDRESS`; Advisory AI returns deterministic responses with `sbomSummary` counts at 0.
+- **Policy Engine unavailable**: pin last-known `policyVersion`; set `AdvisoryAI:Guardrails:RequireCitations=true` to avoid drift; raise `advisory.remediation.policyHold` in responses.
+- **Remote profile disabled**: keep `profile=cloud-openai` blocked; return `advisory.inference.remoteDisabled` with exit code 12 in CLI (see `docs/modules/advisory-ai/guides/cli.md`).
+
+## 5) Air-gap / offline posture
+- All external calls are disabled by default. To re-enable remote inference, set `ADVISORYAI__INFERENCE__MODE=Remote` and provide an allowlisted `Remote.BaseAddress`; record the consent in Authority and in the ops log.
+- Mirror the guardrail artefact folders and `hashes.sha256` into the Offline Kit; re-run the guardrail self-test after mirroring.
+
+## 6) Checklist before declaring healthy
+- [ ] Guardrail self-test suite green.
+- [ ] Cache-only CLI probes return 0 with correct `context.planCacheKey`.
+- [ ] DSSE verifications logged for prompts, profiles, policy bundle.
+- [ ] Metrics scrape shows queue depth < 75% and latency within SLO.
+- [ ] Ops log updated with any config overrides (queue size, clamps, remote inference toggles).
--- a/docs/operations/runbooks/concelier-airgap-bundle-deploy.md
+++ b/docs/operations/runbooks/concelier-airgap-bundle-deploy.md
@@ -0,0 +1,44 @@
+# Concelier Air-Gap Bundle Deploy Runbook (CONCELIER-AIRGAP-56-003)
+
+Status: draft · 2025-11-24
+Scope: deploy sealed-mode Concelier evidence bundles using deterministic NDJSON + manifest/entry-trace outputs.
+
+## Inputs
+- Bundle: `concelier-airgap.ndjson`
+- Manifest: `bundle.manifest.json`
+- Entry trace: `bundle.entry-trace.json`
+- Hashes: SHA256 recorded in manifest and entry-trace; verify before import.
+
+## Preconditions
+- Concelier WebService running with `concelier:features:airgap` enabled.
+- No external egress; only local file system allowed for bundle path.
+- PostgreSQL indexes applied (`advisory_observations`, `advisory_linksets` tables).
+
+## Steps
+1) Transfer bundle directory to offline controller host.
+2) Verify hashes:
+   ```bash
+   sha256sum concelier-airgap.ndjson | diff - <(jq -r .bundleSha256 bundle.manifest.json)
+   jq -r '.[].sha256' bundle.entry-trace.json | nl | sed 's/\t/:/' > entry.hashes
+   paste -d' ' <(cut -d: -f1 entry.hashes) <(cut -d: -f2 entry.hashes)
+   ```
+3) Import:
+   ```bash
+   curl -sSf -X POST \
+     -H 'Content-Type: application/x-ndjson' \
+     --data-binary @concelier-airgap.ndjson \
+     http://localhost:5000/internal/airgap/import
+   ```
+4) Validate import:
+   ```bash
+   curl -sSf http://localhost:5000/internal/airgap/status | jq
+   ```
+5) Record evidence:
+   - Store manifest + entry-trace alongside TRX/logs in `artifacts/airgap/<date>/`.
+
+## Determinism notes
+- NDJSON ordering is lexicographic; do not re-sort downstream.
+- Entry-trace hashes must match post-transfer; any mismatch aborts import.
+
+## Rollback
+- Delete imported batch by `bundleId` from `advisory_observations` and `advisory_linksets` (requires DBA approval); rerun import after fixing hash.
--- a/docs/operations/runbooks/incidents.md
+++ b/docs/operations/runbooks/incidents.md
@@ -0,0 +1,17 @@
+# Incident Mode Runbook (outline)
+
+- Activation, escalation, retention, verification checklist TBD from Ops Guild.
+
+## Pending Inputs
+- See sprint SPRINT_0309_0001_0009_docs_tasks_md_ix action tracker; inputs due 2025-12-09..12 from owning guilds.
+
+## Determinism Checklist
+- [ ] Hash any inbound assets/payloads; place sums alongside artifacts (e.g., SHA256SUMS in this folder).
+- [ ] Keep examples offline-friendly and deterministic (fixed seeds, pinned versions, stable ordering).
+- [ ] Note source/approver for any provided captures or schemas.
+
+## Sections to fill (once inputs arrive)
+- Activation criteria and toggle steps.
+- Escalation paths and roles.
+- Retention/cleanup impacts.
+- Verification checklist and imposed-rule banner text.
--- a/docs/operations/runbooks/policy-incident.md
+++ b/docs/operations/runbooks/policy-incident.md
@@ -0,0 +1,50 @@
+# Policy Publish / Incident Runbook (draft)
+
+Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
+
+## Scope
+- Policy Registry publish/promote workflows (canary → full rollout).
+- Emergency freeze for publish endpoints.
+- Evidence capture for audits and postmortems.
+
+## Pre-flight checks (dev vs. prod)
+1) Validate manifests
+   - Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
+   - Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
+   - Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
+2) Render deployment plan (no apply yet)
+   - Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
+   - Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
+3) Backups
+   - Run `deploy/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
+
+## Canary publish → promote
+1) Prepare override (temporary)
+   - Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
+   - Keep `mock.enabled=false`; only use real digests when available.
+2) Dry-run render
+   - `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
+3) Apply canary
+   - `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
+   - Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
+4) Promote
+   - Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
+   - Update the release manifest with final policy digests and rerun `release-manifest-verify`.
+
+## Emergency freeze
+- Hard stop publishes while keeping read access
+  - `kubectl scale deployment/policy-registry -n stellaops --replicas=0`
+  - Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
+- Manifest gate
+  - Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared.
+
+## Evidence capture
+- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout.
+- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`.
+- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`.
+- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket.
+
+## Open items (blockers)
+- Replace mock digests with production pins in `deploy/releases/*` once provided.
+- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
+- Add Grafana/Prometheus dashboard references once policy metrics are exposed.
--- a/docs/operations/runbooks/reachability-runtime.md
+++ b/docs/operations/runbooks/reachability-runtime.md
@@ -0,0 +1,63 @@
+# Reachability Runtime Ingestion Runbook
+
+> **Imposed rule:** Runtime traces must never bypass CAS/DSSE verification; ingest only CAS-addressed NDJSON with hashes logged to Timeline and Evidence Locker.
+
+This runbook guides operators through ingesting runtime reachability evidence (EntryTrace, probes, Signals ingestion) and wiring it into the reachability evidence chain.
+
+## 1. Prerequisites
+- Services: `Signals` API, `Zastava Observer` (or other probes), `Evidence Locker`, optional `Attestor` for DSSE.
+- Reachability schema: `docs/modules/reach-graph/guides/function-level-evidence.md`, `docs/modules/reach-graph/guides/evidence-schema.md`.
+- CAS: configured bucket/path for `cas://reachability/runtime/*` and `.../graphs/*`.
+- Time sync: AirGap Time anchor if sealed; otherwise NTP with drift <200ms.
+
+## 2. Ingestion workflow (online)
+1) **Capture traces** from Observer/probes → NDJSON (`runtime-trace.ndjson.gz`) with `symbol_id`, `purl`, `timestamp`, `pid`, `container`, `count`.
+2) **Stage to CAS**: upload file, record `sha256`, store at `cas://reachability/runtime/<sha256>`.
+3) **Optionally sign**: wrap CAS digest in DSSE (`stella attest runtime --bundle runtime.dsse.json`).
+4) **Ingest** via Signals API:
+   ```sh
+   curl -H "X-Stella-Tenant: acme" \
+        -H "Content-Type: application/x-ndjson" \
+        --data-binary @runtime-trace.ndjson.gz \
+        "https://signals.example/api/v1/runtime-facts?graph_hash=<graph>"
+   ```
+   Headers returned: `Content-SHA256`, `X-Graph-Hash`, `X-Ingest-Id`.
+5) **Emit timeline**: ensure Timeline event `reach.runtime.ingested` with CAS digest and ingest id.
+6) **Verify**: run `stella graph verify --runtime runtime-trace.ndjson.gz --graph <graph_hash>` to confirm edges mapped.
+
+## 3. Ingestion workflow (air-gap)
+1) Receive runtime bundle containing `runtime-trace.ndjson.gz`, `manifest.json` (hashes), optional DSSE.
+2) Validate hashes against manifest; if present, verify DSSE bundle.
+3) Import into CAS path `cas://reachability/runtime/<sha256>` using offline loader.
+4) Run Signals offline ingest tool:
+   ```sh
+   signals-offline ingest-runtime \
+     --tenant acme \
+     --graph-hash <graph_hash> \
+     --runtime runtime-trace.ndjson.gz \
+     --manifest manifest.json
+   ```
+5) Export ingest receipt and add to Evidence Locker; update Timeline when reconnected.
+
+## 4. Checks & alerts
+- **Drift**: block ingest if time anchor age > configured budget; surface `staleness_seconds`.
+- **Hash mismatch**: fail ingest; write `runtime.ingest.failed` event with reason.
+- **Orphan traces**: if no matching `graph_hash`, queue for retry and alert `reachability.orphan_traces` counter.
+
+## 5. Troubleshooting
+- **400 Bad Request**: validate NDJSON schema; run `scripts/reachability/validate_runtime_trace.py`.
+- **Hash mismatch**: recompute `sha256sum runtime-trace.ndjson.gz`; compare to manifest.
+- **Missing symbols**: ensure symbol manifest ingested (see `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`); rerun `stella graph verify`.
+- **High drift**: refresh time anchor (AirGap Time service) or resync NTP; retry ingest.
+
+## 6. Artefact checklist
+- `runtime-trace.ndjson.gz` (or `.json`), `sha256` recorded.
+- Optional `runtime.dsse.json` DSSE bundle.
+- Ingest receipt (ingest id, graph hash, CAS digest, tenant).
+- Timeline event `reach.runtime.ingested` and Evidence Locker record (bundle + receipt).
+
+## 7. References
+- `docs/modules/reach-graph/guides/DELIVERY_GUIDE.md`
+- `docs/modules/reach-graph/guides/function-level-evidence.md`
+- `docs/modules/reach-graph/guides/evidence-schema.md`
+- `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`
--- a/docs/operations/runbooks/replay_ops.md
+++ b/docs/operations/runbooks/replay_ops.md
@@ -0,0 +1,96 @@
+# Runbook - Replay Operations
+
+> **Audience:** Ops Guild / Evidence Locker Guild / Scanner Guild / Authority/Signer / Attestor  
+> **Prereqs:** `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`, `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`, `docs/modules/replay/guides/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md`
+
+This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`.
+
+---
+
+## 1 Terminology
+
+- **Replay Manifest** - `manifest.json` describing scan inputs, outputs, signatures.  
+- **Input Bundle** - `inputbundle.tar.zst` containing feeds, policies, tools, env.  
+- **Output Bundle** - `outputbundle.tar.zst` with SBOM, findings, VEX, logs.  
+- **DSSE Envelope** - Signed metadata produced by Authority/Signer.  
+- **RootPack** - Trusted key bundle used to validate DSSE signatures offline.
+
+---
+
+## 2 Normal operations
+
+1. **Ingestion**
+   - Scanner WebService writes manifest metadata to `replay_runs`.
+   - Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`).
+   - Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
+2. **Verification**
+   - Nightly job runs `stella verify` on the latest N replay manifests per tenant.
+   - Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`).
+   - Failures alert `#ops-replay` via PagerDuty with runbook link.
+3. **Retention**
+   - Hot CAS retention: 180 days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries.
+   - Cold storage (Evidence Locker): 2 years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`.
+   - Retention declaration: validate against `docs/schemas/replay-retention.schema.json` (frozen 2025-12-10). Include `retention_policy_id`, `tenant_id`, `bundle_type`, `retention_days`, `legal_hold`, `purge_after`, `checksum`, `created_at`. Audit checksum via DSSE envelope when persisting.
+4. **Access control**
+   - Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP.
+
+---
+
+## 3 Incident response (Replay Integrity)
+
+| Step | Action | Owner | Notes |
+|------|--------|-------|-------|
+| 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes |
+| 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket |
+| 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident |
+| 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale |
+| 5 | If tool hash drift -> coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
+| 6 | Update incident timeline (`docs/operations/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions |
+| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
+
+---
+
+## 4 Air-gapped workflow
+
+1. Receive Offline Kit bundle containing:
+   - `offline/replay/<scan-id>/manifest.json`
+   - Bundles + DSSE signatures
+   - RootPack snapshot
+2. Run `stella replay manifest.json --strict --offline` using local CLI.
+3. Load feed/policy snapshots from kit; never hit external networks.
+4. Store verification logs under `ops/offline/replay/<scan-id>/`.
+5. Sync results back to Evidence Locker once connectivity restored.
+
+---
+
+## 5 Maintenance checklist
+
+- [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
+- [ ] CAS retention job executed successfully in the past 24 hours.
+- [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes).
+- [ ] Runbook incident log updated (see section 6) for the last drill.
+- [ ] Offline kit instructions verified against current CLI version.
+
+---
+
+## 6 Incident log
+
+| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
+|------------|-------------|--------|---------|-----------|
+| _TBD_ |  |  |  |  |
+
+---
+
+## 7 References
+
+- `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`
+- `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`
+- `docs/modules/replay/guides/TEST_STRATEGY.md`
+- `docs/modules/platform/architecture-overview.md` section 5
+- `docs/modules/evidence-locker/architecture.md`
+- `docs/modules/telemetry/architecture.md`
+- `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`
+
+---
+
+*Created: 2025-11-03 - Update alongside replay task status changes.*
--- a/docs/operations/runbooks/vex-ops.md
+++ b/docs/operations/runbooks/vex-ops.md
@@ -0,0 +1,35 @@
+# VEX Ops Runbook (dev-mock ready)
+
+Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production rollouts wait on policy/VEX final digests.
+
+## Pre-flight (dev vs. prod)
+1) Release manifest guard
+   - Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
+   - Prod: rerun against `deploy/releases/2025.09-stable.yaml` once VEX digests land.
+2) Render plan
+   - Helm (mock overlay): `helm template vex-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
+   - Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
+3) Backups (when touching prod data) — not required for mock, but in prod take PostgreSQL snapshots for issuer-directory and VEX state before rollout.
+
+## Deploy (mock path)
+- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
+- Observe VEX Lens pod logs: `kubectl logs deploy/vex-lens -n stellaops --tail=200 -f`.
+- Issuer Directory seed: ensure `issuer-directory-config` ConfigMap includes `csaf-publishers.json`; mock overlay already mounts default seed.
+
+## Rollback
+- Helm: `helm rollback stellaops 1` (choose previous revision). Mock overlay uses `stellaops.dev/mock: "true"` annotations; safe to tear down after tests.
+- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
+
+## Troubleshooting
+- Recompute storms: throttle via `VEX_LENS__MAX_PARALLELISM` env (set in values once schema lands); for now scale deployment down to 1 replica to reduce concurrency.
+- Mapping failures: capture request/response with `kubectl logs ... --since=10m`; rerun after clearing queue.
+- Signature errors: confirm Authority token audience/issuer; mock overlay uses the same auth settings as dev compose.
+
+## Evidence capture
+- Save `/tmp/vex-mock.yaml` and `/tmp/vex-compose.yaml` with the manifest used.
+- `kubectl get deploy/pod,svc -n stellaops -l app=vex-lens -o yaml > /tmp/vex-live.yaml`.
+- Tarball: `tar -czf vex-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/vex-*`.
+
+## Open TODOs
+- Replace mock digests with production pins and add env/schema knobs for VEX Lens once published.
+- Add Grafana panels for recompute throughput and mapping failure rate after metrics are exposed.
--- a/docs/operations/runbooks/vuln-ops.md
+++ b/docs/operations/runbooks/vuln-ops.md
@@ -0,0 +1,40 @@
+# Vuln / Findings Ops Runbook (dev-mock ready)
+
+Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production steps need final digests and schema from DEPLOY-VULN-29-001.
+
+## Scope
+- Findings Ledger + projector + Vuln Explorer API deployment/rollback, plus common incident drills (lag, storms, export failures).
+
+## Pre-flight (dev vs. prod)
+1) Release manifest guard
+   - Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
+   - Prod: rerun against `deploy/releases/2025.09-stable.yaml` once ledger/api digests land.
+2) Render plan
+   - Helm (mock overlay): `helm template vuln-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
+   - Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
+3) Backups (prod only)
+   - PostgreSQL dump for Findings Ledger DB; copy object-store buckets tied to projector anchors.
+
+## Deploy (mock path)
+- Helm apply (dev): `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
+- Compose: quickstart already starts ledger + vuln API with mock pins; validate health at `https://localhost:8443/swagger` (dev certs).
+
+## Incident drills
+- Projector lag: scale projector worker up (`kubectl scale deploy/findings-ledger -n stellaops --replicas=2`) then back down; monitor queue length (metric hook pending).
+- Resolver storms: temporarily set `ASPNETCORE_THREADPOOL_MINTHREADS` higher or scale API horizontally; in compose, use `docker compose restart vuln-explorer-api` after bumping `VULNEXPLORER__MAX_CONCURRENCY` env once schema lands.
+- Export failures: re-run export job after verifying hashes in `deploy/releases/*`; mock path skips signing but still exercises checksum validation via `ops/devops/release/check_release_manifest.py`.
+
+## Rollback
+- Helm: `helm rollback stellaops 1` to previous revision.
+- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
+
+## Evidence capture
+- Keep `/tmp/vuln-mock.yaml`, `/tmp/vuln-compose.yaml`, and the release manifest used.
+- `kubectl logs deployment/findings-ledger -n stellaops --since=30m > /tmp/ledger-logs.txt`
+- DB snapshot checksums if taken; bundle into `vuln-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz`.
+
+## Open TODOs
+- Replace mock digests with production pins; add concrete env knobs for projector and API when schemas publish.
+- Hook Prometheus counters for projector lag and resolver storm dashboards once metrics are exported.
+
+_Last updated: 2025-12-06 (UTC)_