docs consolidation and others
This commit is contained in:
42
docs/operations/runbooks/assistant-ops.md
Normal file
42
docs/operations/runbooks/assistant-ops.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Assistant Ops Runbook (DOCS-AIAI-31-009)
|
||||
|
||||
_Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111_
|
||||
|
||||
This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.
|
||||
|
||||
## 1) Warmup & cache priming
|
||||
- Ensure Offline Kit fixtures are staged:
|
||||
- CLI guardrail bundles: `out/console/guardrails/cli-vuln-29-001/`, `out/console/guardrails/cli-vex-30-001/`.
|
||||
- SBOM context fixtures: copy into `data/advisory-ai/fixtures/sbom/` and record hashes in `SHA256SUMS`.
|
||||
- Profiles/prompts manifests: ensure `profiles.catalog.json` and `prompts.manifest` hashes match `AdvisoryAI:Provenance` settings.
|
||||
- Start services and prime caches using cache-only calls:
|
||||
- `stella advise run summary --advisory-key <id> --timeout 0 --json` (should return cached/empty context, exit 0).
|
||||
- `stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json` (verifies SBOM clamps without executing inference).
|
||||
|
||||
## 2) Guardrail & provenance verification
|
||||
- Run guardrail self-test: `dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail` (offline-safe).
|
||||
- Validate DSSE bundles:
|
||||
- `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifest`
|
||||
- `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>`
|
||||
- Confirm `AdvisoryAI:Guardrails:BlockedPhrases` file matches the hash captured during pack build; diff against `prompts.manifest`.
|
||||
|
||||
## 3) Scaling & queue health
|
||||
- Defaults: queue capacity 1024, dequeue wait 1s (see `docs/modules/policy/guides/assistant-parameters.md`). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism.
|
||||
- Metrics to watch: `advisory_ai_queue_depth`, `advisory_ai_latency_seconds`, `advisory_ai_guardrail_blocks_total`.
|
||||
- If queue depth > 75% for 5 minutes, add one worker pod or increase `Queue:Capacity` by 25% (record change in ops log).
|
||||
|
||||
## 4) Outage handling
|
||||
- **SBOM service down**: switch to `NullSbomContextClient` by unsetting `ADVISORYAI__SBOM__BASEADDRESS`; Advisory AI returns deterministic responses with `sbomSummary` counts at 0.
|
||||
- **Policy Engine unavailable**: pin last-known `policyVersion`; set `AdvisoryAI:Guardrails:RequireCitations=true` to avoid drift; raise `advisory.remediation.policyHold` in responses.
|
||||
- **Remote profile disabled**: keep `profile=cloud-openai` blocked; return `advisory.inference.remoteDisabled` with exit code 12 in CLI (see `docs/modules/advisory-ai/guides/cli.md`).
|
||||
|
||||
## 5) Air-gap / offline posture
|
||||
- All external calls are disabled by default. To re-enable remote inference, set `ADVISORYAI__INFERENCE__MODE=Remote` and provide an allowlisted `Remote.BaseAddress`; record the consent in Authority and in the ops log.
|
||||
- Mirror the guardrail artefact folders and `hashes.sha256` into the Offline Kit; re-run the guardrail self-test after mirroring.
|
||||
|
||||
## 6) Checklist before declaring healthy
|
||||
- [ ] Guardrail self-test suite green.
|
||||
- [ ] Cache-only CLI probes return 0 with correct `context.planCacheKey`.
|
||||
- [ ] DSSE verifications logged for prompts, profiles, policy bundle.
|
||||
- [ ] Metrics scrape shows queue depth < 75% and latency within SLO.
|
||||
- [ ] Ops log updated with any config overrides (queue size, clamps, remote inference toggles).
|
||||
44
docs/operations/runbooks/concelier-airgap-bundle-deploy.md
Normal file
44
docs/operations/runbooks/concelier-airgap-bundle-deploy.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Concelier Air-Gap Bundle Deploy Runbook (CONCELIER-AIRGAP-56-003)
|
||||
|
||||
Status: draft · 2025-11-24
|
||||
Scope: deploy sealed-mode Concelier evidence bundles using deterministic NDJSON + manifest/entry-trace outputs.
|
||||
|
||||
## Inputs
|
||||
- Bundle: `concelier-airgap.ndjson`
|
||||
- Manifest: `bundle.manifest.json`
|
||||
- Entry trace: `bundle.entry-trace.json`
|
||||
- Hashes: SHA256 recorded in manifest and entry-trace; verify before import.
|
||||
|
||||
## Preconditions
|
||||
- Concelier WebService running with `concelier:features:airgap` enabled.
|
||||
- No external egress; only local file system allowed for bundle path.
|
||||
- PostgreSQL indexes applied (`advisory_observations`, `advisory_linksets` tables).
|
||||
|
||||
## Steps
|
||||
1) Transfer bundle directory to offline controller host.
|
||||
2) Verify hashes:
|
||||
```bash
|
||||
sha256sum concelier-airgap.ndjson | diff - <(jq -r .bundleSha256 bundle.manifest.json)
|
||||
jq -r '.[].sha256' bundle.entry-trace.json | nl | sed 's/\t/:/' > entry.hashes
|
||||
paste -d' ' <(cut -d: -f1 entry.hashes) <(cut -d: -f2 entry.hashes)
|
||||
```
|
||||
3) Import:
|
||||
```bash
|
||||
curl -sSf -X POST \
|
||||
-H 'Content-Type: application/x-ndjson' \
|
||||
--data-binary @concelier-airgap.ndjson \
|
||||
http://localhost:5000/internal/airgap/import
|
||||
```
|
||||
4) Validate import:
|
||||
```bash
|
||||
curl -sSf http://localhost:5000/internal/airgap/status | jq
|
||||
```
|
||||
5) Record evidence:
|
||||
- Store manifest + entry-trace alongside TRX/logs in `artifacts/airgap/<date>/`.
|
||||
|
||||
## Determinism notes
|
||||
- NDJSON ordering is lexicographic; do not re-sort downstream.
|
||||
- Entry-trace hashes must match post-transfer; any mismatch aborts import.
|
||||
|
||||
## Rollback
|
||||
- Delete imported batch by `bundleId` from `advisory_observations` and `advisory_linksets` (requires DBA approval); rerun import after fixing hash.
|
||||
17
docs/operations/runbooks/incidents.md
Normal file
17
docs/operations/runbooks/incidents.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Incident Mode Runbook (outline)
|
||||
|
||||
- Activation, escalation, retention, verification checklist TBD from Ops Guild.
|
||||
|
||||
## Pending Inputs
|
||||
- See sprint SPRINT_0309_0001_0009_docs_tasks_md_ix action tracker; inputs due 2025-12-09..12 from owning guilds.
|
||||
|
||||
## Determinism Checklist
|
||||
- [ ] Hash any inbound assets/payloads; place sums alongside artifacts (e.g., SHA256SUMS in this folder).
|
||||
- [ ] Keep examples offline-friendly and deterministic (fixed seeds, pinned versions, stable ordering).
|
||||
- [ ] Note source/approver for any provided captures or schemas.
|
||||
|
||||
## Sections to fill (once inputs arrive)
|
||||
- Activation criteria and toggle steps.
|
||||
- Escalation paths and roles.
|
||||
- Retention/cleanup impacts.
|
||||
- Verification checklist and imposed-rule banner text.
|
||||
50
docs/operations/runbooks/policy-incident.md
Normal file
50
docs/operations/runbooks/policy-incident.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Policy Publish / Incident Runbook (draft)
|
||||
|
||||
Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
|
||||
|
||||
## Scope
|
||||
- Policy Registry publish/promote workflows (canary → full rollout).
|
||||
- Emergency freeze for publish endpoints.
|
||||
- Evidence capture for audits and postmortems.
|
||||
|
||||
## Pre-flight checks (dev vs. prod)
|
||||
1) Validate manifests
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
|
||||
2) Render deployment plan (no apply yet)
|
||||
- Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
|
||||
- Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
|
||||
3) Backups
|
||||
- Run `deploy/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
|
||||
|
||||
## Canary publish → promote
|
||||
1) Prepare override (temporary)
|
||||
- Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
|
||||
- Keep `mock.enabled=false`; only use real digests when available.
|
||||
2) Dry-run render
|
||||
- `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
|
||||
3) Apply canary
|
||||
- `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
|
||||
- Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
|
||||
4) Promote
|
||||
- Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
|
||||
- Update the release manifest with final policy digests and rerun `release-manifest-verify`.
|
||||
|
||||
## Emergency freeze
|
||||
- Hard stop publishes while keeping read access
|
||||
- `kubectl scale deployment/policy-registry -n stellaops --replicas=0`
|
||||
- Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
|
||||
- Manifest gate
|
||||
- Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared.
|
||||
|
||||
## Evidence capture
|
||||
- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout.
|
||||
- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`.
|
||||
- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`.
|
||||
- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket.
|
||||
|
||||
## Open items (blockers)
|
||||
- Replace mock digests with production pins in `deploy/releases/*` once provided.
|
||||
- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
|
||||
- Add Grafana/Prometheus dashboard references once policy metrics are exposed.
|
||||
63
docs/operations/runbooks/reachability-runtime.md
Normal file
63
docs/operations/runbooks/reachability-runtime.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Reachability Runtime Ingestion Runbook
|
||||
|
||||
> **Imposed rule:** Runtime traces must never bypass CAS/DSSE verification; ingest only CAS-addressed NDJSON with hashes logged to Timeline and Evidence Locker.
|
||||
|
||||
This runbook guides operators through ingesting runtime reachability evidence (EntryTrace, probes, Signals ingestion) and wiring it into the reachability evidence chain.
|
||||
|
||||
## 1. Prerequisites
|
||||
- Services: `Signals` API, `Zastava Observer` (or other probes), `Evidence Locker`, optional `Attestor` for DSSE.
|
||||
- Reachability schema: `docs/modules/reach-graph/guides/function-level-evidence.md`, `docs/modules/reach-graph/guides/evidence-schema.md`.
|
||||
- CAS: configured bucket/path for `cas://reachability/runtime/*` and `.../graphs/*`.
|
||||
- Time sync: AirGap Time anchor if sealed; otherwise NTP with drift <200ms.
|
||||
|
||||
## 2. Ingestion workflow (online)
|
||||
1) **Capture traces** from Observer/probes → NDJSON (`runtime-trace.ndjson.gz`) with `symbol_id`, `purl`, `timestamp`, `pid`, `container`, `count`.
|
||||
2) **Stage to CAS**: upload file, record `sha256`, store at `cas://reachability/runtime/<sha256>`.
|
||||
3) **Optionally sign**: wrap CAS digest in DSSE (`stella attest runtime --bundle runtime.dsse.json`).
|
||||
4) **Ingest** via Signals API:
|
||||
```sh
|
||||
curl -H "X-Stella-Tenant: acme" \
|
||||
-H "Content-Type: application/x-ndjson" \
|
||||
--data-binary @runtime-trace.ndjson.gz \
|
||||
"https://signals.example/api/v1/runtime-facts?graph_hash=<graph>"
|
||||
```
|
||||
Headers returned: `Content-SHA256`, `X-Graph-Hash`, `X-Ingest-Id`.
|
||||
5) **Emit timeline**: ensure Timeline event `reach.runtime.ingested` with CAS digest and ingest id.
|
||||
6) **Verify**: run `stella graph verify --runtime runtime-trace.ndjson.gz --graph <graph_hash>` to confirm edges mapped.
|
||||
|
||||
## 3. Ingestion workflow (air-gap)
|
||||
1) Receive runtime bundle containing `runtime-trace.ndjson.gz`, `manifest.json` (hashes), optional DSSE.
|
||||
2) Validate hashes against manifest; if present, verify DSSE bundle.
|
||||
3) Import into CAS path `cas://reachability/runtime/<sha256>` using offline loader.
|
||||
4) Run Signals offline ingest tool:
|
||||
```sh
|
||||
signals-offline ingest-runtime \
|
||||
--tenant acme \
|
||||
--graph-hash <graph_hash> \
|
||||
--runtime runtime-trace.ndjson.gz \
|
||||
--manifest manifest.json
|
||||
```
|
||||
5) Export ingest receipt and add to Evidence Locker; update Timeline when reconnected.
|
||||
|
||||
## 4. Checks & alerts
|
||||
- **Drift**: block ingest if time anchor age > configured budget; surface `staleness_seconds`.
|
||||
- **Hash mismatch**: fail ingest; write `runtime.ingest.failed` event with reason.
|
||||
- **Orphan traces**: if no matching `graph_hash`, queue for retry and alert `reachability.orphan_traces` counter.
|
||||
|
||||
## 5. Troubleshooting
|
||||
- **400 Bad Request**: validate NDJSON schema; run `scripts/reachability/validate_runtime_trace.py`.
|
||||
- **Hash mismatch**: recompute `sha256sum runtime-trace.ndjson.gz`; compare to manifest.
|
||||
- **Missing symbols**: ensure symbol manifest ingested (see `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`); rerun `stella graph verify`.
|
||||
- **High drift**: refresh time anchor (AirGap Time service) or resync NTP; retry ingest.
|
||||
|
||||
## 6. Artefact checklist
|
||||
- `runtime-trace.ndjson.gz` (or `.json`), `sha256` recorded.
|
||||
- Optional `runtime.dsse.json` DSSE bundle.
|
||||
- Ingest receipt (ingest id, graph hash, CAS digest, tenant).
|
||||
- Timeline event `reach.runtime.ingested` and Evidence Locker record (bundle + receipt).
|
||||
|
||||
## 7. References
|
||||
- `docs/modules/reach-graph/guides/DELIVERY_GUIDE.md`
|
||||
- `docs/modules/reach-graph/guides/function-level-evidence.md`
|
||||
- `docs/modules/reach-graph/guides/evidence-schema.md`
|
||||
- `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`
|
||||
96
docs/operations/runbooks/replay_ops.md
Normal file
96
docs/operations/runbooks/replay_ops.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# Runbook - Replay Operations
|
||||
|
||||
> **Audience:** Ops Guild / Evidence Locker Guild / Scanner Guild / Authority/Signer / Attestor
|
||||
> **Prereqs:** `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`, `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`, `docs/modules/replay/guides/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1 Terminology
|
||||
|
||||
- **Replay Manifest** - `manifest.json` describing scan inputs, outputs, signatures.
|
||||
- **Input Bundle** - `inputbundle.tar.zst` containing feeds, policies, tools, env.
|
||||
- **Output Bundle** - `outputbundle.tar.zst` with SBOM, findings, VEX, logs.
|
||||
- **DSSE Envelope** - Signed metadata produced by Authority/Signer.
|
||||
- **RootPack** - Trusted key bundle used to validate DSSE signatures offline.
|
||||
|
||||
---
|
||||
|
||||
## 2 Normal operations
|
||||
|
||||
1. **Ingestion**
|
||||
- Scanner WebService writes manifest metadata to `replay_runs`.
|
||||
- Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`).
|
||||
- Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
|
||||
2. **Verification**
|
||||
- Nightly job runs `stella verify` on the latest N replay manifests per tenant.
|
||||
- Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`).
|
||||
- Failures alert `#ops-replay` via PagerDuty with runbook link.
|
||||
3. **Retention**
|
||||
- Hot CAS retention: 180 days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries.
|
||||
- Cold storage (Evidence Locker): 2 years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`.
|
||||
- Retention declaration: validate against `docs/schemas/replay-retention.schema.json` (frozen 2025-12-10). Include `retention_policy_id`, `tenant_id`, `bundle_type`, `retention_days`, `legal_hold`, `purge_after`, `checksum`, `created_at`. Audit checksum via DSSE envelope when persisting.
|
||||
4. **Access control**
|
||||
- Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP.
|
||||
|
||||
---
|
||||
|
||||
## 3 Incident response (Replay Integrity)
|
||||
|
||||
| Step | Action | Owner | Notes |
|
||||
|------|--------|-------|-------|
|
||||
| 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes |
|
||||
| 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket |
|
||||
| 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident |
|
||||
| 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale |
|
||||
| 5 | If tool hash drift -> coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
|
||||
| 6 | Update incident timeline (`docs/operations/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions |
|
||||
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
|
||||
|
||||
---
|
||||
|
||||
## 4 Air-gapped workflow
|
||||
|
||||
1. Receive Offline Kit bundle containing:
|
||||
- `offline/replay/<scan-id>/manifest.json`
|
||||
- Bundles + DSSE signatures
|
||||
- RootPack snapshot
|
||||
2. Run `stella replay manifest.json --strict --offline` using local CLI.
|
||||
3. Load feed/policy snapshots from kit; never hit external networks.
|
||||
4. Store verification logs under `ops/offline/replay/<scan-id>/`.
|
||||
5. Sync results back to Evidence Locker once connectivity restored.
|
||||
|
||||
---
|
||||
|
||||
## 5 Maintenance checklist
|
||||
|
||||
- [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
|
||||
- [ ] CAS retention job executed successfully in the past 24 hours.
|
||||
- [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes).
|
||||
- [ ] Runbook incident log updated (see section 6) for the last drill.
|
||||
- [ ] Offline kit instructions verified against current CLI version.
|
||||
|
||||
---
|
||||
|
||||
## 6 Incident log
|
||||
|
||||
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|
||||
|------------|-------------|--------|---------|-----------|
|
||||
| _TBD_ | | | | |
|
||||
|
||||
---
|
||||
|
||||
## 7 References
|
||||
|
||||
- `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`
|
||||
- `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`
|
||||
- `docs/modules/replay/guides/TEST_STRATEGY.md`
|
||||
- `docs/modules/platform/architecture-overview.md` section 5
|
||||
- `docs/modules/evidence-locker/architecture.md`
|
||||
- `docs/modules/telemetry/architecture.md`
|
||||
- `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`
|
||||
|
||||
---
|
||||
|
||||
*Created: 2025-11-03 - Update alongside replay task status changes.*
|
||||
35
docs/operations/runbooks/vex-ops.md
Normal file
35
docs/operations/runbooks/vex-ops.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# VEX Ops Runbook (dev-mock ready)
|
||||
|
||||
Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production rollouts wait on policy/VEX final digests.
|
||||
|
||||
## Pre-flight (dev vs. prod)
|
||||
1) Release manifest guard
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once VEX digests land.
|
||||
2) Render plan
|
||||
- Helm (mock overlay): `helm template vex-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
|
||||
3) Backups (when touching prod data) — not required for mock, but in prod take PostgreSQL snapshots for issuer-directory and VEX state before rollout.
|
||||
|
||||
## Deploy (mock path)
|
||||
- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Observe VEX Lens pod logs: `kubectl logs deploy/vex-lens -n stellaops --tail=200 -f`.
|
||||
- Issuer Directory seed: ensure `issuer-directory-config` ConfigMap includes `csaf-publishers.json`; mock overlay already mounts default seed.
|
||||
|
||||
## Rollback
|
||||
- Helm: `helm rollback stellaops 1` (choose previous revision). Mock overlay uses `stellaops.dev/mock: "true"` annotations; safe to tear down after tests.
|
||||
- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
|
||||
|
||||
## Troubleshooting
|
||||
- Recompute storms: throttle via `VEX_LENS__MAX_PARALLELISM` env (set in values once schema lands); for now scale deployment down to 1 replica to reduce concurrency.
|
||||
- Mapping failures: capture request/response with `kubectl logs ... --since=10m`; rerun after clearing queue.
|
||||
- Signature errors: confirm Authority token audience/issuer; mock overlay uses the same auth settings as dev compose.
|
||||
|
||||
## Evidence capture
|
||||
- Save `/tmp/vex-mock.yaml` and `/tmp/vex-compose.yaml` with the manifest used.
|
||||
- `kubectl get deploy/pod,svc -n stellaops -l app=vex-lens -o yaml > /tmp/vex-live.yaml`.
|
||||
- Tarball: `tar -czf vex-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/vex-*`.
|
||||
|
||||
## Open TODOs
|
||||
- Replace mock digests with production pins and add env/schema knobs for VEX Lens once published.
|
||||
- Add Grafana panels for recompute throughput and mapping failure rate after metrics are exposed.
|
||||
40
docs/operations/runbooks/vuln-ops.md
Normal file
40
docs/operations/runbooks/vuln-ops.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Vuln / Findings Ops Runbook (dev-mock ready)
|
||||
|
||||
Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production steps need final digests and schema from DEPLOY-VULN-29-001.
|
||||
|
||||
## Scope
|
||||
- Findings Ledger + projector + Vuln Explorer API deployment/rollback, plus common incident drills (lag, storms, export failures).
|
||||
|
||||
## Pre-flight (dev vs. prod)
|
||||
1) Release manifest guard
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once ledger/api digests land.
|
||||
2) Render plan
|
||||
- Helm (mock overlay): `helm template vuln-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
|
||||
3) Backups (prod only)
|
||||
- PostgreSQL dump for Findings Ledger DB; copy object-store buckets tied to projector anchors.
|
||||
|
||||
## Deploy (mock path)
|
||||
- Helm apply (dev): `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Compose: quickstart already starts ledger + vuln API with mock pins; validate health at `https://localhost:8443/swagger` (dev certs).
|
||||
|
||||
## Incident drills
|
||||
- Projector lag: scale projector worker up (`kubectl scale deploy/findings-ledger -n stellaops --replicas=2`) then back down; monitor queue length (metric hook pending).
|
||||
- Resolver storms: temporarily set `ASPNETCORE_THREADPOOL_MINTHREADS` higher or scale API horizontally; in compose, use `docker compose restart vuln-explorer-api` after bumping `VULNEXPLORER__MAX_CONCURRENCY` env once schema lands.
|
||||
- Export failures: re-run export job after verifying hashes in `deploy/releases/*`; mock path skips signing but still exercises checksum validation via `ops/devops/release/check_release_manifest.py`.
|
||||
|
||||
## Rollback
|
||||
- Helm: `helm rollback stellaops 1` to previous revision.
|
||||
- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
|
||||
|
||||
## Evidence capture
|
||||
- Keep `/tmp/vuln-mock.yaml`, `/tmp/vuln-compose.yaml`, and the release manifest used.
|
||||
- `kubectl logs deployment/findings-ledger -n stellaops --since=30m > /tmp/ledger-logs.txt`
|
||||
- DB snapshot checksums if taken; bundle into `vuln-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz`.
|
||||
|
||||
## Open TODOs
|
||||
- Replace mock digests with production pins; add concrete env knobs for projector and API when schemas publish.
|
||||
- Hook Prometheus counters for projector lag and resolver storm dashboards once metrics are exported.
|
||||
|
||||
_Last updated: 2025-12-06 (UTC)_
|
||||
Reference in New Issue
Block a user