Files
git.stella-ops.org/docs/migration/no-merge.md
master b1e78fe412
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Implement vulnerability token signing and verification utilities
- Added VulnTokenSigner for signing JWT tokens with specified algorithms and keys.
- Introduced VulnTokenUtilities for resolving tenant and subject claims, and sanitizing context dictionaries.
- Created VulnTokenVerificationUtilities for parsing tokens, verifying signatures, and deserializing payloads.
- Developed VulnWorkflowAntiForgeryTokenIssuer for issuing anti-forgery tokens with configurable options.
- Implemented VulnWorkflowAntiForgeryTokenVerifier for verifying anti-forgery tokens and validating payloads.
- Added AuthorityVulnerabilityExplorerOptions to manage configuration for vulnerability explorer features.
- Included tests for FilesystemPackRunDispatcher to ensure proper job handling under egress policy restrictions.
2025-11-03 10:04:10 +02:00

142 lines
10 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# No-Merge Migration Playbook
_Last updated: 2025-11-03_
This playbook guides the full retirement of the legacy Merge service (`AdvisoryMergeService`) in favour of Link-Not-Merge (LNM) observations plus linksets. It is written for the BE-Merge, Architecture, DevOps, and Docs guilds coordinating Sprint110 (Ingestion & Evidence) deliverables, and it feeds CONCELIER-LNM-21-101 / MERGE-LNM-21-001 and downstream DOCS-LNM-22-008.
## 0. Scope & objectives
- **Primary goal:** cut over all advisory pipelines to Link-Not-Merge with no residual dependencies on `AdvisoryMergeService`.
- **Secondary goals:** maintain deterministic evidence, zero data loss, and reversible deployment across online and offline tenants.
- **Success criteria:**
- All connectors emit observation `affected.versions[]` with provenance and pass LNM guardrails.
- Linkset dashboards show zero `missing_version_entries_total` and no `Normalized version rules missing…` warnings.
- Policy, Export Center, and CLI consumers operate solely on observations/linksets.
- Rollback playbook validated and rehearsed in staging.
## 1. Prerequisites checklist
| Item | Owner | Notes |
| --- | --- | --- |
| Normalized version ranges emitted for all Sprint110 connectors (`Acsc`, `Cccs`, `CertBund`, `CertCc`, `Cve`, `Ghsa`, `Ics.Cisa`, `Kisa`, `Ru.Bdu`, `Ru.Nkcki`, `Vndr.Apple`, `Vndr.Cisco`, `Vndr.Msrc`). | Connector guilds | Follow `docs/dev/normalized-rule-recipes.md`; update fixtures with `UPDATE_*_FIXTURES=1`. |
| Metrics dashboards (`LinksetVersionCoverage`, `Normalized version rules missing`) available in Grafana/CI snapshots. | Observability guild | Publish baseline before shadow rollout. |
| Concelier WebService exposes `linkset` and `observation` read APIs for policy/CLI consumers. | BE-Merge / Platform | Confirm contract parity with Merge outputs. |
| Export Center / Offline Kit aware of new manifests. | Export Center guild | Provide beta bundle for QA verification. |
| Docs guild aligned on public migration messaging. | Docs guild | Update `docs/dev`, `docs/modules/concelier`, and release notes once cutover date is locked. |
Do not proceed to Phase1 until all prerequisites are checked or explicitly waived by Architecture guild.
## 2. Feature flag & configuration plan
| Toggle | Default | Purpose | Notes |
| --- | --- | --- | --- |
| `concelier:features:noMergeEnabled` | `false` | Master switch to disable legacy Merge job scheduling/execution. | Applies to WebService + Worker; gate `AdvisoryMergeService` DI registration. |
| `concelier:features:lnmShadowWrites` | `true` | Enables dual-write of linksets while Merge remains active. | Keep enabled throughout Phase01 to validate parity. |
| `concelier:jobs:merge:allowlist` | `[]` | Explicit allowlist for Merge jobs when noMergeEnabled is `false`. | Set to empty during Phase2+ to prevent accidental restarts. |
| `policy:overlays:requireLinksetEvidence` | `false` | Policy engine safety net to require linkset-backed findings. | Flip to `true` only after cutover (Phase2). |
> **Configuration hygiene:** Document the toggle values per environment in `ops/devops/configuration/staging.md` and `ops/devops/configuration/production.md`. Air-gapped customers receive defaults through the Offline Kit release notes.
## 3. Rollout phases
| Phase | Goal | Duration | Key actions |
| --- | --- | --- | --- |
| **0 Preparation** | Ensure readiness | 23 days | Finalise prerequisites, snapshot Merge metrics, dry-run backfill scripts in dev. |
| **1 Shadow / Dual Write** | Validate parity | 57 days | Enable `lnmShadowWrites`, keep Merge primary. Compare linkset vs merged outputs using `stella concelier diff-merge --snapshot <date>`; fix discrepancies. |
| **2 Cutover** | Switch to LNM | 1 day (per env) | Enable `noMergeEnabled`, disable Merge job schedules, update Policy/Export configs, run post-cutover smoke tests. |
| **3 Harden** | Decommission Merge | 23 days | Remove Merge background services, delete `merge_event` retention jobs, clean dashboards, notify operators. |
### 3.1 Environment sequencing
1. **Dev/Test clusters:** Validate all automation. Run full regression suite (`dotnet test src/Concelier/...`).
2. **Staging:** Execute complete backfill (see §4) and collect 24h of telemetry before sign-off.
3. **Production:** Perform cutover during low-ingest window; communicate via Slack/email + status page two days in advance.
4. **Offline kit:** Package new Observer snapshots with LNM-only data; ensure instructions cover flag toggles for air-gapped deployments.
### 3.2 Smoke test matrix
- `stella concelier status --include linkset` returns healthy and shows zero Merge workers.
- `stella policy evaluate` against sample tenants produces identical findings pre/post cutover.
- Export Center bundle diff shows only expected metadata changes (manifest ID, timestamps).
- Grafana dashboards: `linkset_insert_duration_ms` steady, `merge.identity.conflicts` flatlined.
## 4. Backfill strategy
1. **Freeze Merge writes:** Pause Merge job scheduler (`MergeJobScheduler.PauseAsync`) to prevent new merge events while snapshots are taken.
2. **Generate linkset baseline:** Run `dotnet run --project src/Concelier/StellaOps.Concelier.WebService -- linkset backfill --from 2024-01-01` (or equivalent CLI job) to rebuild linksets from `advisory_raw`. Capture job output artefacts and attach to the sprint issue.
3. **Validate parity:** Use the internal diff tool (`tools/concelier/compare-linkset-merge.ps1`) to compare sample advisories. Any diffs must be triaged before production cutover.
4. **Publish evidence:** For air-gapped tenants, create a one-off Offline Kit slice (`export profile linkset-backfill`) and push to staging mirror.
5. **Tag snapshot:** Record Mongo `oplog` timestamp and S3/object storage manifests in `ops/devops/runbooks/concelier/no-merge.md` (new section) so rollback knows the safe point.
> **Determinism:** rerunning the backfill with identical inputs must produce byte-identical linkset documents. Use the `--verify-determinism` flag where available and archive the checksum report under `artifacts/lnm-backfill/<date>/`.
## 5. Validation gates
- **Metrics:** `linkset_insert_duration_ms`, `linkset_documents_total`, `normalized_version_rules_missing`, `merge.identity.conflicts`.
- Gate: `normalized_version_rules_missing == 0` for 48h before enabling `noMergeEnabled`.
- **Logs:** Ensure no occurrences of `Fallbacking to merge service` after cutover.
- **Change streams:** Policy and Scheduler should observe only `advisory.linkset.updated` events; monitor for stragglers referencing merge IDs.
- **QA:** Golden tests in `StellaOps.Concelier.Merge.Tests` updated to assert absence of merge outputs, plus integration tests verifying LNM-only exports.
Capture validation evidence in the sprint journal (attach Grafana screenshots + CLI output).
## 6. Rollback plan
1. **Toggle sequence:**
- Set `concelier:features:noMergeEnabled=false`.
- Re-enable Merge job schedules (`concelier:jobs:merge:allowlist=["merge:default"]`).
- Disable `policy:overlays:requireLinksetEvidence`.
2. **Data considerations:**
- Linkset writes continue, so no data is lost; ensure Policy consumers ignore linkset-only fields during rollback window.
- If Merge pipeline was fully removed (Phase3 complete), redeploy the Merge service container image from the `rollback` tag published before cutover.
3. **Verification:**
- Run `stella concelier status` to confirm Merge workers active.
- Monitor `merge.identity.conflicts` for spikes; if present, roll forward and re-open incident with Architecture guild.
4. **Communication:**
- Post incident note in #release-infra and customer status page.
- Log rollback reason, window, and configs in `ops/devops/incidents/<yyyy-mm-dd>-no-merge.md`.
Rollback window should not exceed 4hours; beyond that, plan to roll forward with a hotfix rather than reintroducing Merge.
## 7. Documentation & communications
- Update `docs/modules/concelier/architecture.md` appendix to mark Merge deprecated and link back to this playbook.
- Coordinate with Docs guild to publish operator-facing guidance (`docs/releases/2025-q4.md`) and update CLI help text.
- Notify product/CS teams with a short FAQ covering timelines, customer impact, and steps for self-hosted installations.
## 8. Responsibilities matrix
| Area | Lead guild(s) | Supporting |
| --- | --- | --- |
| Feature flags & config | BE-Merge | DevOps |
| Backfill scripting | BE-Merge | Tools |
| Observability dashboards | Observability | QA |
| Offline kit packaging | Export Center | AirGap |
| Customer comms | Docs | Product, Support |
## 9. Deliverables & artefacts
- Config diff per environment (stored in GitOps repo).
- Backfill checksum report (`artifacts/lnm-backfill/<date>/checksums.json`).
- Grafana export (PDF) showing validation metrics.
- QA test run attesting to LNM-only regressions passing.
- Updated runbook entry in `ops/devops/runbooks/concelier/`.
---
## 10. Migration readiness checklist
| Item | Primary owner | Status notes |
| --- | --- | --- |
| Capture Linkset coverage baselines (`version_entries_total`, `missing_version_entries_total`) and archive Grafana export. | Observability Guild | [ ] Pending |
| Stage and verify linkset backfill using `linkset backfill` job; store checksum report under `artifacts/lnm-backfill/<date>/`. | BE-Merge, DevOps Guild | [ ] Pending |
| Confirm feature flags per environment (`noMergeEnabled`, `lnmShadowWrites`, `policy:overlays:requireLinksetEvidence`) match Phase 03 plan. | DevOps Guild | [ ] Pending |
| Publish operator comms (status page, Slack/email) with cutover + rollback windows. | Docs Guild, Product | [ ] Pending |
| Execute rollback rehearsal in staging and log results in `ops/devops/incidents/<date>-no-merge.md`. | DevOps Guild, Architecture Guild | [ ] Pending |
> Update the checklist as each item completes; completion of every row is required before moving to Phase2 (Cutover).
---
With this playbook completed, proceed to MERGE-LNM-21-002 to remove the Merge service code paths and enforce compile-time analyzers that block new merge dependencies.