Files
git.stella-ops.org/docs/migration/no-merge.md
master b1e78fe412
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Implement vulnerability token signing and verification utilities
- Added VulnTokenSigner for signing JWT tokens with specified algorithms and keys.
- Introduced VulnTokenUtilities for resolving tenant and subject claims, and sanitizing context dictionaries.
- Created VulnTokenVerificationUtilities for parsing tokens, verifying signatures, and deserializing payloads.
- Developed VulnWorkflowAntiForgeryTokenIssuer for issuing anti-forgery tokens with configurable options.
- Implemented VulnWorkflowAntiForgeryTokenVerifier for verifying anti-forgery tokens and validating payloads.
- Added AuthorityVulnerabilityExplorerOptions to manage configuration for vulnerability explorer features.
- Included tests for FilesystemPackRunDispatcher to ensure proper job handling under egress policy restrictions.
2025-11-03 10:04:10 +02:00

10 KiB
Raw Blame History

No-Merge Migration Playbook

Last updated: 2025-11-03

This playbook guides the full retirement of the legacy Merge service (AdvisoryMergeService) in favour of Link-Not-Merge (LNM) observations plus linksets. It is written for the BE-Merge, Architecture, DevOps, and Docs guilds coordinating Sprint110 (Ingestion & Evidence) deliverables, and it feeds CONCELIER-LNM-21-101 / MERGE-LNM-21-001 and downstream DOCS-LNM-22-008.

0. Scope & objectives

  • Primary goal: cut over all advisory pipelines to Link-Not-Merge with no residual dependencies on AdvisoryMergeService.
  • Secondary goals: maintain deterministic evidence, zero data loss, and reversible deployment across online and offline tenants.
  • Success criteria:
    • All connectors emit observation affected.versions[] with provenance and pass LNM guardrails.
    • Linkset dashboards show zero missing_version_entries_total and no Normalized version rules missing… warnings.
    • Policy, Export Center, and CLI consumers operate solely on observations/linksets.
    • Rollback playbook validated and rehearsed in staging.

1. Prerequisites checklist

Item Owner Notes
Normalized version ranges emitted for all Sprint110 connectors (Acsc, Cccs, CertBund, CertCc, Cve, Ghsa, Ics.Cisa, Kisa, Ru.Bdu, Ru.Nkcki, Vndr.Apple, Vndr.Cisco, Vndr.Msrc). Connector guilds Follow docs/dev/normalized-rule-recipes.md; update fixtures with UPDATE_*_FIXTURES=1.
Metrics dashboards (LinksetVersionCoverage, Normalized version rules missing) available in Grafana/CI snapshots. Observability guild Publish baseline before shadow rollout.
Concelier WebService exposes linkset and observation read APIs for policy/CLI consumers. BE-Merge / Platform Confirm contract parity with Merge outputs.
Export Center / Offline Kit aware of new manifests. Export Center guild Provide beta bundle for QA verification.
Docs guild aligned on public migration messaging. Docs guild Update docs/dev, docs/modules/concelier, and release notes once cutover date is locked.

Do not proceed to Phase1 until all prerequisites are checked or explicitly waived by Architecture guild.

2. Feature flag & configuration plan

Toggle Default Purpose Notes
concelier:features:noMergeEnabled false Master switch to disable legacy Merge job scheduling/execution. Applies to WebService + Worker; gate AdvisoryMergeService DI registration.
concelier:features:lnmShadowWrites true Enables dual-write of linksets while Merge remains active. Keep enabled throughout Phase01 to validate parity.
concelier:jobs:merge:allowlist [] Explicit allowlist for Merge jobs when noMergeEnabled is false. Set to empty during Phase2+ to prevent accidental restarts.
policy:overlays:requireLinksetEvidence false Policy engine safety net to require linkset-backed findings. Flip to true only after cutover (Phase2).

Configuration hygiene: Document the toggle values per environment in ops/devops/configuration/staging.md and ops/devops/configuration/production.md. Air-gapped customers receive defaults through the Offline Kit release notes.

3. Rollout phases

Phase Goal Duration Key actions
0 Preparation Ensure readiness 23 days Finalise prerequisites, snapshot Merge metrics, dry-run backfill scripts in dev.
1 Shadow / Dual Write Validate parity 57 days Enable lnmShadowWrites, keep Merge primary. Compare linkset vs merged outputs using stella concelier diff-merge --snapshot <date>; fix discrepancies.
2 Cutover Switch to LNM 1 day (per env) Enable noMergeEnabled, disable Merge job schedules, update Policy/Export configs, run post-cutover smoke tests.
3 Harden Decommission Merge 23 days Remove Merge background services, delete merge_event retention jobs, clean dashboards, notify operators.

3.1 Environment sequencing

  1. Dev/Test clusters: Validate all automation. Run full regression suite (dotnet test src/Concelier/...).
  2. Staging: Execute complete backfill (see §4) and collect 24h of telemetry before sign-off.
  3. Production: Perform cutover during low-ingest window; communicate via Slack/email + status page two days in advance.
  4. Offline kit: Package new Observer snapshots with LNM-only data; ensure instructions cover flag toggles for air-gapped deployments.

3.2 Smoke test matrix

  • stella concelier status --include linkset returns healthy and shows zero Merge workers.
  • stella policy evaluate against sample tenants produces identical findings pre/post cutover.
  • Export Center bundle diff shows only expected metadata changes (manifest ID, timestamps).
  • Grafana dashboards: linkset_insert_duration_ms steady, merge.identity.conflicts flatlined.

4. Backfill strategy

  1. Freeze Merge writes: Pause Merge job scheduler (MergeJobScheduler.PauseAsync) to prevent new merge events while snapshots are taken.
  2. Generate linkset baseline: Run dotnet run --project src/Concelier/StellaOps.Concelier.WebService -- linkset backfill --from 2024-01-01 (or equivalent CLI job) to rebuild linksets from advisory_raw. Capture job output artefacts and attach to the sprint issue.
  3. Validate parity: Use the internal diff tool (tools/concelier/compare-linkset-merge.ps1) to compare sample advisories. Any diffs must be triaged before production cutover.
  4. Publish evidence: For air-gapped tenants, create a one-off Offline Kit slice (export profile linkset-backfill) and push to staging mirror.
  5. Tag snapshot: Record Mongo oplog timestamp and S3/object storage manifests in ops/devops/runbooks/concelier/no-merge.md (new section) so rollback knows the safe point.

Determinism: rerunning the backfill with identical inputs must produce byte-identical linkset documents. Use the --verify-determinism flag where available and archive the checksum report under artifacts/lnm-backfill/<date>/.

5. Validation gates

  • Metrics: linkset_insert_duration_ms, linkset_documents_total, normalized_version_rules_missing, merge.identity.conflicts.
    • Gate: normalized_version_rules_missing == 0 for 48h before enabling noMergeEnabled.
  • Logs: Ensure no occurrences of Fallbacking to merge service after cutover.
  • Change streams: Policy and Scheduler should observe only advisory.linkset.updated events; monitor for stragglers referencing merge IDs.
  • QA: Golden tests in StellaOps.Concelier.Merge.Tests updated to assert absence of merge outputs, plus integration tests verifying LNM-only exports.

Capture validation evidence in the sprint journal (attach Grafana screenshots + CLI output).

6. Rollback plan

  1. Toggle sequence:
    • Set concelier:features:noMergeEnabled=false.
    • Re-enable Merge job schedules (concelier:jobs:merge:allowlist=["merge:default"]).
    • Disable policy:overlays:requireLinksetEvidence.
  2. Data considerations:
    • Linkset writes continue, so no data is lost; ensure Policy consumers ignore linkset-only fields during rollback window.
    • If Merge pipeline was fully removed (Phase3 complete), redeploy the Merge service container image from the rollback tag published before cutover.
  3. Verification:
    • Run stella concelier status to confirm Merge workers active.
    • Monitor merge.identity.conflicts for spikes; if present, roll forward and re-open incident with Architecture guild.
  4. Communication:
    • Post incident note in #release-infra and customer status page.
    • Log rollback reason, window, and configs in ops/devops/incidents/<yyyy-mm-dd>-no-merge.md.

Rollback window should not exceed 4hours; beyond that, plan to roll forward with a hotfix rather than reintroducing Merge.

7. Documentation & communications

  • Update docs/modules/concelier/architecture.md appendix to mark Merge deprecated and link back to this playbook.
  • Coordinate with Docs guild to publish operator-facing guidance (docs/releases/2025-q4.md) and update CLI help text.
  • Notify product/CS teams with a short FAQ covering timelines, customer impact, and steps for self-hosted installations.

8. Responsibilities matrix

Area Lead guild(s) Supporting
Feature flags & config BE-Merge DevOps
Backfill scripting BE-Merge Tools
Observability dashboards Observability QA
Offline kit packaging Export Center AirGap
Customer comms Docs Product, Support

9. Deliverables & artefacts

  • Config diff per environment (stored in GitOps repo).
  • Backfill checksum report (artifacts/lnm-backfill/<date>/checksums.json).
  • Grafana export (PDF) showing validation metrics.
  • QA test run attesting to LNM-only regressions passing.
  • Updated runbook entry in ops/devops/runbooks/concelier/.

10. Migration readiness checklist

Item Primary owner Status notes
Capture Linkset coverage baselines (version_entries_total, missing_version_entries_total) and archive Grafana export. Observability Guild [ ] Pending
Stage and verify linkset backfill using linkset backfill job; store checksum report under artifacts/lnm-backfill/<date>/. BE-Merge, DevOps Guild [ ] Pending
Confirm feature flags per environment (noMergeEnabled, lnmShadowWrites, policy:overlays:requireLinksetEvidence) match Phase 03 plan. DevOps Guild [ ] Pending
Publish operator comms (status page, Slack/email) with cutover + rollback windows. Docs Guild, Product [ ] Pending
Execute rollback rehearsal in staging and log results in ops/devops/incidents/<date>-no-merge.md. DevOps Guild, Architecture Guild [ ] Pending

Update the checklist as each item completes; completion of every row is required before moving to Phase2 (Cutover).


With this playbook completed, proceed to MERGE-LNM-21-002 to remove the Merge service code paths and enforce compile-time analyzers that block new merge dependencies.