- Added VulnTokenSigner for signing JWT tokens with specified algorithms and keys. - Introduced VulnTokenUtilities for resolving tenant and subject claims, and sanitizing context dictionaries. - Created VulnTokenVerificationUtilities for parsing tokens, verifying signatures, and deserializing payloads. - Developed VulnWorkflowAntiForgeryTokenIssuer for issuing anti-forgery tokens with configurable options. - Implemented VulnWorkflowAntiForgeryTokenVerifier for verifying anti-forgery tokens and validating payloads. - Added AuthorityVulnerabilityExplorerOptions to manage configuration for vulnerability explorer features. - Included tests for FilesystemPackRunDispatcher to ensure proper job handling under egress policy restrictions.
10 KiB
No-Merge Migration Playbook
Last updated: 2025-11-03
This playbook guides the full retirement of the legacy Merge service (AdvisoryMergeService) in favour of Link-Not-Merge (LNM) observations plus linksets. It is written for the BE-Merge, Architecture, DevOps, and Docs guilds coordinating Sprint 110 (Ingestion & Evidence) deliverables, and it feeds CONCELIER-LNM-21-101 / MERGE-LNM-21-001 and downstream DOCS-LNM-22-008.
0. Scope & objectives
- Primary goal: cut over all advisory pipelines to Link-Not-Merge with no residual dependencies on
AdvisoryMergeService. - Secondary goals: maintain deterministic evidence, zero data loss, and reversible deployment across online and offline tenants.
- Success criteria:
- All connectors emit observation
affected.versions[]with provenance and pass LNM guardrails. - Linkset dashboards show zero
missing_version_entries_totaland noNormalized version rules missing…warnings. - Policy, Export Center, and CLI consumers operate solely on observations/linksets.
- Rollback playbook validated and rehearsed in staging.
- All connectors emit observation
1. Prerequisites checklist
| Item | Owner | Notes |
|---|---|---|
Normalized version ranges emitted for all Sprint 110 connectors (Acsc, Cccs, CertBund, CertCc, Cve, Ghsa, Ics.Cisa, Kisa, Ru.Bdu, Ru.Nkcki, Vndr.Apple, Vndr.Cisco, Vndr.Msrc). |
Connector guilds | Follow docs/dev/normalized-rule-recipes.md; update fixtures with UPDATE_*_FIXTURES=1. |
Metrics dashboards (LinksetVersionCoverage, Normalized version rules missing) available in Grafana/CI snapshots. |
Observability guild | Publish baseline before shadow rollout. |
Concelier WebService exposes linkset and observation read APIs for policy/CLI consumers. |
BE-Merge / Platform | Confirm contract parity with Merge outputs. |
| Export Center / Offline Kit aware of new manifests. | Export Center guild | Provide beta bundle for QA verification. |
| Docs guild aligned on public migration messaging. | Docs guild | Update docs/dev, docs/modules/concelier, and release notes once cutover date is locked. |
Do not proceed to Phase 1 until all prerequisites are checked or explicitly waived by Architecture guild.
2. Feature flag & configuration plan
| Toggle | Default | Purpose | Notes |
|---|---|---|---|
concelier:features:noMergeEnabled |
false |
Master switch to disable legacy Merge job scheduling/execution. | Applies to WebService + Worker; gate AdvisoryMergeService DI registration. |
concelier:features:lnmShadowWrites |
true |
Enables dual-write of linksets while Merge remains active. | Keep enabled throughout Phase 0–1 to validate parity. |
concelier:jobs:merge:allowlist |
[] |
Explicit allowlist for Merge jobs when noMergeEnabled is false. |
Set to empty during Phase 2+ to prevent accidental restarts. |
policy:overlays:requireLinksetEvidence |
false |
Policy engine safety net to require linkset-backed findings. | Flip to true only after cutover (Phase 2). |
Configuration hygiene: Document the toggle values per environment in
ops/devops/configuration/staging.mdandops/devops/configuration/production.md. Air-gapped customers receive defaults through the Offline Kit release notes.
3. Rollout phases
| Phase | Goal | Duration | Key actions |
|---|---|---|---|
| 0 – Preparation | Ensure readiness | 2–3 days | Finalise prerequisites, snapshot Merge metrics, dry-run backfill scripts in dev. |
| 1 – Shadow / Dual Write | Validate parity | 5–7 days | Enable lnmShadowWrites, keep Merge primary. Compare linkset vs merged outputs using stella concelier diff-merge --snapshot <date>; fix discrepancies. |
| 2 – Cutover | Switch to LNM | 1 day (per env) | Enable noMergeEnabled, disable Merge job schedules, update Policy/Export configs, run post-cutover smoke tests. |
| 3 – Harden | Decommission Merge | 2–3 days | Remove Merge background services, delete merge_event retention jobs, clean dashboards, notify operators. |
3.1 Environment sequencing
- Dev/Test clusters: Validate all automation. Run full regression suite (
dotnet test src/Concelier/...). - Staging: Execute complete backfill (see §4) and collect 24 h of telemetry before sign-off.
- Production: Perform cutover during low-ingest window; communicate via Slack/email + status page two days in advance.
- Offline kit: Package new Observer snapshots with LNM-only data; ensure instructions cover flag toggles for air-gapped deployments.
3.2 Smoke test matrix
stella concelier status --include linksetreturns healthy and shows zero Merge workers.stella policy evaluateagainst sample tenants produces identical findings pre/post cutover.- Export Center bundle diff shows only expected metadata changes (manifest ID, timestamps).
- Grafana dashboards:
linkset_insert_duration_mssteady,merge.identity.conflictsflatlined.
4. Backfill strategy
- Freeze Merge writes: Pause Merge job scheduler (
MergeJobScheduler.PauseAsync) to prevent new merge events while snapshots are taken. - Generate linkset baseline: Run
dotnet run --project src/Concelier/StellaOps.Concelier.WebService -- linkset backfill --from 2024-01-01(or equivalent CLI job) to rebuild linksets fromadvisory_raw. Capture job output artefacts and attach to the sprint issue. - Validate parity: Use the internal diff tool (
tools/concelier/compare-linkset-merge.ps1) to compare sample advisories. Any diffs must be triaged before production cutover. - Publish evidence: For air-gapped tenants, create a one-off Offline Kit slice (
export profile linkset-backfill) and push to staging mirror. - Tag snapshot: Record Mongo
oplogtimestamp and S3/object storage manifests inops/devops/runbooks/concelier/no-merge.md(new section) so rollback knows the safe point.
Determinism: rerunning the backfill with identical inputs must produce byte-identical linkset documents. Use the
--verify-determinismflag where available and archive the checksum report underartifacts/lnm-backfill/<date>/.
5. Validation gates
- Metrics:
linkset_insert_duration_ms,linkset_documents_total,normalized_version_rules_missing,merge.identity.conflicts.- Gate:
normalized_version_rules_missing == 0for 48 h before enablingnoMergeEnabled.
- Gate:
- Logs: Ensure no occurrences of
Fallbacking to merge serviceafter cutover. - Change streams: Policy and Scheduler should observe only
advisory.linkset.updatedevents; monitor for stragglers referencing merge IDs. - QA: Golden tests in
StellaOps.Concelier.Merge.Testsupdated to assert absence of merge outputs, plus integration tests verifying LNM-only exports.
Capture validation evidence in the sprint journal (attach Grafana screenshots + CLI output).
6. Rollback plan
- Toggle sequence:
- Set
concelier:features:noMergeEnabled=false. - Re-enable Merge job schedules (
concelier:jobs:merge:allowlist=["merge:default"]). - Disable
policy:overlays:requireLinksetEvidence.
- Set
- Data considerations:
- Linkset writes continue, so no data is lost; ensure Policy consumers ignore linkset-only fields during rollback window.
- If Merge pipeline was fully removed (Phase 3 complete), redeploy the Merge service container image from the
rollbacktag published before cutover.
- Verification:
- Run
stella concelier statusto confirm Merge workers active. - Monitor
merge.identity.conflictsfor spikes; if present, roll forward and re-open incident with Architecture guild.
- Run
- Communication:
- Post incident note in #release-infra and customer status page.
- Log rollback reason, window, and configs in
ops/devops/incidents/<yyyy-mm-dd>-no-merge.md.
Rollback window should not exceed 4 hours; beyond that, plan to roll forward with a hotfix rather than reintroducing Merge.
7. Documentation & communications
- Update
docs/modules/concelier/architecture.mdappendix to mark Merge deprecated and link back to this playbook. - Coordinate with Docs guild to publish operator-facing guidance (
docs/releases/2025-q4.md) and update CLI help text. - Notify product/CS teams with a short FAQ covering timelines, customer impact, and steps for self-hosted installations.
8. Responsibilities matrix
| Area | Lead guild(s) | Supporting |
|---|---|---|
| Feature flags & config | BE-Merge | DevOps |
| Backfill scripting | BE-Merge | Tools |
| Observability dashboards | Observability | QA |
| Offline kit packaging | Export Center | AirGap |
| Customer comms | Docs | Product, Support |
9. Deliverables & artefacts
- Config diff per environment (stored in GitOps repo).
- Backfill checksum report (
artifacts/lnm-backfill/<date>/checksums.json). - Grafana export (PDF) showing validation metrics.
- QA test run attesting to LNM-only regressions passing.
- Updated runbook entry in
ops/devops/runbooks/concelier/.
10. Migration readiness checklist
| Item | Primary owner | Status notes |
|---|---|---|
Capture Linkset coverage baselines (version_entries_total, missing_version_entries_total) and archive Grafana export. |
Observability Guild | [ ] Pending |
Stage and verify linkset backfill using linkset backfill job; store checksum report under artifacts/lnm-backfill/<date>/. |
BE-Merge, DevOps Guild | [ ] Pending |
Confirm feature flags per environment (noMergeEnabled, lnmShadowWrites, policy:overlays:requireLinksetEvidence) match Phase 0–3 plan. |
DevOps Guild | [ ] Pending |
| Publish operator comms (status page, Slack/email) with cutover + rollback windows. | Docs Guild, Product | [ ] Pending |
Execute rollback rehearsal in staging and log results in ops/devops/incidents/<date>-no-merge.md. |
DevOps Guild, Architecture Guild | [ ] Pending |
Update the checklist as each item completes; completion of every row is required before moving to Phase 2 (Cutover).
With this playbook completed, proceed to MERGE-LNM-21-002 to remove the Merge service code paths and enforce compile-time analyzers that block new merge dependencies.