161 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			161 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Concelier Conflict Resolution Runbook (Sprint 3)
 | ||
| 
 | ||
| This runbook equips Concelier operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 1. Precedence Model (recap)
 | ||
| 
 | ||
| - **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `concelier:merge:precedence:ranks` to override per source when incident response requires it.
 | ||
| - **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
 | ||
| - **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
 | ||
| - **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 2. Telemetry Shipped This Sprint
 | ||
| 
 | ||
| | Instrument | Type | Key Tags | Purpose |
 | ||
| |------------|------|----------|---------|
 | ||
| | `concelier.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
 | ||
| | `concelier.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
 | ||
| | `concelier.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
 | ||
| | `concelier.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
 | ||
| | `concelier.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
 | ||
| 
 | ||
| ### Structured logs
 | ||
| 
 | ||
| - `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
 | ||
| - `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
 | ||
| - `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
 | ||
| - `Alias collision ...` (no EventId) - emitted when `concelier.merge.identity_conflicts` increments.
 | ||
| 
 | ||
| Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Concelier.Merge`.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 3. Detection & Alerting
 | ||
| 
 | ||
| 1. **Dashboard panels**
 | ||
|    - `concelier.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
 | ||
|    - `concelier.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
 | ||
|    - `concelier.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
 | ||
|    - `concelier.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
 | ||
| 2. **Log based alerts**
 | ||
|    - `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
 | ||
|    - `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
 | ||
| 3. **Job health**
 | ||
|    - `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #concelier-ops.
 | ||
| 
 | ||
| ### Threshold updates (2025-10-12)
 | ||
| 
 | ||
| - `concelier.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
 | ||
| - `concelier.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
 | ||
| - `concelier.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 4. Triage Workflow
 | ||
| 
 | ||
| 1. **Confirm job context**
 | ||
|    - `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
 | ||
| 2. **Inspect metrics**
 | ||
|    - Correlate spikes in `concelier.merge.conflicts` with `primary_source`/`suppressed_source` tags from `concelier.merge.overrides`.
 | ||
| 3. **Pull structured logs**
 | ||
|    - Example (vector output):
 | ||
|      ```
 | ||
|      jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-concelier.log
 | ||
|      ```
 | ||
| 4. **Review merge events**
 | ||
|    - `mongosh`:
 | ||
|      ```javascript
 | ||
|      use concelier;
 | ||
|      db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
 | ||
|      ```
 | ||
|    - Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
 | ||
| 5. **Interrogate provenance**
 | ||
|    - `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
 | ||
|    - Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 5. Conflict Classification Matrix
 | ||
| 
 | ||
| | Signal | Likely Cause | Immediate Action |
 | ||
| |--------|--------------|------------------|
 | ||
| | `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
 | ||
| | `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
 | ||
| | `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `concelier:merge:precedence:ranks` to break the tie; restart merge job. |
 | ||
| | Rising `concelier.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
 | ||
| | `concelier.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 6. Resolution Playbook
 | ||
| 
 | ||
| 1. **Connector data fix**
 | ||
|    - Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
 | ||
|    - Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
 | ||
| 2. **Temporary precedence override**
 | ||
|    - Edit `etc/concelier.yaml`:
 | ||
|      ```yaml
 | ||
|      concelier:
 | ||
|        merge:
 | ||
|          precedence:
 | ||
|            ranks:
 | ||
|              osv: 1
 | ||
|              ghsa: 0
 | ||
|      ```
 | ||
|    - Restart Concelier workers; confirm tags in `concelier.merge.overrides` show the new ranks.
 | ||
|    - Document the override with expiry in the change log.
 | ||
| 3. **Alias remediation**
 | ||
|    - Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
 | ||
|    - Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
 | ||
| 4. **Escalation**
 | ||
|    - If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 7. Validation Checklist
 | ||
| 
 | ||
| - [ ] Merge job rerun returns exit code `0`.
 | ||
| - [ ] `concelier.merge.conflicts` baseline returns to zero after corrective action.
 | ||
| - [ ] Latest `merge_event` entry shows expected hash delta.
 | ||
| - [ ] Affected advisory document shows updated `provenance[].decisionReason`.
 | ||
| - [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 8. Reference Material
 | ||
| 
 | ||
| - Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
 | ||
| - Merge engine internals: `src/StellaOps.Concelier.Merge/Services/AdvisoryPrecedenceMerger.cs`.
 | ||
| - Metrics definitions: `src/StellaOps.Concelier.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
 | ||
| - Storage audit trail: `src/StellaOps.Concelier.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Concelier.Storage.Mongo/MergeEvents`.
 | ||
| 
 | ||
| Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 9. Synthetic Regression Fixtures
 | ||
| 
 | ||
| - **Locations** – Canonical conflict snapshots now live at `src/StellaOps.Concelier.Source.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/StellaOps.Concelier.Source.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/StellaOps.Concelier.Source.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
 | ||
| - **Validation commands** – To regenerate and verify the fixtures offline, run:
 | ||
| 
 | ||
| ```bash
 | ||
| dotnet test src/StellaOps.Concelier.Source.Ghsa.Tests/StellaOps.Concelier.Source.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
 | ||
| dotnet test src/StellaOps.Concelier.Source.Nvd.Tests/StellaOps.Concelier.Source.Nvd.Tests.csproj --filter NvdConflictFixtureTests
 | ||
| dotnet test src/StellaOps.Concelier.Source.Osv.Tests/StellaOps.Concelier.Source.Osv.Tests.csproj --filter OsvConflictFixtureTests
 | ||
| dotnet test src/StellaOps.Concelier.Merge.Tests/StellaOps.Concelier.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
 | ||
| ```
 | ||
| 
 | ||
| - **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `concelier.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 10. Change Log
 | ||
| 
 | ||
| | Date (UTC) | Change | Notes |
 | ||
| |------------|--------|-------|
 | ||
| | 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Concelier Ops Guild; no additional overrides required. |
 |