Files
git.stella-ops.org/docs/ops/feedser-conflict-resolution.md
Vladimir Moushkov ea1106ce7c up
2025-10-15 10:03:56 +03:00

161 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feedser Conflict Resolution Runbook (Sprint 3)
This runbook equips Feedser operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
---
## 1. Precedence Model (recap)
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `feedser:merge:precedence:ranks` to override per source when incident response requires it.
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
---
## 2. Telemetry Shipped This Sprint
| Instrument | Type | Key Tags | Purpose |
|------------|------|----------|---------|
| `feedser.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
| `feedser.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
| `feedser.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
| `feedser.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
| `feedser.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
### Structured logs
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
- `Alias collision ...` (no EventId) - emitted when `feedser.merge.identity_conflicts` increments.
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Feedser.Merge`.
---
## 3. Detection & Alerting
1. **Dashboard panels**
- `feedser.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
- `feedser.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
- `feedser.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
- `feedser.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
2. **Log based alerts**
- `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
- `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
3. **Job health**
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #feedser-ops.
### Threshold updates (2025-10-12)
- `feedser.merge.conflicts` Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
- `feedser.merge.overrides` Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
- `feedser.merge.range_overrides` Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
---
## 4. Triage Workflow
1. **Confirm job context**
- `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
2. **Inspect metrics**
- Correlate spikes in `feedser.merge.conflicts` with `primary_source`/`suppressed_source` tags from `feedser.merge.overrides`.
3. **Pull structured logs**
- Example (vector output):
```
jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-feedser.log
```
4. **Review merge events**
- `mongosh`:
```javascript
use feedser;
db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
```
- Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
5. **Interrogate provenance**
- `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
- Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
---
## 5. Conflict Classification Matrix
| Signal | Likely Cause | Immediate Action |
|--------|--------------|------------------|
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `feedser:merge:precedence:ranks` to break the tie; restart merge job. |
| Rising `feedser.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
| `feedser.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
---
## 6. Resolution Playbook
1. **Connector data fix**
- Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
- Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
2. **Temporary precedence override**
- Edit `etc/feedser.yaml`:
```yaml
feedser:
merge:
precedence:
ranks:
osv: 1
ghsa: 0
```
- Restart Feedser workers; confirm tags in `feedser.merge.overrides` show the new ranks.
- Document the override with expiry in the change log.
3. **Alias remediation**
- Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
- Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
4. **Escalation**
- If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
---
## 7. Validation Checklist
- [ ] Merge job rerun returns exit code `0`.
- [ ] `feedser.merge.conflicts` baseline returns to zero after corrective action.
- [ ] Latest `merge_event` entry shows expected hash delta.
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
---
## 8. Reference Material
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
- Merge engine internals: `src/StellaOps.Feedser.Merge/Services/AdvisoryPrecedenceMerger.cs`.
- Metrics definitions: `src/StellaOps.Feedser.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
- Storage audit trail: `src/StellaOps.Feedser.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Feedser.Storage.Mongo/MergeEvents`.
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
---
## 9. Synthetic Regression Fixtures
- **Locations** Canonical conflict snapshots now live at `src/StellaOps.Feedser.Source.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/StellaOps.Feedser.Source.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/StellaOps.Feedser.Source.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
- **Validation commands** To regenerate and verify the fixtures offline, run:
```bash
dotnet test src/StellaOps.Feedser.Source.Ghsa.Tests/StellaOps.Feedser.Source.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
dotnet test src/StellaOps.Feedser.Source.Nvd.Tests/StellaOps.Feedser.Source.Nvd.Tests.csproj --filter NvdConflictFixtureTests
dotnet test src/StellaOps.Feedser.Source.Osv.Tests/StellaOps.Feedser.Source.Osv.Tests.csproj --filter OsvConflictFixtureTests
dotnet test src/StellaOps.Feedser.Merge.Tests/StellaOps.Feedser.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
```
- **Expected signals** The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `feedser.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
---
## 10. Change Log
| Date (UTC) | Change | Notes |
|------------|--------|-------|
| 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Feedser Ops Guild; no additional overrides required. |