Initial commit (history squashed)

2025-10-07 10:14:21 +03:00
commit 016c5a3fe7
1132 changed files with 117842 additions and 0 deletions
--- a/docs/ops/authority-backup-restore.md
+++ b/docs/ops/authority-backup-restore.md
@@ -0,0 +1,97 @@
+# Authority Backup & Restore Runbook
+
+## Scope
+- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
+- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
+- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
+
+## Inventory Checklist
+| Component | Location (compose default) | Notes |
+| --- | --- | --- |
+| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
+| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
+| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
+| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
+
+> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
+
+## Hot Backup (no downtime)
+1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
+2. **Dump Mongo:**
+   ```bash
+   docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
+     mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
+     --gzip --db stellaops-authority
+   docker compose -f ops/authority/docker-compose.authority.yaml cp \
+     mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
+   ```
+   The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
+3. **Capture configuration + manifests:**
+   ```bash
+   cp etc/authority.yaml backup/
+   rsync -a etc/authority.plugins/ backup/authority.plugins/
+   ```
+4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
+   ```bash
+   docker run --rm \
+     -v authority-keys:/keys \
+     -v "$(pwd)/backup:/backup" \
+     busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
+   ```
+5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
+6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
+
+## Cold Backup (planned downtime)
+1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
+2. Stop services:
+   ```bash
+   docker compose -f ops/authority/docker-compose.authority.yaml down
+   ```
+3. Back up volumes directly using `tar`:
+   ```bash
+   docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
+     busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
+   docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
+     busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
+   ```
+4. Copy configuration + manifests as in the hot backup (steps 3–6).
+5. Restart services and verify health:
+   ```bash
+   docker compose -f ops/authority/docker-compose.authority.yaml up -d
+   curl -fsS http://localhost:8080/ready
+   ```
+
+## Restore Procedure
+1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
+2. **Restore Mongo:**
+   ```bash
+   docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
+   ```
+   Use `--drop` to replace collections; omit if doing a partial restore.
+3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
+4. **Restore signing keys:** untar into the mounted volume:
+   ```bash
+   docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
+     busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
+   ```
+   Ensure file permissions remain `600` for private keys (`chmod -R 600`).
+5. **Start services & validate:**
+   ```bash
+   docker compose up -d
+   curl -fsS http://localhost:8080/health
+   ```
+6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations.
+
+## Disaster Recovery Notes
+- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
+- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
+- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (key rotation tooling), and publish a revocation notice.
+- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Restoring across major versions requires a compatibility review.
+
+## Verification Checklist
+- [ ] `/ready` reports all identity providers ready.
+- [ ] OAuth flows issue tokens signed by the restored keys.
+- [ ] `PluginRegistrationSummary` logs expected providers on startup.
+- [ ] Revocation manifest export (`dotnet run --project src/StellaOps.Authority`) succeeds.
+- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
+
--- a/docs/ops/authority-grafana-dashboard.json
+++ b/docs/ops/authority-grafana-dashboard.json
@@ -0,0 +1,174 @@
+{
+  "title": "StellaOps Authority - Token & Access Monitoring",
+  "uid": "authority-token-monitoring",
+  "schemaVersion": 38,
+  "version": 1,
+  "editable": true,
+  "timezone": "",
+  "graphTooltip": 0,
+  "time": {
+    "from": "now-6h",
+    "to": "now"
+  },
+  "templating": {
+    "list": [
+      {
+        "name": "datasource",
+        "type": "datasource",
+        "query": "prometheus",
+        "refresh": 1,
+        "hide": 0,
+        "current": {}
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Token Requests – Success vs Failure",
+      "type": "timeseries",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${datasource}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "req/s",
+          "displayName": "{{grant_type}} ({{status}})"
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
+          "legendFormat": "{{grant_type}} {{status}}"
+        }
+      ],
+      "options": {
+        "legend": {
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      }
+    },
+    {
+      "id": 2,
+      "title": "Rate Limiter Rejections",
+      "type": "timeseries",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${datasource}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "req/s",
+          "displayName": "{{limiter}}"
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
+          "legendFormat": "{{limiter}}"
+        }
+      ]
+    },
+    {
+      "id": 3,
+      "title": "Bypass Events (5m)",
+      "type": "stat",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${datasource}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "orange", "value": 1 },
+              { "color": "red", "value": 5 }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
+        }
+      ],
+      "options": {
+        "reduceOptions": {
+          "calcs": ["last"],
+          "fields": "",
+          "values": false
+        },
+        "orientation": "horizontal",
+        "textMode": "auto"
+      }
+    },
+    {
+      "id": 4,
+      "title": "Lockout Events (15m)",
+      "type": "stat",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${datasource}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "orange", "value": 5 },
+              { "color": "red", "value": 10 }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
+        }
+      ],
+      "options": {
+        "reduceOptions": {
+          "calcs": ["last"],
+          "fields": "",
+          "values": false
+        },
+        "orientation": "horizontal",
+        "textMode": "auto"
+      }
+    },
+    {
+      "id": 5,
+      "title": "Trace Explorer Shortcut",
+      "type": "text",
+      "options": {
+        "mode": "markdown",
+        "content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
+      }
+    }
+  ],
+  "links": []
+}
--- a/docs/ops/authority-monitoring.md
+++ b/docs/ops/authority-monitoring.md
@@ -0,0 +1,81 @@
+# Authority Monitoring & Alerting Playbook
+
+## Telemetry Sources
+- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
+- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
+  - `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
+  - `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
+- **Logs:** Serilog writes structured events to stdout. Notable templates:
+  - `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
+  - `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
+  - `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
+
+## Prometheus Metrics to Collect
+| Metric | Query | Purpose |
+| --- | --- | --- |
+| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
+| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
+| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
+| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
+| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
+
+> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
+
+## Alert Rules
+1. **Token Failure Surge**
+   - _Expression_: `token_failure_ratio > 0.05`
+   - _For_: `10m`
+   - _Labels_: `severity="critical"`
+   - _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
+2. **Lockout Spike**
+   - _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
+   - _For_: `15m`
+   - Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
+3. **Bypass Threshold**
+   - _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
+   - _For_: `5m`
+   - Alert severity `warning` — verify the calling host list.
+4. **Rate Limiter Saturation**
+   - _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
+   - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
+
+## Grafana Dashboard
+- Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels:
+  - **Token Success vs Failure** – stacked rate visualization split by grant type.
+  - **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
+  - **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
+  - **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
+
+## Collector Configuration Snippets
+```yaml
+receivers:
+  otlp:
+    protocols:
+      http:
+exporters:
+  prometheus:
+    endpoint: "0.0.0.0:9464"
+processors:
+  batch:
+  attributes/token_grant:
+    actions:
+      - key: grant_type
+        action: upsert
+        from_attribute: authority.grant_type
+service:
+  pipelines:
+    metrics:
+      receivers: [otlp]
+      processors: [attributes/token_grant, batch]
+      exporters: [prometheus]
+    logs:
+      receivers: [otlp]
+      processors: [batch]
+      exporters: [loki]
+```
+
+## Operational Checklist
+- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
+- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
+- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
+- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
--- a/docs/ops/feedser-conflict-resolution.md
+++ b/docs/ops/feedser-conflict-resolution.md
@@ -0,0 +1,130 @@
+# Feedser Conflict Resolution Runbook (Sprint 3)
+
+This runbook equips Feedser operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
+
+---
+
+## 1. Precedence Model (recap)
+
+- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `feedser:merge:precedence:ranks` to override per source when incident response requires it.
+- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
+- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
+- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
+
+---
+
+## 2. Telemetry Shipped This Sprint
+
+| Instrument | Type | Key Tags | Purpose |
+|------------|------|----------|---------|
+| `feedser.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
+| `feedser.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
+| `feedser.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
+| `feedser.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
+| `feedser.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
+
+### Structured logs
+
+- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
+- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
+- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
+- `Alias collision ...` (no EventId) - emitted when `feedser.merge.identity_conflicts` increments.
+
+Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Feedser.Merge`.
+
+---
+
+## 3. Detection & Alerting
+
+1. **Dashboard panels**
+   - `feedser.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
+   - `feedser.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
+   - `feedser.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
+   - `feedser.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
+2. **Log based alerts**
+   - `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
+   - `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
+3. **Job health**
+   - `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #feedser-ops.
+
+---
+
+## 4. Triage Workflow
+
+1. **Confirm job context**
+   - `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
+2. **Inspect metrics**
+   - Correlate spikes in `feedser.merge.conflicts` with `primary_source`/`suppressed_source` tags from `feedser.merge.overrides`.
+3. **Pull structured logs**
+   - Example (vector output):
+     ```
+     jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-feedser.log
+     ```
+4. **Review merge events**
+   - `mongosh`:
+     ```javascript
+     use feedser;
+     db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
+     ```
+   - Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
+5. **Interrogate provenance**
+   - `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
+   - Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
+
+---
+
+## 5. Conflict Classification Matrix
+
+| Signal | Likely Cause | Immediate Action |
+|--------|--------------|------------------|
+| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
+| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
+| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `feedser:merge:precedence:ranks` to break the tie; restart merge job. |
+| Rising `feedser.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
+| `feedser.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
+
+---
+
+## 6. Resolution Playbook
+
+1. **Connector data fix**
+   - Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
+   - Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
+2. **Temporary precedence override**
+   - Edit `etc/feedser.yaml`:
+     ```yaml
+     feedser:
+       merge:
+         precedence:
+           ranks:
+             osv: 1
+             ghsa: 0
+     ```
+   - Restart Feedser workers; confirm tags in `feedser.merge.overrides` show the new ranks.
+   - Document the override with expiry in the change log.
+3. **Alias remediation**
+   - Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
+   - Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
+4. **Escalation**
+   - If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
+
+---
+
+## 7. Validation Checklist
+
+- [ ] Merge job rerun returns exit code `0`.
+- [ ] `feedser.merge.conflicts` baseline returns to zero after corrective action.
+- [ ] Latest `merge_event` entry shows expected hash delta.
+- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
+- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
+
+---
+
+## 8. Reference Material
+
+- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
+- Merge engine internals: `src/StellaOps.Feedser.Merge/Services/AdvisoryPrecedenceMerger.cs`.
+- Metrics definitions: `src/StellaOps.Feedser.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
+- Storage audit trail: `src/StellaOps.Feedser.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Feedser.Storage.Mongo/MergeEvents`.
+
+Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
--- a/docs/ops/feedser-cve-kev-operations.md
+++ b/docs/ops/feedser-cve-kev-operations.md
@@ -0,0 +1,104 @@
+# Feedser CVE & KEV Connector Operations
+
+This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
+
+## 1. CVE Services Connector (`source:cve:*`)
+
+### 1.1 Prerequisites
+
+- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
+- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Feedser workers.
+- Updated `feedser.yaml` (or the matching environment variables) with the following section:
+
+```yaml
+feedser:
+  sources:
+    cve:
+      baseEndpoint: "https://cveawg.mitre.org/api/"
+      apiOrg: "ORG123"
+      apiUser: "user@example.org"
+      apiKeyFile: "/var/run/secrets/feedser/cve-api-key"
+      pageSize: 200
+      maxPagesPerFetch: 5
+      initialBackfill: "30.00:00:00"
+      requestDelay: "00:00:00.250"
+      failureBackoff: "00:10:00"
+```
+
+> ℹ️  Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `FEEDSER_SOURCES__CVE__APIKEY`.
+
+### 1.2 Smoke Test (staging)
+
+1. Deploy the updated configuration and restart the Feedser service so the connector picks up the credentials.
+2. Trigger one end-to-end cycle:
+   - Feedser CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
+   - REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
+3. Observe the following metrics (exported via OTEL meter `StellaOps.Feedser.Source.Cve`):
+   - `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.failures`, `cve.fetch.unchanged`
+   - `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
+   - `cve.map.success`
+4. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
+
+### 1.3 Production Monitoring
+
+- **Dashboards** – Add the counters above plus `feedser.range.primitives` (filtered by `scheme=semver` or `scheme=vendor`) to the Feedser overview board. Alert when:
+  - `rate(cve.fetch.failures[5m]) > 0`
+  - `rate(cve.map.success[15m]) == 0` while fetch attempts continue
+  - `sum_over_time(cve.parse.quarantine[1h]) > 0`
+- **Logs** – Watch for `CveConnector` warnings such as `Failed fetching CVE record` or schema validation errors (`Malformed CVE JSON`). These are emitted with the CVE ID and document identifier for triage.
+- **Backfill window** – operators can tighten or widen the `initialBackfill` / `maxPagesPerFetch` values after validating baseline throughput. Update the config and restart the worker to apply changes.
+
+## 2. CISA KEV Connector (`source:kev:*`)
+
+### 2.1 Prerequisites
+
+- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
+- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
+- Confirm the following snippet in `feedser.yaml` (defaults shown; tune as needed):
+
+```yaml
+feedser:
+  sources:
+    kev:
+      feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
+      requestTimeout: "00:01:00"
+      failureBackoff: "00:05:00"
+```
+
+### 2.2 Schema validation & anomaly handling
+
+From this sprint the connector validates the KEV JSON payload against `Schemas/kev-catalog.schema.json`. Malformed documents are quarantined, and entries missing a CVE ID are dropped with a warning (`reason=missingCveId`). Operators should treat repeated schema failures as an upstream regression and coordinate with CISA or mirror maintainers.
+
+### 2.3 Smoke Test (staging)
+
+1. Deploy the configuration and restart Feedser.
+2. Trigger a pipeline run:
+   - CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
+   - REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
+3. Verify the metrics exposed by meter `StellaOps.Feedser.Source.Kev`:
+   - `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
+   - `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
+   - `kev.map.advisories` (tag `catalogVersion`)
+4. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
+
+### 2.4 Production Monitoring
+
+- Alert when `kev.fetch.success` goes to zero for longer than the expected daily cadence (default: trigger if `rate(kev.fetch.success[8h]) == 0` during business hours).
+- Track anomaly spikes via `kev.parse.anomalies{reason="missingCveId"}`. A sustained non-zero rate means the upstream catalog contains unexpected records.
+- The connector logs each validated catalog: `Parsed KEV catalog document … entries=X`. Absence of that log alongside consecutive `kev.fetch.success` counts suggests schema validation failures—correlate with warning-level events in the `StellaOps.Feedser.Source.Kev` logger.
+
+### 2.5 Known good dashboard tiles
+
+Add the following panels to the Feedser observability board:
+
+| Metric | Recommended visualisation |
+|--------|---------------------------|
+| `kev.fetch.success` | Single-stat (last 24 h) with threshold alert |
+| `rate(kev.parse.entries[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
+| `sum_over_time(kev.parse.anomalies[1d])` by `reason` | Table – anomaly breakdown |
+
+## 3. Runbook updates
+
+- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
+- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
+- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
--- a/docs/ops/migrations/SEMVER_STYLE.md
+++ b/docs/ops/migrations/SEMVER_STYLE.md
@@ -0,0 +1,50 @@
+# SemVer Style Backfill Runbook
+
+_Last updated: 2025-10-11_
+
+## Overview
+
+The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures
+provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only
+runs when the feature flag `feedser:storage:enableSemVerStyle` is enabled.
+
+## Preconditions
+
+1. **Review configuration** – set `feedser.storage.enableSemVerStyle` to `true` on all Feedser services.
+2. **Confirm batch size** – adjust `feedser.storage.backfillBatchSize` if you need smaller batches for older
+   deployments (default: `250`).
+3. **Back up** – capture a fresh snapshot of the `advisory` collection or a full MongoDB backup.
+4. **Staging dry-run** – enable the flag in a staging environment and observe the migration output before
+   rolling to production.
+
+## Execution
+
+No manual command is required. After deploying the configuration change, restart the Feedser WebService or
+any component that hosts the Mongo migration runner. During startup you will see log entries similar to:
+
+```
+Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled.
+Mongo migration 20251011-semver-style-backfill applied
+```
+
+The migration reads advisories in batches (`feedser.storage.backfillBatchSize`) and writes flattened
+`normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched.
+
+## Post-checks
+
+1. Verify the new indexes exist:
+   ```
+   db.advisory.getIndexes()
+   ```
+   You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`.
+2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches
+   the embedded package data.
+3. Run `dotnet test` for `StellaOps.Feedser.Storage.Mongo.Tests` (optional but recommended) in CI to confirm
+   the storage suite passes with the feature flag enabled.
+
+## Rollback
+
+Set `feedser.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on
+subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when
+the feature flag is off. If you must remove them entirely, restore from the backup captured during
+preparation.