up

2025-10-12 20:37:18 +03:00
parent 016c5a3fe7
commit d3a98326d1
306 changed files with 21409 additions and 4449 deletions
--- a/docs/ops/authority-backup-restore.md
+++ b/docs/ops/authority-backup-restore.md
@@ -80,12 +80,12 @@
   docker compose up -d
   curl -fsS http://localhost:8080/health
   ```
-6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations.
+6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.

 ## Disaster Recovery Notes
 - **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
 - **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (key rotation tooling), and publish a revocation notice.
+- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
 - **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Restoring across major versions requires a compatibility review.

 ## Verification Checklist
--- a/docs/ops/authority-key-rotation.md
+++ b/docs/ops/authority-key-rotation.md
@@ -0,0 +1,83 @@
+# Authority Signing Key Rotation Playbook
+
+> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.  
+> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
+
+## 1. Overview
+
+Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
+
+- **Automation script:** `ops/authority/key-rotation.sh`  
+  Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
+- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`  
+  Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
+
+This playbook documents the repeatable sequence for all environments.
+
+## 2. Pre-requisites
+
+1. **Generate a new PEM key (per environment)**
+   ```bash
+   openssl ecparam -name prime256v1 -genkey -noout \
+     -out certificates/authority-signing-<env>-<year>.pem
+   chmod 600 certificates/authority-signing-<env>-<year>.pem
+   ```
+2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
+3. **Ensure secrets/vars exist in Gitea**
+   - `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
+   - `<ENV>_AUTHORITY_URL`
+   - Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
+
+## 3. Executing the rotation
+
+### Option A – via CI workflow (recommended)
+
+1. Navigate to **Actions → Authority Key Rotation**.
+2. Provide inputs:
+   - `environment`: `staging`, `production`, etc.
+   - `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
+   - `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
+   - Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
+3. Trigger. The workflow:
+   - Reads the bootstrap key/URL from secrets.
+   - Runs `ops/authority/key-rotation.sh`.
+   - Prints the JWKS response for verification.
+
+### Option B – manual shell invocation
+
+```bash
+AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
+./ops/authority/key-rotation.sh \
+  --authority-url https://authority.example.com \
+  --key-id authority-signing-2025-dev \
+  --key-path ../certificates/authority-signing-2025-dev.pem \
+  --meta rotatedBy=ops --meta changeTicket=OPS-1234
+```
+
+Use `--dry-run` to inspect the payload before execution.
+
+## 4. Post-rotation checklist
+
+1. Update `authority.yaml` (or environment-specific overrides):
+   - Set `signing.activeKeyId` to the new key.
+   - Set `signing.keyPath` to the new PEM.
+   - Append the previous key into `signing.additionalKeys`.
+   - Ensure `keySource`/`provider` match the values passed to the script.
+2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
+3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
+4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
+
+## 5. Development key state
+
+For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
+
+- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
+- Retired: `authority-signing-dev`
+
+Treat these as examples; real environments must maintain their own PEM material.
+
+## 6. References
+
+- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
+- `docs/ops/authority-backup-restore.md` – Recovery flow referencing this playbook.
+- `ops/authority/README.md` – CLI usage and examples.
--- a/docs/ops/feedser-apple-operations.md
+++ b/docs/ops/feedser-apple-operations.md
@@ -0,0 +1,77 @@
+# Feedser Apple Security Update Connector Operations
+
+This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
+
+## 1. Prerequisites
+
+- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
+- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
+- Updated configuration (environment variables or `feedser.yaml`) with an `apple` section. Example baseline:
+
+```yaml
+feedser:
+  sources:
+    apple:
+      softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
+      advisoryBaseUri: "https://support.apple.com/"
+      localeSegment: "en-us"
+      maxAdvisoriesPerFetch: 25
+      initialBackfill: "120.00:00:00"
+      modifiedTolerance: "02:00:00"
+      failureBackoff: "00:05:00"
+```
+
+> ℹ️  `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Feedser automatically adds both hosts to the connector HttpClient.
+
+## 2. Staging Smoke Test
+
+1. Deploy the configuration and restart the Feedser workers to ensure the Apple connector options are bound.
+2. Trigger a full connector cycle:
+   - CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
+   - REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
+3. Validate metrics exported under meter `StellaOps.Feedser.Source.Vndr.Apple`:
+   - `apple.fetch.items` (documents fetched)
+   - `apple.fetch.failures`
+   - `apple.fetch.unchanged`
+   - `apple.parse.failures`
+   - `apple.map.affected.count` (histogram of affected package counts)
+4. Cross-check the shared HTTP counters:
+   - `feedser.source.http.requests_total{feedser_source="vndr-apple"}` should increase for both index and detail phases.
+   - `feedser.source.http.failures_total{feedser_source="vndr-apple"}` should remain flat (0) during a healthy run.
+5. Inspect the info logs:
+   - `Apple software index fetch … processed=X newDocuments=Y`
+   - `Apple advisory parse complete … aliases=… affected=…`
+   - `Mapped Apple advisory … pendingMappings=0`
+6. Confirm MongoDB state:
+   - `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
+   - `dtos` store has `schemaVersion="apple.security.update.v1"`.
+   - `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
+   - `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
+
+## 3. Production Monitoring
+
+- **Dashboards** – Add the following expressions to your Feedser Grafana board (OTLP/Prometheus naming assumed):
+  - `rate(apple_fetch_items_total[15m])` vs `rate(feedser_source_http_requests_total{feedser_source="vndr-apple"}[15m])`
+  - `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
+  - `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
+  - `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
+- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
+- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
+- **Telemetry pipeline** – `StellaOps.Feedser.WebService` now exports `StellaOps.Feedser.Source.Vndr.Apple` alongside existing Feedser meters; ensure your OTEL collector or Prometheus scraper includes it.
+
+## 4. Fixture Maintenance
+
+Regression fixtures live under `src/StellaOps.Feedser.Source.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
+
+1. Run the helper script matching your platform:
+   - Bash: `./scripts/update-apple-fixtures.sh`
+   - PowerShell: `./scripts/update-apple-fixtures.ps1`
+2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
+3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/StellaOps.Feedser.Source.Vndr.Apple.Tests/StellaOps.Feedser.Source.Vndr.Apple.Tests.csproj`) to verify determinism.
+4. Commit fixture updates together with any parser/mapping changes that motivated them.
+
+## 5. Known Issues & Follow-up Tasks
+
+- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
+- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
+- Multi-locale content is still under regression sweep (`src/StellaOps.Feedser.Source.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
--- a/docs/ops/feedser-authority-audit-runbook.md
+++ b/docs/ops/feedser-authority-audit-runbook.md
@@ -0,0 +1,150 @@
+# Feedser Authority Audit Runbook
+
+_Last updated: 2025-10-12_
+
+This runbook helps operators verify and monitor the StellaOps Feedser ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
+
+## 1. Prerequisites
+
+- Authority integration is enabled in `feedser.yaml` (or via `FEEDSER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
+- OTLP metrics/log exporters are configured (`feedser.telemetry.*`) or container stdout is shipped to your SIEM.
+- Operators have access to the Feedser job trigger endpoints via CLI or REST for smoke tests.
+
+### Configuration snippet
+
+```yaml
+feedser:
+  authority:
+    enabled: true
+    allowAnonymousFallback: false          # keep true only during initial rollout
+    issuer: "https://authority.internal"
+    audiences:
+      - "api://feedser"
+    requiredScopes:
+      - "feedser.jobs.trigger"
+    bypassNetworks:
+      - "127.0.0.1/32"
+      - "::1/128"
+    clientId: "feedser-jobs"
+    clientSecretFile: "/run/secrets/feedser_authority_client"
+    tokenClockSkewSeconds: 60
+    resilience:
+      enableRetries: true
+      retryDelays:
+        - "00:00:01"
+        - "00:00:02"
+        - "00:00:05"
+      allowOfflineCacheFallback: true
+      offlineCacheTolerance: "00:10:00"
+```
+
+> Store secrets outside source control. Feedser reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
+
+### Resilience tuning
+
+- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Feedser retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
+- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Feedser will fail fast but keep deterministic logs.
+- Feedser resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `feedser.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
+
+## 2. Key Signals
+
+### 2.1 Audit log channel
+
+Feedser emits structured audit entries via the `Feedser.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
+
+```
+Feedser authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=feedser-cli scopes=feedser.jobs.trigger bypass=False remote=10.1.4.7
+```
+
+| Field        | Sample value            | Meaning                                                                                  |
+|--------------|-------------------------|------------------------------------------------------------------------------------------|
+| `route`      | `/jobs/definitions`     | Endpoint that processed the request.                                                     |
+| `status`     | `200` / `401` / `409`   | Final HTTP status code returned to the caller.                                           |
+| `subject`    | `ops@example.com`       | User or service principal subject (falls back to `(anonymous)` when unauthenticated).    |
+| `clientId`   | `feedser-cli`           | OAuth client ID provided by Authority ( `(none)` if the token lacked the claim).         |
+| `scopes`     | `feedser.jobs.trigger`  | Normalised scope list extracted from token claims; `(none)` if the token carried none.   |
+| `bypass`     | `True` / `False`        | Indicates whether the request succeeded because its source IP matched a bypass CIDR.    |
+| `remote`     | `10.1.4.7`              | Remote IP recorded from the connection / forwarded header test hooks.                    |
+
+Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
+
+- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
+- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
+- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
+
+### 2.2 Metrics
+
+Feedser publishes counters under the OTEL meter `StellaOps.Feedser.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
+
+| Metric name                   | Description                                        | PromQL example |
+|-------------------------------|----------------------------------------------------|----------------|
+| `web.jobs.triggered`          | Accepted job trigger requests.                     | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
+| `web.jobs.trigger.conflict`   | Rejected triggers (already running, disabled…).    | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
+| `web.jobs.trigger.failed`     | Server-side job failures.                          | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
+
+> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
+
+Correlate audit logs with the following global meter exported via `Feedser.SourceDiagnostics`:
+
+- `feedser.source.http.requests_total{feedser_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
+- If Grafana dashboards are deployed, extend the “Feedser Jobs” board with the above counters plus a table of recent audit log entries.
+
+## 3. Alerting Guidance
+
+1. **Unauthorized bypass attempt**  
+   - Query: `sum(rate(log_messages_total{logger="Feedser.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`  
+   - Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
+
+2. **Missing scopes**  
+   - Query: `sum(rate(log_messages_total{logger="Feedser.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`  
+   - Action: audit Authority client registration; ensure `requiredScopes` includes `feedser.jobs.trigger`.
+
+3. **Trigger failure surge**  
+   - Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.  
+   - Action: inspect correlated audit entries and `Feedser.Telemetry` traces for job execution errors.
+
+4. **Conflict spike**  
+   - Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).  
+   - Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
+
+5. **Authority offline**  
+   - Watch `Feedser.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
+
+## 4. Rollout & Verification Procedure
+
+1. **Pre-checks**
+   - Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
+   - Validate Authority issuer metadata is reachable from Feedser (`curl https://authority.internal/.well-known/openid-configuration` from the host).
+
+2. **Smoke test with valid token**
+   - Obtain a token via CLI: `stella auth login --scope feedser.jobs.trigger`.
+   - Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://feedser.internal/jobs/definitions`.
+   - Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=feedser.jobs.trigger`.
+
+3. **Negative test without token**
+   - Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
+   - If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
+
+4. **Bypass check (if applicable)**
+   - From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
+
+5. **Metrics validation**
+   - Ensure `web.jobs.triggered` counter increments during accepted runs.
+   - Exporters should show corresponding spans (`feedser.job.trigger`) if tracing is enabled.
+
+## 5. Troubleshooting
+
+| Symptom | Probable cause | Remediation |
+|---------|----------------|-------------|
+| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
+| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Feedser. |
+| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`feedser.jobs.trigger`) and ensure the token audience matches `audiences` config. |
+| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `feedser.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Feedser.WebService.Jobs` meter. |
+| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Feedser job logs, re-run with tracing enabled, validate Authority latency. |
+
+## 6. References
+
+- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
+- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
+- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
+- `StellaOps.Feedser.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
--- a/docs/ops/feedser-conflict-resolution.md
+++ b/docs/ops/feedser-conflict-resolution.md
@@ -47,6 +47,12 @@ Expect all logs at `Information`. Ensure OTEL exporters include the scope `Stell
 3. **Job health**
   - `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #feedser-ops.

+### Threshold updates (2025-10-12)
+
+- `feedser.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
+- `feedser.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
+- `feedser.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
+
 ---

 ## 4. Triage Workflow
@@ -128,3 +134,19 @@ Expect all logs at `Information`. Ensure OTEL exporters include the scope `Stell
 - Storage audit trail: `src/StellaOps.Feedser.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Feedser.Storage.Mongo/MergeEvents`.

 Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
+
+---
+
+## 9. Synthetic Regression Fixtures
+
+- **Locations** – Canonical conflict snapshots now live at `src/StellaOps.Feedser.Source.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/StellaOps.Feedser.Source.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/StellaOps.Feedser.Source.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
+- **Validation commands** – To regenerate and verify the fixtures offline, run:
+
+```bash
+dotnet test src/StellaOps.Feedser.Source.Ghsa.Tests/StellaOps.Feedser.Source.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
+dotnet test src/StellaOps.Feedser.Source.Nvd.Tests/StellaOps.Feedser.Source.Nvd.Tests.csproj --filter NvdConflictFixtureTests
+dotnet test src/StellaOps.Feedser.Source.Osv.Tests/StellaOps.Feedser.Source.Osv.Tests.csproj --filter OsvConflictFixtureTests
+dotnet test src/StellaOps.Feedser.Merge.Tests/StellaOps.Feedser.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
+```
+
+- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `feedser.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
--- a/docs/ops/feedser-cve-kev-grafana-dashboard.json
+++ b/docs/ops/feedser-cve-kev-grafana-dashboard.json
@@ -0,0 +1,151 @@
+{
+  "title": "Feedser CVE & KEV Observability",
+  "uid": "feedser-cve-kev",
+  "schemaVersion": 38,
+  "version": 1,
+  "editable": true,
+  "timezone": "",
+  "time": {
+    "from": "now-24h",
+    "to": "now"
+  },
+  "refresh": "5m",
+  "templating": {
+    "list": [
+      {
+        "name": "datasource",
+        "type": "datasource",
+        "query": "prometheus",
+        "refresh": 1,
+        "hide": 0
+      }
+    ]
+  },
+  "panels": [
+    {
+      "type": "timeseries",
+      "title": "CVE fetch success vs failure",
+      "gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ops",
+          "custom": {
+            "drawStyle": "line",
+            "lineWidth": 2,
+            "fillOpacity": 10
+          }
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "rate(cve_fetch_success_total[5m])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "success"
+        },
+        {
+          "refId": "B",
+          "expr": "rate(cve_fetch_failures_total[5m])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "failure"
+        }
+      ]
+    },
+    {
+      "type": "timeseries",
+      "title": "KEV fetch cadence",
+      "gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ops",
+          "custom": {
+            "drawStyle": "line",
+            "lineWidth": 2,
+            "fillOpacity": 10
+          }
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "rate(kev_fetch_success_total[30m])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "success"
+        },
+        {
+          "refId": "B",
+          "expr": "rate(kev_fetch_failures_total[30m])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "failure"
+        },
+        {
+          "refId": "C",
+          "expr": "rate(kev_fetch_unchanged_total[30m])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "unchanged"
+        }
+      ]
+    },
+    {
+      "type": "table",
+      "title": "KEV parse anomalies (24h)",
+      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 9 },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short"
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "sum by (reason) (increase(kev_parse_anomalies_total[24h]))",
+          "format": "table",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" }
+        }
+      ],
+      "transformations": [
+        {
+          "id": "organize",
+          "options": {
+            "renameByName": {
+              "Value": "count"
+            }
+          }
+        }
+      ]
+    },
+    {
+      "type": "timeseries",
+      "title": "Advisories emitted",
+      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 9 },
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ops",
+          "custom": {
+            "drawStyle": "line",
+            "lineWidth": 2,
+            "fillOpacity": 10
+          }
+        },
+        "overrides": []
+      },
+      "targets": [
+        {
+          "refId": "A",
+          "expr": "rate(cve_map_success_total[15m])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "CVE"
+        },
+        {
+          "refId": "B",
+          "expr": "rate(kev_map_advisories_total[24h])",
+          "datasource": { "type": "prometheus", "uid": "${datasource}" },
+          "legendFormat": "KEV"
+        }
+      ]
+    }
+  ]
+}
--- a/docs/ops/feedser-cve-kev-operations.md
+++ b/docs/ops/feedser-cve-kev-operations.md
@@ -34,19 +34,22 @@ feedser:
   - Feedser CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
   - REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
 3. Observe the following metrics (exported via OTEL meter `StellaOps.Feedser.Source.Cve`):
-   - `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.failures`, `cve.fetch.unchanged`
+   - `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
   - `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
   - `cve.map.success`
-4. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
+4. Verify Prometheus shows matching `feedser.source.http.requests_total{feedser_source="cve"}` deltas (list vs detail phases) while `feedser.source.http.failures_total{feedser_source="cve"}` stays flat.
+5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
+6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.

 ### 1.3 Production Monitoring

- **Dashboards** – Add the counters above plus `feedser.range.primitives` (filtered by `scheme=semver` or `scheme=vendor`) to the Feedser overview board. Alert when:
-  - `rate(cve.fetch.failures[5m]) > 0`
-  - `rate(cve.map.success[15m]) == 0` while fetch attempts continue
-  - `sum_over_time(cve.parse.quarantine[1h]) > 0`
- **Logs** – Watch for `CveConnector` warnings such as `Failed fetching CVE record` or schema validation errors (`Malformed CVE JSON`). These are emitted with the CVE ID and document identifier for triage.
- **Backfill window** – operators can tighten or widen the `initialBackfill` / `maxPagesPerFetch` values after validating baseline throughput. Update the config and restart the worker to apply changes.
+- **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `feedser_source_http_requests_total{feedser_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `feedser.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
+  - `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
+  - `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
+  - `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
+- **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
+- **Grafana pack** – Import `docs/ops/feedser-cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
+- **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Feedser to apply changes.

 ## 2. CISA KEV Connector (`source:kev:*`)

@@ -67,7 +70,15 @@ feedser:

 ### 2.2 Schema validation & anomaly handling

-From this sprint the connector validates the KEV JSON payload against `Schemas/kev-catalog.schema.json`. Malformed documents are quarantined, and entries missing a CVE ID are dropped with a warning (`reason=missingCveId`). Operators should treat repeated schema failures as an upstream regression and coordinate with CISA or mirror maintainers.
+The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:
+
+| Reason | Meaning |
+| --- | --- |
+| `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
+| `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
+| `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |
+
+Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.

 ### 2.3 Smoke Test (staging)

@@ -79,13 +90,16 @@ From this sprint the connector validates the KEV JSON payload against `Schemas/k
   - `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
   - `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
   - `kev.map.advisories` (tag `catalogVersion`)
-4. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
+4. Confirm `feedser.source.http.requests_total{feedser_source="kev"}` increments once per fetch and that the paired `feedser.source.http.failures_total` stays flat (zero increase).
+5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
+6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.

 ### 2.4 Production Monitoring

- Alert when `kev.fetch.success` goes to zero for longer than the expected daily cadence (default: trigger if `rate(kev.fetch.success[8h]) == 0` during business hours).
- Track anomaly spikes via `kev.parse.anomalies{reason="missingCveId"}`. A sustained non-zero rate means the upstream catalog contains unexpected records.
- The connector logs each validated catalog: `Parsed KEV catalog document … entries=X`. Absence of that log alongside consecutive `kev.fetch.success` counts suggests schema validation failures—correlate with warning-level events in the `StellaOps.Feedser.Source.Kev` logger.
+- Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
+- Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
+- Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
+- Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.

 ### 2.5 Known good dashboard tiles

@@ -93,12 +107,14 @@ Add the following panels to the Feedser observability board:

 | Metric | Recommended visualisation |
 |--------|---------------------------|
-| `kev.fetch.success` | Single-stat (last 24 h) with threshold alert |
-| `rate(kev.parse.entries[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
-| `sum_over_time(kev.parse.anomalies[1d])` by `reason` | Table – anomaly breakdown |
+| `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
+| `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
+| `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
+| `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |

 ## 3. Runbook updates

 - Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
 - Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
 - Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
+- Version-control dashboard tweaks alongside `docs/ops/feedser-cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.
--- a/docs/ops/feedser-ghsa-operations.md
+++ b/docs/ops/feedser-ghsa-operations.md
@@ -0,0 +1,111 @@
+# Feedser GHSA Connector – Operations Runbook
+
+_Last updated: 2025-10-12_
+
+## 1. Overview
+The GitHub Security Advisories (GHSA) connector pulls advisory metadata from the GitHub REST API `/security/advisories` endpoint. GitHub enforces both primary and secondary rate limits, so operators must monitor usage and configure retries to avoid throttling incidents.
+
+## 2. Rate-limit telemetry
+The connector now surfaces rate-limit headers on every fetch and exposes the following metrics via OpenTelemetry:
+
+| Metric | Description | Tags |
+|--------|-------------|------|
+| `ghsa.ratelimit.limit` (histogram) | Samples the reported request quota at fetch time. | `phase` = `list` or `detail`, `resource` (e.g., `core`). |
+| `ghsa.ratelimit.remaining` (histogram) | Remaining requests returned by `X-RateLimit-Remaining`. | `phase`, `resource`. |
+| `ghsa.ratelimit.reset_seconds` (histogram) | Seconds until `X-RateLimit-Reset`. | `phase`, `resource`. |
+| `ghsa.ratelimit.exhausted` (counter) | Incremented whenever GitHub returns a zero remaining quota and the connector delays before retrying. | `phase`. |
+
+### Dashboards & alerts
+- Plot `ghsa.ratelimit.remaining` as the latest value to watch the runway. Alert when the value stays below **`RateLimitWarningThreshold`** (default `500`) for more than 5 minutes.
+- Raise a separate alert on `increase(ghsa.ratelimit.exhausted[15m]) > 0` to catch hard throttles.
+- Overlay `ghsa.fetch.attempts` vs `ghsa.fetch.failures` to confirm retries are effective.
+
+## 3. Logging signals
+When `X-RateLimit-Remaining` falls below `RateLimitWarningThreshold`, the connector emits:
+```
+GHSA rate limit warning: remaining {Remaining}/{Limit} for {Phase} {Resource}
+```
+When GitHub reports zero remaining calls, the connector logs and sleeps for the reported `Retry-After`/`X-RateLimit-Reset` interval (falling back to `SecondaryRateLimitBackoff`).
+
+## 4. Configuration knobs (`feedser.yaml`)
+```yaml
+feedser:
+  sources:
+    ghsa:
+      apiToken: "${GITHUB_PAT}"
+      pageSize: 50
+      requestDelay: "00:00:00.200"
+      failureBackoff: "00:05:00"
+      rateLimitWarningThreshold: 500    # warn below this many remaining calls
+      secondaryRateLimitBackoff: "00:02:00"  # fallback delay when GitHub omits Retry-After
+```
+
+### Recommendations
+- Increase `requestDelay` in air-gapped or burst-heavy deployments to smooth token consumption.
+- Lower `rateLimitWarningThreshold` only if your dashboards already page on the new histogram; never set it negative.
+- For bots using a low-privilege PAT, keep `secondaryRateLimitBackoff` at ≥60 seconds to respect GitHub’s secondary-limit guidance.
+
+#### Default job schedule
+
+| Job kind | Cron | Timeout | Lease |
+|----------|------|---------|-------|
+| `source:ghsa:fetch` | `1,11,21,31,41,51 * * * *` | 6 minutes | 4 minutes |
+| `source:ghsa:parse` | `3,13,23,33,43,53 * * * *` | 5 minutes | 4 minutes |
+| `source:ghsa:map` | `5,15,25,35,45,55 * * * *` | 5 minutes | 4 minutes |
+
+These defaults spread GHSA stages across the hour so fetch completes before parse/map fire. Override them via `feedser.jobs.definitions[...]` when coordinating multiple connectors on the same runner.
+
+## 5. Provisioning credentials
+
+Feedser requires a GitHub personal access token (classic) with the **`read:org`** and **`security_events`** scopes to pull GHSA data. Store it as a secret and reference it via `feedser.sources.ghsa.apiToken`.
+
+### Docker Compose (stack operators)
+```yaml
+services:
+  feedser:
+    environment:
+      FEEDSER__SOURCES__GHSA__APITOKEN: /run/secrets/ghsa_pat
+    secrets:
+      - ghsa_pat
+
+secrets:
+  ghsa_pat:
+    file: ./secrets/ghsa_pat.txt  # contains only the PAT value
+```
+
+### Helm values (cluster operators)
+```yaml
+feedser:
+  extraEnv:
+    - name: FEEDSER__SOURCES__GHSA__APITOKEN
+      valueFrom:
+        secretKeyRef:
+          name: feedser-ghsa
+          key: apiToken
+
+extraSecrets:
+  feedser-ghsa:
+    apiToken: "<paste PAT here or source from external secret store>"
+```
+
+After rotating the PAT, restart the Feedser workers (or run `kubectl rollout restart deployment/feedser`) to ensure the configuration reloads.
+
+When enabling GHSA the first time, run a staged backfill:
+
+1. Trigger `source:ghsa:fetch` manually (CLI or API) outside of peak hours.
+2. Watch `feedser.jobs.health` for the GHSA jobs until they report `healthy`.
+3. Allow the scheduled cron cadence to resume once the initial backlog drains (typically < 30 minutes).
+
+## 6. Runbook steps when throttled
+1. Check `ghsa.ratelimit.exhausted` for the affected phase (`list` vs `detail`).
+2. Confirm the connector is delaying—logs will show `GHSA rate limit exhausted...` with the chosen backoff.
+3. If rate limits stay exhausted:
+   - Verify no other jobs are sharing the PAT.
+   - Temporarily reduce `MaxPagesPerFetch` or `PageSize` to shrink burst size.
+   - Consider provisioning a dedicated PAT (GHSA permissions only) for Feedser.
+4. After the quota resets, reset `rateLimitWarningThreshold`/`requestDelay` to their normal values and monitor the histograms for at least one hour.
+
+## 7. Alert integration quick reference
+- Prometheus: `ghsa_ratelimit_remaining_bucket` (from histogram) – use `histogram_quantile(0.99, ...)` to trend capacity.
+- VictoriaMetrics: `LAST_over_time(ghsa_ratelimit_remaining_sum[5m])` for simple last-value graphs.
+- Grafana: stack remaining + used to visualise total limit per resource.