up
This commit is contained in:
@@ -80,12 +80,12 @@
|
||||
docker compose up -d
|
||||
curl -fsS http://localhost:8080/health
|
||||
```
|
||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations.
|
||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
|
||||
|
||||
## Disaster Recovery Notes
|
||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
|
||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
|
||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (key rotation tooling), and publish a revocation notice.
|
||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
|
||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Restoring across major versions requires a compatibility review.
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
83
docs/ops/authority-key-rotation.md
Normal file
83
docs/ops/authority-key-rotation.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# Authority Signing Key Rotation Playbook
|
||||
|
||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.
|
||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
|
||||
|
||||
- **Automation script:** `ops/authority/key-rotation.sh`
|
||||
Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
|
||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`
|
||||
Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
|
||||
|
||||
This playbook documents the repeatable sequence for all environments.
|
||||
|
||||
## 2. Pre-requisites
|
||||
|
||||
1. **Generate a new PEM key (per environment)**
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout \
|
||||
-out certificates/authority-signing-<env>-<year>.pem
|
||||
chmod 600 certificates/authority-signing-<env>-<year>.pem
|
||||
```
|
||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
|
||||
3. **Ensure secrets/vars exist in Gitea**
|
||||
- `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
|
||||
- `<ENV>_AUTHORITY_URL`
|
||||
- Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
|
||||
|
||||
## 3. Executing the rotation
|
||||
|
||||
### Option A – via CI workflow (recommended)
|
||||
|
||||
1. Navigate to **Actions → Authority Key Rotation**.
|
||||
2. Provide inputs:
|
||||
- `environment`: `staging`, `production`, etc.
|
||||
- `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
|
||||
- `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
|
||||
- Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
|
||||
3. Trigger. The workflow:
|
||||
- Reads the bootstrap key/URL from secrets.
|
||||
- Runs `ops/authority/key-rotation.sh`.
|
||||
- Prints the JWKS response for verification.
|
||||
|
||||
### Option B – manual shell invocation
|
||||
|
||||
```bash
|
||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
|
||||
./ops/authority/key-rotation.sh \
|
||||
--authority-url https://authority.example.com \
|
||||
--key-id authority-signing-2025-dev \
|
||||
--key-path ../certificates/authority-signing-2025-dev.pem \
|
||||
--meta rotatedBy=ops --meta changeTicket=OPS-1234
|
||||
```
|
||||
|
||||
Use `--dry-run` to inspect the payload before execution.
|
||||
|
||||
## 4. Post-rotation checklist
|
||||
|
||||
1. Update `authority.yaml` (or environment-specific overrides):
|
||||
- Set `signing.activeKeyId` to the new key.
|
||||
- Set `signing.keyPath` to the new PEM.
|
||||
- Append the previous key into `signing.additionalKeys`.
|
||||
- Ensure `keySource`/`provider` match the values passed to the script.
|
||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
|
||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
|
||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
|
||||
|
||||
## 5. Development key state
|
||||
|
||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
|
||||
|
||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
|
||||
- Retired: `authority-signing-dev`
|
||||
|
||||
Treat these as examples; real environments must maintain their own PEM material.
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
|
||||
- `docs/ops/authority-backup-restore.md` – Recovery flow referencing this playbook.
|
||||
- `ops/authority/README.md` – CLI usage and examples.
|
||||
77
docs/ops/feedser-apple-operations.md
Normal file
77
docs/ops/feedser-apple-operations.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Feedser Apple Security Update Connector Operations
|
||||
|
||||
This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
|
||||
- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
|
||||
- Updated configuration (environment variables or `feedser.yaml`) with an `apple` section. Example baseline:
|
||||
|
||||
```yaml
|
||||
feedser:
|
||||
sources:
|
||||
apple:
|
||||
softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
|
||||
advisoryBaseUri: "https://support.apple.com/"
|
||||
localeSegment: "en-us"
|
||||
maxAdvisoriesPerFetch: 25
|
||||
initialBackfill: "120.00:00:00"
|
||||
modifiedTolerance: "02:00:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Feedser automatically adds both hosts to the connector HttpClient.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Deploy the configuration and restart the Feedser workers to ensure the Apple connector options are bound.
|
||||
2. Trigger a full connector cycle:
|
||||
- CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
|
||||
3. Validate metrics exported under meter `StellaOps.Feedser.Source.Vndr.Apple`:
|
||||
- `apple.fetch.items` (documents fetched)
|
||||
- `apple.fetch.failures`
|
||||
- `apple.fetch.unchanged`
|
||||
- `apple.parse.failures`
|
||||
- `apple.map.affected.count` (histogram of affected package counts)
|
||||
4. Cross-check the shared HTTP counters:
|
||||
- `feedser.source.http.requests_total{feedser_source="vndr-apple"}` should increase for both index and detail phases.
|
||||
- `feedser.source.http.failures_total{feedser_source="vndr-apple"}` should remain flat (0) during a healthy run.
|
||||
5. Inspect the info logs:
|
||||
- `Apple software index fetch … processed=X newDocuments=Y`
|
||||
- `Apple advisory parse complete … aliases=… affected=…`
|
||||
- `Mapped Apple advisory … pendingMappings=0`
|
||||
6. Confirm MongoDB state:
|
||||
- `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
|
||||
- `dtos` store has `schemaVersion="apple.security.update.v1"`.
|
||||
- `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
|
||||
- `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following expressions to your Feedser Grafana board (OTLP/Prometheus naming assumed):
|
||||
- `rate(apple_fetch_items_total[15m])` vs `rate(feedser_source_http_requests_total{feedser_source="vndr-apple"}[15m])`
|
||||
- `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
|
||||
- `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
|
||||
- `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
|
||||
- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
|
||||
- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
|
||||
- **Telemetry pipeline** – `StellaOps.Feedser.WebService` now exports `StellaOps.Feedser.Source.Vndr.Apple` alongside existing Feedser meters; ensure your OTEL collector or Prometheus scraper includes it.
|
||||
|
||||
## 4. Fixture Maintenance
|
||||
|
||||
Regression fixtures live under `src/StellaOps.Feedser.Source.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
|
||||
|
||||
1. Run the helper script matching your platform:
|
||||
- Bash: `./scripts/update-apple-fixtures.sh`
|
||||
- PowerShell: `./scripts/update-apple-fixtures.ps1`
|
||||
2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
|
||||
3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/StellaOps.Feedser.Source.Vndr.Apple.Tests/StellaOps.Feedser.Source.Vndr.Apple.Tests.csproj`) to verify determinism.
|
||||
4. Commit fixture updates together with any parser/mapping changes that motivated them.
|
||||
|
||||
## 5. Known Issues & Follow-up Tasks
|
||||
|
||||
- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
|
||||
- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
|
||||
- Multi-locale content is still under regression sweep (`src/StellaOps.Feedser.Source.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
|
||||
150
docs/ops/feedser-authority-audit-runbook.md
Normal file
150
docs/ops/feedser-authority-audit-runbook.md
Normal file
@@ -0,0 +1,150 @@
|
||||
# Feedser Authority Audit Runbook
|
||||
|
||||
_Last updated: 2025-10-12_
|
||||
|
||||
This runbook helps operators verify and monitor the StellaOps Feedser ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Authority integration is enabled in `feedser.yaml` (or via `FEEDSER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
|
||||
- OTLP metrics/log exporters are configured (`feedser.telemetry.*`) or container stdout is shipped to your SIEM.
|
||||
- Operators have access to the Feedser job trigger endpoints via CLI or REST for smoke tests.
|
||||
|
||||
### Configuration snippet
|
||||
|
||||
```yaml
|
||||
feedser:
|
||||
authority:
|
||||
enabled: true
|
||||
allowAnonymousFallback: false # keep true only during initial rollout
|
||||
issuer: "https://authority.internal"
|
||||
audiences:
|
||||
- "api://feedser"
|
||||
requiredScopes:
|
||||
- "feedser.jobs.trigger"
|
||||
bypassNetworks:
|
||||
- "127.0.0.1/32"
|
||||
- "::1/128"
|
||||
clientId: "feedser-jobs"
|
||||
clientSecretFile: "/run/secrets/feedser_authority_client"
|
||||
tokenClockSkewSeconds: 60
|
||||
resilience:
|
||||
enableRetries: true
|
||||
retryDelays:
|
||||
- "00:00:01"
|
||||
- "00:00:02"
|
||||
- "00:00:05"
|
||||
allowOfflineCacheFallback: true
|
||||
offlineCacheTolerance: "00:10:00"
|
||||
```
|
||||
|
||||
> Store secrets outside source control. Feedser reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
|
||||
|
||||
### Resilience tuning
|
||||
|
||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Feedser retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
|
||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Feedser will fail fast but keep deterministic logs.
|
||||
- Feedser resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `feedser.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
|
||||
|
||||
## 2. Key Signals
|
||||
|
||||
### 2.1 Audit log channel
|
||||
|
||||
Feedser emits structured audit entries via the `Feedser.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
|
||||
|
||||
```
|
||||
Feedser authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=feedser-cli scopes=feedser.jobs.trigger bypass=False remote=10.1.4.7
|
||||
```
|
||||
|
||||
| Field | Sample value | Meaning |
|
||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
|
||||
| `route` | `/jobs/definitions` | Endpoint that processed the request. |
|
||||
| `status` | `200` / `401` / `409` | Final HTTP status code returned to the caller. |
|
||||
| `subject` | `ops@example.com` | User or service principal subject (falls back to `(anonymous)` when unauthenticated). |
|
||||
| `clientId` | `feedser-cli` | OAuth client ID provided by Authority ( `(none)` if the token lacked the claim). |
|
||||
| `scopes` | `feedser.jobs.trigger` | Normalised scope list extracted from token claims; `(none)` if the token carried none. |
|
||||
| `bypass` | `True` / `False` | Indicates whether the request succeeded because its source IP matched a bypass CIDR. |
|
||||
| `remote` | `10.1.4.7` | Remote IP recorded from the connection / forwarded header test hooks. |
|
||||
|
||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
|
||||
|
||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
|
||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
|
||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
|
||||
|
||||
### 2.2 Metrics
|
||||
|
||||
Feedser publishes counters under the OTEL meter `StellaOps.Feedser.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
|
||||
|
||||
| Metric name | Description | PromQL example |
|
||||
|-------------------------------|----------------------------------------------------|----------------|
|
||||
| `web.jobs.triggered` | Accepted job trigger requests. | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
|
||||
| `web.jobs.trigger.conflict` | Rejected triggers (already running, disabled…). | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
|
||||
| `web.jobs.trigger.failed` | Server-side job failures. | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
|
||||
|
||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
|
||||
|
||||
Correlate audit logs with the following global meter exported via `Feedser.SourceDiagnostics`:
|
||||
|
||||
- `feedser.source.http.requests_total{feedser_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
|
||||
- If Grafana dashboards are deployed, extend the “Feedser Jobs” board with the above counters plus a table of recent audit log entries.
|
||||
|
||||
## 3. Alerting Guidance
|
||||
|
||||
1. **Unauthorized bypass attempt**
|
||||
- Query: `sum(rate(log_messages_total{logger="Feedser.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`
|
||||
- Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
|
||||
|
||||
2. **Missing scopes**
|
||||
- Query: `sum(rate(log_messages_total{logger="Feedser.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`
|
||||
- Action: audit Authority client registration; ensure `requiredScopes` includes `feedser.jobs.trigger`.
|
||||
|
||||
3. **Trigger failure surge**
|
||||
- Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.
|
||||
- Action: inspect correlated audit entries and `Feedser.Telemetry` traces for job execution errors.
|
||||
|
||||
4. **Conflict spike**
|
||||
- Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).
|
||||
- Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
|
||||
|
||||
5. **Authority offline**
|
||||
- Watch `Feedser.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
|
||||
|
||||
## 4. Rollout & Verification Procedure
|
||||
|
||||
1. **Pre-checks**
|
||||
- Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
|
||||
- Validate Authority issuer metadata is reachable from Feedser (`curl https://authority.internal/.well-known/openid-configuration` from the host).
|
||||
|
||||
2. **Smoke test with valid token**
|
||||
- Obtain a token via CLI: `stella auth login --scope feedser.jobs.trigger`.
|
||||
- Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://feedser.internal/jobs/definitions`.
|
||||
- Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=feedser.jobs.trigger`.
|
||||
|
||||
3. **Negative test without token**
|
||||
- Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
|
||||
- If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
|
||||
|
||||
4. **Bypass check (if applicable)**
|
||||
- From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
|
||||
|
||||
5. **Metrics validation**
|
||||
- Ensure `web.jobs.triggered` counter increments during accepted runs.
|
||||
- Exporters should show corresponding spans (`feedser.job.trigger`) if tracing is enabled.
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
| Symptom | Probable cause | Remediation |
|
||||
|---------|----------------|-------------|
|
||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
|
||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Feedser. |
|
||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`feedser.jobs.trigger`) and ensure the token audience matches `audiences` config. |
|
||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `feedser.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Feedser.WebService.Jobs` meter. |
|
||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Feedser job logs, re-run with tracing enabled, validate Authority latency. |
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
|
||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
|
||||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
|
||||
- `StellaOps.Feedser.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
|
||||
@@ -47,6 +47,12 @@ Expect all logs at `Information`. Ensure OTEL exporters include the scope `Stell
|
||||
3. **Job health**
|
||||
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #feedser-ops.
|
||||
|
||||
### Threshold updates (2025-10-12)
|
||||
|
||||
- `feedser.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
|
||||
- `feedser.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
|
||||
- `feedser.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
|
||||
|
||||
---
|
||||
|
||||
## 4. Triage Workflow
|
||||
@@ -128,3 +134,19 @@ Expect all logs at `Information`. Ensure OTEL exporters include the scope `Stell
|
||||
- Storage audit trail: `src/StellaOps.Feedser.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Feedser.Storage.Mongo/MergeEvents`.
|
||||
|
||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
|
||||
|
||||
---
|
||||
|
||||
## 9. Synthetic Regression Fixtures
|
||||
|
||||
- **Locations** – Canonical conflict snapshots now live at `src/StellaOps.Feedser.Source.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/StellaOps.Feedser.Source.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/StellaOps.Feedser.Source.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
|
||||
- **Validation commands** – To regenerate and verify the fixtures offline, run:
|
||||
|
||||
```bash
|
||||
dotnet test src/StellaOps.Feedser.Source.Ghsa.Tests/StellaOps.Feedser.Source.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
|
||||
dotnet test src/StellaOps.Feedser.Source.Nvd.Tests/StellaOps.Feedser.Source.Nvd.Tests.csproj --filter NvdConflictFixtureTests
|
||||
dotnet test src/StellaOps.Feedser.Source.Osv.Tests/StellaOps.Feedser.Source.Osv.Tests.csproj --filter OsvConflictFixtureTests
|
||||
dotnet test src/StellaOps.Feedser.Merge.Tests/StellaOps.Feedser.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
|
||||
```
|
||||
|
||||
- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `feedser.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
|
||||
|
||||
151
docs/ops/feedser-cve-kev-grafana-dashboard.json
Normal file
151
docs/ops/feedser-cve-kev-grafana-dashboard.json
Normal file
@@ -0,0 +1,151 @@
|
||||
{
|
||||
"title": "Feedser CVE & KEV Observability",
|
||||
"uid": "feedser-cve-kev",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"refresh": "5m",
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "CVE fetch success vs failure",
|
||||
"gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(cve_fetch_success_total[5m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "success"
|
||||
},
|
||||
{
|
||||
"refId": "B",
|
||||
"expr": "rate(cve_fetch_failures_total[5m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "failure"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "KEV fetch cadence",
|
||||
"gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(kev_fetch_success_total[30m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "success"
|
||||
},
|
||||
{
|
||||
"refId": "B",
|
||||
"expr": "rate(kev_fetch_failures_total[30m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "failure"
|
||||
},
|
||||
{
|
||||
"refId": "C",
|
||||
"expr": "rate(kev_fetch_unchanged_total[30m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "unchanged"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "table",
|
||||
"title": "KEV parse anomalies (24h)",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 9 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (reason) (increase(kev_parse_anomalies_total[24h]))",
|
||||
"format": "table",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" }
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {
|
||||
"renameByName": {
|
||||
"Value": "count"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Advisories emitted",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 9 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(cve_map_success_total[15m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "CVE"
|
||||
},
|
||||
{
|
||||
"refId": "B",
|
||||
"expr": "rate(kev_map_advisories_total[24h])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "KEV"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -34,19 +34,22 @@ feedser:
|
||||
- Feedser CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
|
||||
- REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
|
||||
3. Observe the following metrics (exported via OTEL meter `StellaOps.Feedser.Source.Cve`):
|
||||
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.failures`, `cve.fetch.unchanged`
|
||||
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
|
||||
- `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
|
||||
- `cve.map.success`
|
||||
4. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
|
||||
4. Verify Prometheus shows matching `feedser.source.http.requests_total{feedser_source="cve"}` deltas (list vs detail phases) while `feedser.source.http.failures_total{feedser_source="cve"}` stays flat.
|
||||
5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
|
||||
6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
|
||||
|
||||
### 1.3 Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the counters above plus `feedser.range.primitives` (filtered by `scheme=semver` or `scheme=vendor`) to the Feedser overview board. Alert when:
|
||||
- `rate(cve.fetch.failures[5m]) > 0`
|
||||
- `rate(cve.map.success[15m]) == 0` while fetch attempts continue
|
||||
- `sum_over_time(cve.parse.quarantine[1h]) > 0`
|
||||
- **Logs** – Watch for `CveConnector` warnings such as `Failed fetching CVE record` or schema validation errors (`Malformed CVE JSON`). These are emitted with the CVE ID and document identifier for triage.
|
||||
- **Backfill window** – operators can tighten or widen the `initialBackfill` / `maxPagesPerFetch` values after validating baseline throughput. Update the config and restart the worker to apply changes.
|
||||
- **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `feedser_source_http_requests_total{feedser_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `feedser.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
|
||||
- `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
|
||||
- `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
|
||||
- `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
|
||||
- **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
|
||||
- **Grafana pack** – Import `docs/ops/feedser-cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
|
||||
- **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Feedser to apply changes.
|
||||
|
||||
## 2. CISA KEV Connector (`source:kev:*`)
|
||||
|
||||
@@ -67,7 +70,15 @@ feedser:
|
||||
|
||||
### 2.2 Schema validation & anomaly handling
|
||||
|
||||
From this sprint the connector validates the KEV JSON payload against `Schemas/kev-catalog.schema.json`. Malformed documents are quarantined, and entries missing a CVE ID are dropped with a warning (`reason=missingCveId`). Operators should treat repeated schema failures as an upstream regression and coordinate with CISA or mirror maintainers.
|
||||
The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:
|
||||
|
||||
| Reason | Meaning |
|
||||
| --- | --- |
|
||||
| `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
|
||||
| `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
|
||||
| `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |
|
||||
|
||||
Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.
|
||||
|
||||
### 2.3 Smoke Test (staging)
|
||||
|
||||
@@ -79,13 +90,16 @@ From this sprint the connector validates the KEV JSON payload against `Schemas/k
|
||||
- `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
|
||||
- `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
|
||||
- `kev.map.advisories` (tag `catalogVersion`)
|
||||
4. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
|
||||
4. Confirm `feedser.source.http.requests_total{feedser_source="kev"}` increments once per fetch and that the paired `feedser.source.http.failures_total` stays flat (zero increase).
|
||||
5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
|
||||
6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
|
||||
|
||||
### 2.4 Production Monitoring
|
||||
|
||||
- Alert when `kev.fetch.success` goes to zero for longer than the expected daily cadence (default: trigger if `rate(kev.fetch.success[8h]) == 0` during business hours).
|
||||
- Track anomaly spikes via `kev.parse.anomalies{reason="missingCveId"}`. A sustained non-zero rate means the upstream catalog contains unexpected records.
|
||||
- The connector logs each validated catalog: `Parsed KEV catalog document … entries=X`. Absence of that log alongside consecutive `kev.fetch.success` counts suggests schema validation failures—correlate with warning-level events in the `StellaOps.Feedser.Source.Kev` logger.
|
||||
- Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
|
||||
- Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
|
||||
- Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
|
||||
- Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.
|
||||
|
||||
### 2.5 Known good dashboard tiles
|
||||
|
||||
@@ -93,12 +107,14 @@ Add the following panels to the Feedser observability board:
|
||||
|
||||
| Metric | Recommended visualisation |
|
||||
|--------|---------------------------|
|
||||
| `kev.fetch.success` | Single-stat (last 24 h) with threshold alert |
|
||||
| `rate(kev.parse.entries[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
|
||||
| `sum_over_time(kev.parse.anomalies[1d])` by `reason` | Table – anomaly breakdown |
|
||||
| `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
|
||||
| `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
|
||||
| `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
|
||||
| `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |
|
||||
|
||||
## 3. Runbook updates
|
||||
|
||||
- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
|
||||
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
|
||||
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
|
||||
- Version-control dashboard tweaks alongside `docs/ops/feedser-cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.
|
||||
|
||||
111
docs/ops/feedser-ghsa-operations.md
Normal file
111
docs/ops/feedser-ghsa-operations.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# Feedser GHSA Connector – Operations Runbook
|
||||
|
||||
_Last updated: 2025-10-12_
|
||||
|
||||
## 1. Overview
|
||||
The GitHub Security Advisories (GHSA) connector pulls advisory metadata from the GitHub REST API `/security/advisories` endpoint. GitHub enforces both primary and secondary rate limits, so operators must monitor usage and configure retries to avoid throttling incidents.
|
||||
|
||||
## 2. Rate-limit telemetry
|
||||
The connector now surfaces rate-limit headers on every fetch and exposes the following metrics via OpenTelemetry:
|
||||
|
||||
| Metric | Description | Tags |
|
||||
|--------|-------------|------|
|
||||
| `ghsa.ratelimit.limit` (histogram) | Samples the reported request quota at fetch time. | `phase` = `list` or `detail`, `resource` (e.g., `core`). |
|
||||
| `ghsa.ratelimit.remaining` (histogram) | Remaining requests returned by `X-RateLimit-Remaining`. | `phase`, `resource`. |
|
||||
| `ghsa.ratelimit.reset_seconds` (histogram) | Seconds until `X-RateLimit-Reset`. | `phase`, `resource`. |
|
||||
| `ghsa.ratelimit.exhausted` (counter) | Incremented whenever GitHub returns a zero remaining quota and the connector delays before retrying. | `phase`. |
|
||||
|
||||
### Dashboards & alerts
|
||||
- Plot `ghsa.ratelimit.remaining` as the latest value to watch the runway. Alert when the value stays below **`RateLimitWarningThreshold`** (default `500`) for more than 5 minutes.
|
||||
- Raise a separate alert on `increase(ghsa.ratelimit.exhausted[15m]) > 0` to catch hard throttles.
|
||||
- Overlay `ghsa.fetch.attempts` vs `ghsa.fetch.failures` to confirm retries are effective.
|
||||
|
||||
## 3. Logging signals
|
||||
When `X-RateLimit-Remaining` falls below `RateLimitWarningThreshold`, the connector emits:
|
||||
```
|
||||
GHSA rate limit warning: remaining {Remaining}/{Limit} for {Phase} {Resource}
|
||||
```
|
||||
When GitHub reports zero remaining calls, the connector logs and sleeps for the reported `Retry-After`/`X-RateLimit-Reset` interval (falling back to `SecondaryRateLimitBackoff`).
|
||||
|
||||
## 4. Configuration knobs (`feedser.yaml`)
|
||||
```yaml
|
||||
feedser:
|
||||
sources:
|
||||
ghsa:
|
||||
apiToken: "${GITHUB_PAT}"
|
||||
pageSize: 50
|
||||
requestDelay: "00:00:00.200"
|
||||
failureBackoff: "00:05:00"
|
||||
rateLimitWarningThreshold: 500 # warn below this many remaining calls
|
||||
secondaryRateLimitBackoff: "00:02:00" # fallback delay when GitHub omits Retry-After
|
||||
```
|
||||
|
||||
### Recommendations
|
||||
- Increase `requestDelay` in air-gapped or burst-heavy deployments to smooth token consumption.
|
||||
- Lower `rateLimitWarningThreshold` only if your dashboards already page on the new histogram; never set it negative.
|
||||
- For bots using a low-privilege PAT, keep `secondaryRateLimitBackoff` at ≥60 seconds to respect GitHub’s secondary-limit guidance.
|
||||
|
||||
#### Default job schedule
|
||||
|
||||
| Job kind | Cron | Timeout | Lease |
|
||||
|----------|------|---------|-------|
|
||||
| `source:ghsa:fetch` | `1,11,21,31,41,51 * * * *` | 6 minutes | 4 minutes |
|
||||
| `source:ghsa:parse` | `3,13,23,33,43,53 * * * *` | 5 minutes | 4 minutes |
|
||||
| `source:ghsa:map` | `5,15,25,35,45,55 * * * *` | 5 minutes | 4 minutes |
|
||||
|
||||
These defaults spread GHSA stages across the hour so fetch completes before parse/map fire. Override them via `feedser.jobs.definitions[...]` when coordinating multiple connectors on the same runner.
|
||||
|
||||
## 5. Provisioning credentials
|
||||
|
||||
Feedser requires a GitHub personal access token (classic) with the **`read:org`** and **`security_events`** scopes to pull GHSA data. Store it as a secret and reference it via `feedser.sources.ghsa.apiToken`.
|
||||
|
||||
### Docker Compose (stack operators)
|
||||
```yaml
|
||||
services:
|
||||
feedser:
|
||||
environment:
|
||||
FEEDSER__SOURCES__GHSA__APITOKEN: /run/secrets/ghsa_pat
|
||||
secrets:
|
||||
- ghsa_pat
|
||||
|
||||
secrets:
|
||||
ghsa_pat:
|
||||
file: ./secrets/ghsa_pat.txt # contains only the PAT value
|
||||
```
|
||||
|
||||
### Helm values (cluster operators)
|
||||
```yaml
|
||||
feedser:
|
||||
extraEnv:
|
||||
- name: FEEDSER__SOURCES__GHSA__APITOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: feedser-ghsa
|
||||
key: apiToken
|
||||
|
||||
extraSecrets:
|
||||
feedser-ghsa:
|
||||
apiToken: "<paste PAT here or source from external secret store>"
|
||||
```
|
||||
|
||||
After rotating the PAT, restart the Feedser workers (or run `kubectl rollout restart deployment/feedser`) to ensure the configuration reloads.
|
||||
|
||||
When enabling GHSA the first time, run a staged backfill:
|
||||
|
||||
1. Trigger `source:ghsa:fetch` manually (CLI or API) outside of peak hours.
|
||||
2. Watch `feedser.jobs.health` for the GHSA jobs until they report `healthy`.
|
||||
3. Allow the scheduled cron cadence to resume once the initial backlog drains (typically < 30 minutes).
|
||||
|
||||
## 6. Runbook steps when throttled
|
||||
1. Check `ghsa.ratelimit.exhausted` for the affected phase (`list` vs `detail`).
|
||||
2. Confirm the connector is delaying—logs will show `GHSA rate limit exhausted...` with the chosen backoff.
|
||||
3. If rate limits stay exhausted:
|
||||
- Verify no other jobs are sharing the PAT.
|
||||
- Temporarily reduce `MaxPagesPerFetch` or `PageSize` to shrink burst size.
|
||||
- Consider provisioning a dedicated PAT (GHSA permissions only) for Feedser.
|
||||
4. After the quota resets, reset `rateLimitWarningThreshold`/`requestDelay` to their normal values and monitor the histograms for at least one hour.
|
||||
|
||||
## 7. Alert integration quick reference
|
||||
- Prometheus: `ghsa_ratelimit_remaining_bucket` (from histogram) – use `histogram_quantile(0.99, ...)` to trend capacity.
|
||||
- VictoriaMetrics: `LAST_over_time(ghsa_ratelimit_remaining_sum[5m])` for simple last-value graphs.
|
||||
- Grafana: stack remaining + used to visualise total limit per resource.
|
||||
Reference in New Issue
Block a user