Initial commit (history squashed)

This commit is contained in:
master
2025-10-07 10:14:21 +03:00
commit 016c5a3fe7
1132 changed files with 117842 additions and 0 deletions

View File

@@ -0,0 +1,97 @@
# Authority Backup & Restore Runbook
## Scope
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24h in production. Store snapshots in an encrypted, access-controlled vault.
## Inventory Checklist
| Component | Location (compose default) | Notes |
| --- | --- | --- |
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
## Hot Backup (no downtime)
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
2. **Dump Mongo:**
```bash
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
--gzip --db stellaops-authority
docker compose -f ops/authority/docker-compose.authority.yaml cp \
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
```
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
3. **Capture configuration + manifests:**
```bash
cp etc/authority.yaml backup/
rsync -a etc/authority.plugins/ backup/authority.plugins/
```
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
```bash
docker run --rm \
-v authority-keys:/keys \
-v "$(pwd)/backup:/backup" \
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
```
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
## Cold Backup (planned downtime)
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
2. Stop services:
```bash
docker compose -f ops/authority/docker-compose.authority.yaml down
```
3. Back up volumes directly using `tar`:
```bash
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
```
4. Copy configuration + manifests as in the hot backup (steps 36).
5. Restart services and verify health:
```bash
docker compose -f ops/authority/docker-compose.authority.yaml up -d
curl -fsS http://localhost:8080/ready
```
## Restore Procedure
1. **Provision clean volumes:** remove existing volumes if youre rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
2. **Restore Mongo:**
```bash
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
```
Use `--drop` to replace collections; omit if doing a partial restore.
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
4. **Restore signing keys:** untar into the mounted volume:
```bash
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
```
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
5. **Start services & validate:**
```bash
docker compose up -d
curl -fsS http://localhost:8080/health
```
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations.
## Disaster Recovery Notes
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (key rotation tooling), and publish a revocation notice.
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Restoring across major versions requires a compatibility review.
## Verification Checklist
- [ ] `/ready` reports all identity providers ready.
- [ ] OAuth flows issue tokens signed by the restored keys.
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
- [ ] Revocation manifest export (`dotnet run --project src/StellaOps.Authority`) succeeds.
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).

View File

@@ -0,0 +1,174 @@
{
"title": "StellaOps Authority - Token & Access Monitoring",
"uid": "authority-token-monitoring",
"schemaVersion": 38,
"version": 1,
"editable": true,
"timezone": "",
"graphTooltip": 0,
"time": {
"from": "now-6h",
"to": "now"
},
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"refresh": 1,
"hide": 0,
"current": {}
}
]
},
"panels": [
{
"id": 1,
"title": "Token Requests Success vs Failure",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "req/s",
"displayName": "{{grant_type}} ({{status}})"
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
"legendFormat": "{{grant_type}} {{status}}"
}
],
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 2,
"title": "Rate Limiter Rejections",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "req/s",
"displayName": "{{limiter}}"
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
"legendFormat": "{{limiter}}"
}
]
},
{
"id": 3,
"title": "Bypass Events (5m)",
"type": "stat",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "short",
"color": {
"mode": "thresholds"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 1 },
{ "color": "red", "value": 5 }
]
}
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
}
],
"options": {
"reduceOptions": {
"calcs": ["last"],
"fields": "",
"values": false
},
"orientation": "horizontal",
"textMode": "auto"
}
},
{
"id": 4,
"title": "Lockout Events (15m)",
"type": "stat",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "short",
"color": {
"mode": "thresholds"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 5 },
{ "color": "red", "value": 10 }
]
}
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
}
],
"options": {
"reduceOptions": {
"calcs": ["last"],
"fields": "",
"values": false
},
"orientation": "horizontal",
"textMode": "auto"
}
},
{
"id": 5,
"title": "Trace Explorer Shortcut",
"type": "text",
"options": {
"mode": "markdown",
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
}
}
],
"links": []
}

View File

@@ -0,0 +1,81 @@
# Authority Monitoring & Alerting Playbook
## Telemetry Sources
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
- **Logs:** Serilog writes structured events to stdout. Notable templates:
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
## Prometheus Metrics to Collect
| Metric | Query | Purpose |
| --- | --- | --- |
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when >5% for 10min. |
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
## Alert Rules
1. **Token Failure Surge**
- _Expression_: `token_failure_ratio > 0.05`
- _For_: `10m`
- _Labels_: `severity="critical"`
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
2. **Lockout Spike**
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
- _For_: `15m`
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
3. **Bypass Threshold**
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
- _For_: `5m`
- Alert severity `warning` — verify the calling host list.
4. **Rate Limiter Saturation**
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
- Escalate if sustained for 5min; confirm trusted clients arent misconfigured.
## Grafana Dashboard
- Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels:
- **Token Success vs Failure** stacked rate visualization split by grant type.
- **Rate Limiter Hits** bar chart showing `authority-token` and `authority-authorize`.
- **Bypass & Lockout Events** dual-stat panel using Loki-derived counters.
- **Trace Explorer Link** panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
## Collector Configuration Snippets
```yaml
receivers:
otlp:
protocols:
http:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
processors:
batch:
attributes/token_grant:
actions:
- key: grant_type
action: upsert
from_attribute: authority.grant_type
service:
pipelines:
metrics:
receivers: [otlp]
processors: [attributes/token_grant, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
```
## Operational Checklist
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.

View File

@@ -0,0 +1,130 @@
# Feedser Conflict Resolution Runbook (Sprint 3)
This runbook equips Feedser operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
---
## 1. Precedence Model (recap)
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `feedser:merge:precedence:ranks` to override per source when incident response requires it.
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
---
## 2. Telemetry Shipped This Sprint
| Instrument | Type | Key Tags | Purpose |
|------------|------|----------|---------|
| `feedser.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
| `feedser.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
| `feedser.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
| `feedser.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
| `feedser.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
### Structured logs
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
- `Alias collision ...` (no EventId) - emitted when `feedser.merge.identity_conflicts` increments.
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Feedser.Merge`.
---
## 3. Detection & Alerting
1. **Dashboard panels**
- `feedser.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
- `feedser.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
- `feedser.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
- `feedser.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
2. **Log based alerts**
- `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
- `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
3. **Job health**
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #feedser-ops.
---
## 4. Triage Workflow
1. **Confirm job context**
- `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
2. **Inspect metrics**
- Correlate spikes in `feedser.merge.conflicts` with `primary_source`/`suppressed_source` tags from `feedser.merge.overrides`.
3. **Pull structured logs**
- Example (vector output):
```
jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-feedser.log
```
4. **Review merge events**
- `mongosh`:
```javascript
use feedser;
db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
```
- Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
5. **Interrogate provenance**
- `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
- Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
---
## 5. Conflict Classification Matrix
| Signal | Likely Cause | Immediate Action |
|--------|--------------|------------------|
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `feedser:merge:precedence:ranks` to break the tie; restart merge job. |
| Rising `feedser.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
| `feedser.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
---
## 6. Resolution Playbook
1. **Connector data fix**
- Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
- Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
2. **Temporary precedence override**
- Edit `etc/feedser.yaml`:
```yaml
feedser:
merge:
precedence:
ranks:
osv: 1
ghsa: 0
```
- Restart Feedser workers; confirm tags in `feedser.merge.overrides` show the new ranks.
- Document the override with expiry in the change log.
3. **Alias remediation**
- Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
- Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
4. **Escalation**
- If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
---
## 7. Validation Checklist
- [ ] Merge job rerun returns exit code `0`.
- [ ] `feedser.merge.conflicts` baseline returns to zero after corrective action.
- [ ] Latest `merge_event` entry shows expected hash delta.
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
---
## 8. Reference Material
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
- Merge engine internals: `src/StellaOps.Feedser.Merge/Services/AdvisoryPrecedenceMerger.cs`.
- Metrics definitions: `src/StellaOps.Feedser.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
- Storage audit trail: `src/StellaOps.Feedser.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Feedser.Storage.Mongo/MergeEvents`.
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.

View File

@@ -0,0 +1,104 @@
# Feedser CVE & KEV Connector Operations
This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
## 1. CVE Services Connector (`source:cve:*`)
### 1.1 Prerequisites
- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Feedser workers.
- Updated `feedser.yaml` (or the matching environment variables) with the following section:
```yaml
feedser:
sources:
cve:
baseEndpoint: "https://cveawg.mitre.org/api/"
apiOrg: "ORG123"
apiUser: "user@example.org"
apiKeyFile: "/var/run/secrets/feedser/cve-api-key"
pageSize: 200
maxPagesPerFetch: 5
initialBackfill: "30.00:00:00"
requestDelay: "00:00:00.250"
failureBackoff: "00:10:00"
```
> Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `FEEDSER_SOURCES__CVE__APIKEY`.
### 1.2 Smoke Test (staging)
1. Deploy the updated configuration and restart the Feedser service so the connector picks up the credentials.
2. Trigger one end-to-end cycle:
- Feedser CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
- REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
3. Observe the following metrics (exported via OTEL meter `StellaOps.Feedser.Source.Cve`):
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.failures`, `cve.fetch.unchanged`
- `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
- `cve.map.success`
4. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
### 1.3 Production Monitoring
- **Dashboards** Add the counters above plus `feedser.range.primitives` (filtered by `scheme=semver` or `scheme=vendor`) to the Feedser overview board. Alert when:
- `rate(cve.fetch.failures[5m]) > 0`
- `rate(cve.map.success[15m]) == 0` while fetch attempts continue
- `sum_over_time(cve.parse.quarantine[1h]) > 0`
- **Logs** Watch for `CveConnector` warnings such as `Failed fetching CVE record` or schema validation errors (`Malformed CVE JSON`). These are emitted with the CVE ID and document identifier for triage.
- **Backfill window** operators can tighten or widen the `initialBackfill` / `maxPagesPerFetch` values after validating baseline throughput. Update the config and restart the worker to apply changes.
## 2. CISA KEV Connector (`source:kev:*`)
### 2.1 Prerequisites
- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
- Confirm the following snippet in `feedser.yaml` (defaults shown; tune as needed):
```yaml
feedser:
sources:
kev:
feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
requestTimeout: "00:01:00"
failureBackoff: "00:05:00"
```
### 2.2 Schema validation & anomaly handling
From this sprint the connector validates the KEV JSON payload against `Schemas/kev-catalog.schema.json`. Malformed documents are quarantined, and entries missing a CVE ID are dropped with a warning (`reason=missingCveId`). Operators should treat repeated schema failures as an upstream regression and coordinate with CISA or mirror maintainers.
### 2.3 Smoke Test (staging)
1. Deploy the configuration and restart Feedser.
2. Trigger a pipeline run:
- CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
- REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
3. Verify the metrics exposed by meter `StellaOps.Feedser.Source.Kev`:
- `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
- `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
- `kev.map.advisories` (tag `catalogVersion`)
4. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
### 2.4 Production Monitoring
- Alert when `kev.fetch.success` goes to zero for longer than the expected daily cadence (default: trigger if `rate(kev.fetch.success[8h]) == 0` during business hours).
- Track anomaly spikes via `kev.parse.anomalies{reason="missingCveId"}`. A sustained non-zero rate means the upstream catalog contains unexpected records.
- The connector logs each validated catalog: `Parsed KEV catalog document … entries=X`. Absence of that log alongside consecutive `kev.fetch.success` counts suggests schema validation failures—correlate with warning-level events in the `StellaOps.Feedser.Source.Kev` logger.
### 2.5 Known good dashboard tiles
Add the following panels to the Feedser observability board:
| Metric | Recommended visualisation |
|--------|---------------------------|
| `kev.fetch.success` | Single-stat (last 24h) with threshold alert |
| `rate(kev.parse.entries[1h])` by `catalogVersion` | Stacked area highlights daily release size |
| `sum_over_time(kev.parse.anomalies[1d])` by `reason` | Table anomaly breakdown |
## 3. Runbook updates
- Record staging/production smoke test results (date, catalog version, advisory counts) in your teams change log.
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).

View File

@@ -0,0 +1,50 @@
# SemVer Style Backfill Runbook
_Last updated: 2025-10-11_
## Overview
The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures
provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only
runs when the feature flag `feedser:storage:enableSemVerStyle` is enabled.
## Preconditions
1. **Review configuration** set `feedser.storage.enableSemVerStyle` to `true` on all Feedser services.
2. **Confirm batch size** adjust `feedser.storage.backfillBatchSize` if you need smaller batches for older
deployments (default: `250`).
3. **Back up** capture a fresh snapshot of the `advisory` collection or a full MongoDB backup.
4. **Staging dry-run** enable the flag in a staging environment and observe the migration output before
rolling to production.
## Execution
No manual command is required. After deploying the configuration change, restart the Feedser WebService or
any component that hosts the Mongo migration runner. During startup you will see log entries similar to:
```
Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled.
Mongo migration 20251011-semver-style-backfill applied
```
The migration reads advisories in batches (`feedser.storage.backfillBatchSize`) and writes flattened
`normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched.
## Post-checks
1. Verify the new indexes exist:
```
db.advisory.getIndexes()
```
You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`.
2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches
the embedded package data.
3. Run `dotnet test` for `StellaOps.Feedser.Storage.Mongo.Tests` (optional but recommended) in CI to confirm
the storage suite passes with the feature flag enabled.
## Rollback
Set `feedser.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on
subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when
the feature flag is off. If you must remove them entirely, restore from the backup captured during
preparation.