Initial commit (history squashed)
This commit is contained in:
97
docs/ops/authority-backup-restore.md
Normal file
97
docs/ops/authority-backup-restore.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Authority Backup & Restore Runbook
|
||||
|
||||
## Scope
|
||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
|
||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
|
||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
|
||||
|
||||
## Inventory Checklist
|
||||
| Component | Location (compose default) | Notes |
|
||||
| --- | --- | --- |
|
||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
|
||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
|
||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
|
||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
|
||||
|
||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
|
||||
|
||||
## Hot Backup (no downtime)
|
||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
|
||||
2. **Dump Mongo:**
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
|
||||
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
|
||||
--gzip --db stellaops-authority
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml cp \
|
||||
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
|
||||
```
|
||||
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
|
||||
3. **Capture configuration + manifests:**
|
||||
```bash
|
||||
cp etc/authority.yaml backup/
|
||||
rsync -a etc/authority.plugins/ backup/authority.plugins/
|
||||
```
|
||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
|
||||
```bash
|
||||
docker run --rm \
|
||||
-v authority-keys:/keys \
|
||||
-v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
|
||||
```
|
||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
|
||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
|
||||
|
||||
## Cold Backup (planned downtime)
|
||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
|
||||
2. Stop services:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml down
|
||||
```
|
||||
3. Back up volumes directly using `tar`:
|
||||
```bash
|
||||
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
|
||||
```
|
||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
|
||||
5. Restart services and verify health:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml up -d
|
||||
curl -fsS http://localhost:8080/ready
|
||||
```
|
||||
|
||||
## Restore Procedure
|
||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
|
||||
2. **Restore Mongo:**
|
||||
```bash
|
||||
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
|
||||
```
|
||||
Use `--drop` to replace collections; omit if doing a partial restore.
|
||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
|
||||
4. **Restore signing keys:** untar into the mounted volume:
|
||||
```bash
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
|
||||
```
|
||||
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
|
||||
5. **Start services & validate:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
curl -fsS http://localhost:8080/health
|
||||
```
|
||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations.
|
||||
|
||||
## Disaster Recovery Notes
|
||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
|
||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
|
||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (key rotation tooling), and publish a revocation notice.
|
||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Restoring across major versions requires a compatibility review.
|
||||
|
||||
## Verification Checklist
|
||||
- [ ] `/ready` reports all identity providers ready.
|
||||
- [ ] OAuth flows issue tokens signed by the restored keys.
|
||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
|
||||
- [ ] Revocation manifest export (`dotnet run --project src/StellaOps.Authority`) succeeds.
|
||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
|
||||
|
||||
174
docs/ops/authority-grafana-dashboard.json
Normal file
174
docs/ops/authority-grafana-dashboard.json
Normal file
@@ -0,0 +1,174 @@
|
||||
{
|
||||
"title": "StellaOps Authority - Token & Access Monitoring",
|
||||
"uid": "authority-token-monitoring",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {}
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Token Requests – Success vs Failure",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{grant_type}} ({{status}})"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
|
||||
"legendFormat": "{{grant_type}} {{status}}"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Rate Limiter Rejections",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{limiter}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
|
||||
"legendFormat": "{{limiter}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Bypass Events (5m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Lockout Events (15m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 5 },
|
||||
{ "color": "red", "value": 10 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Trace Explorer Shortcut",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"mode": "markdown",
|
||||
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
|
||||
}
|
||||
}
|
||||
],
|
||||
"links": []
|
||||
}
|
||||
81
docs/ops/authority-monitoring.md
Normal file
81
docs/ops/authority-monitoring.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Authority Monitoring & Alerting Playbook
|
||||
|
||||
## Telemetry Sources
|
||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
|
||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
|
||||
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
|
||||
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
|
||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
|
||||
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
|
||||
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
|
||||
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
|
||||
|
||||
## Prometheus Metrics to Collect
|
||||
| Metric | Query | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
|
||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
|
||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
|
||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
|
||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
|
||||
|
||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
|
||||
|
||||
## Alert Rules
|
||||
1. **Token Failure Surge**
|
||||
- _Expression_: `token_failure_ratio > 0.05`
|
||||
- _For_: `10m`
|
||||
- _Labels_: `severity="critical"`
|
||||
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
|
||||
2. **Lockout Spike**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
|
||||
- _For_: `15m`
|
||||
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
|
||||
3. **Bypass Threshold**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
|
||||
- _For_: `5m`
|
||||
- Alert severity `warning` — verify the calling host list.
|
||||
4. **Rate Limiter Saturation**
|
||||
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
|
||||
- Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
|
||||
|
||||
## Grafana Dashboard
|
||||
- Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels:
|
||||
- **Token Success vs Failure** – stacked rate visualization split by grant type.
|
||||
- **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
|
||||
- **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
|
||||
- **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
|
||||
|
||||
## Collector Configuration Snippets
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:9464"
|
||||
processors:
|
||||
batch:
|
||||
attributes/token_grant:
|
||||
actions:
|
||||
- key: grant_type
|
||||
action: upsert
|
||||
from_attribute: authority.grant_type
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [attributes/token_grant, batch]
|
||||
exporters: [prometheus]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [loki]
|
||||
```
|
||||
|
||||
## Operational Checklist
|
||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
|
||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
|
||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
|
||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
|
||||
130
docs/ops/feedser-conflict-resolution.md
Normal file
130
docs/ops/feedser-conflict-resolution.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# Feedser Conflict Resolution Runbook (Sprint 3)
|
||||
|
||||
This runbook equips Feedser operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
|
||||
|
||||
---
|
||||
|
||||
## 1. Precedence Model (recap)
|
||||
|
||||
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `feedser:merge:precedence:ranks` to override per source when incident response requires it.
|
||||
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
|
||||
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
|
||||
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
|
||||
|
||||
---
|
||||
|
||||
## 2. Telemetry Shipped This Sprint
|
||||
|
||||
| Instrument | Type | Key Tags | Purpose |
|
||||
|------------|------|----------|---------|
|
||||
| `feedser.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
|
||||
| `feedser.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
|
||||
| `feedser.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
|
||||
| `feedser.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
|
||||
| `feedser.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
|
||||
|
||||
### Structured logs
|
||||
|
||||
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
|
||||
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
|
||||
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
|
||||
- `Alias collision ...` (no EventId) - emitted when `feedser.merge.identity_conflicts` increments.
|
||||
|
||||
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Feedser.Merge`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detection & Alerting
|
||||
|
||||
1. **Dashboard panels**
|
||||
- `feedser.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
|
||||
- `feedser.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
|
||||
- `feedser.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
|
||||
- `feedser.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
|
||||
2. **Log based alerts**
|
||||
- `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
|
||||
- `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
|
||||
3. **Job health**
|
||||
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #feedser-ops.
|
||||
|
||||
---
|
||||
|
||||
## 4. Triage Workflow
|
||||
|
||||
1. **Confirm job context**
|
||||
- `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
|
||||
2. **Inspect metrics**
|
||||
- Correlate spikes in `feedser.merge.conflicts` with `primary_source`/`suppressed_source` tags from `feedser.merge.overrides`.
|
||||
3. **Pull structured logs**
|
||||
- Example (vector output):
|
||||
```
|
||||
jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-feedser.log
|
||||
```
|
||||
4. **Review merge events**
|
||||
- `mongosh`:
|
||||
```javascript
|
||||
use feedser;
|
||||
db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
|
||||
```
|
||||
- Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
|
||||
5. **Interrogate provenance**
|
||||
- `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
|
||||
- Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
|
||||
|
||||
---
|
||||
|
||||
## 5. Conflict Classification Matrix
|
||||
|
||||
| Signal | Likely Cause | Immediate Action |
|
||||
|--------|--------------|------------------|
|
||||
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
|
||||
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
|
||||
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `feedser:merge:precedence:ranks` to break the tie; restart merge job. |
|
||||
| Rising `feedser.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
|
||||
| `feedser.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
|
||||
|
||||
---
|
||||
|
||||
## 6. Resolution Playbook
|
||||
|
||||
1. **Connector data fix**
|
||||
- Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
|
||||
- Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
|
||||
2. **Temporary precedence override**
|
||||
- Edit `etc/feedser.yaml`:
|
||||
```yaml
|
||||
feedser:
|
||||
merge:
|
||||
precedence:
|
||||
ranks:
|
||||
osv: 1
|
||||
ghsa: 0
|
||||
```
|
||||
- Restart Feedser workers; confirm tags in `feedser.merge.overrides` show the new ranks.
|
||||
- Document the override with expiry in the change log.
|
||||
3. **Alias remediation**
|
||||
- Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
|
||||
- Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
|
||||
4. **Escalation**
|
||||
- If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
|
||||
|
||||
---
|
||||
|
||||
## 7. Validation Checklist
|
||||
|
||||
- [ ] Merge job rerun returns exit code `0`.
|
||||
- [ ] `feedser.merge.conflicts` baseline returns to zero after corrective action.
|
||||
- [ ] Latest `merge_event` entry shows expected hash delta.
|
||||
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
|
||||
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reference Material
|
||||
|
||||
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
|
||||
- Merge engine internals: `src/StellaOps.Feedser.Merge/Services/AdvisoryPrecedenceMerger.cs`.
|
||||
- Metrics definitions: `src/StellaOps.Feedser.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
|
||||
- Storage audit trail: `src/StellaOps.Feedser.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Feedser.Storage.Mongo/MergeEvents`.
|
||||
|
||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
|
||||
104
docs/ops/feedser-cve-kev-operations.md
Normal file
104
docs/ops/feedser-cve-kev-operations.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Feedser CVE & KEV Connector Operations
|
||||
|
||||
This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
|
||||
|
||||
## 1. CVE Services Connector (`source:cve:*`)
|
||||
|
||||
### 1.1 Prerequisites
|
||||
|
||||
- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
|
||||
- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Feedser workers.
|
||||
- Updated `feedser.yaml` (or the matching environment variables) with the following section:
|
||||
|
||||
```yaml
|
||||
feedser:
|
||||
sources:
|
||||
cve:
|
||||
baseEndpoint: "https://cveawg.mitre.org/api/"
|
||||
apiOrg: "ORG123"
|
||||
apiUser: "user@example.org"
|
||||
apiKeyFile: "/var/run/secrets/feedser/cve-api-key"
|
||||
pageSize: 200
|
||||
maxPagesPerFetch: 5
|
||||
initialBackfill: "30.00:00:00"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:10:00"
|
||||
```
|
||||
|
||||
> ℹ️ Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `FEEDSER_SOURCES__CVE__APIKEY`.
|
||||
|
||||
### 1.2 Smoke Test (staging)
|
||||
|
||||
1. Deploy the updated configuration and restart the Feedser service so the connector picks up the credentials.
|
||||
2. Trigger one end-to-end cycle:
|
||||
- Feedser CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
|
||||
- REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
|
||||
3. Observe the following metrics (exported via OTEL meter `StellaOps.Feedser.Source.Cve`):
|
||||
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.failures`, `cve.fetch.unchanged`
|
||||
- `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
|
||||
- `cve.map.success`
|
||||
4. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
|
||||
|
||||
### 1.3 Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the counters above plus `feedser.range.primitives` (filtered by `scheme=semver` or `scheme=vendor`) to the Feedser overview board. Alert when:
|
||||
- `rate(cve.fetch.failures[5m]) > 0`
|
||||
- `rate(cve.map.success[15m]) == 0` while fetch attempts continue
|
||||
- `sum_over_time(cve.parse.quarantine[1h]) > 0`
|
||||
- **Logs** – Watch for `CveConnector` warnings such as `Failed fetching CVE record` or schema validation errors (`Malformed CVE JSON`). These are emitted with the CVE ID and document identifier for triage.
|
||||
- **Backfill window** – operators can tighten or widen the `initialBackfill` / `maxPagesPerFetch` values after validating baseline throughput. Update the config and restart the worker to apply changes.
|
||||
|
||||
## 2. CISA KEV Connector (`source:kev:*`)
|
||||
|
||||
### 2.1 Prerequisites
|
||||
|
||||
- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
|
||||
- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
|
||||
- Confirm the following snippet in `feedser.yaml` (defaults shown; tune as needed):
|
||||
|
||||
```yaml
|
||||
feedser:
|
||||
sources:
|
||||
kev:
|
||||
feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
|
||||
requestTimeout: "00:01:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
### 2.2 Schema validation & anomaly handling
|
||||
|
||||
From this sprint the connector validates the KEV JSON payload against `Schemas/kev-catalog.schema.json`. Malformed documents are quarantined, and entries missing a CVE ID are dropped with a warning (`reason=missingCveId`). Operators should treat repeated schema failures as an upstream regression and coordinate with CISA or mirror maintainers.
|
||||
|
||||
### 2.3 Smoke Test (staging)
|
||||
|
||||
1. Deploy the configuration and restart Feedser.
|
||||
2. Trigger a pipeline run:
|
||||
- CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
|
||||
3. Verify the metrics exposed by meter `StellaOps.Feedser.Source.Kev`:
|
||||
- `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
|
||||
- `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
|
||||
- `kev.map.advisories` (tag `catalogVersion`)
|
||||
4. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
|
||||
|
||||
### 2.4 Production Monitoring
|
||||
|
||||
- Alert when `kev.fetch.success` goes to zero for longer than the expected daily cadence (default: trigger if `rate(kev.fetch.success[8h]) == 0` during business hours).
|
||||
- Track anomaly spikes via `kev.parse.anomalies{reason="missingCveId"}`. A sustained non-zero rate means the upstream catalog contains unexpected records.
|
||||
- The connector logs each validated catalog: `Parsed KEV catalog document … entries=X`. Absence of that log alongside consecutive `kev.fetch.success` counts suggests schema validation failures—correlate with warning-level events in the `StellaOps.Feedser.Source.Kev` logger.
|
||||
|
||||
### 2.5 Known good dashboard tiles
|
||||
|
||||
Add the following panels to the Feedser observability board:
|
||||
|
||||
| Metric | Recommended visualisation |
|
||||
|--------|---------------------------|
|
||||
| `kev.fetch.success` | Single-stat (last 24 h) with threshold alert |
|
||||
| `rate(kev.parse.entries[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
|
||||
| `sum_over_time(kev.parse.anomalies[1d])` by `reason` | Table – anomaly breakdown |
|
||||
|
||||
## 3. Runbook updates
|
||||
|
||||
- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
|
||||
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
|
||||
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
|
||||
50
docs/ops/migrations/SEMVER_STYLE.md
Normal file
50
docs/ops/migrations/SEMVER_STYLE.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# SemVer Style Backfill Runbook
|
||||
|
||||
_Last updated: 2025-10-11_
|
||||
|
||||
## Overview
|
||||
|
||||
The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures
|
||||
provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only
|
||||
runs when the feature flag `feedser:storage:enableSemVerStyle` is enabled.
|
||||
|
||||
## Preconditions
|
||||
|
||||
1. **Review configuration** – set `feedser.storage.enableSemVerStyle` to `true` on all Feedser services.
|
||||
2. **Confirm batch size** – adjust `feedser.storage.backfillBatchSize` if you need smaller batches for older
|
||||
deployments (default: `250`).
|
||||
3. **Back up** – capture a fresh snapshot of the `advisory` collection or a full MongoDB backup.
|
||||
4. **Staging dry-run** – enable the flag in a staging environment and observe the migration output before
|
||||
rolling to production.
|
||||
|
||||
## Execution
|
||||
|
||||
No manual command is required. After deploying the configuration change, restart the Feedser WebService or
|
||||
any component that hosts the Mongo migration runner. During startup you will see log entries similar to:
|
||||
|
||||
```
|
||||
Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled.
|
||||
Mongo migration 20251011-semver-style-backfill applied
|
||||
```
|
||||
|
||||
The migration reads advisories in batches (`feedser.storage.backfillBatchSize`) and writes flattened
|
||||
`normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched.
|
||||
|
||||
## Post-checks
|
||||
|
||||
1. Verify the new indexes exist:
|
||||
```
|
||||
db.advisory.getIndexes()
|
||||
```
|
||||
You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`.
|
||||
2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches
|
||||
the embedded package data.
|
||||
3. Run `dotnet test` for `StellaOps.Feedser.Storage.Mongo.Tests` (optional but recommended) in CI to confirm
|
||||
the storage suite passes with the feature flag enabled.
|
||||
|
||||
## Rollback
|
||||
|
||||
Set `feedser.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on
|
||||
subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when
|
||||
the feature flag is off. If you must remove them entirely, restore from the backup captured during
|
||||
preparation.
|
||||
Reference in New Issue
Block a user