Rename Feedser to Concelier
This commit is contained in:
		
							
								
								
									
										143
									
								
								docs/ops/concelier-cve-kev-operations.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										143
									
								
								docs/ops/concelier-cve-kev-operations.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,143 @@ | ||||
| # Concelier CVE & KEV Connector Operations | ||||
|  | ||||
| This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments. | ||||
|  | ||||
| ## 1. CVE Services Connector (`source:cve:*`) | ||||
|  | ||||
| ### 1.1 Prerequisites | ||||
|  | ||||
| - CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API. | ||||
| - Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Concelier workers. | ||||
| - Updated `concelier.yaml` (or the matching environment variables) with the following section: | ||||
|  | ||||
| ```yaml | ||||
| concelier: | ||||
|   sources: | ||||
|     cve: | ||||
|       baseEndpoint: "https://cveawg.mitre.org/api/" | ||||
|       apiOrg: "ORG123" | ||||
|       apiUser: "user@example.org" | ||||
|       apiKeyFile: "/var/run/secrets/concelier/cve-api-key" | ||||
|       seedDirectory: "./seed-data/cve" | ||||
|       pageSize: 200 | ||||
|       maxPagesPerFetch: 5 | ||||
|       initialBackfill: "30.00:00:00" | ||||
|       requestDelay: "00:00:00.250" | ||||
|       failureBackoff: "00:10:00" | ||||
| ``` | ||||
|  | ||||
| > ℹ️  Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `CONCELIER_SOURCES__CVE__APIKEY`. | ||||
|  | ||||
| > 🪙  When credentials are not yet available, configure `seedDirectory` to point at mirrored CVE JSON (for example, the repo’s `seed-data/cve/` bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once `apiOrg` / `apiUser` / `apiKey` are supplied. | ||||
|  | ||||
| ### 1.2 Smoke Test (staging) | ||||
|  | ||||
| 1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials. | ||||
| 2. Trigger one end-to-end cycle: | ||||
|    - Concelier CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map` | ||||
|    - REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }` | ||||
| 3. Observe the following metrics (exported via OTEL meter `StellaOps.Concelier.Source.Cve`): | ||||
|    - `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged` | ||||
|    - `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine` | ||||
|    - `cve.map.success` | ||||
| 4. Verify Prometheus shows matching `concelier.source.http.requests_total{concelier_source="cve"}` deltas (list vs detail phases) while `concelier.source.http.failures_total{concelier_source="cve"}` stays flat. | ||||
| 5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`. | ||||
| 6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced. | ||||
|  | ||||
| ### 1.3 Production Monitoring | ||||
|  | ||||
| - **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `concelier_source_http_requests_total{concelier_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `concelier.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts: | ||||
|   - `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`) | ||||
|   - `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`) | ||||
|   - `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies | ||||
| - **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests. | ||||
| - **Grafana pack** – Import `docs/ops/concelier-cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout. | ||||
| - **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Concelier to apply changes. | ||||
|  | ||||
| ### 1.4 Staging smoke log (2025-10-15) | ||||
|  | ||||
| While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests: | ||||
|  | ||||
| - Command: `dotnet test src/StellaOps.Concelier.Source.Cve.Tests/StellaOps.Concelier.Source.Cve.Tests.csproj -l "console;verbosity=detailed"` | ||||
| - Summary log emitted by the connector: | ||||
|   ``` | ||||
|   CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1 | ||||
|   ``` | ||||
| - Telemetry captured by `Meter` `StellaOps.Concelier.Source.Cve`: | ||||
|   | Metric | Value | | ||||
|   |--------|-------| | ||||
|   | `cve.fetch.attempts` | 1 | | ||||
|   | `cve.fetch.success` | 1 | | ||||
|   | `cve.fetch.documents` | 1 | | ||||
|   | `cve.parse.success` | 1 | | ||||
|   | `cve.map.success` | 1 | | ||||
|  | ||||
| The Grafana pack `docs/ops/concelier-cve-kev-grafana-dashboard.json` has been imported into staging so the panels referenced above render against these counters once the live API keys are in place. | ||||
|  | ||||
| ## 2. CISA KEV Connector (`source:kev:*`) | ||||
|  | ||||
| ### 2.1 Prerequisites | ||||
|  | ||||
| - Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`. | ||||
| - No credentials are required, but the HTTP allow-list must include `www.cisa.gov`. | ||||
| - Confirm the following snippet in `concelier.yaml` (defaults shown; tune as needed): | ||||
|  | ||||
| ```yaml | ||||
| concelier: | ||||
|   sources: | ||||
|     kev: | ||||
|       feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json" | ||||
|       requestTimeout: "00:01:00" | ||||
|       failureBackoff: "00:05:00" | ||||
| ``` | ||||
|  | ||||
| ### 2.2 Schema validation & anomaly handling | ||||
|  | ||||
| The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons: | ||||
|  | ||||
| | Reason | Meaning | | ||||
| | --- | --- | | ||||
| | `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. | | ||||
| | `countMismatch` | Catalog `count` field disagreed with the actual entry total. | | ||||
| | `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). | | ||||
|  | ||||
| Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers. | ||||
|  | ||||
| ### 2.3 Smoke Test (staging) | ||||
|  | ||||
| 1. Deploy the configuration and restart Concelier. | ||||
| 2. Trigger a pipeline run: | ||||
|    - CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map` | ||||
|    - REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }` | ||||
| 3. Verify the metrics exposed by meter `StellaOps.Concelier.Source.Kev`: | ||||
|    - `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures` | ||||
|    - `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`) | ||||
|    - `kev.map.advisories` (tag `catalogVersion`) | ||||
| 4. Confirm `concelier.source.http.requests_total{concelier_source="kev"}` increments once per fetch and that the paired `concelier.source.http.failures_total` stays flat (zero increase). | ||||
| 5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta. | ||||
| 6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written. | ||||
|  | ||||
| ### 2.4 Production Monitoring | ||||
|  | ||||
| - Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`. | ||||
| - Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror. | ||||
| - Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs. | ||||
| - Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run. | ||||
|  | ||||
| ### 2.5 Known good dashboard tiles | ||||
|  | ||||
| Add the following panels to the Concelier observability board: | ||||
|  | ||||
| | Metric | Recommended visualisation | | ||||
| |--------|---------------------------| | ||||
| | `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` | | ||||
| | `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size | | ||||
| | `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) | | ||||
| | `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted | | ||||
|  | ||||
| ## 3. Runbook updates | ||||
|  | ||||
| - Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log. | ||||
| - Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime. | ||||
| - Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics). | ||||
| - Version-control dashboard tweaks alongside `docs/ops/concelier-cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores. | ||||
		Reference in New Issue
	
	Block a user