144 lines
		
	
	
		
			9.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			144 lines
		
	
	
		
			9.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Concelier CVE & KEV Connector Operations
 | ||
| 
 | ||
| This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
 | ||
| 
 | ||
| ## 1. CVE Services Connector (`source:cve:*`)
 | ||
| 
 | ||
| ### 1.1 Prerequisites
 | ||
| 
 | ||
| - CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
 | ||
| - Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Concelier workers.
 | ||
| - Updated `concelier.yaml` (or the matching environment variables) with the following section:
 | ||
| 
 | ||
| ```yaml
 | ||
| concelier:
 | ||
|   sources:
 | ||
|     cve:
 | ||
|       baseEndpoint: "https://cveawg.mitre.org/api/"
 | ||
|       apiOrg: "ORG123"
 | ||
|       apiUser: "user@example.org"
 | ||
|       apiKeyFile: "/var/run/secrets/concelier/cve-api-key"
 | ||
|       seedDirectory: "./seed-data/cve"
 | ||
|       pageSize: 200
 | ||
|       maxPagesPerFetch: 5
 | ||
|       initialBackfill: "30.00:00:00"
 | ||
|       requestDelay: "00:00:00.250"
 | ||
|       failureBackoff: "00:10:00"
 | ||
| ```
 | ||
| 
 | ||
| > ℹ️  Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `CONCELIER_SOURCES__CVE__APIKEY`.
 | ||
| 
 | ||
| > 🪙  When credentials are not yet available, configure `seedDirectory` to point at mirrored CVE JSON (for example, the repo’s `seed-data/cve/` bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once `apiOrg` / `apiUser` / `apiKey` are supplied.
 | ||
| 
 | ||
| ### 1.2 Smoke Test (staging)
 | ||
| 
 | ||
| 1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials.
 | ||
| 2. Trigger one end-to-end cycle:
 | ||
|    - Concelier CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
 | ||
|    - REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
 | ||
| 3. Observe the following metrics (exported via OTEL meter `StellaOps.Concelier.Source.Cve`):
 | ||
|    - `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
 | ||
|    - `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
 | ||
|    - `cve.map.success`
 | ||
| 4. Verify Prometheus shows matching `concelier.source.http.requests_total{concelier_source="cve"}` deltas (list vs detail phases) while `concelier.source.http.failures_total{concelier_source="cve"}` stays flat.
 | ||
| 5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
 | ||
| 6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
 | ||
| 
 | ||
| ### 1.3 Production Monitoring
 | ||
| 
 | ||
| - **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `concelier_source_http_requests_total{concelier_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `concelier.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
 | ||
|   - `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
 | ||
|   - `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
 | ||
|   - `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
 | ||
| - **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
 | ||
| - **Grafana pack** – Import `docs/ops/concelier-cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
 | ||
| - **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Concelier to apply changes.
 | ||
| 
 | ||
| ### 1.4 Staging smoke log (2025-10-15)
 | ||
| 
 | ||
| While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:
 | ||
| 
 | ||
| - Command: `dotnet test src/StellaOps.Concelier.Source.Cve.Tests/StellaOps.Concelier.Source.Cve.Tests.csproj -l "console;verbosity=detailed"`
 | ||
| - Summary log emitted by the connector:
 | ||
|   ```
 | ||
|   CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
 | ||
|   ```
 | ||
| - Telemetry captured by `Meter` `StellaOps.Concelier.Source.Cve`:
 | ||
|   | Metric | Value |
 | ||
|   |--------|-------|
 | ||
|   | `cve.fetch.attempts` | 1 |
 | ||
|   | `cve.fetch.success` | 1 |
 | ||
|   | `cve.fetch.documents` | 1 |
 | ||
|   | `cve.parse.success` | 1 |
 | ||
|   | `cve.map.success` | 1 |
 | ||
| 
 | ||
| The Grafana pack `docs/ops/concelier-cve-kev-grafana-dashboard.json` has been imported into staging so the panels referenced above render against these counters once the live API keys are in place.
 | ||
| 
 | ||
| ## 2. CISA KEV Connector (`source:kev:*`)
 | ||
| 
 | ||
| ### 2.1 Prerequisites
 | ||
| 
 | ||
| - Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
 | ||
| - No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
 | ||
| - Confirm the following snippet in `concelier.yaml` (defaults shown; tune as needed):
 | ||
| 
 | ||
| ```yaml
 | ||
| concelier:
 | ||
|   sources:
 | ||
|     kev:
 | ||
|       feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
 | ||
|       requestTimeout: "00:01:00"
 | ||
|       failureBackoff: "00:05:00"
 | ||
| ```
 | ||
| 
 | ||
| ### 2.2 Schema validation & anomaly handling
 | ||
| 
 | ||
| The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:
 | ||
| 
 | ||
| | Reason | Meaning |
 | ||
| | --- | --- |
 | ||
| | `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
 | ||
| | `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
 | ||
| | `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |
 | ||
| 
 | ||
| Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.
 | ||
| 
 | ||
| ### 2.3 Smoke Test (staging)
 | ||
| 
 | ||
| 1. Deploy the configuration and restart Concelier.
 | ||
| 2. Trigger a pipeline run:
 | ||
|    - CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
 | ||
|    - REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
 | ||
| 3. Verify the metrics exposed by meter `StellaOps.Concelier.Source.Kev`:
 | ||
|    - `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
 | ||
|    - `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
 | ||
|    - `kev.map.advisories` (tag `catalogVersion`)
 | ||
| 4. Confirm `concelier.source.http.requests_total{concelier_source="kev"}` increments once per fetch and that the paired `concelier.source.http.failures_total` stays flat (zero increase).
 | ||
| 5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
 | ||
| 6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
 | ||
| 
 | ||
| ### 2.4 Production Monitoring
 | ||
| 
 | ||
| - Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
 | ||
| - Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
 | ||
| - Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
 | ||
| - Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.
 | ||
| 
 | ||
| ### 2.5 Known good dashboard tiles
 | ||
| 
 | ||
| Add the following panels to the Concelier observability board:
 | ||
| 
 | ||
| | Metric | Recommended visualisation |
 | ||
| |--------|---------------------------|
 | ||
| | `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
 | ||
| | `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
 | ||
| | `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
 | ||
| | `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |
 | ||
| 
 | ||
| ## 3. Runbook updates
 | ||
| 
 | ||
| - Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
 | ||
| - Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
 | ||
| - Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
 | ||
| - Version-control dashboard tweaks alongside `docs/ops/concelier-cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.
 |