git.stella-ops.org/docs/ops/concelier-cve-kev-operations.md

# Concelier CVE & KEV Connector Operations

This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.

## 1. CVE Services Connector (`source:cve:*`)

### 1.1 Prerequisites

- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Concelier workers.
- Updated `concelier.yaml` (or the matching environment variables) with the following section:

```yaml
concelier:
  sources:
    cve:
      baseEndpoint: "https://cveawg.mitre.org/api/"
      apiOrg: "ORG123"
      apiUser: "user@example.org"
      apiKeyFile: "/var/run/secrets/concelier/cve-api-key"
      seedDirectory: "./seed-data/cve"
      pageSize: 200
      maxPagesPerFetch: 5
      initialBackfill: "30.00:00:00"
      requestDelay: "00:00:00.250"
      failureBackoff: "00:10:00"
```

> ℹ️  Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `CONCELIER_SOURCES__CVE__APIKEY`.

> 🪙  When credentials are not yet available, configure `seedDirectory` to point at mirrored CVE JSON (for example, the repo’s `seed-data/cve/` bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once `apiOrg` / `apiUser` / `apiKey` are supplied.

### 1.2 Smoke Test (staging)

1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials.
2. Trigger one end-to-end cycle:
   - Concelier CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
   - REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
3. Observe the following metrics (exported via OTEL meter `StellaOps.Concelier.Source.Cve`):
   - `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
   - `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
   - `cve.map.success`
4. Verify Prometheus shows matching `concelier.source.http.requests_total{concelier_source="cve"}` deltas (list vs detail phases) while `concelier.source.http.failures_total{concelier_source="cve"}` stays flat.
5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.

### 1.3 Production Monitoring

- **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `concelier_source_http_requests_total{concelier_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `concelier.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
  - `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
  - `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
  - `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
- **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
- **Grafana pack** – Import `docs/ops/concelier-cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
- **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Concelier to apply changes.

### 1.4 Staging smoke log (2025-10-15)

While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:

- Command: `dotnet test src/StellaOps.Concelier.Source.Cve.Tests/StellaOps.Concelier.Source.Cve.Tests.csproj -l "console;verbosity=detailed"`
- Summary log emitted by the connector:
  ```
  CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
  ```
- Telemetry captured by `Meter` `StellaOps.Concelier.Source.Cve`:
  | Metric | Value |
  |--------|-------|
  | `cve.fetch.attempts` | 1 |
  | `cve.fetch.success` | 1 |
  | `cve.fetch.documents` | 1 |
  | `cve.parse.success` | 1 |
  | `cve.map.success` | 1 |

The Grafana pack `docs/ops/concelier-cve-kev-grafana-dashboard.json` has been imported into staging so the panels referenced above render against these counters once the live API keys are in place.

## 2. CISA KEV Connector (`source:kev:*`)

### 2.1 Prerequisites

- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
- Confirm the following snippet in `concelier.yaml` (defaults shown; tune as needed):

```yaml
concelier:
  sources:
    kev:
      feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
      requestTimeout: "00:01:00"
      failureBackoff: "00:05:00"
```

### 2.2 Schema validation & anomaly handling

The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:

| Reason | Meaning |
| --- | --- |
| `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
| `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
| `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |

Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.

### 2.3 Smoke Test (staging)

1. Deploy the configuration and restart Concelier.
2. Trigger a pipeline run:
   - CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
   - REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
3. Verify the metrics exposed by meter `StellaOps.Concelier.Source.Kev`:
   - `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
   - `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
   - `kev.map.advisories` (tag `catalogVersion`)
4. Confirm `concelier.source.http.requests_total{concelier_source="kev"}` increments once per fetch and that the paired `concelier.source.http.failures_total` stays flat (zero increase).
5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.

### 2.4 Production Monitoring

- Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
- Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
- Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
- Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.

### 2.5 Known good dashboard tiles

Add the following panels to the Concelier observability board:

| Metric | Recommended visualisation |
|--------|---------------------------|
| `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
| `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
| `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
| `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |

## 3. Runbook updates

- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
- Version-control dashboard tweaks alongside `docs/ops/concelier-cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.