Files
git.stella-ops.org/docs/ops/concelier-cve-kev-operations.md

9.6 KiB
Raw Blame History

Concelier CVE & KEV Connector Operations

This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.

1. CVE Services Connector (source:cve:*)

1.1 Prerequisites

  • CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
  • Network egress to https://cveawg.mitre.org (or a mirrored endpoint) from the Concelier workers.
  • Updated concelier.yaml (or the matching environment variables) with the following section:
concelier:
  sources:
    cve:
      baseEndpoint: "https://cveawg.mitre.org/api/"
      apiOrg: "ORG123"
      apiUser: "user@example.org"
      apiKeyFile: "/var/run/secrets/concelier/cve-api-key"
      seedDirectory: "./seed-data/cve"
      pageSize: 200
      maxPagesPerFetch: 5
      initialBackfill: "30.00:00:00"
      requestDelay: "00:00:00.250"
      failureBackoff: "00:10:00"

Store the API key outside source control. When using apiKeyFile, mount the secret file into the container/host; alternatively supply apiKey via CONCELIER_SOURCES__CVE__APIKEY.

🪙 When credentials are not yet available, configure seedDirectory to point at mirrored CVE JSON (for example, the repos seed-data/cve/ bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once apiOrg / apiUser / apiKey are supplied.

1.2 Smoke Test (staging)

  1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials.
  2. Trigger one end-to-end cycle:
    • Concelier CLI: stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map
    • REST fallback: POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }
  3. Observe the following metrics (exported via OTEL meter StellaOps.Concelier.Connector.Cve):
    • cve.fetch.attempts, cve.fetch.success, cve.fetch.documents, cve.fetch.failures, cve.fetch.unchanged
    • cve.parse.success, cve.parse.failures, cve.parse.quarantine
    • cve.map.success
  4. Verify Prometheus shows matching concelier.source.http.requests_total{concelier_source="cve"} deltas (list vs detail phases) while concelier.source.http.failures_total{concelier_source="cve"} stays flat.
  5. Confirm the info-level summary log CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z appears once per fetch run and shows detailFailures=0.
  6. Verify the MongoDB advisory store contains fresh CVE advisories (advisoryKey prefix cve/) and that the source cursor (source_states collection) advanced.

1.3 Production Monitoring

  • Dashboards Plot rate(cve_fetch_success_total[5m]), rate(cve_fetch_failures_total[5m]), and rate(cve_fetch_documents_total[5m]) alongside concelier_source_http_requests_total{concelier_source="cve"} to confirm HTTP and connector counters stay aligned. Keep concelier.range.primitives{scheme=~"semver|vendor"} on the same board for range coverage. Example alerts:
    • rate(cve_fetch_failures_total[5m]) > 0 for 10minutes (severity=warning)
    • rate(cve_map_success_total[15m]) == 0 while rate(cve_fetch_success_total[15m]) > 0 (severity=critical)
    • sum_over_time(cve_parse_quarantine_total[1h]) > 0 to catch schema anomalies
  • Logs Monitor warnings such as Failed fetching CVE record {CveId} and Malformed CVE JSON, and surface the summary info log CVEs fetch window … detailFailures=0 detailUnchanged=0 on dashboards. A non-zero detailFailures usually indicates rate-limit or auth issues on detail requests.
  • Grafana pack Import docs/ops/concelier-cve-kev-grafana-dashboard.json and filter by panel legend (CVE, KEV) to reuse the canned layout.
  • Backfill window Operators can tighten or widen initialBackfill / maxPagesPerFetch after validating throughput. Update config and restart Concelier to apply changes.

1.4 Staging smoke log (2025-10-15)

While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:

  • Command: dotnet test src/StellaOps.Concelier.Connector.Cve.Tests/StellaOps.Concelier.Connector.Cve.Tests.csproj -l "console;verbosity=detailed"
  • Summary log emitted by the connector:
    CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
    
  • Telemetry captured by Meter StellaOps.Concelier.Connector.Cve:
    Metric Value
    cve.fetch.attempts 1
    cve.fetch.success 1
    cve.fetch.documents 1
    cve.parse.success 1
    cve.map.success 1

The Grafana pack docs/ops/concelier-cve-kev-grafana-dashboard.json has been imported into staging so the panels referenced above render against these counters once the live API keys are in place.

2. CISA KEV Connector (source:kev:*)

2.1 Prerequisites

  • Network egress (or mirrored content) for https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json.
  • No credentials are required, but the HTTP allow-list must include www.cisa.gov.
  • Confirm the following snippet in concelier.yaml (defaults shown; tune as needed):
concelier:
  sources:
    kev:
      feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
      requestTimeout: "00:01:00"
      failureBackoff: "00:05:00"

2.2 Schema validation & anomaly handling

The connector validates each catalog against Schemas/kev-catalog.schema.json. Failures increment kev.parse.failures_total{reason="schema"} and the document is quarantined (status Failed). Additional failure reasons include download, invalidJson, deserialize, missingPayload, and emptyCatalog. Entry-level anomalies are surfaced through kev.parse.anomalies_total with reasons:

Reason Meaning
missingCveId Catalog entry omitted cveID; the entry is skipped.
countMismatch Catalog count field disagreed with the actual entry total.
nullEntry Upstream emitted a null entry object (rare upstream defect).

Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.

2.3 Smoke Test (staging)

  1. Deploy the configuration and restart Concelier.
  2. Trigger a pipeline run:
    • CLI: stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map
    • REST: POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }
  3. Verify the metrics exposed by meter StellaOps.Concelier.Connector.Kev:
    • kev.fetch.attempts, kev.fetch.success, kev.fetch.unchanged, kev.fetch.failures
    • kev.parse.entries (tag catalogVersion), kev.parse.failures, kev.parse.anomalies (tag reason)
    • kev.map.advisories (tag catalogVersion)
  4. Confirm concelier.source.http.requests_total{concelier_source="kev"} increments once per fetch and that the paired concelier.source.http.failures_total stays flat (zero increase).
  5. Inspect the info logs Fetched KEV catalog document … pendingDocuments=… and Parsed KEV catalog document … entries=…—they should appear exactly once per run and Mapped X/Y… skipped=0 should match the kev.map.advisories delta.
  6. Confirm MongoDB documents exist for the catalog JSON (raw_documents & dtos) and that advisories with prefix kev/ are written.

2.4 Production Monitoring

  • Alert when rate(kev_fetch_success_total[8h]) == 0 during working hours (daily cadence breach) and when increase(kev_fetch_failures_total[1h]) > 0.
  • Page the on-call if increase(kev_parse_failures_total{reason="schema"}[6h]) > 0—this usually signals an upstream payload change. Treat repeated reason="download" spikes as networking issues to the mirror.
  • Track anomaly spikes through sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h]). Rising countMismatch trends point to catalog publishing bugs.
  • Surface the fetch/mapping info logs (Fetched KEV catalog document … and Mapped X/Y KEV advisories … skipped=S) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.

2.5 Known good dashboard tiles

Add the following panels to the Concelier observability board:

Metric Recommended visualisation
rate(kev_fetch_success_total[30m]) Single-stat (last 24h) with warning threshold >0
rate(kev_parse_entries_total[1h]) by catalogVersion Stacked area highlights daily release size
sum_over_time(kev_parse_anomalies_total[1d]) by reason Table anomaly breakdown (matches dashboard panel)
rate(cve_map_success_total[15m]) vs rate(kev_map_advisories_total[24h]) Comparative timeseries for advisories emitted

3. Runbook updates

  • Record staging/production smoke test results (date, catalog version, advisory counts) in your teams change log.
  • Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
  • Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
  • Version-control dashboard tweaks alongside docs/ops/concelier-cve-kev-grafana-dashboard.json so operations can re-import the observability pack during restores.