Restructure solution layout by module
This commit is contained in:
		@@ -1,72 +1,72 @@
 | 
			
		||||
# Concelier CCCS Connector Operations
 | 
			
		||||
 | 
			
		||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
 | 
			
		||||
 | 
			
		||||
## 1. Configuration Checklist
 | 
			
		||||
 | 
			
		||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
 | 
			
		||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    cccs:
 | 
			
		||||
      feeds:
 | 
			
		||||
        - language: "en"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
 | 
			
		||||
        - language: "fr"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
 | 
			
		||||
      maxEntriesPerFetch: 80        # increase temporarily for backfill runs
 | 
			
		||||
      maxKnownEntries: 512
 | 
			
		||||
      requestTimeout: "00:00:30"
 | 
			
		||||
      requestDelay: "00:00:00.250"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> ℹ️  The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
 | 
			
		||||
 | 
			
		||||
## 2. Telemetry & Logging
 | 
			
		||||
 | 
			
		||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
 | 
			
		||||
  - `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
 | 
			
		||||
  - `cccs.fetch.documents`, `cccs.fetch.unchanged`
 | 
			
		||||
  - `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
 | 
			
		||||
  - `cccs.map.success`, `cccs.map.failures`
 | 
			
		||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
 | 
			
		||||
  - `concelier.source.http.requests{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.failures{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.duration{concelier.source="cccs"}`
 | 
			
		||||
- **Structured logs**
 | 
			
		||||
  - `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
 | 
			
		||||
  - `CCCS parse completed parsed=… failures=…`
 | 
			
		||||
  - `CCCS map completed mapped=… failures=…`
 | 
			
		||||
  - Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
 | 
			
		||||
 | 
			
		||||
Suggested Grafana alerts:
 | 
			
		||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
 | 
			
		||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
 | 
			
		||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
 | 
			
		||||
 | 
			
		||||
## 3. Historical Backfill Plan
 | 
			
		||||
 | 
			
		||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
 | 
			
		||||
2. **Stage ingestion**:
 | 
			
		||||
   - Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
 | 
			
		||||
   - Run chained jobs until `pendingDocuments` drains:  
 | 
			
		||||
     `stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
 | 
			
		||||
   - Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
 | 
			
		||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
 | 
			
		||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
 | 
			
		||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
 | 
			
		||||
 | 
			
		||||
## 4. Selector & Sanitiser Notes
 | 
			
		||||
 | 
			
		||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
 | 
			
		||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
 | 
			
		||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
 | 
			
		||||
 | 
			
		||||
## 5. Fixture Maintenance
 | 
			
		||||
 | 
			
		||||
- Regression fixtures live in `src/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
 | 
			
		||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
 | 
			
		||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
 | 
			
		||||
# Concelier CCCS Connector Operations
 | 
			
		||||
 | 
			
		||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
 | 
			
		||||
 | 
			
		||||
## 1. Configuration Checklist
 | 
			
		||||
 | 
			
		||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
 | 
			
		||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    cccs:
 | 
			
		||||
      feeds:
 | 
			
		||||
        - language: "en"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
 | 
			
		||||
        - language: "fr"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
 | 
			
		||||
      maxEntriesPerFetch: 80        # increase temporarily for backfill runs
 | 
			
		||||
      maxKnownEntries: 512
 | 
			
		||||
      requestTimeout: "00:00:30"
 | 
			
		||||
      requestDelay: "00:00:00.250"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> ℹ️  The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
 | 
			
		||||
 | 
			
		||||
## 2. Telemetry & Logging
 | 
			
		||||
 | 
			
		||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
 | 
			
		||||
  - `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
 | 
			
		||||
  - `cccs.fetch.documents`, `cccs.fetch.unchanged`
 | 
			
		||||
  - `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
 | 
			
		||||
  - `cccs.map.success`, `cccs.map.failures`
 | 
			
		||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
 | 
			
		||||
  - `concelier.source.http.requests{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.failures{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.duration{concelier.source="cccs"}`
 | 
			
		||||
- **Structured logs**
 | 
			
		||||
  - `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
 | 
			
		||||
  - `CCCS parse completed parsed=… failures=…`
 | 
			
		||||
  - `CCCS map completed mapped=… failures=…`
 | 
			
		||||
  - Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
 | 
			
		||||
 | 
			
		||||
Suggested Grafana alerts:
 | 
			
		||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
 | 
			
		||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
 | 
			
		||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
 | 
			
		||||
 | 
			
		||||
## 3. Historical Backfill Plan
 | 
			
		||||
 | 
			
		||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
 | 
			
		||||
2. **Stage ingestion**:
 | 
			
		||||
   - Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
 | 
			
		||||
   - Run chained jobs until `pendingDocuments` drains:  
 | 
			
		||||
     `stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
 | 
			
		||||
   - Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
 | 
			
		||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
 | 
			
		||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
 | 
			
		||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
 | 
			
		||||
 | 
			
		||||
## 4. Selector & Sanitiser Notes
 | 
			
		||||
 | 
			
		||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
 | 
			
		||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
 | 
			
		||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
 | 
			
		||||
 | 
			
		||||
## 5. Fixture Maintenance
 | 
			
		||||
 | 
			
		||||
- Regression fixtures live in `src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
 | 
			
		||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
 | 
			
		||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user