Align AOC tasks for Excititor and Concelier
This commit is contained in:
@@ -1,159 +1,159 @@
|
||||
# Concelier Authority Audit Runbook
|
||||
|
||||
_Last updated: 2025-10-22_
|
||||
|
||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
|
||||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
|
||||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
|
||||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
|
||||
|
||||
### Configuration snippet
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
authority:
|
||||
enabled: true
|
||||
allowAnonymousFallback: false # keep true only during initial rollout
|
||||
issuer: "https://authority.internal"
|
||||
audiences:
|
||||
- "api://concelier"
|
||||
requiredScopes:
|
||||
- "concelier.jobs.trigger"
|
||||
- "advisory:read"
|
||||
- "advisory:ingest"
|
||||
requiredTenants:
|
||||
- "tenant-default"
|
||||
bypassNetworks:
|
||||
- "127.0.0.1/32"
|
||||
- "::1/128"
|
||||
clientId: "concelier-jobs"
|
||||
clientSecretFile: "/run/secrets/concelier_authority_client"
|
||||
tokenClockSkewSeconds: 60
|
||||
resilience:
|
||||
enableRetries: true
|
||||
retryDelays:
|
||||
- "00:00:01"
|
||||
- "00:00:02"
|
||||
- "00:00:05"
|
||||
allowOfflineCacheFallback: true
|
||||
offlineCacheTolerance: "00:10:00"
|
||||
```
|
||||
|
||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
|
||||
|
||||
### Resilience tuning
|
||||
|
||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
|
||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
|
||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
|
||||
|
||||
## 2. Key Signals
|
||||
|
||||
### 2.1 Audit log channel
|
||||
|
||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
|
||||
|
||||
```
|
||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
|
||||
```
|
||||
|
||||
| Field | Sample value | Meaning |
|
||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
|
||||
| `route` | `/jobs/definitions` | Endpoint that processed the request. |
|
||||
| `status` | `200` / `401` / `409` | Final HTTP status code returned to the caller. |
|
||||
| `subject` | `ops@example.com` | User or service principal subject (falls back to `(anonymous)` when unauthenticated). |
|
||||
| `clientId` | `concelier-cli` | OAuth client ID provided by Authority (`(none)` if the token lacked the claim). |
|
||||
| `scopes` | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none. |
|
||||
| `tenant` | `tenant-default` | Tenant claim extracted from the Authority token (`(none)` when the token lacked it). |
|
||||
| `bypass` | `True` / `False` | Indicates whether the request succeeded because its source IP matched a bypass CIDR. |
|
||||
| `remote` | `10.1.4.7` | Remote IP recorded from the connection / forwarded header test hooks. |
|
||||
|
||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
|
||||
|
||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
|
||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
|
||||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
|
||||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
|
||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
|
||||
|
||||
### 2.2 Metrics
|
||||
|
||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
|
||||
|
||||
| Metric name | Description | PromQL example |
|
||||
|-------------------------------|----------------------------------------------------|----------------|
|
||||
| `web.jobs.triggered` | Accepted job trigger requests. | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
|
||||
| `web.jobs.trigger.conflict` | Rejected triggers (already running, disabled…). | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
|
||||
| `web.jobs.trigger.failed` | Server-side job failures. | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
|
||||
|
||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
|
||||
|
||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
|
||||
|
||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
|
||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
|
||||
|
||||
## 3. Alerting Guidance
|
||||
|
||||
1. **Unauthorized bypass attempt**
|
||||
- Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`
|
||||
- Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
|
||||
|
||||
2. **Missing scopes**
|
||||
- Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`
|
||||
- Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
|
||||
|
||||
3. **Trigger failure surge**
|
||||
- Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.
|
||||
- Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
|
||||
|
||||
4. **Conflict spike**
|
||||
- Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).
|
||||
- Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
|
||||
|
||||
5. **Authority offline**
|
||||
- Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
|
||||
|
||||
## 4. Rollout & Verification Procedure
|
||||
|
||||
1. **Pre-checks**
|
||||
- Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
|
||||
- Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
|
||||
- Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
|
||||
|
||||
2. **Smoke test with valid token**
|
||||
- Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
|
||||
- Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
|
||||
- Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
|
||||
|
||||
3. **Negative test without token**
|
||||
- Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
|
||||
- If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
|
||||
|
||||
4. **Bypass check (if applicable)**
|
||||
- From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
|
||||
|
||||
5. **Metrics validation**
|
||||
- Ensure `web.jobs.triggered` counter increments during accepted runs.
|
||||
- Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
| Symptom | Probable cause | Remediation |
|
||||
|---------|----------------|-------------|
|
||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
|
||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
|
||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
|
||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
|
||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
|
||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
|
||||
- `docs/modules/authority/operations/monitoring.md` – Authority-side monitoring and alerting playbook.
|
||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
|
||||
# Concelier Authority Audit Runbook
|
||||
|
||||
_Last updated: 2025-10-22_
|
||||
|
||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
|
||||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
|
||||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
|
||||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
|
||||
|
||||
### Configuration snippet
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
authority:
|
||||
enabled: true
|
||||
allowAnonymousFallback: false # keep true only during initial rollout
|
||||
issuer: "https://authority.internal"
|
||||
audiences:
|
||||
- "api://concelier"
|
||||
requiredScopes:
|
||||
- "concelier.jobs.trigger"
|
||||
- "advisory:read"
|
||||
- "advisory:ingest"
|
||||
requiredTenants:
|
||||
- "tenant-default"
|
||||
bypassNetworks:
|
||||
- "127.0.0.1/32"
|
||||
- "::1/128"
|
||||
clientId: "concelier-jobs"
|
||||
clientSecretFile: "/run/secrets/concelier_authority_client"
|
||||
tokenClockSkewSeconds: 60
|
||||
resilience:
|
||||
enableRetries: true
|
||||
retryDelays:
|
||||
- "00:00:01"
|
||||
- "00:00:02"
|
||||
- "00:00:05"
|
||||
allowOfflineCacheFallback: true
|
||||
offlineCacheTolerance: "00:10:00"
|
||||
```
|
||||
|
||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
|
||||
|
||||
### Resilience tuning
|
||||
|
||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
|
||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
|
||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
|
||||
|
||||
## 2. Key Signals
|
||||
|
||||
### 2.1 Audit log channel
|
||||
|
||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
|
||||
|
||||
```
|
||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
|
||||
```
|
||||
|
||||
| Field | Sample value | Meaning |
|
||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
|
||||
| `route` | `/jobs/definitions` | Endpoint that processed the request. |
|
||||
| `status` | `200` / `401` / `409` | Final HTTP status code returned to the caller. |
|
||||
| `subject` | `ops@example.com` | User or service principal subject (falls back to `(anonymous)` when unauthenticated). |
|
||||
| `clientId` | `concelier-cli` | OAuth client ID provided by Authority (`(none)` if the token lacked the claim). |
|
||||
| `scopes` | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none. |
|
||||
| `tenant` | `tenant-default` | Tenant claim extracted from the Authority token (`(none)` when the token lacked it). |
|
||||
| `bypass` | `True` / `False` | Indicates whether the request succeeded because its source IP matched a bypass CIDR. |
|
||||
| `remote` | `10.1.4.7` | Remote IP recorded from the connection / forwarded header test hooks. |
|
||||
|
||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
|
||||
|
||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
|
||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
|
||||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
|
||||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
|
||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
|
||||
|
||||
### 2.2 Metrics
|
||||
|
||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
|
||||
|
||||
| Metric name | Description | PromQL example |
|
||||
|-------------------------------|----------------------------------------------------|----------------|
|
||||
| `web.jobs.triggered` | Accepted job trigger requests. | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
|
||||
| `web.jobs.trigger.conflict` | Rejected triggers (already running, disabled…). | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
|
||||
| `web.jobs.trigger.failed` | Server-side job failures. | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
|
||||
|
||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
|
||||
|
||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
|
||||
|
||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
|
||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
|
||||
|
||||
## 3. Alerting Guidance
|
||||
|
||||
1. **Unauthorized bypass attempt**
|
||||
- Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`
|
||||
- Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
|
||||
|
||||
2. **Missing scopes**
|
||||
- Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`
|
||||
- Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
|
||||
|
||||
3. **Trigger failure surge**
|
||||
- Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.
|
||||
- Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
|
||||
|
||||
4. **Conflict spike**
|
||||
- Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).
|
||||
- Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
|
||||
|
||||
5. **Authority offline**
|
||||
- Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
|
||||
|
||||
## 4. Rollout & Verification Procedure
|
||||
|
||||
1. **Pre-checks**
|
||||
- Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
|
||||
- Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
|
||||
- Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
|
||||
|
||||
2. **Smoke test with valid token**
|
||||
- Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
|
||||
- Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
|
||||
- Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
|
||||
|
||||
3. **Negative test without token**
|
||||
- Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
|
||||
- If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
|
||||
|
||||
4. **Bypass check (if applicable)**
|
||||
- From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
|
||||
|
||||
5. **Metrics validation**
|
||||
- Ensure `web.jobs.triggered` counter increments during accepted runs.
|
||||
- Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
| Symptom | Probable cause | Remediation |
|
||||
|---------|----------------|-------------|
|
||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
|
||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
|
||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
|
||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
|
||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
|
||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
|
||||
- `docs/modules/authority/operations/monitoring.md` – Authority-side monitoring and alerting playbook.
|
||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
|
||||
|
||||
@@ -1,160 +1,160 @@
|
||||
# Concelier Conflict Resolution Runbook (Sprint 3)
|
||||
|
||||
This runbook equips Concelier operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
|
||||
|
||||
---
|
||||
|
||||
## 1. Precedence Model (recap)
|
||||
|
||||
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `concelier:merge:precedence:ranks` to override per source when incident response requires it.
|
||||
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
|
||||
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
|
||||
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
|
||||
|
||||
---
|
||||
|
||||
## 2. Telemetry Shipped This Sprint
|
||||
|
||||
| Instrument | Type | Key Tags | Purpose |
|
||||
|------------|------|----------|---------|
|
||||
| `concelier.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
|
||||
| `concelier.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
|
||||
| `concelier.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
|
||||
| `concelier.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
|
||||
| `concelier.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
|
||||
|
||||
### Structured logs
|
||||
|
||||
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
|
||||
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
|
||||
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
|
||||
- `Alias collision ...` (no EventId) - emitted when `concelier.merge.identity_conflicts` increments.
|
||||
|
||||
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Concelier.Merge`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detection & Alerting
|
||||
|
||||
1. **Dashboard panels**
|
||||
- `concelier.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
|
||||
- `concelier.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
|
||||
- `concelier.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
|
||||
- `concelier.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
|
||||
2. **Log based alerts**
|
||||
- `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
|
||||
- `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
|
||||
3. **Job health**
|
||||
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #concelier-ops.
|
||||
|
||||
### Threshold updates (2025-10-12)
|
||||
|
||||
- `concelier.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
|
||||
- `concelier.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
|
||||
- `concelier.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
|
||||
|
||||
---
|
||||
|
||||
## 4. Triage Workflow
|
||||
|
||||
1. **Confirm job context**
|
||||
- `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
|
||||
2. **Inspect metrics**
|
||||
- Correlate spikes in `concelier.merge.conflicts` with `primary_source`/`suppressed_source` tags from `concelier.merge.overrides`.
|
||||
3. **Pull structured logs**
|
||||
- Example (vector output):
|
||||
```
|
||||
jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-concelier.log
|
||||
```
|
||||
4. **Review merge events**
|
||||
- `mongosh`:
|
||||
```javascript
|
||||
use concelier;
|
||||
db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
|
||||
```
|
||||
- Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
|
||||
5. **Interrogate provenance**
|
||||
- `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
|
||||
- Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
|
||||
|
||||
---
|
||||
|
||||
## 5. Conflict Classification Matrix
|
||||
|
||||
| Signal | Likely Cause | Immediate Action |
|
||||
|--------|--------------|------------------|
|
||||
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
|
||||
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
|
||||
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `concelier:merge:precedence:ranks` to break the tie; restart merge job. |
|
||||
| Rising `concelier.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
|
||||
| `concelier.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
|
||||
|
||||
---
|
||||
|
||||
## 6. Resolution Playbook
|
||||
|
||||
1. **Connector data fix**
|
||||
- Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
|
||||
- Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
|
||||
2. **Temporary precedence override**
|
||||
- Edit `etc/concelier.yaml`:
|
||||
```yaml
|
||||
concelier:
|
||||
merge:
|
||||
precedence:
|
||||
ranks:
|
||||
osv: 1
|
||||
ghsa: 0
|
||||
```
|
||||
- Restart Concelier workers; confirm tags in `concelier.merge.overrides` show the new ranks.
|
||||
- Document the override with expiry in the change log.
|
||||
3. **Alias remediation**
|
||||
- Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
|
||||
- Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
|
||||
4. **Escalation**
|
||||
- If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
|
||||
|
||||
---
|
||||
|
||||
## 7. Validation Checklist
|
||||
|
||||
- [ ] Merge job rerun returns exit code `0`.
|
||||
- [ ] `concelier.merge.conflicts` baseline returns to zero after corrective action.
|
||||
- [ ] Latest `merge_event` entry shows expected hash delta.
|
||||
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
|
||||
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reference Material
|
||||
|
||||
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
|
||||
- Merge engine internals: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryPrecedenceMerger.cs`.
|
||||
- Metrics definitions: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
|
||||
- Storage audit trail: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/MergeEventWriter.cs`, `src/Concelier/__Libraries/StellaOps.Concelier.Storage.Mongo/MergeEvents`.
|
||||
|
||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
|
||||
|
||||
---
|
||||
|
||||
## 9. Synthetic Regression Fixtures
|
||||
|
||||
- **Locations** – Canonical conflict snapshots now live at `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
|
||||
- **Validation commands** – To regenerate and verify the fixtures offline, run:
|
||||
|
||||
```bash
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/StellaOps.Concelier.Connector.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/StellaOps.Concelier.Connector.Nvd.Tests.csproj --filter NvdConflictFixtureTests
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj --filter OsvConflictFixtureTests
|
||||
dotnet test src/Concelier/__Tests/StellaOps.Concelier.Merge.Tests/StellaOps.Concelier.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
|
||||
```
|
||||
|
||||
- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `concelier.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
|
||||
|
||||
---
|
||||
|
||||
## 10. Change Log
|
||||
|
||||
| Date (UTC) | Change | Notes |
|
||||
|------------|--------|-------|
|
||||
| 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Concelier Ops Guild; no additional overrides required. |
|
||||
# Concelier Conflict Resolution Runbook (Sprint 3)
|
||||
|
||||
This runbook equips Concelier operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
|
||||
|
||||
---
|
||||
|
||||
## 1. Precedence Model (recap)
|
||||
|
||||
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `concelier:merge:precedence:ranks` to override per source when incident response requires it.
|
||||
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
|
||||
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
|
||||
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
|
||||
|
||||
---
|
||||
|
||||
## 2. Telemetry Shipped This Sprint
|
||||
|
||||
| Instrument | Type | Key Tags | Purpose |
|
||||
|------------|------|----------|---------|
|
||||
| `concelier.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
|
||||
| `concelier.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
|
||||
| `concelier.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
|
||||
| `concelier.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
|
||||
| `concelier.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
|
||||
|
||||
### Structured logs
|
||||
|
||||
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
|
||||
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
|
||||
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
|
||||
- `Alias collision ...` (no EventId) - emitted when `concelier.merge.identity_conflicts` increments.
|
||||
|
||||
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Concelier.Merge`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detection & Alerting
|
||||
|
||||
1. **Dashboard panels**
|
||||
- `concelier.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
|
||||
- `concelier.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
|
||||
- `concelier.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
|
||||
- `concelier.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
|
||||
2. **Log based alerts**
|
||||
- `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
|
||||
- `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
|
||||
3. **Job health**
|
||||
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #concelier-ops.
|
||||
|
||||
### Threshold updates (2025-10-12)
|
||||
|
||||
- `concelier.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
|
||||
- `concelier.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
|
||||
- `concelier.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
|
||||
|
||||
---
|
||||
|
||||
## 4. Triage Workflow
|
||||
|
||||
1. **Confirm job context**
|
||||
- `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
|
||||
2. **Inspect metrics**
|
||||
- Correlate spikes in `concelier.merge.conflicts` with `primary_source`/`suppressed_source` tags from `concelier.merge.overrides`.
|
||||
3. **Pull structured logs**
|
||||
- Example (vector output):
|
||||
```
|
||||
jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-concelier.log
|
||||
```
|
||||
4. **Review merge events**
|
||||
- `mongosh`:
|
||||
```javascript
|
||||
use concelier;
|
||||
db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
|
||||
```
|
||||
- Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
|
||||
5. **Interrogate provenance**
|
||||
- `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
|
||||
- Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
|
||||
|
||||
---
|
||||
|
||||
## 5. Conflict Classification Matrix
|
||||
|
||||
| Signal | Likely Cause | Immediate Action |
|
||||
|--------|--------------|------------------|
|
||||
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
|
||||
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
|
||||
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `concelier:merge:precedence:ranks` to break the tie; restart merge job. |
|
||||
| Rising `concelier.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
|
||||
| `concelier.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
|
||||
|
||||
---
|
||||
|
||||
## 6. Resolution Playbook
|
||||
|
||||
1. **Connector data fix**
|
||||
- Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
|
||||
- Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
|
||||
2. **Temporary precedence override**
|
||||
- Edit `etc/concelier.yaml`:
|
||||
```yaml
|
||||
concelier:
|
||||
merge:
|
||||
precedence:
|
||||
ranks:
|
||||
osv: 1
|
||||
ghsa: 0
|
||||
```
|
||||
- Restart Concelier workers; confirm tags in `concelier.merge.overrides` show the new ranks.
|
||||
- Document the override with expiry in the change log.
|
||||
3. **Alias remediation**
|
||||
- Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
|
||||
- Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
|
||||
4. **Escalation**
|
||||
- If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
|
||||
|
||||
---
|
||||
|
||||
## 7. Validation Checklist
|
||||
|
||||
- [ ] Merge job rerun returns exit code `0`.
|
||||
- [ ] `concelier.merge.conflicts` baseline returns to zero after corrective action.
|
||||
- [ ] Latest `merge_event` entry shows expected hash delta.
|
||||
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
|
||||
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reference Material
|
||||
|
||||
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
|
||||
- Merge engine internals: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryPrecedenceMerger.cs`.
|
||||
- Metrics definitions: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
|
||||
- Storage audit trail: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/MergeEventWriter.cs`, `src/Concelier/__Libraries/StellaOps.Concelier.Storage.Mongo/MergeEvents`.
|
||||
|
||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
|
||||
|
||||
---
|
||||
|
||||
## 9. Synthetic Regression Fixtures
|
||||
|
||||
- **Locations** – Canonical conflict snapshots now live at `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
|
||||
- **Validation commands** – To regenerate and verify the fixtures offline, run:
|
||||
|
||||
```bash
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/StellaOps.Concelier.Connector.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/StellaOps.Concelier.Connector.Nvd.Tests.csproj --filter NvdConflictFixtureTests
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj --filter OsvConflictFixtureTests
|
||||
dotnet test src/Concelier/__Tests/StellaOps.Concelier.Merge.Tests/StellaOps.Concelier.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
|
||||
```
|
||||
|
||||
- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `concelier.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
|
||||
|
||||
---
|
||||
|
||||
## 10. Change Log
|
||||
|
||||
| Date (UTC) | Change | Notes |
|
||||
|------------|--------|-------|
|
||||
| 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Concelier Ops Guild; no additional overrides required. |
|
||||
|
||||
@@ -1,77 +1,77 @@
|
||||
# Concelier Apple Security Update Connector Operations
|
||||
|
||||
This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
|
||||
- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
|
||||
- Updated configuration (environment variables or `concelier.yaml`) with an `apple` section. Example baseline:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
apple:
|
||||
softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
|
||||
advisoryBaseUri: "https://support.apple.com/"
|
||||
localeSegment: "en-us"
|
||||
maxAdvisoriesPerFetch: 25
|
||||
initialBackfill: "120.00:00:00"
|
||||
modifiedTolerance: "02:00:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Concelier automatically adds both hosts to the connector HttpClient.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Deploy the configuration and restart the Concelier workers to ensure the Apple connector options are bound.
|
||||
2. Trigger a full connector cycle:
|
||||
- CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
|
||||
3. Validate metrics exported under meter `StellaOps.Concelier.Connector.Vndr.Apple`:
|
||||
- `apple.fetch.items` (documents fetched)
|
||||
- `apple.fetch.failures`
|
||||
- `apple.fetch.unchanged`
|
||||
- `apple.parse.failures`
|
||||
- `apple.map.affected.count` (histogram of affected package counts)
|
||||
4. Cross-check the shared HTTP counters:
|
||||
- `concelier.source.http.requests_total{concelier_source="vndr-apple"}` should increase for both index and detail phases.
|
||||
- `concelier.source.http.failures_total{concelier_source="vndr-apple"}` should remain flat (0) during a healthy run.
|
||||
5. Inspect the info logs:
|
||||
- `Apple software index fetch … processed=X newDocuments=Y`
|
||||
- `Apple advisory parse complete … aliases=… affected=…`
|
||||
- `Mapped Apple advisory … pendingMappings=0`
|
||||
6. Confirm MongoDB state:
|
||||
- `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
|
||||
- `dtos` store has `schemaVersion="apple.security.update.v1"`.
|
||||
- `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
|
||||
- `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following expressions to your Concelier Grafana board (OTLP/Prometheus naming assumed):
|
||||
- `rate(apple_fetch_items_total[15m])` vs `rate(concelier_source_http_requests_total{concelier_source="vndr-apple"}[15m])`
|
||||
- `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
|
||||
- `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
|
||||
- `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
|
||||
- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
|
||||
- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
|
||||
- **Telemetry pipeline** – `StellaOps.Concelier.WebService` now exports `StellaOps.Concelier.Connector.Vndr.Apple` alongside existing Concelier meters; ensure your OTEL collector or Prometheus scraper includes it.
|
||||
|
||||
## 4. Fixture Maintenance
|
||||
|
||||
Regression fixtures live under `src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
|
||||
|
||||
1. Run the helper script matching your platform:
|
||||
- Bash: `./scripts/update-apple-fixtures.sh`
|
||||
- PowerShell: `./scripts/update-apple-fixtures.ps1`
|
||||
2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
|
||||
3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests.csproj`) to verify determinism.
|
||||
4. Commit fixture updates together with any parser/mapping changes that motivated them.
|
||||
|
||||
## 5. Known Issues & Follow-up Tasks
|
||||
|
||||
- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
|
||||
- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
|
||||
- Multi-locale content is still under regression sweep (`src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
|
||||
# Concelier Apple Security Update Connector Operations
|
||||
|
||||
This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
|
||||
- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
|
||||
- Updated configuration (environment variables or `concelier.yaml`) with an `apple` section. Example baseline:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
apple:
|
||||
softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
|
||||
advisoryBaseUri: "https://support.apple.com/"
|
||||
localeSegment: "en-us"
|
||||
maxAdvisoriesPerFetch: 25
|
||||
initialBackfill: "120.00:00:00"
|
||||
modifiedTolerance: "02:00:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Concelier automatically adds both hosts to the connector HttpClient.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Deploy the configuration and restart the Concelier workers to ensure the Apple connector options are bound.
|
||||
2. Trigger a full connector cycle:
|
||||
- CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
|
||||
3. Validate metrics exported under meter `StellaOps.Concelier.Connector.Vndr.Apple`:
|
||||
- `apple.fetch.items` (documents fetched)
|
||||
- `apple.fetch.failures`
|
||||
- `apple.fetch.unchanged`
|
||||
- `apple.parse.failures`
|
||||
- `apple.map.affected.count` (histogram of affected package counts)
|
||||
4. Cross-check the shared HTTP counters:
|
||||
- `concelier.source.http.requests_total{concelier_source="vndr-apple"}` should increase for both index and detail phases.
|
||||
- `concelier.source.http.failures_total{concelier_source="vndr-apple"}` should remain flat (0) during a healthy run.
|
||||
5. Inspect the info logs:
|
||||
- `Apple software index fetch … processed=X newDocuments=Y`
|
||||
- `Apple advisory parse complete … aliases=… affected=…`
|
||||
- `Mapped Apple advisory … pendingMappings=0`
|
||||
6. Confirm MongoDB state:
|
||||
- `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
|
||||
- `dtos` store has `schemaVersion="apple.security.update.v1"`.
|
||||
- `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
|
||||
- `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following expressions to your Concelier Grafana board (OTLP/Prometheus naming assumed):
|
||||
- `rate(apple_fetch_items_total[15m])` vs `rate(concelier_source_http_requests_total{concelier_source="vndr-apple"}[15m])`
|
||||
- `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
|
||||
- `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
|
||||
- `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
|
||||
- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
|
||||
- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
|
||||
- **Telemetry pipeline** – `StellaOps.Concelier.WebService` now exports `StellaOps.Concelier.Connector.Vndr.Apple` alongside existing Concelier meters; ensure your OTEL collector or Prometheus scraper includes it.
|
||||
|
||||
## 4. Fixture Maintenance
|
||||
|
||||
Regression fixtures live under `src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
|
||||
|
||||
1. Run the helper script matching your platform:
|
||||
- Bash: `./scripts/update-apple-fixtures.sh`
|
||||
- PowerShell: `./scripts/update-apple-fixtures.ps1`
|
||||
2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
|
||||
3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests.csproj`) to verify determinism.
|
||||
4. Commit fixture updates together with any parser/mapping changes that motivated them.
|
||||
|
||||
## 5. Known Issues & Follow-up Tasks
|
||||
|
||||
- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
|
||||
- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
|
||||
- Multi-locale content is still under regression sweep (`src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
|
||||
|
||||
@@ -1,72 +1,72 @@
|
||||
# Concelier CCCS Connector Operations
|
||||
|
||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
|
||||
|
||||
## 1. Configuration Checklist
|
||||
|
||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
|
||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cccs:
|
||||
feeds:
|
||||
- language: "en"
|
||||
uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
|
||||
- language: "fr"
|
||||
uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
|
||||
maxEntriesPerFetch: 80 # increase temporarily for backfill runs
|
||||
maxKnownEntries: 512
|
||||
requestTimeout: "00:00:30"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
|
||||
|
||||
## 2. Telemetry & Logging
|
||||
|
||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
|
||||
- `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
|
||||
- `cccs.fetch.documents`, `cccs.fetch.unchanged`
|
||||
- `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
|
||||
- `cccs.map.success`, `cccs.map.failures`
|
||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
|
||||
- `concelier.source.http.requests{concelier.source="cccs"}`
|
||||
- `concelier.source.http.failures{concelier.source="cccs"}`
|
||||
- `concelier.source.http.duration{concelier.source="cccs"}`
|
||||
- **Structured logs**
|
||||
- `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
|
||||
- `CCCS parse completed parsed=… failures=…`
|
||||
- `CCCS map completed mapped=… failures=…`
|
||||
- Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
|
||||
|
||||
Suggested Grafana alerts:
|
||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
|
||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
|
||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
|
||||
|
||||
## 3. Historical Backfill Plan
|
||||
|
||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
|
||||
2. **Stage ingestion**:
|
||||
- Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
|
||||
- Run chained jobs until `pendingDocuments` drains:
|
||||
`stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
|
||||
- Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
|
||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
|
||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
|
||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
|
||||
|
||||
## 4. Selector & Sanitiser Notes
|
||||
|
||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
|
||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
|
||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
|
||||
|
||||
## 5. Fixture Maintenance
|
||||
|
||||
- Regression fixtures live in `src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
|
||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
|
||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
|
||||
# Concelier CCCS Connector Operations
|
||||
|
||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
|
||||
|
||||
## 1. Configuration Checklist
|
||||
|
||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
|
||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cccs:
|
||||
feeds:
|
||||
- language: "en"
|
||||
uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
|
||||
- language: "fr"
|
||||
uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
|
||||
maxEntriesPerFetch: 80 # increase temporarily for backfill runs
|
||||
maxKnownEntries: 512
|
||||
requestTimeout: "00:00:30"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
|
||||
|
||||
## 2. Telemetry & Logging
|
||||
|
||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
|
||||
- `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
|
||||
- `cccs.fetch.documents`, `cccs.fetch.unchanged`
|
||||
- `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
|
||||
- `cccs.map.success`, `cccs.map.failures`
|
||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
|
||||
- `concelier.source.http.requests{concelier.source="cccs"}`
|
||||
- `concelier.source.http.failures{concelier.source="cccs"}`
|
||||
- `concelier.source.http.duration{concelier.source="cccs"}`
|
||||
- **Structured logs**
|
||||
- `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
|
||||
- `CCCS parse completed parsed=… failures=…`
|
||||
- `CCCS map completed mapped=… failures=…`
|
||||
- Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
|
||||
|
||||
Suggested Grafana alerts:
|
||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
|
||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
|
||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
|
||||
|
||||
## 3. Historical Backfill Plan
|
||||
|
||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
|
||||
2. **Stage ingestion**:
|
||||
- Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
|
||||
- Run chained jobs until `pendingDocuments` drains:
|
||||
`stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
|
||||
- Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
|
||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
|
||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
|
||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
|
||||
|
||||
## 4. Selector & Sanitiser Notes
|
||||
|
||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
|
||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
|
||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
|
||||
|
||||
## 5. Fixture Maintenance
|
||||
|
||||
- Regression fixtures live in `src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
|
||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
|
||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
|
||||
|
||||
@@ -1,143 +1,143 @@
|
||||
# Concelier CVE & KEV Connector Operations
|
||||
|
||||
This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
|
||||
|
||||
## 1. CVE Services Connector (`source:cve:*`)
|
||||
|
||||
### 1.1 Prerequisites
|
||||
|
||||
- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
|
||||
- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Concelier workers.
|
||||
- Updated `concelier.yaml` (or the matching environment variables) with the following section:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cve:
|
||||
baseEndpoint: "https://cveawg.mitre.org/api/"
|
||||
apiOrg: "ORG123"
|
||||
apiUser: "user@example.org"
|
||||
apiKeyFile: "/var/run/secrets/concelier/cve-api-key"
|
||||
seedDirectory: "./seed-data/cve"
|
||||
pageSize: 200
|
||||
maxPagesPerFetch: 5
|
||||
initialBackfill: "30.00:00:00"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:10:00"
|
||||
```
|
||||
|
||||
> ℹ️ Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `CONCELIER_SOURCES__CVE__APIKEY`.
|
||||
|
||||
> 🪙 When credentials are not yet available, configure `seedDirectory` to point at mirrored CVE JSON (for example, the repo’s `seed-data/cve/` bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once `apiOrg` / `apiUser` / `apiKey` are supplied.
|
||||
|
||||
### 1.2 Smoke Test (staging)
|
||||
|
||||
1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials.
|
||||
2. Trigger one end-to-end cycle:
|
||||
- Concelier CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
|
||||
- REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
|
||||
3. Observe the following metrics (exported via OTEL meter `StellaOps.Concelier.Connector.Cve`):
|
||||
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
|
||||
- `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
|
||||
- `cve.map.success`
|
||||
4. Verify Prometheus shows matching `concelier.source.http.requests_total{concelier_source="cve"}` deltas (list vs detail phases) while `concelier.source.http.failures_total{concelier_source="cve"}` stays flat.
|
||||
5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
|
||||
6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
|
||||
|
||||
### 1.3 Production Monitoring
|
||||
|
||||
- **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `concelier_source_http_requests_total{concelier_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `concelier.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
|
||||
- `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
|
||||
- `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
|
||||
- `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
|
||||
- **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
|
||||
- **Grafana pack** – Import `docs/modules/concelier/operations/connectors/cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
|
||||
- **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Concelier to apply changes.
|
||||
|
||||
### 1.4 Staging smoke log (2025-10-15)
|
||||
|
||||
While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:
|
||||
|
||||
- Command: `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Cve.Tests/StellaOps.Concelier.Connector.Cve.Tests.csproj -l "console;verbosity=detailed"`
|
||||
- Summary log emitted by the connector:
|
||||
```
|
||||
CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
|
||||
```
|
||||
- Telemetry captured by `Meter` `StellaOps.Concelier.Connector.Cve`:
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `cve.fetch.attempts` | 1 |
|
||||
| `cve.fetch.success` | 1 |
|
||||
| `cve.fetch.documents` | 1 |
|
||||
| `cve.parse.success` | 1 |
|
||||
| `cve.map.success` | 1 |
|
||||
|
||||
The Grafana pack `docs/modules/concelier/operations/connectors/cve-kev-grafana-dashboard.json` has been imported into staging so the panels referenced above render against these counters once the live API keys are in place.
|
||||
|
||||
## 2. CISA KEV Connector (`source:kev:*`)
|
||||
|
||||
### 2.1 Prerequisites
|
||||
|
||||
- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
|
||||
- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
|
||||
- Confirm the following snippet in `concelier.yaml` (defaults shown; tune as needed):
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
kev:
|
||||
feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
|
||||
requestTimeout: "00:01:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
### 2.2 Schema validation & anomaly handling
|
||||
|
||||
The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:
|
||||
|
||||
| Reason | Meaning |
|
||||
| --- | --- |
|
||||
| `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
|
||||
| `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
|
||||
| `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |
|
||||
|
||||
Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.
|
||||
|
||||
### 2.3 Smoke Test (staging)
|
||||
|
||||
1. Deploy the configuration and restart Concelier.
|
||||
2. Trigger a pipeline run:
|
||||
- CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
|
||||
3. Verify the metrics exposed by meter `StellaOps.Concelier.Connector.Kev`:
|
||||
- `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
|
||||
- `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
|
||||
- `kev.map.advisories` (tag `catalogVersion`)
|
||||
4. Confirm `concelier.source.http.requests_total{concelier_source="kev"}` increments once per fetch and that the paired `concelier.source.http.failures_total` stays flat (zero increase).
|
||||
5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
|
||||
6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
|
||||
|
||||
### 2.4 Production Monitoring
|
||||
|
||||
- Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
|
||||
- Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
|
||||
- Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
|
||||
- Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.
|
||||
|
||||
### 2.5 Known good dashboard tiles
|
||||
|
||||
Add the following panels to the Concelier observability board:
|
||||
|
||||
| Metric | Recommended visualisation |
|
||||
|--------|---------------------------|
|
||||
| `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
|
||||
| `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
|
||||
| `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
|
||||
| `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |
|
||||
|
||||
## 3. Runbook updates
|
||||
|
||||
- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
|
||||
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
|
||||
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
|
||||
- Version-control dashboard tweaks alongside `docs/modules/concelier/operations/connectors/cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.
|
||||
# Concelier CVE & KEV Connector Operations
|
||||
|
||||
This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
|
||||
|
||||
## 1. CVE Services Connector (`source:cve:*`)
|
||||
|
||||
### 1.1 Prerequisites
|
||||
|
||||
- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
|
||||
- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Concelier workers.
|
||||
- Updated `concelier.yaml` (or the matching environment variables) with the following section:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cve:
|
||||
baseEndpoint: "https://cveawg.mitre.org/api/"
|
||||
apiOrg: "ORG123"
|
||||
apiUser: "user@example.org"
|
||||
apiKeyFile: "/var/run/secrets/concelier/cve-api-key"
|
||||
seedDirectory: "./seed-data/cve"
|
||||
pageSize: 200
|
||||
maxPagesPerFetch: 5
|
||||
initialBackfill: "30.00:00:00"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:10:00"
|
||||
```
|
||||
|
||||
> ℹ️ Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `CONCELIER_SOURCES__CVE__APIKEY`.
|
||||
|
||||
> 🪙 When credentials are not yet available, configure `seedDirectory` to point at mirrored CVE JSON (for example, the repo’s `seed-data/cve/` bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once `apiOrg` / `apiUser` / `apiKey` are supplied.
|
||||
|
||||
### 1.2 Smoke Test (staging)
|
||||
|
||||
1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials.
|
||||
2. Trigger one end-to-end cycle:
|
||||
- Concelier CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
|
||||
- REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
|
||||
3. Observe the following metrics (exported via OTEL meter `StellaOps.Concelier.Connector.Cve`):
|
||||
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
|
||||
- `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
|
||||
- `cve.map.success`
|
||||
4. Verify Prometheus shows matching `concelier.source.http.requests_total{concelier_source="cve"}` deltas (list vs detail phases) while `concelier.source.http.failures_total{concelier_source="cve"}` stays flat.
|
||||
5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
|
||||
6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
|
||||
|
||||
### 1.3 Production Monitoring
|
||||
|
||||
- **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `concelier_source_http_requests_total{concelier_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `concelier.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
|
||||
- `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
|
||||
- `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
|
||||
- `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
|
||||
- **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
|
||||
- **Grafana pack** – Import `docs/modules/concelier/operations/connectors/cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
|
||||
- **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Concelier to apply changes.
|
||||
|
||||
### 1.4 Staging smoke log (2025-10-15)
|
||||
|
||||
While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:
|
||||
|
||||
- Command: `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Cve.Tests/StellaOps.Concelier.Connector.Cve.Tests.csproj -l "console;verbosity=detailed"`
|
||||
- Summary log emitted by the connector:
|
||||
```
|
||||
CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
|
||||
```
|
||||
- Telemetry captured by `Meter` `StellaOps.Concelier.Connector.Cve`:
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `cve.fetch.attempts` | 1 |
|
||||
| `cve.fetch.success` | 1 |
|
||||
| `cve.fetch.documents` | 1 |
|
||||
| `cve.parse.success` | 1 |
|
||||
| `cve.map.success` | 1 |
|
||||
|
||||
The Grafana pack `docs/modules/concelier/operations/connectors/cve-kev-grafana-dashboard.json` has been imported into staging so the panels referenced above render against these counters once the live API keys are in place.
|
||||
|
||||
## 2. CISA KEV Connector (`source:kev:*`)
|
||||
|
||||
### 2.1 Prerequisites
|
||||
|
||||
- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
|
||||
- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
|
||||
- Confirm the following snippet in `concelier.yaml` (defaults shown; tune as needed):
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
kev:
|
||||
feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
|
||||
requestTimeout: "00:01:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
### 2.2 Schema validation & anomaly handling
|
||||
|
||||
The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:
|
||||
|
||||
| Reason | Meaning |
|
||||
| --- | --- |
|
||||
| `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
|
||||
| `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
|
||||
| `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |
|
||||
|
||||
Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.
|
||||
|
||||
### 2.3 Smoke Test (staging)
|
||||
|
||||
1. Deploy the configuration and restart Concelier.
|
||||
2. Trigger a pipeline run:
|
||||
- CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
|
||||
3. Verify the metrics exposed by meter `StellaOps.Concelier.Connector.Kev`:
|
||||
- `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
|
||||
- `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
|
||||
- `kev.map.advisories` (tag `catalogVersion`)
|
||||
4. Confirm `concelier.source.http.requests_total{concelier_source="kev"}` increments once per fetch and that the paired `concelier.source.http.failures_total` stays flat (zero increase).
|
||||
5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
|
||||
6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
|
||||
|
||||
### 2.4 Production Monitoring
|
||||
|
||||
- Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
|
||||
- Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
|
||||
- Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
|
||||
- Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.
|
||||
|
||||
### 2.5 Known good dashboard tiles
|
||||
|
||||
Add the following panels to the Concelier observability board:
|
||||
|
||||
| Metric | Recommended visualisation |
|
||||
|--------|---------------------------|
|
||||
| `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
|
||||
| `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
|
||||
| `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
|
||||
| `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |
|
||||
|
||||
## 3. Runbook updates
|
||||
|
||||
- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
|
||||
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
|
||||
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
|
||||
- Version-control dashboard tweaks alongside `docs/modules/concelier/operations/connectors/cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.
|
||||
|
||||
@@ -1,74 +1,74 @@
|
||||
# Concelier KISA Connector Operations
|
||||
|
||||
Operational guidance for the Korea Internet & Security Agency (KISA / KNVD) connector (`source:kisa:*`). Pair this with the engineering brief in `docs/dev/kisa_connector_notes.md`.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Outbound HTTPS (or mirrored cache) for `https://knvd.krcert.or.kr/`.
|
||||
- Connector options defined under `concelier:sources:kisa`:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
kisa:
|
||||
feedUri: "https://knvd.krcert.or.kr/rss/securityInfo.do"
|
||||
detailApiUri: "https://knvd.krcert.or.kr/rssDetailData.do"
|
||||
detailPageUri: "https://knvd.krcert.or.kr/detailDos.do"
|
||||
maxAdvisoriesPerFetch: 10
|
||||
requestDelay: "00:00:01"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> Ensure the URIs stay absolute—Concelier adds the `feedUri`/`detailApiUri` hosts to the HttpClient allow-list automatically.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Restart the Concelier workers so the KISA options bind.
|
||||
2. Run a full connector cycle:
|
||||
- CLI: `stella db jobs run source:kisa:fetch --and-then source:kisa:parse --and-then source:kisa:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kisa:fetch", "chain": ["source:kisa:parse", "source:kisa:map"] }`
|
||||
3. Confirm telemetry (Meter `StellaOps.Concelier.Connector.Kisa`):
|
||||
- `kisa.feed.success`, `kisa.feed.items`
|
||||
- `kisa.detail.success` / `.failures`
|
||||
- `kisa.parse.success` / `.failures`
|
||||
- `kisa.map.success` / `.failures`
|
||||
- `kisa.cursor.updates`
|
||||
4. Inspect logs for structured entries:
|
||||
- `KISA feed returned {ItemCount}`
|
||||
- `KISA fetched detail for {Idx} … category={Category}`
|
||||
- `KISA mapped advisory {AdvisoryId} (severity={Severity})`
|
||||
- Absence of warnings such as `document missing GridFS payload`.
|
||||
5. Validate MongoDB state:
|
||||
- `raw_documents.metadata` has `kisa.idx`, `kisa.category`, `kisa.title`.
|
||||
- DTO store contains `schemaVersion="kisa.detail.v1"`.
|
||||
- Advisories include aliases (`IDX`, CVE) and `language="ko"`.
|
||||
- `source_states` entry for `kisa` shows recent `cursor.lastFetchAt`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following Prometheus/OTEL expressions:
|
||||
- `rate(kisa_feed_items_total[15m])` versus `rate(concelier_source_http_requests_total{concelier_source="kisa"}[15m])`
|
||||
- `increase(kisa_detail_failures_total{reason!="empty-document"}[1h])` alert at `>0`
|
||||
- `increase(kisa_parse_failures_total[1h])` for storage/JSON issues
|
||||
- `increase(kisa_map_failures_total[1h])` to flag schema drift
|
||||
- `increase(kisa_cursor_updates_total[6h]) == 0` during active windows → warn
|
||||
- **Alerts** – Page when `rate(kisa_feed_success_total[2h]) == 0` while other connectors are active; back off for maintenance windows announced on `https://knvd.krcert.or.kr/`.
|
||||
- **Logs** – Watch for repeated warnings (`document missing`, `DTO missing`) or errors with reason tags `HttpRequestException`, `download`, `parse`, `map`.
|
||||
|
||||
## 4. Localisation Handling
|
||||
|
||||
- Hangul categories (for example `취약점정보`) flow into telemetry tags (`category=…`) and logs. Dashboards must render UTF‑8 and avoid transliteration.
|
||||
- HTML content is sanitised before storage; translation teams can consume the `ContentHtml` field safely.
|
||||
- Advisory severity remains as provided by KISA (`High`, `Medium`, etc.). Map-level failures include the severity tag for filtering.
|
||||
|
||||
## 5. Fixture & Regression Maintenance
|
||||
|
||||
- Regression fixtures: `src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/Fixtures/kisa-feed.xml` and `kisa-detail.json`.
|
||||
- Refresh via `UPDATE_KISA_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/StellaOps.Concelier.Connector.Kisa.Tests.csproj`.
|
||||
- The telemetry regression (`KisaConnectorTests.Telemetry_RecordsMetrics`) will fail if counters/log wiring drifts—treat failures as gating.
|
||||
|
||||
## 6. Known Issues
|
||||
|
||||
- RSS feeds only expose the latest 10 advisories; long outages require replay via archived feeds or manual IDX seeds.
|
||||
- Detail endpoint occasionally throttles; the connector honours `requestDelay` and reports failures with reason `HttpRequestException`. Consider increasing delay for weekend backfills.
|
||||
- If `kisa.category` tags suddenly appear as `unknown`, verify KISA has not renamed RSS elements; update the parser fixtures before production rollout.
|
||||
# Concelier KISA Connector Operations
|
||||
|
||||
Operational guidance for the Korea Internet & Security Agency (KISA / KNVD) connector (`source:kisa:*`). Pair this with the engineering brief in `docs/dev/kisa_connector_notes.md`.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Outbound HTTPS (or mirrored cache) for `https://knvd.krcert.or.kr/`.
|
||||
- Connector options defined under `concelier:sources:kisa`:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
kisa:
|
||||
feedUri: "https://knvd.krcert.or.kr/rss/securityInfo.do"
|
||||
detailApiUri: "https://knvd.krcert.or.kr/rssDetailData.do"
|
||||
detailPageUri: "https://knvd.krcert.or.kr/detailDos.do"
|
||||
maxAdvisoriesPerFetch: 10
|
||||
requestDelay: "00:00:01"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> Ensure the URIs stay absolute—Concelier adds the `feedUri`/`detailApiUri` hosts to the HttpClient allow-list automatically.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Restart the Concelier workers so the KISA options bind.
|
||||
2. Run a full connector cycle:
|
||||
- CLI: `stella db jobs run source:kisa:fetch --and-then source:kisa:parse --and-then source:kisa:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kisa:fetch", "chain": ["source:kisa:parse", "source:kisa:map"] }`
|
||||
3. Confirm telemetry (Meter `StellaOps.Concelier.Connector.Kisa`):
|
||||
- `kisa.feed.success`, `kisa.feed.items`
|
||||
- `kisa.detail.success` / `.failures`
|
||||
- `kisa.parse.success` / `.failures`
|
||||
- `kisa.map.success` / `.failures`
|
||||
- `kisa.cursor.updates`
|
||||
4. Inspect logs for structured entries:
|
||||
- `KISA feed returned {ItemCount}`
|
||||
- `KISA fetched detail for {Idx} … category={Category}`
|
||||
- `KISA mapped advisory {AdvisoryId} (severity={Severity})`
|
||||
- Absence of warnings such as `document missing GridFS payload`.
|
||||
5. Validate MongoDB state:
|
||||
- `raw_documents.metadata` has `kisa.idx`, `kisa.category`, `kisa.title`.
|
||||
- DTO store contains `schemaVersion="kisa.detail.v1"`.
|
||||
- Advisories include aliases (`IDX`, CVE) and `language="ko"`.
|
||||
- `source_states` entry for `kisa` shows recent `cursor.lastFetchAt`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following Prometheus/OTEL expressions:
|
||||
- `rate(kisa_feed_items_total[15m])` versus `rate(concelier_source_http_requests_total{concelier_source="kisa"}[15m])`
|
||||
- `increase(kisa_detail_failures_total{reason!="empty-document"}[1h])` alert at `>0`
|
||||
- `increase(kisa_parse_failures_total[1h])` for storage/JSON issues
|
||||
- `increase(kisa_map_failures_total[1h])` to flag schema drift
|
||||
- `increase(kisa_cursor_updates_total[6h]) == 0` during active windows → warn
|
||||
- **Alerts** – Page when `rate(kisa_feed_success_total[2h]) == 0` while other connectors are active; back off for maintenance windows announced on `https://knvd.krcert.or.kr/`.
|
||||
- **Logs** – Watch for repeated warnings (`document missing`, `DTO missing`) or errors with reason tags `HttpRequestException`, `download`, `parse`, `map`.
|
||||
|
||||
## 4. Localisation Handling
|
||||
|
||||
- Hangul categories (for example `취약점정보`) flow into telemetry tags (`category=…`) and logs. Dashboards must render UTF‑8 and avoid transliteration.
|
||||
- HTML content is sanitised before storage; translation teams can consume the `ContentHtml` field safely.
|
||||
- Advisory severity remains as provided by KISA (`High`, `Medium`, etc.). Map-level failures include the severity tag for filtering.
|
||||
|
||||
## 5. Fixture & Regression Maintenance
|
||||
|
||||
- Regression fixtures: `src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/Fixtures/kisa-feed.xml` and `kisa-detail.json`.
|
||||
- Refresh via `UPDATE_KISA_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/StellaOps.Concelier.Connector.Kisa.Tests.csproj`.
|
||||
- The telemetry regression (`KisaConnectorTests.Telemetry_RecordsMetrics`) will fail if counters/log wiring drifts—treat failures as gating.
|
||||
|
||||
## 6. Known Issues
|
||||
|
||||
- RSS feeds only expose the latest 10 advisories; long outages require replay via archived feeds or manual IDX seeds.
|
||||
- Detail endpoint occasionally throttles; the connector honours `requestDelay` and reports failures with reason `HttpRequestException`. Consider increasing delay for weekend backfills.
|
||||
- If `kisa.category` tags suddenly appear as `unknown`, verify KISA has not renamed RSS elements; update the parser fixtures before production rollout.
|
||||
|
||||
@@ -1,48 +1,48 @@
|
||||
# NKCKI Connector Operations Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The NKCKI connector ingests JSON bulletin archives from cert.gov.ru, expanding each `*.json.zip` attachment into per-vulnerability DTOs before canonical mapping. The fetch pipeline now supports cache-backed recovery, deterministic pagination, and telemetry suitable for production monitoring.
|
||||
|
||||
## Configuration
|
||||
|
||||
Key options exposed through `concelier:sources:ru-nkcki:http`:
|
||||
|
||||
- `maxBulletinsPerFetch` – limits new bulletin downloads in a single run (default `5`).
|
||||
- `maxListingPagesPerFetch` – maximum listing pages visited during pagination (default `3`).
|
||||
- `listingCacheDuration` – minimum interval between listing fetches before falling back to cached artefacts (default `00:10:00`).
|
||||
- `cacheDirectory` – optional path for persisted bulletin archives used during offline or failure scenarios.
|
||||
- `requestDelay` – delay inserted between bulletin downloads to respect upstream politeness.
|
||||
|
||||
When operating in offline-first mode, set `cacheDirectory` to a writable path (e.g. `/var/lib/concelier/cache/ru-nkcki`) and pre-populate bulletin archives via the offline kit.
|
||||
|
||||
## Telemetry
|
||||
|
||||
`RuNkckiDiagnostics` emits the following metrics under meter `StellaOps.Concelier.Connector.Ru.Nkcki`:
|
||||
|
||||
- `nkcki.listing.fetch.attempts` / `nkcki.listing.fetch.success` / `nkcki.listing.fetch.failures`
|
||||
- `nkcki.listing.pages.visited` (histogram, `pages`)
|
||||
- `nkcki.listing.attachments.discovered` / `nkcki.listing.attachments.new`
|
||||
- `nkcki.bulletin.fetch.success` / `nkcki.bulletin.fetch.cached` / `nkcki.bulletin.fetch.failures`
|
||||
- `nkcki.entries.processed` (histogram, `entries`)
|
||||
|
||||
Integrate these counters into standard Concelier observability dashboards to track crawl coverage and cache hit rates.
|
||||
|
||||
## Archive Backfill Strategy
|
||||
|
||||
Bitrix pagination surfaces archives via `?PAGEN_1=n`. The connector now walks up to `maxListingPagesPerFetch` pages, deduplicating bulletin IDs and maintaining a rolling `knownBulletins` window. Backfill strategy:
|
||||
|
||||
1. Enumerate pages from newest to oldest, respecting `maxListingPagesPerFetch` and `listingCacheDuration` to avoid refetch storms.
|
||||
2. Persist every `*.json.zip` attachment to the configured cache directory. This enables replay when listing access is temporarily blocked.
|
||||
3. During archive replay, `ProcessCachedBulletinsAsync` enqueues missing documents while respecting `maxVulnerabilitiesPerFetch`.
|
||||
4. For historical HTML-only advisories, collect page URLs and metadata while offline (future work: HTML and PDF extraction pipeline documented in `docs/concelier-connector-research-20251011.md`).
|
||||
|
||||
For large migrations, seed caches with archived zip bundles, then run fetch/parse/map cycles in chronological order to maintain deterministic outputs.
|
||||
|
||||
## Failure Handling
|
||||
|
||||
- Listing failures mark the source state with exponential backoff while attempting cache replay.
|
||||
- Bulletin fetches fall back to cached copies before surfacing an error.
|
||||
- Mongo integration tests rely on bundled OpenSSL 1.1 libraries (`src/Tools/openssl/linux-x64`) to keep `Mongo2Go` operational on modern distros.
|
||||
|
||||
Refer to `ru-nkcki` entries in `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ru.Nkcki/TASKS.md` for outstanding items.
|
||||
# NKCKI Connector Operations Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The NKCKI connector ingests JSON bulletin archives from cert.gov.ru, expanding each `*.json.zip` attachment into per-vulnerability DTOs before canonical mapping. The fetch pipeline now supports cache-backed recovery, deterministic pagination, and telemetry suitable for production monitoring.
|
||||
|
||||
## Configuration
|
||||
|
||||
Key options exposed through `concelier:sources:ru-nkcki:http`:
|
||||
|
||||
- `maxBulletinsPerFetch` – limits new bulletin downloads in a single run (default `5`).
|
||||
- `maxListingPagesPerFetch` – maximum listing pages visited during pagination (default `3`).
|
||||
- `listingCacheDuration` – minimum interval between listing fetches before falling back to cached artefacts (default `00:10:00`).
|
||||
- `cacheDirectory` – optional path for persisted bulletin archives used during offline or failure scenarios.
|
||||
- `requestDelay` – delay inserted between bulletin downloads to respect upstream politeness.
|
||||
|
||||
When operating in offline-first mode, set `cacheDirectory` to a writable path (e.g. `/var/lib/concelier/cache/ru-nkcki`) and pre-populate bulletin archives via the offline kit.
|
||||
|
||||
## Telemetry
|
||||
|
||||
`RuNkckiDiagnostics` emits the following metrics under meter `StellaOps.Concelier.Connector.Ru.Nkcki`:
|
||||
|
||||
- `nkcki.listing.fetch.attempts` / `nkcki.listing.fetch.success` / `nkcki.listing.fetch.failures`
|
||||
- `nkcki.listing.pages.visited` (histogram, `pages`)
|
||||
- `nkcki.listing.attachments.discovered` / `nkcki.listing.attachments.new`
|
||||
- `nkcki.bulletin.fetch.success` / `nkcki.bulletin.fetch.cached` / `nkcki.bulletin.fetch.failures`
|
||||
- `nkcki.entries.processed` (histogram, `entries`)
|
||||
|
||||
Integrate these counters into standard Concelier observability dashboards to track crawl coverage and cache hit rates.
|
||||
|
||||
## Archive Backfill Strategy
|
||||
|
||||
Bitrix pagination surfaces archives via `?PAGEN_1=n`. The connector now walks up to `maxListingPagesPerFetch` pages, deduplicating bulletin IDs and maintaining a rolling `knownBulletins` window. Backfill strategy:
|
||||
|
||||
1. Enumerate pages from newest to oldest, respecting `maxListingPagesPerFetch` and `listingCacheDuration` to avoid refetch storms.
|
||||
2. Persist every `*.json.zip` attachment to the configured cache directory. This enables replay when listing access is temporarily blocked.
|
||||
3. During archive replay, `ProcessCachedBulletinsAsync` enqueues missing documents while respecting `maxVulnerabilitiesPerFetch`.
|
||||
4. For historical HTML-only advisories, collect page URLs and metadata while offline (future work: HTML and PDF extraction pipeline documented in `docs/concelier-connector-research-20251011.md`).
|
||||
|
||||
For large migrations, seed caches with archived zip bundles, then run fetch/parse/map cycles in chronological order to maintain deterministic outputs.
|
||||
|
||||
## Failure Handling
|
||||
|
||||
- Listing failures mark the source state with exponential backoff while attempting cache replay.
|
||||
- Bulletin fetches fall back to cached copies before surfacing an error.
|
||||
- Mongo integration tests rely on bundled OpenSSL 1.1 libraries (`src/Tools/openssl/linux-x64`) to keep `Mongo2Go` operational on modern distros.
|
||||
|
||||
Refer to `ru-nkcki` entries in `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ru.Nkcki/TASKS.md` for outstanding items.
|
||||
|
||||
@@ -1,24 +1,24 @@
|
||||
# Concelier OSV Connector – Operations Notes
|
||||
|
||||
_Last updated: 2025-10-16_
|
||||
|
||||
The OSV connector ingests advisories from OSV.dev across OSS ecosystems. This note highlights the additional merge/export expectations introduced with the canonical metric fallback work in Sprint 4.
|
||||
|
||||
## 1. Canonical metric fallbacks
|
||||
- When OSV omits CVSS vectors (common for CVSS v4-only payloads) the mapper now emits a deterministic canonical metric id in the form `osv:severity/<level>` and normalises the advisory severity to the same `<level>`.
|
||||
- Metric: `osv.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `ecosystem`, `reason=no_cvss`. Watch this alongside merge parity dashboards to catch spikes where OSV publishes severity-only advisories.
|
||||
- Merge precedence still prefers GHSA over OSV; the shared severity-based canonical id keeps Merge/export parity deterministic even when only OSV supplies severity data.
|
||||
|
||||
## 2. CWE provenance
|
||||
- `database_specific.cwe_ids` now populates provenance decision reasons for every mapped weakness. Expect `decisionReason="database_specific.cwe_ids"` on OSV weakness provenance and confirm exporters preserve the value.
|
||||
- If OSV ever attaches `database_specific.cwe_notes`, the connector will surface the joined note string in `decisionReason` instead of the default marker.
|
||||
|
||||
## 3. Dashboards & alerts
|
||||
- Extend existing merge dashboards with the new counter:
|
||||
- Overlay `sum(osv.map.canonical_metric_fallbacks{ecosystem=~".+"})` with Merge severity overrides to confirm fallback advisories are reconciling cleanly.
|
||||
- Alert when the 1-hour sum exceeds 50 for any ecosystem; baseline volume is currently <5 per day (mostly GHSA mirrors emitting CVSS v4 only).
|
||||
- Exporters already surface `canonicalMetricId`; no schema change is required, but ORAS/Trivy bundles should be spot-checked after deploying the connector update.
|
||||
|
||||
## 4. Runbook updates
|
||||
- Fixture parity suites (`osv-ghsa.*`) now assert the fallback id and provenance notes. Regenerate via `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj`.
|
||||
- When investigating merge severity conflicts, include the fallback counter and confirm OSV advisories carry the expected `osv:severity/<level>` id before raising connector bugs.
|
||||
# Concelier OSV Connector – Operations Notes
|
||||
|
||||
_Last updated: 2025-10-16_
|
||||
|
||||
The OSV connector ingests advisories from OSV.dev across OSS ecosystems. This note highlights the additional merge/export expectations introduced with the canonical metric fallback work in Sprint 4.
|
||||
|
||||
## 1. Canonical metric fallbacks
|
||||
- When OSV omits CVSS vectors (common for CVSS v4-only payloads) the mapper now emits a deterministic canonical metric id in the form `osv:severity/<level>` and normalises the advisory severity to the same `<level>`.
|
||||
- Metric: `osv.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `ecosystem`, `reason=no_cvss`. Watch this alongside merge parity dashboards to catch spikes where OSV publishes severity-only advisories.
|
||||
- Merge precedence still prefers GHSA over OSV; the shared severity-based canonical id keeps Merge/export parity deterministic even when only OSV supplies severity data.
|
||||
|
||||
## 2. CWE provenance
|
||||
- `database_specific.cwe_ids` now populates provenance decision reasons for every mapped weakness. Expect `decisionReason="database_specific.cwe_ids"` on OSV weakness provenance and confirm exporters preserve the value.
|
||||
- If OSV ever attaches `database_specific.cwe_notes`, the connector will surface the joined note string in `decisionReason` instead of the default marker.
|
||||
|
||||
## 3. Dashboards & alerts
|
||||
- Extend existing merge dashboards with the new counter:
|
||||
- Overlay `sum(osv.map.canonical_metric_fallbacks{ecosystem=~".+"})` with Merge severity overrides to confirm fallback advisories are reconciling cleanly.
|
||||
- Alert when the 1-hour sum exceeds 50 for any ecosystem; baseline volume is currently <5 per day (mostly GHSA mirrors emitting CVSS v4 only).
|
||||
- Exporters already surface `canonicalMetricId`; no schema change is required, but ORAS/Trivy bundles should be spot-checked after deploying the connector update.
|
||||
|
||||
## 4. Runbook updates
|
||||
- Fixture parity suites (`osv-ghsa.*`) now assert the fallback id and provenance notes. Regenerate via `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj`.
|
||||
- When investigating merge severity conflicts, include the fallback counter and confirm OSV advisories carry the expected `osv:severity/<level>` id before raising connector bugs.
|
||||
|
||||
@@ -1,238 +1,238 @@
|
||||
# Concelier & Excititor Mirror Operations
|
||||
|
||||
This runbook describes how Stella Ops operates the managed mirrors under `*.stella-ops.org`.
|
||||
It covers Docker Compose and Helm deployment overlays, secret handling for multi-tenant
|
||||
authn, CDN fronting, and the recurring sync pipeline that keeps mirror bundles current.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- **Authority access** – client credentials (`client_id` + secret) authorised for
|
||||
`concelier.mirror.read` and `excititor.mirror.read` scopes. Secrets live outside git.
|
||||
- **Signed TLS certificates** – wildcard or per-domain (`mirror-primary`, `mirror-community`).
|
||||
Store them under `deploy/compose/mirror-gateway/tls/` or in Kubernetes secrets.
|
||||
- **Mirror gateway credentials** – Basic Auth htpasswd files per domain. Generate with
|
||||
`htpasswd -B`. Operators distribute credentials to downstream consumers.
|
||||
- **Export artifact source** – read access to the canonical S3 buckets (or rsync share)
|
||||
that hold `concelier` JSON bundles and `excititor` VEX exports.
|
||||
- **Persistent volumes** – storage for Concelier job metadata and mirror export trees.
|
||||
For Helm, provision PVCs (`concelier-mirror-jobs`, `concelier-mirror-exports`,
|
||||
`excititor-mirror-exports`, `mirror-mongo-data`, `mirror-minio-data`) before rollout.
|
||||
|
||||
### 1.1 Service configuration quick reference
|
||||
|
||||
Concelier.WebService exposes the mirror HTTP endpoints once `CONCELIER__MIRROR__ENABLED=true`.
|
||||
Key knobs:
|
||||
|
||||
- `CONCELIER__MIRROR__EXPORTROOT` – root folder containing export snapshots (`<exportId>/mirror/*`).
|
||||
- `CONCELIER__MIRROR__ACTIVEEXPORTID` – optional explicit export id; otherwise the service auto-falls back to the `latest/` symlink or newest directory.
|
||||
- `CONCELIER__MIRROR__REQUIREAUTHENTICATION` – default auth requirement; override per domain with `CONCELIER__MIRROR__DOMAINS__{n}__REQUIREAUTHENTICATION`.
|
||||
- `CONCELIER__MIRROR__MAXINDEXREQUESTSPERHOUR` – budget for `/concelier/exports/index.json`. Domains inherit this value unless they define `__MAXDOWNLOADREQUESTSPERHOUR`.
|
||||
- `CONCELIER__MIRROR__DOMAINS__{n}__ID` – domain identifier matching the exporter manifest; additional keys configure display name and rate budgets.
|
||||
|
||||
> The service honours Stella Ops Authority when `CONCELIER__AUTHORITY__ENABLED=true` and `ALLOWANONYMOUSFALLBACK=false`. Use the bypass CIDR list (`CONCELIER__AUTHORITY__BYPASSNETWORKS__*`) for in-cluster ingress gateways that terminate Basic Auth. Unauthorized requests emit `WWW-Authenticate: Bearer` so downstream automation can detect token failures.
|
||||
|
||||
Mirror responses carry deterministic cache headers: `/index.json` returns `Cache-Control: public, max-age=60`, while per-domain manifests/bundles include `Cache-Control: public, max-age=300, immutable`. Rate limiting surfaces `Retry-After` when quotas are exceeded.
|
||||
|
||||
### 1.2 Mirror connector configuration
|
||||
|
||||
Downstream Concelier instances ingest published bundles using the `StellaOpsMirrorConnector`. Operators running the connector in air‑gapped or limited connectivity environments can tune the following options (environment prefix `CONCELIER__SOURCES__STELLAOPSMIRROR__`):
|
||||
|
||||
- `BASEADDRESS` – absolute mirror root (e.g., `https://mirror-primary.stella-ops.org`).
|
||||
- `INDEXPATH` – relative path to the mirror index (`/concelier/exports/index.json` by default).
|
||||
- `DOMAINID` – mirror domain identifier from the index (`primary`, `community`, etc.).
|
||||
- `HTTPTIMEOUT` – request timeout; raise when mirrors sit behind slow WAN links.
|
||||
- `SIGNATURE__ENABLED` – require detached JWS verification for `bundle.json`.
|
||||
- `SIGNATURE__KEYID` / `SIGNATURE__PROVIDER` – expected signing key metadata.
|
||||
- `SIGNATURE__PUBLICKEYPATH` – PEM fallback used when the mirror key registry is offline.
|
||||
|
||||
The connector keeps a per-export fingerprint (bundle digest + generated-at timestamp) and tracks outstanding document IDs. If a scan is interrupted, the next run resumes parse/map work using the stored fingerprint and pending document lists—no network requests are reissued unless the upstream digest changes.
|
||||
|
||||
## 2. Secret & certificate layout
|
||||
|
||||
### Docker Compose (`deploy/compose/docker-compose.mirror.yaml`)
|
||||
|
||||
- `deploy/compose/env/mirror.env.example` – copy to `.env` and adjust quotas or domain IDs.
|
||||
- `deploy/compose/mirror-secrets/` – mount read-only into `/run/secrets`. Place:
|
||||
- `concelier-authority-client` – Authority client secret.
|
||||
- `excititor-authority-client` (optional) – reserve for future authn.
|
||||
- `deploy/compose/mirror-gateway/tls/` – PEM-encoded cert/key pairs:
|
||||
- `mirror-primary.crt`, `mirror-primary.key`
|
||||
- `mirror-community.crt`, `mirror-community.key`
|
||||
- `deploy/compose/mirror-gateway/secrets/` – htpasswd files:
|
||||
- `mirror-primary.htpasswd`
|
||||
- `mirror-community.htpasswd`
|
||||
|
||||
### Helm (`deploy/helm/stellaops/values-mirror.yaml`)
|
||||
|
||||
Create secrets in the target namespace:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic concelier-mirror-auth \
|
||||
--from-file=concelier-authority-client=concelier-authority-client
|
||||
|
||||
kubectl create secret generic excititor-mirror-auth \
|
||||
--from-file=excititor-authority-client=excititor-authority-client
|
||||
|
||||
kubectl create secret tls mirror-gateway-tls \
|
||||
--cert=mirror-primary.crt --key=mirror-primary.key
|
||||
|
||||
kubectl create secret generic mirror-gateway-htpasswd \
|
||||
--from-file=mirror-primary.htpasswd --from-file=mirror-community.htpasswd
|
||||
```
|
||||
|
||||
> Keep Basic Auth lists short-lived (rotate quarterly) and document credential recipients.
|
||||
|
||||
## 3. Deployment
|
||||
|
||||
### 3.1 Docker Compose (edge mirrors, lab validation)
|
||||
|
||||
1. `cp deploy/compose/env/mirror.env.example deploy/compose/env/mirror.env`
|
||||
2. Populate secrets/tls directories as described above.
|
||||
3. Sync mirror bundles (see §4) into `deploy/compose/mirror-data/…` and ensure they are mounted
|
||||
on the host path backing the `concelier-exports` and `excititor-exports` volumes.
|
||||
4. Run the profile validator: `deploy/tools/validate-profiles.sh`.
|
||||
5. Launch: `docker compose --env-file env/mirror.env -f docker-compose.mirror.yaml up -d`.
|
||||
|
||||
### 3.2 Helm (production mirrors)
|
||||
|
||||
1. Provision PVCs sized for mirror bundles (baseline: 20 GiB per domain).
|
||||
2. Create secrets/tls config maps (§2).
|
||||
3. `helm upgrade --install mirror deploy/helm/stellaops -f deploy/helm/stellaops/values-mirror.yaml`.
|
||||
4. Annotate the `stellaops-mirror-gateway` service with ingress/LoadBalancer metadata required by
|
||||
your CDN (e.g., AWS load balancer scheme internal + NLB idle timeout).
|
||||
|
||||
## 4. Artifact sync workflow
|
||||
|
||||
Mirrors never generate exports—they ingest signed bundles produced by the Concelier and Excititor
|
||||
export jobs. Recommended sync pattern:
|
||||
|
||||
### 4.1 Compose host (systemd timer)
|
||||
|
||||
`/usr/local/bin/mirror-sync.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
export AWS_ACCESS_KEY_ID=…
|
||||
export AWS_SECRET_ACCESS_KEY=…
|
||||
|
||||
aws s3 sync s3://mirror-stellaops/concelier/latest \
|
||||
/opt/stellaops/mirror-data/concelier --delete --size-only
|
||||
|
||||
aws s3 sync s3://mirror-stellaops/excititor/latest \
|
||||
/opt/stellaops/mirror-data/excititor --delete --size-only
|
||||
```
|
||||
|
||||
Schedule with a systemd timer every 5 minutes. The Compose volumes mount `/opt/stellaops/mirror-data/*`
|
||||
into the containers read-only, matching `CONCELIER__MIRROR__EXPORTROOT=/exports/json` and
|
||||
`EXCITITOR__ARTIFACTS__FILESYSTEM__ROOT=/exports`.
|
||||
|
||||
### 4.2 Kubernetes (CronJob)
|
||||
|
||||
Create a CronJob running the AWS CLI (or rclone) in the same namespace, writing into the PVCs:
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: mirror-sync
|
||||
spec:
|
||||
schedule: "*/5 * * * *"
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: sync
|
||||
image: public.ecr.aws/aws-cli/aws-cli@sha256:5df5f52c29f5e3ba46d0ad9e0e3afc98701c4a0f879400b4c5f80d943b5fadea
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- >
|
||||
aws s3 sync s3://mirror-stellaops/concelier/latest /exports/concelier --delete --size-only &&
|
||||
aws s3 sync s3://mirror-stellaops/excititor/latest /exports/excititor --delete --size-only
|
||||
volumeMounts:
|
||||
- name: concelier-exports
|
||||
mountPath: /exports/concelier
|
||||
- name: excititor-exports
|
||||
mountPath: /exports/excititor
|
||||
envFrom:
|
||||
- secretRef:
|
||||
name: mirror-sync-aws
|
||||
restartPolicy: OnFailure
|
||||
volumes:
|
||||
- name: concelier-exports
|
||||
persistentVolumeClaim:
|
||||
claimName: concelier-mirror-exports
|
||||
- name: excititor-exports
|
||||
persistentVolumeClaim:
|
||||
claimName: excititor-mirror-exports
|
||||
```
|
||||
|
||||
## 5. CDN integration
|
||||
|
||||
1. Point the CDN origin at the mirror gateway (Compose host or Kubernetes LoadBalancer).
|
||||
2. Honour the response headers emitted by the gateway and Concelier/Excititor:
|
||||
`Cache-Control: public, max-age=300, immutable` for mirror payloads.
|
||||
3. Configure origin shields in the CDN to prevent cache stampedes. Recommended TTLs:
|
||||
- Index (`/concelier/exports/index.json`, `/excititor/mirror/*/index`) → 60 s.
|
||||
- Bundle/manifest payloads → 300 s.
|
||||
4. Forward the `Authorization` header—Basic Auth terminates at the gateway.
|
||||
5. Enforce per-domain rate limits at the CDN (matching gateway budgets) and enable logging
|
||||
to SIEM for anomaly detection.
|
||||
|
||||
## 6. Smoke tests
|
||||
|
||||
After each deployment or sync cycle (temporarily set low budgets if you need to observe 429 responses):
|
||||
|
||||
```bash
|
||||
# Index with Basic Auth
|
||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/index.json | jq 'keys'
|
||||
|
||||
# Mirror manifest signature and cache headers
|
||||
curl -u $PRIMARY_CREDS -I https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/manifest.json \
|
||||
| tee /tmp/manifest-headers.txt
|
||||
grep -E '^Cache-Control: ' /tmp/manifest-headers.txt # expect public, max-age=300, immutable
|
||||
|
||||
# Excititor consensus bundle metadata
|
||||
curl -u $COMMUNITY_CREDS https://mirror-community.stella-ops.org/excititor/mirror/community/index \
|
||||
| jq '.exports[].exportKey'
|
||||
|
||||
# Signed bundle + detached JWS (spot check digests)
|
||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/bundle.json.jws \
|
||||
-o bundle.json.jws
|
||||
cosign verify-blob --signature bundle.json.jws --key mirror-key.pub bundle.json
|
||||
|
||||
# Service-level auth check (inside cluster – no gateway credentials)
|
||||
kubectl exec deploy/stellaops-concelier -- curl -si http://localhost:8443/concelier/exports/mirror/primary/manifest.json \
|
||||
| head -n 5 # expect HTTP/1.1 401 with WWW-Authenticate: Bearer
|
||||
|
||||
# Rate limit smoke (repeat quickly; second call should return 429 + Retry-After)
|
||||
for i in 1 2; do
|
||||
curl -s -o /dev/null -D - https://mirror-primary.stella-ops.org/concelier/exports/index.json \
|
||||
-u $PRIMARY_CREDS | grep -E '^(HTTP/|Retry-After:)'
|
||||
sleep 1
|
||||
done
|
||||
```
|
||||
|
||||
Watch the gateway metrics (`nginx_vts` or access logs) for cache hits. In Kubernetes, `kubectl logs deploy/stellaops-mirror-gateway`
|
||||
should show `X-Cache-Status: HIT/MISS`.
|
||||
|
||||
## 7. Maintenance & rotation
|
||||
|
||||
- **Bundle freshness** – alert if sync job lag exceeds 15 minutes or if `concelier` logs
|
||||
`Mirror export root is not configured`.
|
||||
- **Secret rotation** – change Authority client secrets and Basic Auth credentials quarterly.
|
||||
Update the mounted secrets and restart deployments (`docker compose restart concelier` or
|
||||
`kubectl rollout restart deploy/stellaops-concelier`).
|
||||
- **TLS renewal** – reissue certificates, place new files, and reload gateway (`docker compose exec mirror-gateway nginx -s reload`).
|
||||
- **Quota tuning** – adjust per-domain `MAXDOWNLOADREQUESTSPERHOUR` in `.env` or values file.
|
||||
Align CDN rate limits and inform downstreams.
|
||||
|
||||
## 8. References
|
||||
|
||||
- Deployment profiles: `deploy/compose/docker-compose.mirror.yaml`,
|
||||
`deploy/helm/stellaops/values-mirror.yaml`
|
||||
- Mirror architecture dossiers: `docs/modules/concelier/architecture.md`,
|
||||
`docs/modules/excititor/mirrors.md`
|
||||
- Export bundling: `docs/modules/devops/architecture.md` §3, `docs/modules/excititor/architecture.md` §7
|
||||
# Concelier & Excititor Mirror Operations
|
||||
|
||||
This runbook describes how Stella Ops operates the managed mirrors under `*.stella-ops.org`.
|
||||
It covers Docker Compose and Helm deployment overlays, secret handling for multi-tenant
|
||||
authn, CDN fronting, and the recurring sync pipeline that keeps mirror bundles current.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- **Authority access** – client credentials (`client_id` + secret) authorised for
|
||||
`concelier.mirror.read` and `excititor.mirror.read` scopes. Secrets live outside git.
|
||||
- **Signed TLS certificates** – wildcard or per-domain (`mirror-primary`, `mirror-community`).
|
||||
Store them under `deploy/compose/mirror-gateway/tls/` or in Kubernetes secrets.
|
||||
- **Mirror gateway credentials** – Basic Auth htpasswd files per domain. Generate with
|
||||
`htpasswd -B`. Operators distribute credentials to downstream consumers.
|
||||
- **Export artifact source** – read access to the canonical S3 buckets (or rsync share)
|
||||
that hold `concelier` JSON bundles and `excititor` VEX exports.
|
||||
- **Persistent volumes** – storage for Concelier job metadata and mirror export trees.
|
||||
For Helm, provision PVCs (`concelier-mirror-jobs`, `concelier-mirror-exports`,
|
||||
`excititor-mirror-exports`, `mirror-mongo-data`, `mirror-minio-data`) before rollout.
|
||||
|
||||
### 1.1 Service configuration quick reference
|
||||
|
||||
Concelier.WebService exposes the mirror HTTP endpoints once `CONCELIER__MIRROR__ENABLED=true`.
|
||||
Key knobs:
|
||||
|
||||
- `CONCELIER__MIRROR__EXPORTROOT` – root folder containing export snapshots (`<exportId>/mirror/*`).
|
||||
- `CONCELIER__MIRROR__ACTIVEEXPORTID` – optional explicit export id; otherwise the service auto-falls back to the `latest/` symlink or newest directory.
|
||||
- `CONCELIER__MIRROR__REQUIREAUTHENTICATION` – default auth requirement; override per domain with `CONCELIER__MIRROR__DOMAINS__{n}__REQUIREAUTHENTICATION`.
|
||||
- `CONCELIER__MIRROR__MAXINDEXREQUESTSPERHOUR` – budget for `/concelier/exports/index.json`. Domains inherit this value unless they define `__MAXDOWNLOADREQUESTSPERHOUR`.
|
||||
- `CONCELIER__MIRROR__DOMAINS__{n}__ID` – domain identifier matching the exporter manifest; additional keys configure display name and rate budgets.
|
||||
|
||||
> The service honours Stella Ops Authority when `CONCELIER__AUTHORITY__ENABLED=true` and `ALLOWANONYMOUSFALLBACK=false`. Use the bypass CIDR list (`CONCELIER__AUTHORITY__BYPASSNETWORKS__*`) for in-cluster ingress gateways that terminate Basic Auth. Unauthorized requests emit `WWW-Authenticate: Bearer` so downstream automation can detect token failures.
|
||||
|
||||
Mirror responses carry deterministic cache headers: `/index.json` returns `Cache-Control: public, max-age=60`, while per-domain manifests/bundles include `Cache-Control: public, max-age=300, immutable`. Rate limiting surfaces `Retry-After` when quotas are exceeded.
|
||||
|
||||
### 1.2 Mirror connector configuration
|
||||
|
||||
Downstream Concelier instances ingest published bundles using the `StellaOpsMirrorConnector`. Operators running the connector in air‑gapped or limited connectivity environments can tune the following options (environment prefix `CONCELIER__SOURCES__STELLAOPSMIRROR__`):
|
||||
|
||||
- `BASEADDRESS` – absolute mirror root (e.g., `https://mirror-primary.stella-ops.org`).
|
||||
- `INDEXPATH` – relative path to the mirror index (`/concelier/exports/index.json` by default).
|
||||
- `DOMAINID` – mirror domain identifier from the index (`primary`, `community`, etc.).
|
||||
- `HTTPTIMEOUT` – request timeout; raise when mirrors sit behind slow WAN links.
|
||||
- `SIGNATURE__ENABLED` – require detached JWS verification for `bundle.json`.
|
||||
- `SIGNATURE__KEYID` / `SIGNATURE__PROVIDER` – expected signing key metadata.
|
||||
- `SIGNATURE__PUBLICKEYPATH` – PEM fallback used when the mirror key registry is offline.
|
||||
|
||||
The connector keeps a per-export fingerprint (bundle digest + generated-at timestamp) and tracks outstanding document IDs. If a scan is interrupted, the next run resumes parse/map work using the stored fingerprint and pending document lists—no network requests are reissued unless the upstream digest changes.
|
||||
|
||||
## 2. Secret & certificate layout
|
||||
|
||||
### Docker Compose (`deploy/compose/docker-compose.mirror.yaml`)
|
||||
|
||||
- `deploy/compose/env/mirror.env.example` – copy to `.env` and adjust quotas or domain IDs.
|
||||
- `deploy/compose/mirror-secrets/` – mount read-only into `/run/secrets`. Place:
|
||||
- `concelier-authority-client` – Authority client secret.
|
||||
- `excititor-authority-client` (optional) – reserve for future authn.
|
||||
- `deploy/compose/mirror-gateway/tls/` – PEM-encoded cert/key pairs:
|
||||
- `mirror-primary.crt`, `mirror-primary.key`
|
||||
- `mirror-community.crt`, `mirror-community.key`
|
||||
- `deploy/compose/mirror-gateway/secrets/` – htpasswd files:
|
||||
- `mirror-primary.htpasswd`
|
||||
- `mirror-community.htpasswd`
|
||||
|
||||
### Helm (`deploy/helm/stellaops/values-mirror.yaml`)
|
||||
|
||||
Create secrets in the target namespace:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic concelier-mirror-auth \
|
||||
--from-file=concelier-authority-client=concelier-authority-client
|
||||
|
||||
kubectl create secret generic excititor-mirror-auth \
|
||||
--from-file=excititor-authority-client=excititor-authority-client
|
||||
|
||||
kubectl create secret tls mirror-gateway-tls \
|
||||
--cert=mirror-primary.crt --key=mirror-primary.key
|
||||
|
||||
kubectl create secret generic mirror-gateway-htpasswd \
|
||||
--from-file=mirror-primary.htpasswd --from-file=mirror-community.htpasswd
|
||||
```
|
||||
|
||||
> Keep Basic Auth lists short-lived (rotate quarterly) and document credential recipients.
|
||||
|
||||
## 3. Deployment
|
||||
|
||||
### 3.1 Docker Compose (edge mirrors, lab validation)
|
||||
|
||||
1. `cp deploy/compose/env/mirror.env.example deploy/compose/env/mirror.env`
|
||||
2. Populate secrets/tls directories as described above.
|
||||
3. Sync mirror bundles (see §4) into `deploy/compose/mirror-data/…` and ensure they are mounted
|
||||
on the host path backing the `concelier-exports` and `excititor-exports` volumes.
|
||||
4. Run the profile validator: `deploy/tools/validate-profiles.sh`.
|
||||
5. Launch: `docker compose --env-file env/mirror.env -f docker-compose.mirror.yaml up -d`.
|
||||
|
||||
### 3.2 Helm (production mirrors)
|
||||
|
||||
1. Provision PVCs sized for mirror bundles (baseline: 20 GiB per domain).
|
||||
2. Create secrets/tls config maps (§2).
|
||||
3. `helm upgrade --install mirror deploy/helm/stellaops -f deploy/helm/stellaops/values-mirror.yaml`.
|
||||
4. Annotate the `stellaops-mirror-gateway` service with ingress/LoadBalancer metadata required by
|
||||
your CDN (e.g., AWS load balancer scheme internal + NLB idle timeout).
|
||||
|
||||
## 4. Artifact sync workflow
|
||||
|
||||
Mirrors never generate exports—they ingest signed bundles produced by the Concelier and Excititor
|
||||
export jobs. Recommended sync pattern:
|
||||
|
||||
### 4.1 Compose host (systemd timer)
|
||||
|
||||
`/usr/local/bin/mirror-sync.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
export AWS_ACCESS_KEY_ID=…
|
||||
export AWS_SECRET_ACCESS_KEY=…
|
||||
|
||||
aws s3 sync s3://mirror-stellaops/concelier/latest \
|
||||
/opt/stellaops/mirror-data/concelier --delete --size-only
|
||||
|
||||
aws s3 sync s3://mirror-stellaops/excititor/latest \
|
||||
/opt/stellaops/mirror-data/excititor --delete --size-only
|
||||
```
|
||||
|
||||
Schedule with a systemd timer every 5 minutes. The Compose volumes mount `/opt/stellaops/mirror-data/*`
|
||||
into the containers read-only, matching `CONCELIER__MIRROR__EXPORTROOT=/exports/json` and
|
||||
`EXCITITOR__ARTIFACTS__FILESYSTEM__ROOT=/exports`.
|
||||
|
||||
### 4.2 Kubernetes (CronJob)
|
||||
|
||||
Create a CronJob running the AWS CLI (or rclone) in the same namespace, writing into the PVCs:
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: mirror-sync
|
||||
spec:
|
||||
schedule: "*/5 * * * *"
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: sync
|
||||
image: public.ecr.aws/aws-cli/aws-cli@sha256:5df5f52c29f5e3ba46d0ad9e0e3afc98701c4a0f879400b4c5f80d943b5fadea
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- >
|
||||
aws s3 sync s3://mirror-stellaops/concelier/latest /exports/concelier --delete --size-only &&
|
||||
aws s3 sync s3://mirror-stellaops/excititor/latest /exports/excititor --delete --size-only
|
||||
volumeMounts:
|
||||
- name: concelier-exports
|
||||
mountPath: /exports/concelier
|
||||
- name: excititor-exports
|
||||
mountPath: /exports/excititor
|
||||
envFrom:
|
||||
- secretRef:
|
||||
name: mirror-sync-aws
|
||||
restartPolicy: OnFailure
|
||||
volumes:
|
||||
- name: concelier-exports
|
||||
persistentVolumeClaim:
|
||||
claimName: concelier-mirror-exports
|
||||
- name: excititor-exports
|
||||
persistentVolumeClaim:
|
||||
claimName: excititor-mirror-exports
|
||||
```
|
||||
|
||||
## 5. CDN integration
|
||||
|
||||
1. Point the CDN origin at the mirror gateway (Compose host or Kubernetes LoadBalancer).
|
||||
2. Honour the response headers emitted by the gateway and Concelier/Excititor:
|
||||
`Cache-Control: public, max-age=300, immutable` for mirror payloads.
|
||||
3. Configure origin shields in the CDN to prevent cache stampedes. Recommended TTLs:
|
||||
- Index (`/concelier/exports/index.json`, `/excititor/mirror/*/index`) → 60 s.
|
||||
- Bundle/manifest payloads → 300 s.
|
||||
4. Forward the `Authorization` header—Basic Auth terminates at the gateway.
|
||||
5. Enforce per-domain rate limits at the CDN (matching gateway budgets) and enable logging
|
||||
to SIEM for anomaly detection.
|
||||
|
||||
## 6. Smoke tests
|
||||
|
||||
After each deployment or sync cycle (temporarily set low budgets if you need to observe 429 responses):
|
||||
|
||||
```bash
|
||||
# Index with Basic Auth
|
||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/index.json | jq 'keys'
|
||||
|
||||
# Mirror manifest signature and cache headers
|
||||
curl -u $PRIMARY_CREDS -I https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/manifest.json \
|
||||
| tee /tmp/manifest-headers.txt
|
||||
grep -E '^Cache-Control: ' /tmp/manifest-headers.txt # expect public, max-age=300, immutable
|
||||
|
||||
# Excititor consensus bundle metadata
|
||||
curl -u $COMMUNITY_CREDS https://mirror-community.stella-ops.org/excititor/mirror/community/index \
|
||||
| jq '.exports[].exportKey'
|
||||
|
||||
# Signed bundle + detached JWS (spot check digests)
|
||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/bundle.json.jws \
|
||||
-o bundle.json.jws
|
||||
cosign verify-blob --signature bundle.json.jws --key mirror-key.pub bundle.json
|
||||
|
||||
# Service-level auth check (inside cluster – no gateway credentials)
|
||||
kubectl exec deploy/stellaops-concelier -- curl -si http://localhost:8443/concelier/exports/mirror/primary/manifest.json \
|
||||
| head -n 5 # expect HTTP/1.1 401 with WWW-Authenticate: Bearer
|
||||
|
||||
# Rate limit smoke (repeat quickly; second call should return 429 + Retry-After)
|
||||
for i in 1 2; do
|
||||
curl -s -o /dev/null -D - https://mirror-primary.stella-ops.org/concelier/exports/index.json \
|
||||
-u $PRIMARY_CREDS | grep -E '^(HTTP/|Retry-After:)'
|
||||
sleep 1
|
||||
done
|
||||
```
|
||||
|
||||
Watch the gateway metrics (`nginx_vts` or access logs) for cache hits. In Kubernetes, `kubectl logs deploy/stellaops-mirror-gateway`
|
||||
should show `X-Cache-Status: HIT/MISS`.
|
||||
|
||||
## 7. Maintenance & rotation
|
||||
|
||||
- **Bundle freshness** – alert if sync job lag exceeds 15 minutes or if `concelier` logs
|
||||
`Mirror export root is not configured`.
|
||||
- **Secret rotation** – change Authority client secrets and Basic Auth credentials quarterly.
|
||||
Update the mounted secrets and restart deployments (`docker compose restart concelier` or
|
||||
`kubectl rollout restart deploy/stellaops-concelier`).
|
||||
- **TLS renewal** – reissue certificates, place new files, and reload gateway (`docker compose exec mirror-gateway nginx -s reload`).
|
||||
- **Quota tuning** – adjust per-domain `MAXDOWNLOADREQUESTSPERHOUR` in `.env` or values file.
|
||||
Align CDN rate limits and inform downstreams.
|
||||
|
||||
## 8. References
|
||||
|
||||
- Deployment profiles: `deploy/compose/docker-compose.mirror.yaml`,
|
||||
`deploy/helm/stellaops/values-mirror.yaml`
|
||||
- Mirror architecture dossiers: `docs/modules/concelier/architecture.md`,
|
||||
`docs/modules/excititor/mirrors.md`
|
||||
- Export bundling: `docs/modules/devops/architecture.md` §3, `docs/modules/excititor/architecture.md` §7
|
||||
|
||||
Reference in New Issue
Block a user