Rename Feedser to Concelier
This commit is contained in:
		
							
								
								
									
										123
									
								
								docs/ops/concelier-ghsa-operations.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										123
									
								
								docs/ops/concelier-ghsa-operations.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,123 @@ | ||||
| # Concelier GHSA Connector – Operations Runbook | ||||
|  | ||||
| _Last updated: 2025-10-16_ | ||||
|  | ||||
| ## 1. Overview | ||||
| The GitHub Security Advisories (GHSA) connector pulls advisory metadata from the GitHub REST API `/security/advisories` endpoint. GitHub enforces both primary and secondary rate limits, so operators must monitor usage and configure retries to avoid throttling incidents. | ||||
|  | ||||
| ## 2. Rate-limit telemetry | ||||
| The connector now surfaces rate-limit headers on every fetch and exposes the following metrics via OpenTelemetry: | ||||
|  | ||||
| | Metric | Description | Tags | | ||||
| |--------|-------------|------| | ||||
| | `ghsa.ratelimit.limit` (histogram) | Samples the reported request quota at fetch time. | `phase` = `list` or `detail`, `resource` (e.g., `core`). | | ||||
| | `ghsa.ratelimit.remaining` (histogram) | Remaining requests returned by `X-RateLimit-Remaining`. | `phase`, `resource`. | | ||||
| | `ghsa.ratelimit.reset_seconds` (histogram) | Seconds until `X-RateLimit-Reset`. | `phase`, `resource`. | | ||||
| | `ghsa.ratelimit.headroom_pct` (histogram) | Percentage of the quota still available (`remaining / limit * 100`). | `phase`, `resource`. | | ||||
| | `ghsa.ratelimit.headroom_pct_current` (observable gauge) | Latest headroom percentage reported per resource. | `phase`, `resource`. | | ||||
| | `ghsa.ratelimit.exhausted` (counter) | Incremented whenever GitHub returns a zero remaining quota and the connector delays before retrying. | `phase`. | | ||||
|  | ||||
| ### Dashboards & alerts | ||||
| - Plot `ghsa.ratelimit.remaining` as the latest value to watch the runway. Alert when the value stays below **`RateLimitWarningThreshold`** (default `500`) for more than 5 minutes. | ||||
| - Use `ghsa.ratelimit.headroom_pct_current` to visualise remaining quota % — paging once it sits below **10 %** for longer than a single reset window helps avoid secondary limits. | ||||
| - Raise a separate alert on `increase(ghsa.ratelimit.exhausted[15m]) > 0` to catch hard throttles. | ||||
| - Overlay `ghsa.fetch.attempts` vs `ghsa.fetch.failures` to confirm retries are effective. | ||||
|  | ||||
| ## 3. Logging signals | ||||
| When `X-RateLimit-Remaining` falls below `RateLimitWarningThreshold`, the connector emits: | ||||
| ``` | ||||
| GHSA rate limit warning: remaining {Remaining}/{Limit} for {Phase} {Resource} (headroom {Headroom}%) | ||||
| ``` | ||||
| When GitHub reports zero remaining calls, the connector logs and sleeps for the reported `Retry-After`/`X-RateLimit-Reset` interval (falling back to `SecondaryRateLimitBackoff`). | ||||
|  | ||||
| After the quota recovers above the warning threshold the connector writes an informational log with the refreshed remaining/headroom, letting operators clear alerts quickly. | ||||
|  | ||||
| ## 4. Configuration knobs (`concelier.yaml`) | ||||
| ```yaml | ||||
| concelier: | ||||
|   sources: | ||||
|     ghsa: | ||||
|       apiToken: "${GITHUB_PAT}" | ||||
|       pageSize: 50 | ||||
|       requestDelay: "00:00:00.200" | ||||
|       failureBackoff: "00:05:00" | ||||
|       rateLimitWarningThreshold: 500    # warn below this many remaining calls | ||||
|       secondaryRateLimitBackoff: "00:02:00"  # fallback delay when GitHub omits Retry-After | ||||
| ``` | ||||
|  | ||||
| ### Recommendations | ||||
| - Increase `requestDelay` in air-gapped or burst-heavy deployments to smooth token consumption. | ||||
| - Lower `rateLimitWarningThreshold` only if your dashboards already page on the new histogram; never set it negative. | ||||
| - For bots using a low-privilege PAT, keep `secondaryRateLimitBackoff` at ≥60 seconds to respect GitHub’s secondary-limit guidance. | ||||
|  | ||||
| #### Default job schedule | ||||
|  | ||||
| | Job kind | Cron | Timeout | Lease | | ||||
| |----------|------|---------|-------| | ||||
| | `source:ghsa:fetch` | `1,11,21,31,41,51 * * * *` | 6 minutes | 4 minutes | | ||||
| | `source:ghsa:parse` | `3,13,23,33,43,53 * * * *` | 5 minutes | 4 minutes | | ||||
| | `source:ghsa:map` | `5,15,25,35,45,55 * * * *` | 5 minutes | 4 minutes | | ||||
|  | ||||
| These defaults spread GHSA stages across the hour so fetch completes before parse/map fire. Override them via `concelier.jobs.definitions[...]` when coordinating multiple connectors on the same runner. | ||||
|  | ||||
| ## 5. Provisioning credentials | ||||
|  | ||||
| Concelier requires a GitHub personal access token (classic) with the **`read:org`** and **`security_events`** scopes to pull GHSA data. Store it as a secret and reference it via `concelier.sources.ghsa.apiToken`. | ||||
|  | ||||
| ### Docker Compose (stack operators) | ||||
| ```yaml | ||||
| services: | ||||
|   concelier: | ||||
|     environment: | ||||
|       CONCELIER__SOURCES__GHSA__APITOKEN: /run/secrets/ghsa_pat | ||||
|     secrets: | ||||
|       - ghsa_pat | ||||
|  | ||||
| secrets: | ||||
|   ghsa_pat: | ||||
|     file: ./secrets/ghsa_pat.txt  # contains only the PAT value | ||||
| ``` | ||||
|  | ||||
| ### Helm values (cluster operators) | ||||
| ```yaml | ||||
| concelier: | ||||
|   extraEnv: | ||||
|     - name: CONCELIER__SOURCES__GHSA__APITOKEN | ||||
|       valueFrom: | ||||
|         secretKeyRef: | ||||
|           name: concelier-ghsa | ||||
|           key: apiToken | ||||
|  | ||||
| extraSecrets: | ||||
|   concelier-ghsa: | ||||
|     apiToken: "<paste PAT here or source from external secret store>" | ||||
| ``` | ||||
|  | ||||
| After rotating the PAT, restart the Concelier workers (or run `kubectl rollout restart deployment/concelier`) to ensure the configuration reloads. | ||||
|  | ||||
| When enabling GHSA the first time, run a staged backfill: | ||||
|  | ||||
| 1. Trigger `source:ghsa:fetch` manually (CLI or API) outside of peak hours. | ||||
| 2. Watch `concelier.jobs.health` for the GHSA jobs until they report `healthy`. | ||||
| 3. Allow the scheduled cron cadence to resume once the initial backlog drains (typically < 30 minutes). | ||||
|  | ||||
| ## 6. Runbook steps when throttled | ||||
| 1. Check `ghsa.ratelimit.exhausted` for the affected phase (`list` vs `detail`). | ||||
| 2. Confirm the connector is delaying—logs will show `GHSA rate limit exhausted...` with the chosen backoff. | ||||
| 3. If rate limits stay exhausted: | ||||
|    - Verify no other jobs are sharing the PAT. | ||||
|    - Temporarily reduce `MaxPagesPerFetch` or `PageSize` to shrink burst size. | ||||
|    - Consider provisioning a dedicated PAT (GHSA permissions only) for Concelier. | ||||
| 4. After the quota resets, reset `rateLimitWarningThreshold`/`requestDelay` to their normal values and monitor the histograms for at least one hour. | ||||
|  | ||||
| ## 7. Alert integration quick reference | ||||
| - Prometheus: `ghsa_ratelimit_remaining_bucket` (from histogram) – use `histogram_quantile(0.99, ...)` to trend capacity. | ||||
| - VictoriaMetrics: `LAST_over_time(ghsa_ratelimit_remaining_sum[5m])` for simple last-value graphs. | ||||
| - Grafana: stack remaining + used to visualise total limit per resource. | ||||
|  | ||||
| ## 8. Canonical metric fallback analytics | ||||
| When GitHub omits CVSS vectors/scores, the connector now assigns a deterministic canonical metric id in the form `ghsa:severity/<level>` and publishes it to Merge so severity precedence still resolves against GHSA even without CVSS data. | ||||
|  | ||||
| - Metric: `ghsa.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `reason=no_cvss`. | ||||
| - Monitor the counter alongside Merge parity checks; a sudden spike suggests GitHub is shipping advisories without vectors and warrants cross-checking downstream exporters. | ||||
| - Because the canonical id feeds Merge, parity dashboards should overlay this metric to confirm fallback advisories continue to merge ahead of downstream sources when GHSA supplies more recent data. | ||||
		Reference in New Issue
	
	Block a user