Files
git.stella-ops.org/docs/ops/concelier-ghsa-operations.md
2025-10-18 20:46:16 +03:00

6.7 KiB
Raw Blame History

Concelier GHSA Connector Operations Runbook

Last updated: 2025-10-16

1. Overview

The GitHub Security Advisories (GHSA) connector pulls advisory metadata from the GitHub REST API /security/advisories endpoint. GitHub enforces both primary and secondary rate limits, so operators must monitor usage and configure retries to avoid throttling incidents.

2. Rate-limit telemetry

The connector now surfaces rate-limit headers on every fetch and exposes the following metrics via OpenTelemetry:

Metric Description Tags
ghsa.ratelimit.limit (histogram) Samples the reported request quota at fetch time. phase = list or detail, resource (e.g., core).
ghsa.ratelimit.remaining (histogram) Remaining requests returned by X-RateLimit-Remaining. phase, resource.
ghsa.ratelimit.reset_seconds (histogram) Seconds until X-RateLimit-Reset. phase, resource.
ghsa.ratelimit.headroom_pct (histogram) Percentage of the quota still available (remaining / limit * 100). phase, resource.
ghsa.ratelimit.headroom_pct_current (observable gauge) Latest headroom percentage reported per resource. phase, resource.
ghsa.ratelimit.exhausted (counter) Incremented whenever GitHub returns a zero remaining quota and the connector delays before retrying. phase.

Dashboards & alerts

  • Plot ghsa.ratelimit.remaining as the latest value to watch the runway. Alert when the value stays below RateLimitWarningThreshold (default 500) for more than 5 minutes.
  • Use ghsa.ratelimit.headroom_pct_current to visualise remaining quota % — paging once it sits below 10% for longer than a single reset window helps avoid secondary limits.
  • Raise a separate alert on increase(ghsa.ratelimit.exhausted[15m]) > 0 to catch hard throttles.
  • Overlay ghsa.fetch.attempts vs ghsa.fetch.failures to confirm retries are effective.

3. Logging signals

When X-RateLimit-Remaining falls below RateLimitWarningThreshold, the connector emits:

GHSA rate limit warning: remaining {Remaining}/{Limit} for {Phase} {Resource} (headroom {Headroom}%)

When GitHub reports zero remaining calls, the connector logs and sleeps for the reported Retry-After/X-RateLimit-Reset interval (falling back to SecondaryRateLimitBackoff).

After the quota recovers above the warning threshold the connector writes an informational log with the refreshed remaining/headroom, letting operators clear alerts quickly.

4. Configuration knobs (concelier.yaml)

concelier:
  sources:
    ghsa:
      apiToken: "${GITHUB_PAT}"
      pageSize: 50
      requestDelay: "00:00:00.200"
      failureBackoff: "00:05:00"
      rateLimitWarningThreshold: 500    # warn below this many remaining calls
      secondaryRateLimitBackoff: "00:02:00"  # fallback delay when GitHub omits Retry-After

Recommendations

  • Increase requestDelay in air-gapped or burst-heavy deployments to smooth token consumption.
  • Lower rateLimitWarningThreshold only if your dashboards already page on the new histogram; never set it negative.
  • For bots using a low-privilege PAT, keep secondaryRateLimitBackoff at ≥60 seconds to respect GitHubs secondary-limit guidance.

Default job schedule

Job kind Cron Timeout Lease
source:ghsa:fetch 1,11,21,31,41,51 * * * * 6 minutes 4 minutes
source:ghsa:parse 3,13,23,33,43,53 * * * * 5 minutes 4 minutes
source:ghsa:map 5,15,25,35,45,55 * * * * 5 minutes 4 minutes

These defaults spread GHSA stages across the hour so fetch completes before parse/map fire. Override them via concelier.jobs.definitions[...] when coordinating multiple connectors on the same runner.

5. Provisioning credentials

Concelier requires a GitHub personal access token (classic) with the read:org and security_events scopes to pull GHSA data. Store it as a secret and reference it via concelier.sources.ghsa.apiToken.

Docker Compose (stack operators)

services:
  concelier:
    environment:
      CONCELIER__SOURCES__GHSA__APITOKEN: /run/secrets/ghsa_pat
    secrets:
      - ghsa_pat

secrets:
  ghsa_pat:
    file: ./secrets/ghsa_pat.txt  # contains only the PAT value

Helm values (cluster operators)

concelier:
  extraEnv:
    - name: CONCELIER__SOURCES__GHSA__APITOKEN
      valueFrom:
        secretKeyRef:
          name: concelier-ghsa
          key: apiToken

extraSecrets:
  concelier-ghsa:
    apiToken: "<paste PAT here or source from external secret store>"

After rotating the PAT, restart the Concelier workers (or run kubectl rollout restart deployment/concelier) to ensure the configuration reloads.

When enabling GHSA the first time, run a staged backfill:

  1. Trigger source:ghsa:fetch manually (CLI or API) outside of peak hours.
  2. Watch concelier.jobs.health for the GHSA jobs until they report healthy.
  3. Allow the scheduled cron cadence to resume once the initial backlog drains (typically < 30 minutes).

6. Runbook steps when throttled

  1. Check ghsa.ratelimit.exhausted for the affected phase (list vs detail).
  2. Confirm the connector is delaying—logs will show GHSA rate limit exhausted... with the chosen backoff.
  3. If rate limits stay exhausted:
    • Verify no other jobs are sharing the PAT.
    • Temporarily reduce MaxPagesPerFetch or PageSize to shrink burst size.
    • Consider provisioning a dedicated PAT (GHSA permissions only) for Concelier.
  4. After the quota resets, reset rateLimitWarningThreshold/requestDelay to their normal values and monitor the histograms for at least one hour.

7. Alert integration quick reference

  • Prometheus: ghsa_ratelimit_remaining_bucket (from histogram) use histogram_quantile(0.99, ...) to trend capacity.
  • VictoriaMetrics: LAST_over_time(ghsa_ratelimit_remaining_sum[5m]) for simple last-value graphs.
  • Grafana: stack remaining + used to visualise total limit per resource.

8. Canonical metric fallback analytics

When GitHub omits CVSS vectors/scores, the connector now assigns a deterministic canonical metric id in the form ghsa:severity/<level> and publishes it to Merge so severity precedence still resolves against GHSA even without CVSS data.

  • Metric: ghsa.map.canonical_metric_fallbacks (counter) with tags severity, canonical_metric_id, reason=no_cvss.
  • Monitor the counter alongside Merge parity checks; a sudden spike suggests GitHub is shipping advisories without vectors and warrants cross-checking downstream exporters.
  • Because the canonical id feeds Merge, parity dashboards should overlay this metric to confirm fallback advisories continue to merge ahead of downstream sources when GHSA supplies more recent data.