Files
git.stella-ops.org/docs/ops/feedser-authority-audit-runbook.md
master 607e72e2a1
Some checks failed
Build Test Deploy / docs (push) Has been cancelled
Build Test Deploy / deploy (push) Has been cancelled
Build Test Deploy / build-test (push) Has been cancelled
Build Test Deploy / authority-container (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
up
2025-10-12 20:37:18 +03:00

9.6 KiB
Raw Blame History

Feedser Authority Audit Runbook

Last updated: 2025-10-12

This runbook helps operators verify and monitor the StellaOps Feedser ⇆ Authority integration. It focuses on the /jobs* surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.

1. Prerequisites

  • Authority integration is enabled in feedser.yaml (or via FEEDSER_AUTHORITY__* environment variables) with a valid clientId, secret, audience, and required scopes.
  • OTLP metrics/log exporters are configured (feedser.telemetry.*) or container stdout is shipped to your SIEM.
  • Operators have access to the Feedser job trigger endpoints via CLI or REST for smoke tests.

Configuration snippet

feedser:
  authority:
    enabled: true
    allowAnonymousFallback: false          # keep true only during initial rollout
    issuer: "https://authority.internal"
    audiences:
      - "api://feedser"
    requiredScopes:
      - "feedser.jobs.trigger"
    bypassNetworks:
      - "127.0.0.1/32"
      - "::1/128"
    clientId: "feedser-jobs"
    clientSecretFile: "/run/secrets/feedser_authority_client"
    tokenClockSkewSeconds: 60
    resilience:
      enableRetries: true
      retryDelays:
        - "00:00:01"
        - "00:00:02"
        - "00:00:05"
      allowOfflineCacheFallback: true
      offlineCacheTolerance: "00:10:00"

Store secrets outside source control. Feedser reads clientSecretFile on startup; rotate by updating the mounted file and restarting the service.

Resilience tuning

  • Connected sites: keep the default 1s / 2s / 5s retry ladder so Feedser retries transient Authority hiccups but still surfaces outages quickly. Leave allowOfflineCacheFallback=true so cached discovery/JWKS data can bridge short Pathfinder restarts.
  • Air-gapped/Offline Kit installs: extend offlineCacheTolerance (1530minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (enableRetries=false) if infrastructure teams prefer to handle exponential backoff at the network layer; Feedser will fail fast but keep deterministic logs.
  • Feedser resolves these knobs through IOptionsMonitor<StellaOpsAuthClientOptions>. Edits to feedser.yaml are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.

2. Key Signals

2.1 Audit log channel

Feedser emits structured audit entries via the Feedser.Authorization.Audit logger for every /jobs* request once Authority enforcement is active.

Feedser authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=feedser-cli scopes=feedser.jobs.trigger bypass=False remote=10.1.4.7
Field Sample value Meaning
route /jobs/definitions Endpoint that processed the request.
status 200 / 401 / 409 Final HTTP status code returned to the caller.
subject ops@example.com User or service principal subject (falls back to (anonymous) when unauthenticated).
clientId feedser-cli OAuth client ID provided by Authority ( (none) if the token lacked the claim).
scopes feedser.jobs.trigger Normalised scope list extracted from token claims; (none) if the token carried none.
bypass True / False Indicates whether the request succeeded because its source IP matched a bypass CIDR.
remote 10.1.4.7 Remote IP recorded from the connection / forwarded header test hooks.

Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:

  • status=401 AND bypass=True bypass network accepted an unauthenticated call (should be temporary during rollout).
  • status=202 AND scopes="(none)" a token without scopes triggered a job; tighten client configuration.
  • Spike in clientId="(none)" indicates upstream Authority is not issuing client_id claims or the CLI is outdated.

2.2 Metrics

Feedser publishes counters under the OTEL meter StellaOps.Feedser.WebService.Jobs. Tags: job.kind, job.trigger, job.outcome.

Metric name Description PromQL example
web.jobs.triggered Accepted job trigger requests. sum by (job_kind) (rate(web_jobs_triggered_total[5m]))
web.jobs.trigger.conflict Rejected triggers (already running, disabled…). sum(rate(web_jobs_trigger_conflict_total[5m]))
web.jobs.trigger.failed Server-side job failures. sum(rate(web_jobs_trigger_failed_total[5m]))

Prometheus/OTEL collectors typically surface counters with _total suffix. Adjust queries to match your pipelines generated metric names.

Correlate audit logs with the following global meter exported via Feedser.SourceDiagnostics:

  • feedser.source.http.requests_total{feedser_source="jobs-run"} ensures REST/manual triggers route through Authority.
  • If Grafana dashboards are deployed, extend the “Feedser Jobs” board with the above counters plus a table of recent audit log entries.

3. Alerting Guidance

  1. Unauthorized bypass attempt

    • Query: sum(rate(log_messages_total{logger="Feedser.Authorization.Audit", status="401", bypass="True"}[5m])) > 0
    • Action: verify bypassNetworks list; confirm expected maintenance windows; rotate credentials if suspicious.
  2. Missing scopes

    • Query: sum(rate(log_messages_total{logger="Feedser.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0
    • Action: audit Authority client registration; ensure requiredScopes includes feedser.jobs.trigger.
  3. Trigger failure surge

    • Query: sum(rate(web_jobs_trigger_failed_total[10m])) > 0 with severity warning if sustained for 10 minutes.
    • Action: inspect correlated audit entries and Feedser.Telemetry traces for job execution errors.
  4. Conflict spike

    • Query: sum(rate(web_jobs_trigger_conflict_total[10m])) > 5 (tune threshold).
    • Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
  5. Authority offline

    • Watch Feedser.Authorization.Audit logs for status=503 or status=500 along with clientId="(none)". Investigate Authority availability before re-enabling anonymous fallback.

4. Rollout & Verification Procedure

  1. Pre-checks

    • Confirm allowAnonymousFallback is false in production; keep true only during staged validation.
    • Validate Authority issuer metadata is reachable from Feedser (curl https://authority.internal/.well-known/openid-configuration from the host).
  2. Smoke test with valid token

    • Obtain a token via CLI: stella auth login --scope feedser.jobs.trigger.
    • Trigger a read-only endpoint: curl -H "Authorization: Bearer $TOKEN" https://feedser.internal/jobs/definitions.
    • Expect HTTP 200/202 and an audit log with bypass=False, scopes=feedser.jobs.trigger.
  3. Negative test without token

    • Call the same endpoint without a token. Expect HTTP 401, bypass=False.
    • If the request succeeds, double-check bypassNetworks and ensure fallback is disabled.
  4. Bypass check (if applicable)

    • From an allowed maintenance IP, call /jobs/definitions without a token. Confirm the audit log shows bypass=True. Review business justification and expiry date for such entries.
  5. Metrics validation

    • Ensure web.jobs.triggered counter increments during accepted runs.
    • Exporters should show corresponding spans (feedser.job.trigger) if tracing is enabled.

5. Troubleshooting

Symptom Probable cause Remediation
Audit log shows clientId=(none) for all requests Authority not issuing client_id claim or CLI outdated Update StellaOps Authority configuration (StellaOpsAuthorityOptions.Token.Claims.ClientId), or upgrade the CLI token acquisition flow.
Requests succeed with bypass=True unexpectedly Local network added to bypassNetworks or fallback still enabled Remove/adjust the CIDR list, disable anonymous fallback, restart Feedser.
HTTP 401 with valid token requiredScopes missing from client registration or token audience mismatch Verify Authority client scopes (feedser.jobs.trigger) and ensure the token audience matches audiences config.
Metrics missing from Prometheus Telemetry exporters disabled or filter missing OTEL meter Set feedser.telemetry.enableMetrics=true, ensure collector includes StellaOps.Feedser.WebService.Jobs meter.
Sudden spike in web.jobs.trigger.failed Downstream job failure or Authority timeout mid-request Inspect Feedser job logs, re-run with tracing enabled, validate Authority latency.

6. References

  • docs/21_INSTALL_GUIDE.md Authority configuration quick start.
  • docs/17_SECURITY_HARDENING_GUIDE.md Security guardrails and enforcement deadlines.
  • docs/ops/authority-monitoring.md Authority-side monitoring and alerting playbook.
  • StellaOps.Feedser.WebService/Filters/JobAuthorizationAuditFilter.cs source of audit log fields.