Files
git.stella-ops.org/docs/ops/authority-monitoring.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

5.7 KiB
Raw Blame History

Authority Monitoring & Alerting Playbook

Telemetry Sources

  • Traces: Activity source StellaOps.Authority emits spans for every token flow (authority.token.validate_*, authority.token.handle_*, authority.token.validate_access). Key tags include authority.endpoint, authority.grant_type, authority.username, authority.client_id, and authority.identity_provider.
  • Metrics: OpenTelemetry instrumentation (AddAspNetCoreInstrumentation, AddHttpClientInstrumentation, custom meter StellaOps.Authority) exports:
    • http.server.request.duration histogram (http_route, http_status_code, authority.endpoint tag via aspnetcore enrichment).
    • process.runtime.gc.*, process.runtime.dotnet.* (from AddRuntimeInstrumentation).
  • Logs: Serilog writes structured events to stdout. Notable templates:
    • "Password grant verification failed ..." and "Plugin {PluginName} denied access ... due to lockout" (lockout spike detector).
    • "Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals." (identifies users attempting exceptions:approve without MFA support; tie to fresh-auth errors).
    • "Client credentials validation failed for {ClientId}: exception scopes require tenant assignment." (signals misconfigured exception service identities).
    • "Granting StellaOps bypass for remote {RemoteIp}" (bypass usage).
    • "Rate limit exceeded for path {Path} from {RemoteIp}" (limiter alerts).

Prometheus Metrics to Collect

Metric Query Purpose
token_requests_total sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m])) Token issuance volume per grant type (grant_type comes via authority.grant_type span attribute → Exemplars in Grafana).
token_failure_ratio `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4.. 5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))`
authorize_rate_limit_hits sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m])) Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter).
lockout_events sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m])) Derived from Loki/Promtail log counter.
bypass_usage_total sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) Track trusted bypass invocations.

Exporter note: Enable aspnetcore meters (dotnet-counters name Microsoft.AspNetCore.Hosting), or configure the OpenTelemetry Collector metrics pipeline with metric_statements to remap histogram counts into the shown series.

Alert Rules

  1. Token Failure Surge
    • Expression: token_failure_ratio > 0.05
    • For: 10m
    • Labels: severity="critical"
    • Annotations: Include topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m]))) as diagnostic hint (requires span → metric transformation).
  2. Lockout Spike
    • Expression: sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10
    • For: 15m
    • Investigate credential stuffing; consider temporarily tightening RateLimiting.Token.
  3. Bypass Threshold
    • Expression: sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1
    • For: 5m
    • Alert severity warning — verify the calling host list.
  4. Rate Limiter Saturation
    • Expression: sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0
    • Escalate if sustained for 5min; confirm trusted clients arent misconfigured.

Grafana Dashboard

  • Import docs/ops/authority-grafana-dashboard.json to provision baseline panels:
    • Token Success vs Failure stacked rate visualization split by grant type.
    • Rate Limiter Hits bar chart showing authority-token and authority-authorize.
    • Bypass & Lockout Events dual-stat panel using Loki-derived counters.
    • Trace Explorer Link panel links to StellaOps.Authority span search pre-filtered by authority.grant_type.

Collector Configuration Snippets

receivers:
  otlp:
    protocols:
      http:
exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
processors:
  batch:
  attributes/token_grant:
    actions:
      - key: grant_type
        action: upsert
        from_attribute: authority.grant_type
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [attributes/token_grant, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Operational Checklist

  • Confirm STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS enables OTLP in production builds.
  • Ensure Promtail captures container stdout with Serilog structured formatting.
  • Periodically validate alert noise by running load tests that trigger the rate limiter.
  • Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.