Some checks failed
Build Test Deploy / authority-container (push) Has been cancelled
Build Test Deploy / docs (push) Has been cancelled
Build Test Deploy / deploy (push) Has been cancelled
Build Test Deploy / build-test (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
5.2 KiB
5.2 KiB
Authority Monitoring & Alerting Playbook
Telemetry Sources
- Traces: Activity source
StellaOps.Authorityemits spans for every token flow (authority.token.validate_*,authority.token.handle_*,authority.token.validate_access). Key tags includeauthority.endpoint,authority.grant_type,authority.username,authority.client_id, andauthority.identity_provider. - Metrics: OpenTelemetry instrumentation (
AddAspNetCoreInstrumentation,AddHttpClientInstrumentation, custom meterStellaOps.Authority) exports:http.server.request.durationhistogram (http_route,http_status_code,authority.endpointtag viaaspnetcoreenrichment).process.runtime.gc.*,process.runtime.dotnet.*(fromAddRuntimeInstrumentation).
- Logs: Serilog writes structured events to stdout. Notable templates:
"Password grant verification failed ..."and"Plugin {PluginName} denied access ... due to lockout"(lockout spike detector)."Granting StellaOps bypass for remote {RemoteIp}"(bypass usage)."Rate limit exceeded for path {Path} from {RemoteIp}"(limiter alerts).
Prometheus Metrics to Collect
| Metric | Query | Purpose |
|---|---|---|
token_requests_total |
sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m])) |
Token issuance volume per grant type (grant_type comes via authority.grant_type span attribute → Exemplars in Grafana). |
token_failure_ratio |
`sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4.. | 5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` |
authorize_rate_limit_hits |
sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m])) |
Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
lockout_events |
sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m])) |
Derived from Loki/Promtail log counter. |
bypass_usage_total |
sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) |
Track trusted bypass invocations. |
Exporter note: Enable
aspnetcoremeters (dotnet-countersnameMicrosoft.AspNetCore.Hosting), or configure the OpenTelemetry Collectormetricspipeline withmetric_statementsto remap histogram counts into the shown series.
Alert Rules
- Token Failure Surge
- Expression:
token_failure_ratio > 0.05 - For:
10m - Labels:
severity="critical" - Annotations: Include
topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))as diagnostic hint (requires span → metric transformation).
- Expression:
- Lockout Spike
- Expression:
sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10 - For:
15m - Investigate credential stuffing; consider temporarily tightening
RateLimiting.Token.
- Expression:
- Bypass Threshold
- Expression:
sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1 - For:
5m - Alert severity
warning— verify the calling host list.
- Expression:
- Rate Limiter Saturation
- Expression:
sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0 - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
- Expression:
Grafana Dashboard
- Import
docs/ops/authority-grafana-dashboard.jsonto provision baseline panels:- Token Success vs Failure – stacked rate visualization split by grant type.
- Rate Limiter Hits – bar chart showing
authority-tokenandauthority-authorize. - Bypass & Lockout Events – dual-stat panel using Loki-derived counters.
- Trace Explorer Link – panel links to
StellaOps.Authorityspan search pre-filtered byauthority.grant_type.
Collector Configuration Snippets
receivers:
otlp:
protocols:
http:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
processors:
batch:
attributes/token_grant:
actions:
- key: grant_type
action: upsert
from_attribute: authority.grant_type
service:
pipelines:
metrics:
receivers: [otlp]
processors: [attributes/token_grant, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Operational Checklist
- Confirm
STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERSenables OTLP in production builds. - Ensure Promtail captures container stdout with Serilog structured formatting.
- Periodically validate alert noise by running load tests that trigger the rate limiter.
- Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.