5.7 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			5.7 KiB
		
	
	
	
	
	
	
	
Authority Monitoring & Alerting Playbook
Telemetry Sources
- Traces: Activity source 
StellaOps.Authorityemits spans for every token flow (authority.token.validate_*,authority.token.handle_*,authority.token.validate_access). Key tags includeauthority.endpoint,authority.grant_type,authority.username,authority.client_id, andauthority.identity_provider. - Metrics: OpenTelemetry instrumentation (
AddAspNetCoreInstrumentation,AddHttpClientInstrumentation, custom meterStellaOps.Authority) exports:http.server.request.durationhistogram (http_route,http_status_code,authority.endpointtag viaaspnetcoreenrichment).process.runtime.gc.*,process.runtime.dotnet.*(fromAddRuntimeInstrumentation).
 - Logs: Serilog writes structured events to stdout. Notable templates:
"Password grant verification failed ..."and"Plugin {PluginName} denied access ... due to lockout"(lockout spike detector)."Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."(identifies users attemptingexceptions:approvewithout MFA support; tie to fresh-auth errors)."Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."(signals misconfigured exception service identities)."Granting StellaOps bypass for remote {RemoteIp}"(bypass usage)."Rate limit exceeded for path {Path} from {RemoteIp}"(limiter alerts).
 
Prometheus Metrics to Collect
| Metric | Query | Purpose | 
|---|---|---|
token_requests_total | 
sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m])) | 
Token issuance volume per grant type (grant_type comes via authority.grant_type span attribute → Exemplars in Grafana). | 
token_failure_ratio | 
`sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4.. | 5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | 
authorize_rate_limit_hits | 
sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m])) | 
Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). | 
lockout_events | 
sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m])) | 
Derived from Loki/Promtail log counter. | 
bypass_usage_total | 
sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) | 
Track trusted bypass invocations. | 
Exporter note: Enable
aspnetcoremeters (dotnet-countersnameMicrosoft.AspNetCore.Hosting), or configure the OpenTelemetry Collectormetricspipeline withmetric_statementsto remap histogram counts into the shown series.
Alert Rules
- Token Failure Surge
- Expression: 
token_failure_ratio > 0.05 - For: 
10m - Labels: 
severity="critical" - Annotations: Include 
topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))as diagnostic hint (requires span → metric transformation). 
 - Expression: 
 - Lockout Spike
- Expression: 
sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10 - For: 
15m - Investigate credential stuffing; consider temporarily tightening 
RateLimiting.Token. 
 - Expression: 
 - Bypass Threshold
- Expression: 
sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1 - For: 
5m - Alert severity 
warning— verify the calling host list. 
 - Expression: 
 - Rate Limiter Saturation
- Expression: 
sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0 - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
 
 - Expression: 
 
Grafana Dashboard
- Import 
docs/ops/authority-grafana-dashboard.jsonto provision baseline panels:- Token Success vs Failure – stacked rate visualization split by grant type.
 - Rate Limiter Hits – bar chart showing 
authority-tokenandauthority-authorize. - Bypass & Lockout Events – dual-stat panel using Loki-derived counters.
 - Trace Explorer Link – panel links to 
StellaOps.Authorityspan search pre-filtered byauthority.grant_type. 
 
Collector Configuration Snippets
receivers:
  otlp:
    protocols:
      http:
exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
processors:
  batch:
  attributes/token_grant:
    actions:
      - key: grant_type
        action: upsert
        from_attribute: authority.grant_type
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [attributes/token_grant, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
Operational Checklist
- Confirm 
STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERSenables OTLP in production builds. - Ensure Promtail captures container stdout with Serilog structured formatting.
 - Periodically validate alert noise by running load tests that trigger the rate limiter.
 - Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.