Some checks failed
		
		
	
	Build Test Deploy / docs (push) Has been cancelled
				
			Build Test Deploy / deploy (push) Has been cancelled
				
			Build Test Deploy / build-test (push) Has been cancelled
				
			Build Test Deploy / authority-container (push) Has been cancelled
				
			Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
			
				
	
	
	
		
			5.2 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			5.2 KiB
		
	
	
	
	
	
	
	
Authority Monitoring & Alerting Playbook
Telemetry Sources
- Traces: Activity source StellaOps.Authorityemits spans for every token flow (authority.token.validate_*,authority.token.handle_*,authority.token.validate_access). Key tags includeauthority.endpoint,authority.grant_type,authority.username,authority.client_id, andauthority.identity_provider.
- Metrics: OpenTelemetry instrumentation (AddAspNetCoreInstrumentation,AddHttpClientInstrumentation, custom meterStellaOps.Authority) exports:- http.server.request.durationhistogram (- http_route,- http_status_code,- authority.endpointtag via- aspnetcoreenrichment).
- process.runtime.gc.*,- process.runtime.dotnet.*(from- AddRuntimeInstrumentation).
 
- Logs: Serilog writes structured events to stdout. Notable templates:
- "Password grant verification failed ..."and- "Plugin {PluginName} denied access ... due to lockout"(lockout spike detector).
- "Granting StellaOps bypass for remote {RemoteIp}"(bypass usage).
- "Rate limit exceeded for path {Path} from {RemoteIp}"(limiter alerts).
 
Prometheus Metrics to Collect
| Metric | Query | Purpose | 
|---|---|---|
| token_requests_total | sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m])) | Token issuance volume per grant type ( grant_typecomes viaauthority.grant_typespan attribute → Exemplars in Grafana). | 
| token_failure_ratio | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4.. | 5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | 
| authorize_rate_limit_hits | sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m])) | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). | 
| lockout_events | sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m])) | Derived from Loki/Promtail log counter. | 
| bypass_usage_total | sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) | Track trusted bypass invocations. | 
Exporter note: Enable
aspnetcoremeters (dotnet-countersnameMicrosoft.AspNetCore.Hosting), or configure the OpenTelemetry Collectormetricspipeline withmetric_statementsto remap histogram counts into the shown series.
Alert Rules
- Token Failure Surge
- Expression: token_failure_ratio > 0.05
- For: 10m
- Labels: severity="critical"
- Annotations: Include topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))as diagnostic hint (requires span → metric transformation).
 
- Expression: 
- Lockout Spike
- Expression: sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10
- For: 15m
- Investigate credential stuffing; consider temporarily tightening RateLimiting.Token.
 
- Expression: 
- Bypass Threshold
- Expression: sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1
- For: 5m
- Alert severity warning— verify the calling host list.
 
- Expression: 
- Rate Limiter Saturation
- Expression: sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0
- Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
 
- Expression: 
Grafana Dashboard
- Import docs/ops/authority-grafana-dashboard.jsonto provision baseline panels:- Token Success vs Failure – stacked rate visualization split by grant type.
- Rate Limiter Hits – bar chart showing authority-tokenandauthority-authorize.
- Bypass & Lockout Events – dual-stat panel using Loki-derived counters.
- Trace Explorer Link – panel links to StellaOps.Authorityspan search pre-filtered byauthority.grant_type.
 
Collector Configuration Snippets
receivers:
  otlp:
    protocols:
      http:
exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
processors:
  batch:
  attributes/token_grant:
    actions:
      - key: grant_type
        action: upsert
        from_attribute: authority.grant_type
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [attributes/token_grant, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
Operational Checklist
- Confirm STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERSenables OTLP in production builds.
- Ensure Promtail captures container stdout with Serilog structured formatting.
- Periodically validate alert noise by running load tests that trigger the rate limiter.
- Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.