Initial commit (history squashed)
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Build Test Deploy / authority-container (push) Has been cancelled
				
			
		
			
				
	
				Build Test Deploy / docs (push) Has been cancelled
				
			
		
			
				
	
				Build Test Deploy / deploy (push) Has been cancelled
				
			
		
			
				
	
				Build Test Deploy / build-test (push) Has been cancelled
				
			
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Build Test Deploy / authority-container (push) Has been cancelled
				
			Build Test Deploy / docs (push) Has been cancelled
				
			Build Test Deploy / deploy (push) Has been cancelled
				
			Build Test Deploy / build-test (push) Has been cancelled
				
			Docs CI / lint-and-preview (push) Has been cancelled
				
			This commit is contained in:
		
							
								
								
									
										81
									
								
								docs/ops/authority-monitoring.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										81
									
								
								docs/ops/authority-monitoring.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,81 @@ | ||||
| # Authority Monitoring & Alerting Playbook | ||||
|  | ||||
| ## Telemetry Sources | ||||
| - **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`. | ||||
| - **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports: | ||||
|   - `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment). | ||||
|   - `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`). | ||||
| - **Logs:** Serilog writes structured events to stdout. Notable templates: | ||||
|   - `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector). | ||||
|   - `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage). | ||||
|   - `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts). | ||||
|  | ||||
| ## Prometheus Metrics to Collect | ||||
| | Metric | Query | Purpose | | ||||
| | --- | --- | --- | | ||||
| | `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). | | ||||
| | `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. | | ||||
| | `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). | | ||||
| | `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. | | ||||
| | `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. | | ||||
|  | ||||
| > **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series. | ||||
|  | ||||
| ## Alert Rules | ||||
| 1. **Token Failure Surge** | ||||
|    - _Expression_: `token_failure_ratio > 0.05` | ||||
|    - _For_: `10m` | ||||
|    - _Labels_: `severity="critical"` | ||||
|    - _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation). | ||||
| 2. **Lockout Spike** | ||||
|    - _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10` | ||||
|    - _For_: `15m` | ||||
|    - Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`. | ||||
| 3. **Bypass Threshold** | ||||
|    - _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1` | ||||
|    - _For_: `5m` | ||||
|    - Alert severity `warning` — verify the calling host list. | ||||
| 4. **Rate Limiter Saturation** | ||||
|    - _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0` | ||||
|    - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured. | ||||
|  | ||||
| ## Grafana Dashboard | ||||
| - Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels: | ||||
|   - **Token Success vs Failure** – stacked rate visualization split by grant type. | ||||
|   - **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`. | ||||
|   - **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters. | ||||
|   - **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`. | ||||
|  | ||||
| ## Collector Configuration Snippets | ||||
| ```yaml | ||||
| receivers: | ||||
|   otlp: | ||||
|     protocols: | ||||
|       http: | ||||
| exporters: | ||||
|   prometheus: | ||||
|     endpoint: "0.0.0.0:9464" | ||||
| processors: | ||||
|   batch: | ||||
|   attributes/token_grant: | ||||
|     actions: | ||||
|       - key: grant_type | ||||
|         action: upsert | ||||
|         from_attribute: authority.grant_type | ||||
| service: | ||||
|   pipelines: | ||||
|     metrics: | ||||
|       receivers: [otlp] | ||||
|       processors: [attributes/token_grant, batch] | ||||
|       exporters: [prometheus] | ||||
|     logs: | ||||
|       receivers: [otlp] | ||||
|       processors: [batch] | ||||
|       exporters: [loki] | ||||
| ``` | ||||
|  | ||||
| ## Operational Checklist | ||||
| - [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds. | ||||
| - [ ] Ensure Promtail captures container stdout with Serilog structured formatting. | ||||
| - [ ] Periodically validate alert noise by running load tests that trigger the rate limiter. | ||||
| - [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change. | ||||
		Reference in New Issue
	
	Block a user