Add authority bootstrap flows and Concelier ops runbooks

2025-10-15 10:03:56 +03:00
parent ea8226120c
commit 0ba025022f
276 changed files with 21674 additions and 934 deletions
--- a/docs/security/rate-limits.md
+++ b/docs/security/rate-limits.md
@@ -0,0 +1,76 @@
+# StellaOps Authority Rate Limit Guidance
+
+StellaOps Authority applies fixed-window rate limiting to critical endpoints so that brute-force and burst traffic are throttled before they can exhaust downstream resources. This guide complements the lockout policy documentation and captures the recommended defaults, override scenarios, and monitoring practices for `/token`, `/authorize`, and `/internal/*` routes.
+
+## Configuration Overview
+
+Rate limits live under `security.rateLimiting` in `authority.yaml` (and map to the same hierarchy for environment variables). Each endpoint exposes:
+
+- `enabled` &mdash; toggles the limiter.
+- `permitLimit` &mdash; maximum requests per fixed window.
+- `window` &mdash; window duration expressed as an ISO-8601 timespan (e.g., `00:01:00`).
+- `queueLimit` &mdash; number of requests allowed to queue when the window is exhausted.
+
+```yaml
+security:
+  rateLimiting:
+    token:
+      enabled: true
+      permitLimit: 30
+      window: 00:01:00
+      queueLimit: 0
+    authorize:
+      enabled: true
+      permitLimit: 60
+      window: 00:01:00
+      queueLimit: 10
+    internal:
+      enabled: false
+      permitLimit: 5
+      window: 00:01:00
+      queueLimit: 0
+```
+
+When limits trigger, middleware decorates responses with `Retry-After` headers and log tags (`authority.endpoint`, `authority.client_id`, `authority.remote_ip`) so operators can correlate events with clients and source IPs.
+
+Environment overrides follow the same hierarchy. For example:
+
+```
+STELLAOPS_AUTHORITY__SECURITY__RATELIMITING__TOKEN__PERMITLIMIT=60
+STELLAOPS_AUTHORITY__SECURITY__RATELIMITING__TOKEN__WINDOW=00:01:00
+```
+
+## Recommended Profiles
+
+| Scenario | permitLimit | window | queueLimit | Notes |
+|----------|-------------|--------|------------|-------|
+| Default production | 30 | 60s | 0 | Balances anonymous quota (33 scans/day) with headroom for tenant bursts. |
+| High-trust clustered IPs | 60 | 60s | 5 | Requires WAF allowlist + alert `aspnetcore_rate_limiting_rejections_total{limiter="authority-token"} <= 1%` sustained. |
+| Air-gapped lab | 10 | 120s | 0 | Lower concurrency reduces noise when running from shared bastion hosts. |
+| Incident lockdown | 5 | 300s | 0 | Pair with credential lockout limit of 3 attempts and SOC paging for each denial. |
+
+### Lockout Interplay
+
+- Rate limiting throttles by IP/client; lockout policies apply per subject. Keep both enabled.
+- During lockdown scenarios, reduce `security.lockout.maxFailures` alongside the rate limits above so that subjects face quicker escalation.
+- Map support playbooks to the observed `Retry-After` value: anything above 120 seconds should trigger manual investigation before re-enabling clients.
+
+## Monitoring and Alerts
+
+1. **Metrics**
+   - `aspnetcore_rate_limiting_rejections_total{limiter="authority-token"}` for `/token`.
+   - `aspnetcore_rate_limiting_rejections_total{limiter="authority-authorize"}` for `/authorize`.
+   - Custom counters derived from the structured log tags (`authority.remote_ip`, `authority.client_id`).
+2. **Dashboards**
+   - Requests vs. rejections per endpoint.
+   - Top offending clients/IP ranges in the current window.
+   - Heatmap of retry-after durations to spot persistent throttling.
+3. **Alerts**
+   - Notify SOC when 429 rates exceed 25 % for five consecutive minutes on any limiter.
+   - Trigger client-specific alerts when a single client_id produces >100 throttle events/hour.
+
+## Operational Checklist
+
+- Validate updated limits in staging before production rollout; smoke-test with representative workload.
+- When raising limits, confirm audit events continue to capture `authority.client_id`, `authority.remote_ip`, and correlation IDs for throttle responses.
+- Document any overrides in the change log, including justification and expiry review date.