feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										22
									
								
								docs/modules/authority/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								docs/modules/authority/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,22 @@ | ||||
| # Authority agent guide | ||||
|  | ||||
| ## Mission | ||||
| Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool. | ||||
|  | ||||
| ## Key docs | ||||
| - [Module README](./README.md) | ||||
| - [Architecture](./architecture.md) | ||||
| - [Implementation plan](./implementation_plan.md) | ||||
| - [Task board](./TASKS.md) | ||||
|  | ||||
| ## How to get started | ||||
| 1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module. | ||||
| 2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED). | ||||
| 3. Read the architecture and README for domain context before editing code or docs. | ||||
| 4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan. | ||||
|  | ||||
| ## Guardrails | ||||
| - Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md). | ||||
| - Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts. | ||||
| - Keep Offline Kit parity in mind—document air-gapped workflows for any new feature. | ||||
| - Update runbooks/observability assets when operational characteristics change. | ||||
							
								
								
									
										40
									
								
								docs/modules/authority/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										40
									
								
								docs/modules/authority/README.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,40 @@ | ||||
| # StellaOps Authority | ||||
|  | ||||
| Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool. | ||||
|  | ||||
| ## Responsibilities | ||||
| - Expose device-code, auth-code, and client-credential flows with DPoP or mTLS binding. | ||||
| - Manage signing keys, JWKS rotation, and PoE integration for plan enforcement. | ||||
| - Emit structured audit events and enforce tenant-aware scope policies. | ||||
| - Provide plugin surface for custom identity providers and credential validators. | ||||
|  | ||||
| ## Key components | ||||
| - `StellaOps.Authority` web host. | ||||
| - `StellaOps.Authority.Plugin.*` extensions for secret stores, identity bridges, and OpTok validation. | ||||
| - Telemetry and audit pipeline feeding Security/Observability stacks. | ||||
|  | ||||
| ## Integrations & dependencies | ||||
| - Signer/Attestor for PoE and OpTok introspection. | ||||
| - CLI/UI for login flows and token management. | ||||
| - Scheduler/Scanner for machine-to-machine scope enforcement. | ||||
|  | ||||
| ## Operational notes | ||||
| - MongoDB for tenant, client, and token state. | ||||
| - Key material in KMS/HSM with rotation runbooks (see ./operations/key-rotation.md). | ||||
| - Grafana/Prometheus dashboards for auth latency/issuance. | ||||
|  | ||||
| ## Related resources | ||||
| - ./operations/backup-restore.md | ||||
| - ./operations/key-rotation.md | ||||
| - ./operations/monitoring.md | ||||
| - ./operations/grafana-dashboard.json | ||||
|  | ||||
| ## Backlog references | ||||
| - DOCS-SEC-62-001 (scope hardening doc) in ../../TASKS.md. | ||||
| - AUTH-POLICY-20-001/002 follow-ups in src/Authority/StellaOps.Authority/TASKS.md. | ||||
|  | ||||
| ## Epic alignment | ||||
| - **Epic 1 – AOC enforcement:** enforce OpTok scopes and guardrails supporting raw ingestion boundaries. | ||||
| - **Epic 2 – Policy Engine & Editor:** supply policy evaluation/principal scopes and short-lived tokens for evaluator workflows. | ||||
| - **Epic 4 – Policy Studio:** integrate approval/promotion signatures and policy registry access controls. | ||||
| - **Epic 14 – Identity & Tenancy:** deliver tenant isolation, RBAC hierarchies, and governance tooling for authentication. | ||||
							
								
								
									
										9
									
								
								docs/modules/authority/TASKS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										9
									
								
								docs/modules/authority/TASKS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,9 @@ | ||||
| # Task board — Authority | ||||
|  | ||||
| > Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable. | ||||
|  | ||||
| | ID | Status | Owner(s) | Description | Notes | | ||||
| |----|--------|----------|-------------|-------| | ||||
| | AUTHORITY-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md | | ||||
| | AUTHORITY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md | | ||||
| | AUTHORITY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow | | ||||
							
								
								
									
										445
									
								
								docs/modules/authority/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										445
									
								
								docs/modules/authority/architecture.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,445 @@ | ||||
| # component_architecture_authority.md — **Stella Ops Authority** (2025Q4) | ||||
|  | ||||
| > Consolidates identity and tenancy requirements documented across the AOC, Policy, and Platform guides, along with the dedicated Authority implementation plan. | ||||
|  | ||||
| > **Scope.** Implementation‑ready architecture for **Stella Ops Authority**: the on‑prem **OIDC/OAuth2** service that issues **short‑lived, sender‑constrained operational tokens (OpToks)** to first‑party services and tools. Covers protocols (DPoP & mTLS binding), token shapes, endpoints, storage, rotation, HA, RBAC, audit, and testing. This component is the trust anchor for *who* is calling inside a Stella Ops installation. (Entitlement is proven separately by **PoE** from the cloud Licensing Service; Authority does not issue PoE.) | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 0) Mission & boundaries | ||||
|  | ||||
| **Mission.** Provide **fast, local, verifiable** authentication for Stella Ops microservices and tools by minting **very short‑lived** OAuth2/OIDC tokens that are **sender‑constrained** (DPoP or mTLS‑bound). Support RBAC scopes, multi‑tenant claims, and deterministic validation for APIs (Scanner, Signer, Attestor, Excititor, Concelier, UI, CLI, Zastava). | ||||
|  | ||||
| **Boundaries.** | ||||
|  | ||||
| * Authority **does not** validate entitlements/licensing. That’s enforced by **Signer** using **PoE** with the cloud Licensing Service. | ||||
| * Authority tokens are **operational only** (2–5 min TTL) and must not be embedded in long‑lived artifacts or stored in SBOMs. | ||||
| * Authority is **stateless for validation** (JWT) and **optional introspection** for services that prefer online checks. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1) Protocols & cryptography | ||||
|  | ||||
| * **OIDC Discovery**: `/.well-known/openid-configuration` | ||||
| * **OAuth2** grant types: | ||||
|  | ||||
|   * **Client Credentials** (service↔service, with mTLS or private_key_jwt) | ||||
|   * **Device Code** (CLI login on headless agents; optional) | ||||
|   * **Authorization Code + PKCE** (browser login for UI; optional) | ||||
| * **Sender constraint options** (choose per caller or per audience): | ||||
|  | ||||
|   * **DPoP** (Demonstration of Proof‑of‑Possession): proof JWT on each HTTP request, bound to the access token via `cnf.jkt`. | ||||
|   * **OAuth 2.0 mTLS** (certificate‑bound tokens): token bound to client certificate thumbprint via `cnf.x5t#S256`. | ||||
| * **Signing algorithms**: **EdDSA (Ed25519)** preferred; fallback **ES256 (P‑256)**. Rotation is supported via **kid** in JWKS. | ||||
| * **Token format**: **JWT** access tokens (compact), optionally opaque reference tokens for services that insist on introspection. | ||||
| * **Clock skew tolerance**: ±60 s; issue `nbf`, `iat`, `exp` accordingly. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2) Token model | ||||
|  | ||||
| ### 2.1 Access token (OpTok) — short‑lived (120–300 s) | ||||
|  | ||||
| **Registered claims** | ||||
|  | ||||
| ``` | ||||
| iss   = https://authority.<domain> | ||||
| sub   = <client_id or user_id> | ||||
| aud   = <service audience: signer|scanner|attestor|concelier|excititor|ui|zastava> | ||||
| exp   = <unix ts>  (<= 300 s from iat) | ||||
| iat   = <unix ts> | ||||
| nbf   = iat - 30 | ||||
| jti   = <uuid> | ||||
| scope = "scanner.scan scanner.export signer.sign ..." | ||||
| ``` | ||||
|  | ||||
| **Sender‑constraint (`cnf`)** | ||||
|  | ||||
| * **DPoP**: | ||||
|  | ||||
|   ```json | ||||
|   "cnf": { "jkt": "<base64url(SHA-256(JWK))>" } | ||||
|   ``` | ||||
| * **mTLS**: | ||||
|  | ||||
|   ```json | ||||
|   "cnf": { "x5t#S256": "<base64url(SHA-256(client_cert_der))>" } | ||||
|   ``` | ||||
|  | ||||
| **Install/tenant context (custom claims)** | ||||
|  | ||||
| ``` | ||||
| tid          = <tenant id>               // multi-tenant | ||||
| inst         = <installation id>        // unique installation | ||||
| roles        = [ "svc.scanner", "svc.signer", "ui.admin", ... ] | ||||
| plan?        = <plan name>              // optional hint for UIs; not used for enforcement | ||||
| ``` | ||||
|  | ||||
| > **Note**: Do **not** copy PoE claims into OpTok; OpTok ≠ entitlement. Only **Signer** checks PoE. | ||||
|  | ||||
| ### 2.2 Refresh tokens (optional) | ||||
|  | ||||
| * Default **disabled**. If enabled (for UI interactive logins), pair with **DPoP‑bound** refresh tokens or **mTLS** client sessions; short TTL (≤ 8 h), rotating on use (replay‑safe). | ||||
|  | ||||
| ### 2.3 ID tokens (optional) | ||||
|  | ||||
| * Issued for UI/browser OIDC flows (Authorization Code + PKCE); not used for service auth. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3) Endpoints & flows | ||||
|  | ||||
| ### 3.1 OIDC discovery & keys | ||||
|  | ||||
| * `GET /.well-known/openid-configuration` → endpoints, algs, jwks_uri | ||||
| * `GET /jwks` → JSON Web Key Set (rotating, at least 2 active keys during transition) | ||||
|  | ||||
| ### 3.2 Token issuance | ||||
|  | ||||
| * `POST /oauth/token` | ||||
|  | ||||
|   * **Client Credentials** (service→service): | ||||
|  | ||||
|     * **mTLS**: mutual TLS + `client_id` → bound token (`cnf.x5t#S256`) | ||||
|       * `security.senderConstraints.mtls.enforceForAudiences` forces the mTLS path when requested `aud`/`resource` values intersect high-value audiences (defaults include `signer`). Authority rejects clients attempting to use DPoP/basic secrets for these audiences. | ||||
|       * Stored `certificateBindings` are authoritative: thumbprint, subject, issuer, serial number, and SAN values are matched against the presented certificate, with rotation grace applied to activation windows. Failures surface deterministic error codes (e.g. `certificate_binding_subject_mismatch`). | ||||
|     * **private_key_jwt**: JWT‑based client auth + **DPoP** header (preferred for tools and CLI) | ||||
|   * **Device Code** (CLI): `POST /oauth/device/code` + `POST /oauth/token` poll | ||||
|   * **Authorization Code + PKCE** (UI): standard | ||||
|  | ||||
| **DPoP handshake (example)** | ||||
|  | ||||
| 1. Client prepares **JWK** (ephemeral keypair). | ||||
| 2. Client sends **DPoP proof** header with fields: | ||||
|  | ||||
|    ``` | ||||
|    htm=POST | ||||
|    htu=https://authority.../oauth/token | ||||
|    iat=<now> | ||||
|    jti=<uuid> | ||||
|    ``` | ||||
|  | ||||
|    signed with the DPoP private key; header carries JWK. | ||||
| 3. Authority validates proof; issues access token with `cnf.jkt=<thumbprint(JWK)>`. | ||||
| 4. Client uses the same DPoP key to sign **every subsequent API request** to services (Signer, Scanner, …). | ||||
|  | ||||
| **mTLS flow** | ||||
|  | ||||
| * Mutual TLS at the connection; Authority extracts client cert, validates chain; token carries `cnf.x5t#S256`. | ||||
|  | ||||
| ### 3.3 Introspection & revocation (optional) | ||||
|  | ||||
| * `POST /oauth/introspect` → `{ active, sub, scope, aud, exp, cnf, ... }` | ||||
| * `POST /oauth/revoke` → revokes refresh tokens or opaque access tokens. | ||||
| * **Replay prevention**: maintain **DPoP `jti` cache** (TTL ≤ 10 min) to reject duplicate proofs when services supply DPoP nonces (Signer requires nonce for high‑value operations). | ||||
|  | ||||
| ### 3.4 UserInfo (optional for UI) | ||||
|  | ||||
| * `GET /userinfo` (ID token context). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4) Audiences, scopes & RBAC | ||||
|  | ||||
| ### 4.1 Audiences | ||||
|  | ||||
| * `signer` — only the **Signer** service should accept tokens with `aud=signer`. | ||||
| * `attestor`, `scanner`, `concelier`, `excititor`, `ui`, `zastava` similarly. | ||||
|  | ||||
| Services **must** verify `aud` and **sender constraint** (DPoP/mTLS) per their policy. | ||||
|  | ||||
| ### 4.2 Core scopes | ||||
|  | ||||
| | Scope                              | Service            | Operation                  | | ||||
| | ---------------------------------- | ------------------ | -------------------------- | | ||||
| | `signer.sign`                      | Signer             | Request DSSE signing       | | ||||
| | `attestor.write`                   | Attestor           | Submit Rekor entries       | | ||||
| | `scanner.scan`                     | Scanner.WebService | Submit scan jobs           | | ||||
| | `scanner.export`                   | Scanner.WebService | Export SBOMs               | | ||||
| | `scanner.read`                     | Scanner.WebService | Read catalog/SBOMs         | | ||||
| | `vex.read` / `vex.admin`           | Excititor              | Query/operate              | | ||||
| | `concelier.read` / `concelier.export`  | Concelier            | Query/exports              | | ||||
| | `ui.read` / `ui.admin`             | UI                 | View/admin                 | | ||||
| | `zastava.emit` / `zastava.enforce` | Scanner/Zastava    | Runtime events / admission | | ||||
|  | ||||
| **Roles → scopes mapping** is configured centrally (Authority policy) and pushed during token issuance. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5) Storage & state | ||||
|  | ||||
| * **Configuration DB** (PostgreSQL/MySQL): clients, audiences, role→scope maps, tenant/installation registry, device code grants, persistent consents (if any). | ||||
| * **Cache** (Redis): | ||||
|  | ||||
|   * DPoP **jti** replay cache (short TTL) | ||||
|   * **Nonce** store (per resource server, if they demand nonce) | ||||
|   * Device code pollers, rate limiting buckets | ||||
| * **JWKS**: key material in HSM/KMS or encrypted at rest; JWKS served from memory. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6) Key management & rotation | ||||
|  | ||||
| * Maintain **at least 2 signing keys** active during rotation; tokens carry `kid`. | ||||
| * Prefer **Ed25519** for compact tokens; maintain **ES256** fallback for FIPS contexts. | ||||
| * Rotation cadence: 30–90 days; emergency rotation supported. | ||||
| * Publish new JWKS **before** issuing tokens with the new `kid` to avoid cold‑start validation misses. | ||||
| * Keep **old keys** available **at least** for max token TTL + 5 minutes. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7) HA & performance | ||||
|  | ||||
| * **Stateless issuance** (except device codes/refresh) → scale horizontally behind a load‑balancer. | ||||
| * **DB** only for client metadata and optional flows; token checks are JWT‑local; introspection endpoints hit cache/DB minimally. | ||||
| * **Targets**: | ||||
|  | ||||
|   * Token issuance P95 ≤ **20 ms** under warm cache. | ||||
|   * DPoP proof validation ≤ **1 ms** extra per request at resource servers (Signer/Scanner). | ||||
|   * 99.9% uptime; HPA on CPU/latency. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8) Security posture | ||||
|  | ||||
| * **Strict TLS** (1.3 preferred); HSTS; modern cipher suites. | ||||
| * **mTLS** enabled where required (Signer/Attestor paths). | ||||
| * **Replay protection**: DPoP `jti` cache, nonce support for **Signer** (add `DPoP-Nonce` header on 401; clients re‑sign). | ||||
| * **Rate limits** per client & per IP; exponential backoff on failures. | ||||
| * **Secrets**: clients use **private_key_jwt** or **mTLS**; never basic secrets over the wire. | ||||
| * **CSP/CSRF** hardening on UI flows; `SameSite=Lax` cookies; PKCE enforced. | ||||
| * **Logs** redact `Authorization` and DPoP proofs; store `sub`, `aud`, `scopes`, `inst`, `tid`, `cnf` thumbprints, not full keys. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9) Multi‑tenancy & installations | ||||
|  | ||||
| * **Tenant (`tid`)** and **Installation (`inst`)** registries define which audiences/scopes a client can request. | ||||
| * Cross‑tenant isolation enforced at issuance (disallow rogue `aud`), and resource servers **must** check that `tid` matches their configured tenant. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10) Admin & operations APIs | ||||
|  | ||||
| All under `/admin` (mTLS + `authority.admin` scope). | ||||
|  | ||||
| ``` | ||||
| POST /admin/clients                 # create/update client (confidential/public) | ||||
| POST /admin/audiences               # register audience resource URIs | ||||
| POST /admin/roles                   # define role→scope mappings | ||||
| POST /admin/tenants                 # create tenant/install entries | ||||
| POST /admin/keys/rotate             # rotate signing key (zero-downtime) | ||||
| GET  /admin/metrics                 # Prometheus exposition (token issue rates, errors) | ||||
| GET  /admin/healthz|readyz          # health/readiness | ||||
| ``` | ||||
|  | ||||
| Declared client `audiences` flow through to the issued JWT `aud` claim and the token request's `resource` indicators. Authority relies on this metadata to enforce DPoP nonce challenges for `signer`, `attestor`, and other high-value services without requiring clients to repeat the audience parameter on every request. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 11) Integration hard lines (what resource servers must enforce) | ||||
|  | ||||
| Every Stella Ops service that consumes Authority tokens **must**: | ||||
|  | ||||
| 1. Verify JWT signature (`kid` in JWKS), `iss`, `aud`, `exp`, `nbf`. | ||||
| 2. Enforce **sender‑constraint**: | ||||
|  | ||||
|    * **DPoP**: validate DPoP proof (`htu`, `htm`, `iat`, `jti`) and match `cnf.jkt`; cache `jti` for replay defense; honor nonce challenges. | ||||
|    * **mTLS**: match presented client cert thumbprint to token `cnf.x5t#S256`. | ||||
| 3. Check **scopes**; optionally map to internal roles. | ||||
| 4. Check **tenant** (`tid`) and **installation** (`inst`) as appropriate. | ||||
| 5. For **Signer** only: require **both** OpTok and **PoE** in the request (enforced by Signer, not Authority). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 12) Error surfaces & UX | ||||
|  | ||||
| * Token endpoint errors follow OAuth2 (`invalid_client`, `invalid_grant`, `invalid_scope`, `unauthorized_client`). | ||||
| * Resource servers use RFC 6750 style (`WWW-Authenticate: DPoP error="invalid_token", error_description="…", dpop_nonce="…" `). | ||||
| * For DPoP nonce challenges, clients retry with the server‑supplied nonce once. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 13) Observability & audit | ||||
|  | ||||
| * **Metrics**: | ||||
|  | ||||
|   * `authority.tokens_issued_total{grant,aud}` | ||||
|   * `authority.dpop_validations_total{result}` | ||||
|   * `authority.mtls_bindings_total{result}` | ||||
|   * `authority.jwks_rotations_total` | ||||
|   * `authority.errors_total{type}` | ||||
| * **Audit log** (immutable sink): token issuance (`sub`, `aud`, `scopes`, `tid`, `inst`, `cnf thumbprint`, `jti`), revocations, admin changes. | ||||
| * **Tracing**: token flows, DB reads, JWKS cache. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 14) Configuration (YAML) | ||||
|  | ||||
| ```yaml | ||||
| authority: | ||||
|   issuer: "https://authority.internal" | ||||
|   signing: | ||||
|     enabled: true | ||||
|     activeKeyId: "authority-signing-2025" | ||||
|     keyPath: "../certificates/authority-signing-2025.pem" | ||||
|     algorithm: "ES256" | ||||
|     keySource: "file" | ||||
|   security: | ||||
|     rateLimiting: | ||||
|       token: | ||||
|         enabled: true | ||||
|         permitLimit: 30 | ||||
|         window: "00:01:00" | ||||
|         queueLimit: 0 | ||||
|       authorize: | ||||
|         enabled: true | ||||
|         permitLimit: 60 | ||||
|         window: "00:01:00" | ||||
|         queueLimit: 10 | ||||
|       internal: | ||||
|         enabled: false | ||||
|         permitLimit: 5 | ||||
|         window: "00:01:00" | ||||
|         queueLimit: 0 | ||||
|     senderConstraints: | ||||
|       dpop: | ||||
|         enabled: true | ||||
|         allowedAlgorithms: [ "ES256", "ES384" ] | ||||
|         proofLifetime: "00:02:00" | ||||
|         allowedClockSkew: "00:00:30" | ||||
|         replayWindow: "00:05:00" | ||||
|         nonce: | ||||
|           enabled: true | ||||
|           ttl: "00:10:00" | ||||
|           maxIssuancePerMinute: 120 | ||||
|           store: "redis" | ||||
|           redisConnectionString: "redis://authority-redis:6379?ssl=false" | ||||
|           requiredAudiences: | ||||
|             - "signer" | ||||
|             - "attestor" | ||||
|       mtls: | ||||
|         enabled: true | ||||
|         requireChainValidation: true | ||||
|         rotationGrace: "00:15:00" | ||||
|         enforceForAudiences: | ||||
|           - "signer" | ||||
|         allowedSanTypes: | ||||
|           - "dns" | ||||
|           - "uri" | ||||
|         allowedCertificateAuthorities: | ||||
|           - "/etc/ssl/mtls/clients-ca.pem" | ||||
|   clients: | ||||
|     - clientId: scanner-web | ||||
|       grantTypes: [ "client_credentials" ] | ||||
|       audiences: [ "scanner" ] | ||||
|       auth: { type: "private_key_jwt", jwkFile: "/secrets/scanner-web.jwk" } | ||||
|       senderConstraint: "dpop" | ||||
|       scopes: [ "scanner.scan", "scanner.export", "scanner.read" ] | ||||
|     - clientId: signer | ||||
|       grantTypes: [ "client_credentials" ] | ||||
|       audiences: [ "signer" ] | ||||
|       auth: { type: "mtls" } | ||||
|       senderConstraint: "mtls" | ||||
|       scopes: [ "signer.sign" ] | ||||
|     - clientId: notify-web-dev | ||||
|       grantTypes: [ "client_credentials" ] | ||||
|       audiences: [ "notify.dev" ] | ||||
|       auth: { type: "client_secret", secretFile: "/secrets/notify-web-dev.secret" } | ||||
|       senderConstraint: "dpop" | ||||
|       scopes: [ "notify.read", "notify.admin" ] | ||||
|     - clientId: notify-web | ||||
|       grantTypes: [ "client_credentials" ] | ||||
|       audiences: [ "notify" ] | ||||
|       auth: { type: "client_secret", secretFile: "/secrets/notify-web.secret" } | ||||
|       senderConstraint: "dpop" | ||||
|       scopes: [ "notify.read", "notify.admin" ] | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 15) Testing matrix | ||||
|  | ||||
| * **JWT validation**: wrong `aud`, expired `exp`, skewed `nbf`, stale `kid`. | ||||
| * **DPoP**: invalid `htu`/`htm`, replayed `jti`, stale `iat`, wrong `jkt`, nonce dance. | ||||
| * **mTLS**: wrong client cert, wrong CA, thumbprint mismatch. | ||||
| * **RBAC**: scope enforcement per audience; over‑privileged client denied. | ||||
| * **Rotation**: JWKS rotation while load‑testing; zero‑downtime verification. | ||||
| * **HA**: kill one Authority instance; verify issuance continues; JWKS served by peers. | ||||
| * **Performance**: 1k token issuance/sec on 2 cores with Redis enabled for jti caching. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 16) Threat model & mitigations (summary) | ||||
|  | ||||
| | Threat              | Vector           | Mitigation                                                                                 | | ||||
| | ------------------- | ---------------- | ------------------------------------------------------------------------------------------ | | ||||
| | Token theft         | Copy of JWT      | **Short TTL**, **sender‑constraint** (DPoP/mTLS); replay blocked by `jti` cache and nonces | | ||||
| | Replay across hosts | Reuse DPoP proof | Enforce `htu`/`htm`, `iat` freshness, `jti` uniqueness; services may require **nonce**     | | ||||
| | Impersonation       | Fake client      | mTLS or `private_key_jwt` with pinned JWK; client registration & rotation                  | | ||||
| | Key compromise      | Signing key leak | HSM/KMS storage, key rotation, audit; emergency key revoke path; narrow token TTL          | | ||||
| | Cross‑tenant abuse  | Scope elevation  | Enforce `aud`, `tid`, `inst` at issuance and resource servers                              | | ||||
| | Downgrade to bearer | Strip DPoP       | Resource servers require DPoP/mTLS based on `aud`; reject bearer without `cnf`             | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 17) Deployment & HA | ||||
|  | ||||
| * **Stateless** microservice, containerized; run ≥ 2 replicas behind LB. | ||||
| * **DB**: HA Postgres (or MySQL) for clients/roles; **Redis** for device codes, DPoP nonces/jtis. | ||||
| * **Secrets**: mount client JWKs via K8s Secrets/HashiCorp Vault; signing keys via KMS. | ||||
| * **Backups**: DB daily; Redis not critical (ephemeral). | ||||
| * **Disaster recovery**: export/import of client registry; JWKS rehydrate from KMS. | ||||
| * **Compliance**: TLS audit; penetration testing for OIDC flows. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 18) Implementation notes | ||||
|  | ||||
| * Reference stack: **.NET 10** + **OpenIddict 6** (or IdentityServer if licensed) with custom DPoP validator and mTLS binding middleware. | ||||
| * Keep the DPoP/JTI cache pluggable; allow Redis/Memcached. | ||||
| * Provide **client SDKs** for C# and Go: DPoP key mgmt, proof generation, nonce handling, token refresh helper. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 19) Quick reference — wire examples | ||||
|  | ||||
| **Access token (payload excerpt)** | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "iss": "https://authority.internal", | ||||
|   "sub": "scanner-web", | ||||
|   "aud": "signer", | ||||
|   "exp": 1760668800, | ||||
|   "iat": 1760668620, | ||||
|   "nbf": 1760668620, | ||||
|   "jti": "9d9c3f01-6e1a-49f1-8f77-9b7e6f7e3c50", | ||||
|   "scope": "signer.sign", | ||||
|   "tid": "tenant-01", | ||||
|   "inst": "install-7A2B", | ||||
|   "cnf": { "jkt": "KcVb2V...base64url..." } | ||||
| } | ||||
| ``` | ||||
|  | ||||
| **DPoP proof header fields (for POST /sign/dsse)** | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "htu": "https://signer.internal/sign/dsse", | ||||
|   "htm": "POST", | ||||
|   "iat": 1760668620, | ||||
|   "jti": "4b1c9b3c-8a95-4c58-8a92-9c6cfb4a6a0b" | ||||
| } | ||||
| ``` | ||||
|  | ||||
| Signer validates that `hash(JWK)` in the proof matches `cnf.jkt` in the token. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 20) Rollout plan | ||||
|  | ||||
| 1. **MVP**: Client Credentials (private_key_jwt + DPoP), JWKS, short OpToks, per‑audience scopes. | ||||
| 2. **Add**: mTLS‑bound tokens for Signer/Attestor; device code for CLI; optional introspection. | ||||
| 3. **Hardening**: DPoP nonce support; full audit pipeline; HA tuning. | ||||
| 4. **UX**: Tenant/installation admin UI; role→scope editors; client bootstrap wizards. | ||||
							
								
								
									
										22
									
								
								docs/modules/authority/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								docs/modules/authority/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,22 @@ | ||||
| # Implementation plan — Authority | ||||
|  | ||||
| ## Current objectives | ||||
| - Maintain deterministic behaviour and offline parity across releases. | ||||
| - Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes. | ||||
|  | ||||
| ## Workstreams | ||||
| - Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap. | ||||
| - Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs. | ||||
| - Validation: extend tests/fixtures to preserve determinism and provenance requirements. | ||||
|  | ||||
| ## Epic milestones | ||||
| - **Epic 1 – AOC enforcement:** deliver OpTok scopes, guardrails, and AOC verifier hooks for ingestion services. | ||||
| - **Epic 2 – Policy Engine & Editor:** support policy evaluator flows (device-code, client credentials, scope sandboxing). | ||||
| - **Epic 4 – Policy Studio:** provide registry/promotion signing, approvals, and fresh-auth prompts. | ||||
| - **Epic 14 – Identity & Tenancy:** implement tenant isolation, RBAC hierarchies, audit trails, and PoE integration. | ||||
| - Track additional work (DOCS-SEC-62-001, AUTH-POLICY-20-001/002) in ../../TASKS.md and src/Authority/**/TASKS.md. | ||||
|  | ||||
| ## Coordination | ||||
| - Review ./AGENTS.md before picking up new work. | ||||
| - Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md. | ||||
| - Update this plan whenever scope, dependencies, or guardrails change. | ||||
							
								
								
									
										97
									
								
								docs/modules/authority/operations/backup-restore.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										97
									
								
								docs/modules/authority/operations/backup-restore.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,97 @@ | ||||
| # Authority Backup & Restore Runbook | ||||
|  | ||||
| ## Scope | ||||
| - **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging. | ||||
| - **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container). | ||||
| - **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault. | ||||
|  | ||||
| ## Inventory Checklist | ||||
| | Component | Location (compose default) | Notes | | ||||
| | --- | --- | --- | | ||||
| | Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). | | ||||
| | Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. | | ||||
| | Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. | | ||||
| | Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). | | ||||
|  | ||||
| > **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`. | ||||
|  | ||||
| ## Hot Backup (no downtime) | ||||
| 1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host. | ||||
| 2. **Dump Mongo:** | ||||
|    ```bash | ||||
|    docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \ | ||||
|      mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \ | ||||
|      --gzip --db stellaops-authority | ||||
|    docker compose -f ops/authority/docker-compose.authority.yaml cp \ | ||||
|      mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/ | ||||
|    ``` | ||||
|    The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`. | ||||
| 3. **Capture configuration + manifests:** | ||||
|    ```bash | ||||
|    cp etc/authority.yaml backup/ | ||||
|    rsync -a etc/authority.plugins/ backup/authority.plugins/ | ||||
|    ``` | ||||
| 4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service: | ||||
|    ```bash | ||||
|    docker run --rm \ | ||||
|      -v authority-keys:/keys \ | ||||
|      -v "$(pwd)/backup:/backup" \ | ||||
|      busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys . | ||||
|    ``` | ||||
| 5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts. | ||||
| 6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault. | ||||
|  | ||||
| ## Cold Backup (planned downtime) | ||||
| 1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards). | ||||
| 2. Stop services: | ||||
|    ```bash | ||||
|    docker compose -f ops/authority/docker-compose.authority.yaml down | ||||
|    ``` | ||||
| 3. Back up volumes directly using `tar`: | ||||
|    ```bash | ||||
|    docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \ | ||||
|      busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data . | ||||
|    docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \ | ||||
|      busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys . | ||||
|    ``` | ||||
| 4. Copy configuration + manifests as in the hot backup (steps 3–6). | ||||
| 5. Restart services and verify health: | ||||
|    ```bash | ||||
|    docker compose -f ops/authority/docker-compose.authority.yaml up -d | ||||
|    curl -fsS http://localhost:8080/ready | ||||
|    ``` | ||||
|  | ||||
| ## Restore Procedure | ||||
| 1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist. | ||||
| 2. **Restore Mongo:** | ||||
|    ```bash | ||||
|    docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz | ||||
|    ``` | ||||
|    Use `--drop` to replace collections; omit if doing a partial restore. | ||||
| 3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container. | ||||
| 4. **Restore signing keys:** untar into the mounted volume: | ||||
|    ```bash | ||||
|    docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \ | ||||
|      busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys | ||||
|    ``` | ||||
|    Ensure file permissions remain `600` for private keys (`chmod -R 600`). | ||||
| 5. **Start services & validate:** | ||||
|    ```bash | ||||
|    docker compose up -d | ||||
|    curl -fsS http://localhost:8080/health | ||||
|    ``` | ||||
| 6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`. | ||||
|  | ||||
| ## Disaster Recovery Notes | ||||
| - **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning. | ||||
| - **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually. | ||||
| - **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice. | ||||
| - **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1 | ||||
|  | ||||
| ## Verification Checklist | ||||
| - [ ] `/ready` reports all identity providers ready. | ||||
| - [ ] OAuth flows issue tokens signed by the restored keys. | ||||
| - [ ] `PluginRegistrationSummary` logs expected providers on startup. | ||||
| - [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds. | ||||
| - [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables). | ||||
|  | ||||
							
								
								
									
										174
									
								
								docs/modules/authority/operations/grafana-dashboard.json
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										174
									
								
								docs/modules/authority/operations/grafana-dashboard.json
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,174 @@ | ||||
| { | ||||
|   "title": "StellaOps Authority - Token & Access Monitoring", | ||||
|   "uid": "authority-token-monitoring", | ||||
|   "schemaVersion": 38, | ||||
|   "version": 1, | ||||
|   "editable": true, | ||||
|   "timezone": "", | ||||
|   "graphTooltip": 0, | ||||
|   "time": { | ||||
|     "from": "now-6h", | ||||
|     "to": "now" | ||||
|   }, | ||||
|   "templating": { | ||||
|     "list": [ | ||||
|       { | ||||
|         "name": "datasource", | ||||
|         "type": "datasource", | ||||
|         "query": "prometheus", | ||||
|         "refresh": 1, | ||||
|         "hide": 0, | ||||
|         "current": {} | ||||
|       } | ||||
|     ] | ||||
|   }, | ||||
|   "panels": [ | ||||
|     { | ||||
|       "id": 1, | ||||
|       "title": "Token Requests – Success vs Failure", | ||||
|       "type": "timeseries", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "req/s", | ||||
|           "displayName": "{{grant_type}} ({{status}})" | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "refId": "A", | ||||
|           "expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))", | ||||
|           "legendFormat": "{{grant_type}} {{status}}" | ||||
|         } | ||||
|       ], | ||||
|       "options": { | ||||
|         "legend": { | ||||
|           "displayMode": "table", | ||||
|           "placement": "bottom" | ||||
|         }, | ||||
|         "tooltip": { | ||||
|           "mode": "multi" | ||||
|         } | ||||
|       } | ||||
|     }, | ||||
|     { | ||||
|       "id": 2, | ||||
|       "title": "Rate Limiter Rejections", | ||||
|       "type": "timeseries", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "req/s", | ||||
|           "displayName": "{{limiter}}" | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "refId": "A", | ||||
|           "expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))", | ||||
|           "legendFormat": "{{limiter}}" | ||||
|         } | ||||
|       ] | ||||
|     }, | ||||
|     { | ||||
|       "id": 3, | ||||
|       "title": "Bypass Events (5m)", | ||||
|       "type": "stat", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "short", | ||||
|           "color": { | ||||
|             "mode": "thresholds" | ||||
|           }, | ||||
|           "thresholds": { | ||||
|             "mode": "absolute", | ||||
|             "steps": [ | ||||
|               { "color": "green", "value": null }, | ||||
|               { "color": "orange", "value": 1 }, | ||||
|               { "color": "red", "value": 5 } | ||||
|             ] | ||||
|           } | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "refId": "A", | ||||
|           "expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))" | ||||
|         } | ||||
|       ], | ||||
|       "options": { | ||||
|         "reduceOptions": { | ||||
|           "calcs": ["last"], | ||||
|           "fields": "", | ||||
|           "values": false | ||||
|         }, | ||||
|         "orientation": "horizontal", | ||||
|         "textMode": "auto" | ||||
|       } | ||||
|     }, | ||||
|     { | ||||
|       "id": 4, | ||||
|       "title": "Lockout Events (15m)", | ||||
|       "type": "stat", | ||||
|       "datasource": { | ||||
|         "type": "prometheus", | ||||
|         "uid": "${datasource}" | ||||
|       }, | ||||
|       "fieldConfig": { | ||||
|         "defaults": { | ||||
|           "unit": "short", | ||||
|           "color": { | ||||
|             "mode": "thresholds" | ||||
|           }, | ||||
|           "thresholds": { | ||||
|             "mode": "absolute", | ||||
|             "steps": [ | ||||
|               { "color": "green", "value": null }, | ||||
|               { "color": "orange", "value": 5 }, | ||||
|               { "color": "red", "value": 10 } | ||||
|             ] | ||||
|           } | ||||
|         }, | ||||
|         "overrides": [] | ||||
|       }, | ||||
|       "targets": [ | ||||
|         { | ||||
|           "refId": "A", | ||||
|           "expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))" | ||||
|         } | ||||
|       ], | ||||
|       "options": { | ||||
|         "reduceOptions": { | ||||
|           "calcs": ["last"], | ||||
|           "fields": "", | ||||
|           "values": false | ||||
|         }, | ||||
|         "orientation": "horizontal", | ||||
|         "textMode": "auto" | ||||
|       } | ||||
|     }, | ||||
|     { | ||||
|       "id": 5, | ||||
|       "title": "Trace Explorer Shortcut", | ||||
|       "type": "text", | ||||
|       "options": { | ||||
|         "mode": "markdown", | ||||
|         "content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})" | ||||
|       } | ||||
|     } | ||||
|   ], | ||||
|   "links": [] | ||||
| } | ||||
							
								
								
									
										94
									
								
								docs/modules/authority/operations/key-rotation.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										94
									
								
								docs/modules/authority/operations/key-rotation.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,94 @@ | ||||
| # Authority Signing Key Rotation Playbook | ||||
|  | ||||
| > **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.   | ||||
| > Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`. | ||||
|  | ||||
| ## 1. Overview | ||||
|  | ||||
| Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide: | ||||
|  | ||||
| - **Automation script:** `ops/authority/key-rotation.sh`   | ||||
|   Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards. | ||||
| - **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`   | ||||
|   Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input. | ||||
|  | ||||
| This playbook documents the repeatable sequence for all environments. | ||||
|  | ||||
| ## 2. Pre-requisites | ||||
|  | ||||
| 1. **Generate a new PEM key (per environment)** | ||||
|    ```bash | ||||
|    openssl ecparam -name prime256v1 -genkey -noout \ | ||||
|      -out certificates/authority-signing-<env>-<year>.pem | ||||
|    chmod 600 certificates/authority-signing-<env>-<year>.pem | ||||
|    ``` | ||||
| 2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation. | ||||
| 3. **Ensure secrets/vars exist in Gitea** | ||||
|    - `<ENV>_AUTHORITY_BOOTSTRAP_KEY` | ||||
|    - `<ENV>_AUTHORITY_URL` | ||||
|    - Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`. | ||||
|  | ||||
| ## 3. Executing the rotation | ||||
|  | ||||
| ### Option A – via CI workflow (recommended) | ||||
|  | ||||
| 1. Navigate to **Actions → Authority Key Rotation**. | ||||
| 2. Provide inputs: | ||||
|    - `environment`: `staging`, `production`, etc. | ||||
|    - `key_id`: new `kid` (e.g. `authority-signing-2025-dev`). | ||||
|    - `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`). | ||||
|    - Optional `metadata`: comma-separated `key=value` pairs (for audit trails). | ||||
| 3. Trigger. The workflow: | ||||
|    - Reads the bootstrap key/URL from secrets. | ||||
|    - Runs `ops/authority/key-rotation.sh`. | ||||
|    - Prints the JWKS response for verification. | ||||
|  | ||||
| ### Option B – manual shell invocation | ||||
|  | ||||
| ```bash | ||||
| AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \ | ||||
| ./ops/authority/key-rotation.sh \ | ||||
|   --authority-url https://authority.example.com \ | ||||
|   --key-id authority-signing-2025-dev \ | ||||
|   --key-path ../certificates/authority-signing-2025-dev.pem \ | ||||
|   --meta rotatedBy=ops --meta changeTicket=OPS-1234 | ||||
| ``` | ||||
|  | ||||
| Use `--dry-run` to inspect the payload before execution. | ||||
|  | ||||
| ## 4. Post-rotation checklist | ||||
|  | ||||
| 1. Update `authority.yaml` (or environment-specific overrides): | ||||
|    - Set `signing.activeKeyId` to the new key. | ||||
|    - Set `signing.keyPath` to the new PEM. | ||||
|    - Append the previous key into `signing.additionalKeys`. | ||||
|    - Ensure `keySource`/`provider` match the values passed to the script. | ||||
| 2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key. | ||||
| 3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`. | ||||
| 4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired. | ||||
|  | ||||
| ## 5. Development key state | ||||
|  | ||||
| For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key: | ||||
|  | ||||
| - Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`) | ||||
| - Retired: `authority-signing-dev` | ||||
|  | ||||
| Treat these as examples; real environments must maintain their own PEM material. | ||||
|  | ||||
| ## 6. References | ||||
|  | ||||
| - `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5). | ||||
| - `docs/modules/authority/operations/backup-restore.md` – Recovery flow referencing this playbook. | ||||
| - `ops/authority/README.md` – CLI usage and examples. | ||||
| - `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change. | ||||
|  | ||||
| ## 7. Appendix — Policy CLI secret rotation | ||||
|  | ||||
| Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments. | ||||
|  | ||||
| ```bash | ||||
| ./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret | ||||
| ``` | ||||
|  | ||||
| The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze. | ||||
							
								
								
									
										83
									
								
								docs/modules/authority/operations/monitoring.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										83
									
								
								docs/modules/authority/operations/monitoring.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,83 @@ | ||||
| # Authority Monitoring & Alerting Playbook | ||||
|  | ||||
| ## Telemetry Sources | ||||
| - **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`. | ||||
| - **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports: | ||||
|   - `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment). | ||||
|   - `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`). | ||||
| - **Logs:** Serilog writes structured events to stdout. Notable templates: | ||||
|   - `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector). | ||||
|   - `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors). | ||||
|   - `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities). | ||||
|   - `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage). | ||||
|   - `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts). | ||||
|  | ||||
| ## Prometheus Metrics to Collect | ||||
| | Metric | Query | Purpose | | ||||
| | --- | --- | --- | | ||||
| | `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). | | ||||
| | `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. | | ||||
| | `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). | | ||||
| | `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. | | ||||
| | `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. | | ||||
|  | ||||
| > **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series. | ||||
|  | ||||
| ## Alert Rules | ||||
| 1. **Token Failure Surge** | ||||
|    - _Expression_: `token_failure_ratio > 0.05` | ||||
|    - _For_: `10m` | ||||
|    - _Labels_: `severity="critical"` | ||||
|    - _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation). | ||||
| 2. **Lockout Spike** | ||||
|    - _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10` | ||||
|    - _For_: `15m` | ||||
|    - Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`. | ||||
| 3. **Bypass Threshold** | ||||
|    - _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1` | ||||
|    - _For_: `5m` | ||||
|    - Alert severity `warning` — verify the calling host list. | ||||
| 4. **Rate Limiter Saturation** | ||||
|    - _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0` | ||||
|    - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured. | ||||
|  | ||||
| ## Grafana Dashboard | ||||
| - Import `docs/modules/authority/operations/grafana-dashboard.json` to provision baseline panels: | ||||
|   - **Token Success vs Failure** – stacked rate visualization split by grant type. | ||||
|   - **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`. | ||||
|   - **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters. | ||||
|   - **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`. | ||||
|  | ||||
| ## Collector Configuration Snippets | ||||
| ```yaml | ||||
| receivers: | ||||
|   otlp: | ||||
|     protocols: | ||||
|       http: | ||||
| exporters: | ||||
|   prometheus: | ||||
|     endpoint: "0.0.0.0:9464" | ||||
| processors: | ||||
|   batch: | ||||
|   attributes/token_grant: | ||||
|     actions: | ||||
|       - key: grant_type | ||||
|         action: upsert | ||||
|         from_attribute: authority.grant_type | ||||
| service: | ||||
|   pipelines: | ||||
|     metrics: | ||||
|       receivers: [otlp] | ||||
|       processors: [attributes/token_grant, batch] | ||||
|       exporters: [prometheus] | ||||
|     logs: | ||||
|       receivers: [otlp] | ||||
|       processors: [batch] | ||||
|       exporters: [loki] | ||||
| ``` | ||||
|  | ||||
| ## Operational Checklist | ||||
| - [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds. | ||||
| - [ ] Ensure Promtail captures container stdout with Serilog structured formatting. | ||||
| - [ ] Periodically validate alert noise by running load tests that trigger the rate limiter. | ||||
| - [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change. | ||||
		Reference in New Issue
	
	Block a user