feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
22
docs/modules/authority/AGENTS.md
Normal file
22
docs/modules/authority/AGENTS.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Authority agent guide
|
||||
|
||||
## Mission
|
||||
Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
40
docs/modules/authority/README.md
Normal file
40
docs/modules/authority/README.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# StellaOps Authority
|
||||
|
||||
Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool.
|
||||
|
||||
## Responsibilities
|
||||
- Expose device-code, auth-code, and client-credential flows with DPoP or mTLS binding.
|
||||
- Manage signing keys, JWKS rotation, and PoE integration for plan enforcement.
|
||||
- Emit structured audit events and enforce tenant-aware scope policies.
|
||||
- Provide plugin surface for custom identity providers and credential validators.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Authority` web host.
|
||||
- `StellaOps.Authority.Plugin.*` extensions for secret stores, identity bridges, and OpTok validation.
|
||||
- Telemetry and audit pipeline feeding Security/Observability stacks.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Signer/Attestor for PoE and OpTok introspection.
|
||||
- CLI/UI for login flows and token management.
|
||||
- Scheduler/Scanner for machine-to-machine scope enforcement.
|
||||
|
||||
## Operational notes
|
||||
- MongoDB for tenant, client, and token state.
|
||||
- Key material in KMS/HSM with rotation runbooks (see ./operations/key-rotation.md).
|
||||
- Grafana/Prometheus dashboards for auth latency/issuance.
|
||||
|
||||
## Related resources
|
||||
- ./operations/backup-restore.md
|
||||
- ./operations/key-rotation.md
|
||||
- ./operations/monitoring.md
|
||||
- ./operations/grafana-dashboard.json
|
||||
|
||||
## Backlog references
|
||||
- DOCS-SEC-62-001 (scope hardening doc) in ../../TASKS.md.
|
||||
- AUTH-POLICY-20-001/002 follow-ups in src/Authority/StellaOps.Authority/TASKS.md.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 1 – AOC enforcement:** enforce OpTok scopes and guardrails supporting raw ingestion boundaries.
|
||||
- **Epic 2 – Policy Engine & Editor:** supply policy evaluation/principal scopes and short-lived tokens for evaluator workflows.
|
||||
- **Epic 4 – Policy Studio:** integrate approval/promotion signatures and policy registry access controls.
|
||||
- **Epic 14 – Identity & Tenancy:** deliver tenant isolation, RBAC hierarchies, and governance tooling for authentication.
|
||||
9
docs/modules/authority/TASKS.md
Normal file
9
docs/modules/authority/TASKS.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Task board — Authority
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| AUTHORITY-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| AUTHORITY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| AUTHORITY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
445
docs/modules/authority/architecture.md
Normal file
445
docs/modules/authority/architecture.md
Normal file
@@ -0,0 +1,445 @@
|
||||
# component_architecture_authority.md — **Stella Ops Authority** (2025Q4)
|
||||
|
||||
> Consolidates identity and tenancy requirements documented across the AOC, Policy, and Platform guides, along with the dedicated Authority implementation plan.
|
||||
|
||||
> **Scope.** Implementation‑ready architecture for **Stella Ops Authority**: the on‑prem **OIDC/OAuth2** service that issues **short‑lived, sender‑constrained operational tokens (OpToks)** to first‑party services and tools. Covers protocols (DPoP & mTLS binding), token shapes, endpoints, storage, rotation, HA, RBAC, audit, and testing. This component is the trust anchor for *who* is calling inside a Stella Ops installation. (Entitlement is proven separately by **PoE** from the cloud Licensing Service; Authority does not issue PoE.)
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Provide **fast, local, verifiable** authentication for Stella Ops microservices and tools by minting **very short‑lived** OAuth2/OIDC tokens that are **sender‑constrained** (DPoP or mTLS‑bound). Support RBAC scopes, multi‑tenant claims, and deterministic validation for APIs (Scanner, Signer, Attestor, Excititor, Concelier, UI, CLI, Zastava).
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
* Authority **does not** validate entitlements/licensing. That’s enforced by **Signer** using **PoE** with the cloud Licensing Service.
|
||||
* Authority tokens are **operational only** (2–5 min TTL) and must not be embedded in long‑lived artifacts or stored in SBOMs.
|
||||
* Authority is **stateless for validation** (JWT) and **optional introspection** for services that prefer online checks.
|
||||
|
||||
---
|
||||
|
||||
## 1) Protocols & cryptography
|
||||
|
||||
* **OIDC Discovery**: `/.well-known/openid-configuration`
|
||||
* **OAuth2** grant types:
|
||||
|
||||
* **Client Credentials** (service↔service, with mTLS or private_key_jwt)
|
||||
* **Device Code** (CLI login on headless agents; optional)
|
||||
* **Authorization Code + PKCE** (browser login for UI; optional)
|
||||
* **Sender constraint options** (choose per caller or per audience):
|
||||
|
||||
* **DPoP** (Demonstration of Proof‑of‑Possession): proof JWT on each HTTP request, bound to the access token via `cnf.jkt`.
|
||||
* **OAuth 2.0 mTLS** (certificate‑bound tokens): token bound to client certificate thumbprint via `cnf.x5t#S256`.
|
||||
* **Signing algorithms**: **EdDSA (Ed25519)** preferred; fallback **ES256 (P‑256)**. Rotation is supported via **kid** in JWKS.
|
||||
* **Token format**: **JWT** access tokens (compact), optionally opaque reference tokens for services that insist on introspection.
|
||||
* **Clock skew tolerance**: ±60 s; issue `nbf`, `iat`, `exp` accordingly.
|
||||
|
||||
---
|
||||
|
||||
## 2) Token model
|
||||
|
||||
### 2.1 Access token (OpTok) — short‑lived (120–300 s)
|
||||
|
||||
**Registered claims**
|
||||
|
||||
```
|
||||
iss = https://authority.<domain>
|
||||
sub = <client_id or user_id>
|
||||
aud = <service audience: signer|scanner|attestor|concelier|excititor|ui|zastava>
|
||||
exp = <unix ts> (<= 300 s from iat)
|
||||
iat = <unix ts>
|
||||
nbf = iat - 30
|
||||
jti = <uuid>
|
||||
scope = "scanner.scan scanner.export signer.sign ..."
|
||||
```
|
||||
|
||||
**Sender‑constraint (`cnf`)**
|
||||
|
||||
* **DPoP**:
|
||||
|
||||
```json
|
||||
"cnf": { "jkt": "<base64url(SHA-256(JWK))>" }
|
||||
```
|
||||
* **mTLS**:
|
||||
|
||||
```json
|
||||
"cnf": { "x5t#S256": "<base64url(SHA-256(client_cert_der))>" }
|
||||
```
|
||||
|
||||
**Install/tenant context (custom claims)**
|
||||
|
||||
```
|
||||
tid = <tenant id> // multi-tenant
|
||||
inst = <installation id> // unique installation
|
||||
roles = [ "svc.scanner", "svc.signer", "ui.admin", ... ]
|
||||
plan? = <plan name> // optional hint for UIs; not used for enforcement
|
||||
```
|
||||
|
||||
> **Note**: Do **not** copy PoE claims into OpTok; OpTok ≠ entitlement. Only **Signer** checks PoE.
|
||||
|
||||
### 2.2 Refresh tokens (optional)
|
||||
|
||||
* Default **disabled**. If enabled (for UI interactive logins), pair with **DPoP‑bound** refresh tokens or **mTLS** client sessions; short TTL (≤ 8 h), rotating on use (replay‑safe).
|
||||
|
||||
### 2.3 ID tokens (optional)
|
||||
|
||||
* Issued for UI/browser OIDC flows (Authorization Code + PKCE); not used for service auth.
|
||||
|
||||
---
|
||||
|
||||
## 3) Endpoints & flows
|
||||
|
||||
### 3.1 OIDC discovery & keys
|
||||
|
||||
* `GET /.well-known/openid-configuration` → endpoints, algs, jwks_uri
|
||||
* `GET /jwks` → JSON Web Key Set (rotating, at least 2 active keys during transition)
|
||||
|
||||
### 3.2 Token issuance
|
||||
|
||||
* `POST /oauth/token`
|
||||
|
||||
* **Client Credentials** (service→service):
|
||||
|
||||
* **mTLS**: mutual TLS + `client_id` → bound token (`cnf.x5t#S256`)
|
||||
* `security.senderConstraints.mtls.enforceForAudiences` forces the mTLS path when requested `aud`/`resource` values intersect high-value audiences (defaults include `signer`). Authority rejects clients attempting to use DPoP/basic secrets for these audiences.
|
||||
* Stored `certificateBindings` are authoritative: thumbprint, subject, issuer, serial number, and SAN values are matched against the presented certificate, with rotation grace applied to activation windows. Failures surface deterministic error codes (e.g. `certificate_binding_subject_mismatch`).
|
||||
* **private_key_jwt**: JWT‑based client auth + **DPoP** header (preferred for tools and CLI)
|
||||
* **Device Code** (CLI): `POST /oauth/device/code` + `POST /oauth/token` poll
|
||||
* **Authorization Code + PKCE** (UI): standard
|
||||
|
||||
**DPoP handshake (example)**
|
||||
|
||||
1. Client prepares **JWK** (ephemeral keypair).
|
||||
2. Client sends **DPoP proof** header with fields:
|
||||
|
||||
```
|
||||
htm=POST
|
||||
htu=https://authority.../oauth/token
|
||||
iat=<now>
|
||||
jti=<uuid>
|
||||
```
|
||||
|
||||
signed with the DPoP private key; header carries JWK.
|
||||
3. Authority validates proof; issues access token with `cnf.jkt=<thumbprint(JWK)>`.
|
||||
4. Client uses the same DPoP key to sign **every subsequent API request** to services (Signer, Scanner, …).
|
||||
|
||||
**mTLS flow**
|
||||
|
||||
* Mutual TLS at the connection; Authority extracts client cert, validates chain; token carries `cnf.x5t#S256`.
|
||||
|
||||
### 3.3 Introspection & revocation (optional)
|
||||
|
||||
* `POST /oauth/introspect` → `{ active, sub, scope, aud, exp, cnf, ... }`
|
||||
* `POST /oauth/revoke` → revokes refresh tokens or opaque access tokens.
|
||||
* **Replay prevention**: maintain **DPoP `jti` cache** (TTL ≤ 10 min) to reject duplicate proofs when services supply DPoP nonces (Signer requires nonce for high‑value operations).
|
||||
|
||||
### 3.4 UserInfo (optional for UI)
|
||||
|
||||
* `GET /userinfo` (ID token context).
|
||||
|
||||
---
|
||||
|
||||
## 4) Audiences, scopes & RBAC
|
||||
|
||||
### 4.1 Audiences
|
||||
|
||||
* `signer` — only the **Signer** service should accept tokens with `aud=signer`.
|
||||
* `attestor`, `scanner`, `concelier`, `excititor`, `ui`, `zastava` similarly.
|
||||
|
||||
Services **must** verify `aud` and **sender constraint** (DPoP/mTLS) per their policy.
|
||||
|
||||
### 4.2 Core scopes
|
||||
|
||||
| Scope | Service | Operation |
|
||||
| ---------------------------------- | ------------------ | -------------------------- |
|
||||
| `signer.sign` | Signer | Request DSSE signing |
|
||||
| `attestor.write` | Attestor | Submit Rekor entries |
|
||||
| `scanner.scan` | Scanner.WebService | Submit scan jobs |
|
||||
| `scanner.export` | Scanner.WebService | Export SBOMs |
|
||||
| `scanner.read` | Scanner.WebService | Read catalog/SBOMs |
|
||||
| `vex.read` / `vex.admin` | Excititor | Query/operate |
|
||||
| `concelier.read` / `concelier.export` | Concelier | Query/exports |
|
||||
| `ui.read` / `ui.admin` | UI | View/admin |
|
||||
| `zastava.emit` / `zastava.enforce` | Scanner/Zastava | Runtime events / admission |
|
||||
|
||||
**Roles → scopes mapping** is configured centrally (Authority policy) and pushed during token issuance.
|
||||
|
||||
---
|
||||
|
||||
## 5) Storage & state
|
||||
|
||||
* **Configuration DB** (PostgreSQL/MySQL): clients, audiences, role→scope maps, tenant/installation registry, device code grants, persistent consents (if any).
|
||||
* **Cache** (Redis):
|
||||
|
||||
* DPoP **jti** replay cache (short TTL)
|
||||
* **Nonce** store (per resource server, if they demand nonce)
|
||||
* Device code pollers, rate limiting buckets
|
||||
* **JWKS**: key material in HSM/KMS or encrypted at rest; JWKS served from memory.
|
||||
|
||||
---
|
||||
|
||||
## 6) Key management & rotation
|
||||
|
||||
* Maintain **at least 2 signing keys** active during rotation; tokens carry `kid`.
|
||||
* Prefer **Ed25519** for compact tokens; maintain **ES256** fallback for FIPS contexts.
|
||||
* Rotation cadence: 30–90 days; emergency rotation supported.
|
||||
* Publish new JWKS **before** issuing tokens with the new `kid` to avoid cold‑start validation misses.
|
||||
* Keep **old keys** available **at least** for max token TTL + 5 minutes.
|
||||
|
||||
---
|
||||
|
||||
## 7) HA & performance
|
||||
|
||||
* **Stateless issuance** (except device codes/refresh) → scale horizontally behind a load‑balancer.
|
||||
* **DB** only for client metadata and optional flows; token checks are JWT‑local; introspection endpoints hit cache/DB minimally.
|
||||
* **Targets**:
|
||||
|
||||
* Token issuance P95 ≤ **20 ms** under warm cache.
|
||||
* DPoP proof validation ≤ **1 ms** extra per request at resource servers (Signer/Scanner).
|
||||
* 99.9% uptime; HPA on CPU/latency.
|
||||
|
||||
---
|
||||
|
||||
## 8) Security posture
|
||||
|
||||
* **Strict TLS** (1.3 preferred); HSTS; modern cipher suites.
|
||||
* **mTLS** enabled where required (Signer/Attestor paths).
|
||||
* **Replay protection**: DPoP `jti` cache, nonce support for **Signer** (add `DPoP-Nonce` header on 401; clients re‑sign).
|
||||
* **Rate limits** per client & per IP; exponential backoff on failures.
|
||||
* **Secrets**: clients use **private_key_jwt** or **mTLS**; never basic secrets over the wire.
|
||||
* **CSP/CSRF** hardening on UI flows; `SameSite=Lax` cookies; PKCE enforced.
|
||||
* **Logs** redact `Authorization` and DPoP proofs; store `sub`, `aud`, `scopes`, `inst`, `tid`, `cnf` thumbprints, not full keys.
|
||||
|
||||
---
|
||||
|
||||
## 9) Multi‑tenancy & installations
|
||||
|
||||
* **Tenant (`tid`)** and **Installation (`inst`)** registries define which audiences/scopes a client can request.
|
||||
* Cross‑tenant isolation enforced at issuance (disallow rogue `aud`), and resource servers **must** check that `tid` matches their configured tenant.
|
||||
|
||||
---
|
||||
|
||||
## 10) Admin & operations APIs
|
||||
|
||||
All under `/admin` (mTLS + `authority.admin` scope).
|
||||
|
||||
```
|
||||
POST /admin/clients # create/update client (confidential/public)
|
||||
POST /admin/audiences # register audience resource URIs
|
||||
POST /admin/roles # define role→scope mappings
|
||||
POST /admin/tenants # create tenant/install entries
|
||||
POST /admin/keys/rotate # rotate signing key (zero-downtime)
|
||||
GET /admin/metrics # Prometheus exposition (token issue rates, errors)
|
||||
GET /admin/healthz|readyz # health/readiness
|
||||
```
|
||||
|
||||
Declared client `audiences` flow through to the issued JWT `aud` claim and the token request's `resource` indicators. Authority relies on this metadata to enforce DPoP nonce challenges for `signer`, `attestor`, and other high-value services without requiring clients to repeat the audience parameter on every request.
|
||||
|
||||
---
|
||||
|
||||
## 11) Integration hard lines (what resource servers must enforce)
|
||||
|
||||
Every Stella Ops service that consumes Authority tokens **must**:
|
||||
|
||||
1. Verify JWT signature (`kid` in JWKS), `iss`, `aud`, `exp`, `nbf`.
|
||||
2. Enforce **sender‑constraint**:
|
||||
|
||||
* **DPoP**: validate DPoP proof (`htu`, `htm`, `iat`, `jti`) and match `cnf.jkt`; cache `jti` for replay defense; honor nonce challenges.
|
||||
* **mTLS**: match presented client cert thumbprint to token `cnf.x5t#S256`.
|
||||
3. Check **scopes**; optionally map to internal roles.
|
||||
4. Check **tenant** (`tid`) and **installation** (`inst`) as appropriate.
|
||||
5. For **Signer** only: require **both** OpTok and **PoE** in the request (enforced by Signer, not Authority).
|
||||
|
||||
---
|
||||
|
||||
## 12) Error surfaces & UX
|
||||
|
||||
* Token endpoint errors follow OAuth2 (`invalid_client`, `invalid_grant`, `invalid_scope`, `unauthorized_client`).
|
||||
* Resource servers use RFC 6750 style (`WWW-Authenticate: DPoP error="invalid_token", error_description="…", dpop_nonce="…" `).
|
||||
* For DPoP nonce challenges, clients retry with the server‑supplied nonce once.
|
||||
|
||||
---
|
||||
|
||||
## 13) Observability & audit
|
||||
|
||||
* **Metrics**:
|
||||
|
||||
* `authority.tokens_issued_total{grant,aud}`
|
||||
* `authority.dpop_validations_total{result}`
|
||||
* `authority.mtls_bindings_total{result}`
|
||||
* `authority.jwks_rotations_total`
|
||||
* `authority.errors_total{type}`
|
||||
* **Audit log** (immutable sink): token issuance (`sub`, `aud`, `scopes`, `tid`, `inst`, `cnf thumbprint`, `jti`), revocations, admin changes.
|
||||
* **Tracing**: token flows, DB reads, JWKS cache.
|
||||
|
||||
---
|
||||
|
||||
## 14) Configuration (YAML)
|
||||
|
||||
```yaml
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
signing:
|
||||
enabled: true
|
||||
activeKeyId: "authority-signing-2025"
|
||||
keyPath: "../certificates/authority-signing-2025.pem"
|
||||
algorithm: "ES256"
|
||||
keySource: "file"
|
||||
security:
|
||||
rateLimiting:
|
||||
token:
|
||||
enabled: true
|
||||
permitLimit: 30
|
||||
window: "00:01:00"
|
||||
queueLimit: 0
|
||||
authorize:
|
||||
enabled: true
|
||||
permitLimit: 60
|
||||
window: "00:01:00"
|
||||
queueLimit: 10
|
||||
internal:
|
||||
enabled: false
|
||||
permitLimit: 5
|
||||
window: "00:01:00"
|
||||
queueLimit: 0
|
||||
senderConstraints:
|
||||
dpop:
|
||||
enabled: true
|
||||
allowedAlgorithms: [ "ES256", "ES384" ]
|
||||
proofLifetime: "00:02:00"
|
||||
allowedClockSkew: "00:00:30"
|
||||
replayWindow: "00:05:00"
|
||||
nonce:
|
||||
enabled: true
|
||||
ttl: "00:10:00"
|
||||
maxIssuancePerMinute: 120
|
||||
store: "redis"
|
||||
redisConnectionString: "redis://authority-redis:6379?ssl=false"
|
||||
requiredAudiences:
|
||||
- "signer"
|
||||
- "attestor"
|
||||
mtls:
|
||||
enabled: true
|
||||
requireChainValidation: true
|
||||
rotationGrace: "00:15:00"
|
||||
enforceForAudiences:
|
||||
- "signer"
|
||||
allowedSanTypes:
|
||||
- "dns"
|
||||
- "uri"
|
||||
allowedCertificateAuthorities:
|
||||
- "/etc/ssl/mtls/clients-ca.pem"
|
||||
clients:
|
||||
- clientId: scanner-web
|
||||
grantTypes: [ "client_credentials" ]
|
||||
audiences: [ "scanner" ]
|
||||
auth: { type: "private_key_jwt", jwkFile: "/secrets/scanner-web.jwk" }
|
||||
senderConstraint: "dpop"
|
||||
scopes: [ "scanner.scan", "scanner.export", "scanner.read" ]
|
||||
- clientId: signer
|
||||
grantTypes: [ "client_credentials" ]
|
||||
audiences: [ "signer" ]
|
||||
auth: { type: "mtls" }
|
||||
senderConstraint: "mtls"
|
||||
scopes: [ "signer.sign" ]
|
||||
- clientId: notify-web-dev
|
||||
grantTypes: [ "client_credentials" ]
|
||||
audiences: [ "notify.dev" ]
|
||||
auth: { type: "client_secret", secretFile: "/secrets/notify-web-dev.secret" }
|
||||
senderConstraint: "dpop"
|
||||
scopes: [ "notify.read", "notify.admin" ]
|
||||
- clientId: notify-web
|
||||
grantTypes: [ "client_credentials" ]
|
||||
audiences: [ "notify" ]
|
||||
auth: { type: "client_secret", secretFile: "/secrets/notify-web.secret" }
|
||||
senderConstraint: "dpop"
|
||||
scopes: [ "notify.read", "notify.admin" ]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 15) Testing matrix
|
||||
|
||||
* **JWT validation**: wrong `aud`, expired `exp`, skewed `nbf`, stale `kid`.
|
||||
* **DPoP**: invalid `htu`/`htm`, replayed `jti`, stale `iat`, wrong `jkt`, nonce dance.
|
||||
* **mTLS**: wrong client cert, wrong CA, thumbprint mismatch.
|
||||
* **RBAC**: scope enforcement per audience; over‑privileged client denied.
|
||||
* **Rotation**: JWKS rotation while load‑testing; zero‑downtime verification.
|
||||
* **HA**: kill one Authority instance; verify issuance continues; JWKS served by peers.
|
||||
* **Performance**: 1k token issuance/sec on 2 cores with Redis enabled for jti caching.
|
||||
|
||||
---
|
||||
|
||||
## 16) Threat model & mitigations (summary)
|
||||
|
||||
| Threat | Vector | Mitigation |
|
||||
| ------------------- | ---------------- | ------------------------------------------------------------------------------------------ |
|
||||
| Token theft | Copy of JWT | **Short TTL**, **sender‑constraint** (DPoP/mTLS); replay blocked by `jti` cache and nonces |
|
||||
| Replay across hosts | Reuse DPoP proof | Enforce `htu`/`htm`, `iat` freshness, `jti` uniqueness; services may require **nonce** |
|
||||
| Impersonation | Fake client | mTLS or `private_key_jwt` with pinned JWK; client registration & rotation |
|
||||
| Key compromise | Signing key leak | HSM/KMS storage, key rotation, audit; emergency key revoke path; narrow token TTL |
|
||||
| Cross‑tenant abuse | Scope elevation | Enforce `aud`, `tid`, `inst` at issuance and resource servers |
|
||||
| Downgrade to bearer | Strip DPoP | Resource servers require DPoP/mTLS based on `aud`; reject bearer without `cnf` |
|
||||
|
||||
---
|
||||
|
||||
## 17) Deployment & HA
|
||||
|
||||
* **Stateless** microservice, containerized; run ≥ 2 replicas behind LB.
|
||||
* **DB**: HA Postgres (or MySQL) for clients/roles; **Redis** for device codes, DPoP nonces/jtis.
|
||||
* **Secrets**: mount client JWKs via K8s Secrets/HashiCorp Vault; signing keys via KMS.
|
||||
* **Backups**: DB daily; Redis not critical (ephemeral).
|
||||
* **Disaster recovery**: export/import of client registry; JWKS rehydrate from KMS.
|
||||
* **Compliance**: TLS audit; penetration testing for OIDC flows.
|
||||
|
||||
---
|
||||
|
||||
## 18) Implementation notes
|
||||
|
||||
* Reference stack: **.NET 10** + **OpenIddict 6** (or IdentityServer if licensed) with custom DPoP validator and mTLS binding middleware.
|
||||
* Keep the DPoP/JTI cache pluggable; allow Redis/Memcached.
|
||||
* Provide **client SDKs** for C# and Go: DPoP key mgmt, proof generation, nonce handling, token refresh helper.
|
||||
|
||||
---
|
||||
|
||||
## 19) Quick reference — wire examples
|
||||
|
||||
**Access token (payload excerpt)**
|
||||
|
||||
```json
|
||||
{
|
||||
"iss": "https://authority.internal",
|
||||
"sub": "scanner-web",
|
||||
"aud": "signer",
|
||||
"exp": 1760668800,
|
||||
"iat": 1760668620,
|
||||
"nbf": 1760668620,
|
||||
"jti": "9d9c3f01-6e1a-49f1-8f77-9b7e6f7e3c50",
|
||||
"scope": "signer.sign",
|
||||
"tid": "tenant-01",
|
||||
"inst": "install-7A2B",
|
||||
"cnf": { "jkt": "KcVb2V...base64url..." }
|
||||
}
|
||||
```
|
||||
|
||||
**DPoP proof header fields (for POST /sign/dsse)**
|
||||
|
||||
```json
|
||||
{
|
||||
"htu": "https://signer.internal/sign/dsse",
|
||||
"htm": "POST",
|
||||
"iat": 1760668620,
|
||||
"jti": "4b1c9b3c-8a95-4c58-8a92-9c6cfb4a6a0b"
|
||||
}
|
||||
```
|
||||
|
||||
Signer validates that `hash(JWK)` in the proof matches `cnf.jkt` in the token.
|
||||
|
||||
---
|
||||
|
||||
## 20) Rollout plan
|
||||
|
||||
1. **MVP**: Client Credentials (private_key_jwt + DPoP), JWKS, short OpToks, per‑audience scopes.
|
||||
2. **Add**: mTLS‑bound tokens for Signer/Attestor; device code for CLI; optional introspection.
|
||||
3. **Hardening**: DPoP nonce support; full audit pipeline; HA tuning.
|
||||
4. **UX**: Tenant/installation admin UI; role→scope editors; client bootstrap wizards.
|
||||
22
docs/modules/authority/implementation_plan.md
Normal file
22
docs/modules/authority/implementation_plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Implementation plan — Authority
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Epic milestones
|
||||
- **Epic 1 – AOC enforcement:** deliver OpTok scopes, guardrails, and AOC verifier hooks for ingestion services.
|
||||
- **Epic 2 – Policy Engine & Editor:** support policy evaluator flows (device-code, client credentials, scope sandboxing).
|
||||
- **Epic 4 – Policy Studio:** provide registry/promotion signing, approvals, and fresh-auth prompts.
|
||||
- **Epic 14 – Identity & Tenancy:** implement tenant isolation, RBAC hierarchies, audit trails, and PoE integration.
|
||||
- Track additional work (DOCS-SEC-62-001, AUTH-POLICY-20-001/002) in ../../TASKS.md and src/Authority/**/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
97
docs/modules/authority/operations/backup-restore.md
Normal file
97
docs/modules/authority/operations/backup-restore.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Authority Backup & Restore Runbook
|
||||
|
||||
## Scope
|
||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
|
||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
|
||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
|
||||
|
||||
## Inventory Checklist
|
||||
| Component | Location (compose default) | Notes |
|
||||
| --- | --- | --- |
|
||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
|
||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
|
||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
|
||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
|
||||
|
||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
|
||||
|
||||
## Hot Backup (no downtime)
|
||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
|
||||
2. **Dump Mongo:**
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
|
||||
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
|
||||
--gzip --db stellaops-authority
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml cp \
|
||||
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
|
||||
```
|
||||
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
|
||||
3. **Capture configuration + manifests:**
|
||||
```bash
|
||||
cp etc/authority.yaml backup/
|
||||
rsync -a etc/authority.plugins/ backup/authority.plugins/
|
||||
```
|
||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
|
||||
```bash
|
||||
docker run --rm \
|
||||
-v authority-keys:/keys \
|
||||
-v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
|
||||
```
|
||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
|
||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
|
||||
|
||||
## Cold Backup (planned downtime)
|
||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
|
||||
2. Stop services:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml down
|
||||
```
|
||||
3. Back up volumes directly using `tar`:
|
||||
```bash
|
||||
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
|
||||
```
|
||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
|
||||
5. Restart services and verify health:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml up -d
|
||||
curl -fsS http://localhost:8080/ready
|
||||
```
|
||||
|
||||
## Restore Procedure
|
||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
|
||||
2. **Restore Mongo:**
|
||||
```bash
|
||||
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
|
||||
```
|
||||
Use `--drop` to replace collections; omit if doing a partial restore.
|
||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
|
||||
4. **Restore signing keys:** untar into the mounted volume:
|
||||
```bash
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
|
||||
```
|
||||
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
|
||||
5. **Start services & validate:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
curl -fsS http://localhost:8080/health
|
||||
```
|
||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
|
||||
|
||||
## Disaster Recovery Notes
|
||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
|
||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
|
||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
|
||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
|
||||
|
||||
## Verification Checklist
|
||||
- [ ] `/ready` reports all identity providers ready.
|
||||
- [ ] OAuth flows issue tokens signed by the restored keys.
|
||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
|
||||
- [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds.
|
||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
|
||||
|
||||
174
docs/modules/authority/operations/grafana-dashboard.json
Normal file
174
docs/modules/authority/operations/grafana-dashboard.json
Normal file
@@ -0,0 +1,174 @@
|
||||
{
|
||||
"title": "StellaOps Authority - Token & Access Monitoring",
|
||||
"uid": "authority-token-monitoring",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {}
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Token Requests – Success vs Failure",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{grant_type}} ({{status}})"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
|
||||
"legendFormat": "{{grant_type}} {{status}}"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Rate Limiter Rejections",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{limiter}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
|
||||
"legendFormat": "{{limiter}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Bypass Events (5m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Lockout Events (15m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 5 },
|
||||
{ "color": "red", "value": 10 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Trace Explorer Shortcut",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"mode": "markdown",
|
||||
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
|
||||
}
|
||||
}
|
||||
],
|
||||
"links": []
|
||||
}
|
||||
94
docs/modules/authority/operations/key-rotation.md
Normal file
94
docs/modules/authority/operations/key-rotation.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Authority Signing Key Rotation Playbook
|
||||
|
||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.
|
||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
|
||||
|
||||
- **Automation script:** `ops/authority/key-rotation.sh`
|
||||
Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
|
||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`
|
||||
Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
|
||||
|
||||
This playbook documents the repeatable sequence for all environments.
|
||||
|
||||
## 2. Pre-requisites
|
||||
|
||||
1. **Generate a new PEM key (per environment)**
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout \
|
||||
-out certificates/authority-signing-<env>-<year>.pem
|
||||
chmod 600 certificates/authority-signing-<env>-<year>.pem
|
||||
```
|
||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
|
||||
3. **Ensure secrets/vars exist in Gitea**
|
||||
- `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
|
||||
- `<ENV>_AUTHORITY_URL`
|
||||
- Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
|
||||
|
||||
## 3. Executing the rotation
|
||||
|
||||
### Option A – via CI workflow (recommended)
|
||||
|
||||
1. Navigate to **Actions → Authority Key Rotation**.
|
||||
2. Provide inputs:
|
||||
- `environment`: `staging`, `production`, etc.
|
||||
- `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
|
||||
- `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
|
||||
- Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
|
||||
3. Trigger. The workflow:
|
||||
- Reads the bootstrap key/URL from secrets.
|
||||
- Runs `ops/authority/key-rotation.sh`.
|
||||
- Prints the JWKS response for verification.
|
||||
|
||||
### Option B – manual shell invocation
|
||||
|
||||
```bash
|
||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
|
||||
./ops/authority/key-rotation.sh \
|
||||
--authority-url https://authority.example.com \
|
||||
--key-id authority-signing-2025-dev \
|
||||
--key-path ../certificates/authority-signing-2025-dev.pem \
|
||||
--meta rotatedBy=ops --meta changeTicket=OPS-1234
|
||||
```
|
||||
|
||||
Use `--dry-run` to inspect the payload before execution.
|
||||
|
||||
## 4. Post-rotation checklist
|
||||
|
||||
1. Update `authority.yaml` (or environment-specific overrides):
|
||||
- Set `signing.activeKeyId` to the new key.
|
||||
- Set `signing.keyPath` to the new PEM.
|
||||
- Append the previous key into `signing.additionalKeys`.
|
||||
- Ensure `keySource`/`provider` match the values passed to the script.
|
||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
|
||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
|
||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
|
||||
|
||||
## 5. Development key state
|
||||
|
||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
|
||||
|
||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
|
||||
- Retired: `authority-signing-dev`
|
||||
|
||||
Treat these as examples; real environments must maintain their own PEM material.
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
|
||||
- `docs/modules/authority/operations/backup-restore.md` – Recovery flow referencing this playbook.
|
||||
- `ops/authority/README.md` – CLI usage and examples.
|
||||
- `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
|
||||
|
||||
## 7. Appendix — Policy CLI secret rotation
|
||||
|
||||
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
|
||||
|
||||
```bash
|
||||
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
|
||||
```
|
||||
|
||||
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.
|
||||
83
docs/modules/authority/operations/monitoring.md
Normal file
83
docs/modules/authority/operations/monitoring.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# Authority Monitoring & Alerting Playbook
|
||||
|
||||
## Telemetry Sources
|
||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
|
||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
|
||||
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
|
||||
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
|
||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
|
||||
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
|
||||
- `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
|
||||
- `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
|
||||
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
|
||||
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
|
||||
|
||||
## Prometheus Metrics to Collect
|
||||
| Metric | Query | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
|
||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
|
||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
|
||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
|
||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
|
||||
|
||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
|
||||
|
||||
## Alert Rules
|
||||
1. **Token Failure Surge**
|
||||
- _Expression_: `token_failure_ratio > 0.05`
|
||||
- _For_: `10m`
|
||||
- _Labels_: `severity="critical"`
|
||||
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
|
||||
2. **Lockout Spike**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
|
||||
- _For_: `15m`
|
||||
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
|
||||
3. **Bypass Threshold**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
|
||||
- _For_: `5m`
|
||||
- Alert severity `warning` — verify the calling host list.
|
||||
4. **Rate Limiter Saturation**
|
||||
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
|
||||
- Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
|
||||
|
||||
## Grafana Dashboard
|
||||
- Import `docs/modules/authority/operations/grafana-dashboard.json` to provision baseline panels:
|
||||
- **Token Success vs Failure** – stacked rate visualization split by grant type.
|
||||
- **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
|
||||
- **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
|
||||
- **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
|
||||
|
||||
## Collector Configuration Snippets
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:9464"
|
||||
processors:
|
||||
batch:
|
||||
attributes/token_grant:
|
||||
actions:
|
||||
- key: grant_type
|
||||
action: upsert
|
||||
from_attribute: authority.grant_type
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [attributes/token_grant, batch]
|
||||
exporters: [prometheus]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [loki]
|
||||
```
|
||||
|
||||
## Operational Checklist
|
||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
|
||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
|
||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
|
||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
|
||||
Reference in New Issue
Block a user