Align AOC tasks for Excititor and Concelier
This commit is contained in:
@@ -1,22 +1,22 @@
|
||||
# Authority agent guide
|
||||
|
||||
## Mission
|
||||
Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
# Authority agent guide
|
||||
|
||||
## Mission
|
||||
Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
@@ -1,40 +1,40 @@
|
||||
# StellaOps Authority
|
||||
|
||||
Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool.
|
||||
|
||||
## Responsibilities
|
||||
- Expose device-code, auth-code, and client-credential flows with DPoP or mTLS binding.
|
||||
- Manage signing keys, JWKS rotation, and PoE integration for plan enforcement.
|
||||
- Emit structured audit events and enforce tenant-aware scope policies.
|
||||
- Provide plugin surface for custom identity providers and credential validators.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Authority` web host.
|
||||
- `StellaOps.Authority.Plugin.*` extensions for secret stores, identity bridges, and OpTok validation.
|
||||
- Telemetry and audit pipeline feeding Security/Observability stacks.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Signer/Attestor for PoE and OpTok introspection.
|
||||
- CLI/UI for login flows and token management.
|
||||
- Scheduler/Scanner for machine-to-machine scope enforcement.
|
||||
|
||||
## Operational notes
|
||||
- MongoDB for tenant, client, and token state.
|
||||
- Key material in KMS/HSM with rotation runbooks (see ./operations/key-rotation.md).
|
||||
- Grafana/Prometheus dashboards for auth latency/issuance.
|
||||
|
||||
## Related resources
|
||||
- ./operations/backup-restore.md
|
||||
- ./operations/key-rotation.md
|
||||
- ./operations/monitoring.md
|
||||
- ./operations/grafana-dashboard.json
|
||||
|
||||
## Backlog references
|
||||
- DOCS-SEC-62-001 (scope hardening doc) in ../../TASKS.md.
|
||||
- AUTH-POLICY-20-001/002 follow-ups in src/Authority/StellaOps.Authority/TASKS.md.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 1 – AOC enforcement:** enforce OpTok scopes and guardrails supporting raw ingestion boundaries.
|
||||
- **Epic 2 – Policy Engine & Editor:** supply policy evaluation/principal scopes and short-lived tokens for evaluator workflows.
|
||||
- **Epic 4 – Policy Studio:** integrate approval/promotion signatures and policy registry access controls.
|
||||
- **Epic 14 – Identity & Tenancy:** deliver tenant isolation, RBAC hierarchies, and governance tooling for authentication.
|
||||
# StellaOps Authority
|
||||
|
||||
Authority is the platform OIDC/OAuth2 control plane that mints short-lived, sender-constrained operational tokens (OpToks) for every StellaOps service and tool.
|
||||
|
||||
## Responsibilities
|
||||
- Expose device-code, auth-code, and client-credential flows with DPoP or mTLS binding.
|
||||
- Manage signing keys, JWKS rotation, and PoE integration for plan enforcement.
|
||||
- Emit structured audit events and enforce tenant-aware scope policies.
|
||||
- Provide plugin surface for custom identity providers and credential validators.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Authority` web host.
|
||||
- `StellaOps.Authority.Plugin.*` extensions for secret stores, identity bridges, and OpTok validation.
|
||||
- Telemetry and audit pipeline feeding Security/Observability stacks.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Signer/Attestor for PoE and OpTok introspection.
|
||||
- CLI/UI for login flows and token management.
|
||||
- Scheduler/Scanner for machine-to-machine scope enforcement.
|
||||
|
||||
## Operational notes
|
||||
- MongoDB for tenant, client, and token state.
|
||||
- Key material in KMS/HSM with rotation runbooks (see ./operations/key-rotation.md).
|
||||
- Grafana/Prometheus dashboards for auth latency/issuance.
|
||||
|
||||
## Related resources
|
||||
- ./operations/backup-restore.md
|
||||
- ./operations/key-rotation.md
|
||||
- ./operations/monitoring.md
|
||||
- ./operations/grafana-dashboard.json
|
||||
|
||||
## Backlog references
|
||||
- DOCS-SEC-62-001 (scope hardening doc) in ../../TASKS.md.
|
||||
- AUTH-POLICY-20-001/002 follow-ups in src/Authority/StellaOps.Authority/TASKS.md.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 1 – AOC enforcement:** enforce OpTok scopes and guardrails supporting raw ingestion boundaries.
|
||||
- **Epic 2 – Policy Engine & Editor:** supply policy evaluation/principal scopes and short-lived tokens for evaluator workflows.
|
||||
- **Epic 4 – Policy Studio:** integrate approval/promotion signatures and policy registry access controls.
|
||||
- **Epic 14 – Identity & Tenancy:** deliver tenant isolation, RBAC hierarchies, and governance tooling for authentication.
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# Task board — Authority
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| AUTHORITY-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| AUTHORITY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| AUTHORITY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
# Task board — Authority
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| AUTHORITY-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| AUTHORITY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| AUTHORITY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
|
||||
@@ -1,22 +1,22 @@
|
||||
# Implementation plan — Authority
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Epic milestones
|
||||
- **Epic 1 – AOC enforcement:** deliver OpTok scopes, guardrails, and AOC verifier hooks for ingestion services.
|
||||
- **Epic 2 – Policy Engine & Editor:** support policy evaluator flows (device-code, client credentials, scope sandboxing).
|
||||
- **Epic 4 – Policy Studio:** provide registry/promotion signing, approvals, and fresh-auth prompts.
|
||||
- **Epic 14 – Identity & Tenancy:** implement tenant isolation, RBAC hierarchies, audit trails, and PoE integration.
|
||||
- Track additional work (DOCS-SEC-62-001, AUTH-POLICY-20-001/002) in ../../TASKS.md and src/Authority/**/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
# Implementation plan — Authority
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Epic milestones
|
||||
- **Epic 1 – AOC enforcement:** deliver OpTok scopes, guardrails, and AOC verifier hooks for ingestion services.
|
||||
- **Epic 2 – Policy Engine & Editor:** support policy evaluator flows (device-code, client credentials, scope sandboxing).
|
||||
- **Epic 4 – Policy Studio:** provide registry/promotion signing, approvals, and fresh-auth prompts.
|
||||
- **Epic 14 – Identity & Tenancy:** implement tenant isolation, RBAC hierarchies, audit trails, and PoE integration.
|
||||
- Track additional work (DOCS-SEC-62-001, AUTH-POLICY-20-001/002) in ../../TASKS.md and src/Authority/**/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
|
||||
@@ -1,97 +1,97 @@
|
||||
# Authority Backup & Restore Runbook
|
||||
|
||||
## Scope
|
||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
|
||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
|
||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
|
||||
|
||||
## Inventory Checklist
|
||||
| Component | Location (compose default) | Notes |
|
||||
| --- | --- | --- |
|
||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
|
||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
|
||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
|
||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
|
||||
|
||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
|
||||
|
||||
## Hot Backup (no downtime)
|
||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
|
||||
2. **Dump Mongo:**
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
|
||||
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
|
||||
--gzip --db stellaops-authority
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml cp \
|
||||
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
|
||||
```
|
||||
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
|
||||
3. **Capture configuration + manifests:**
|
||||
```bash
|
||||
cp etc/authority.yaml backup/
|
||||
rsync -a etc/authority.plugins/ backup/authority.plugins/
|
||||
```
|
||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
|
||||
```bash
|
||||
docker run --rm \
|
||||
-v authority-keys:/keys \
|
||||
-v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
|
||||
```
|
||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
|
||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
|
||||
|
||||
## Cold Backup (planned downtime)
|
||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
|
||||
2. Stop services:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml down
|
||||
```
|
||||
3. Back up volumes directly using `tar`:
|
||||
```bash
|
||||
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
|
||||
```
|
||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
|
||||
5. Restart services and verify health:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml up -d
|
||||
curl -fsS http://localhost:8080/ready
|
||||
```
|
||||
|
||||
## Restore Procedure
|
||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
|
||||
2. **Restore Mongo:**
|
||||
```bash
|
||||
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
|
||||
```
|
||||
Use `--drop` to replace collections; omit if doing a partial restore.
|
||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
|
||||
4. **Restore signing keys:** untar into the mounted volume:
|
||||
```bash
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
|
||||
```
|
||||
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
|
||||
5. **Start services & validate:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
curl -fsS http://localhost:8080/health
|
||||
```
|
||||
# Authority Backup & Restore Runbook
|
||||
|
||||
## Scope
|
||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
|
||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
|
||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
|
||||
|
||||
## Inventory Checklist
|
||||
| Component | Location (compose default) | Notes |
|
||||
| --- | --- | --- |
|
||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
|
||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
|
||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
|
||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
|
||||
|
||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
|
||||
|
||||
## Hot Backup (no downtime)
|
||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
|
||||
2. **Dump Mongo:**
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
|
||||
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
|
||||
--gzip --db stellaops-authority
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml cp \
|
||||
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
|
||||
```
|
||||
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
|
||||
3. **Capture configuration + manifests:**
|
||||
```bash
|
||||
cp etc/authority.yaml backup/
|
||||
rsync -a etc/authority.plugins/ backup/authority.plugins/
|
||||
```
|
||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
|
||||
```bash
|
||||
docker run --rm \
|
||||
-v authority-keys:/keys \
|
||||
-v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
|
||||
```
|
||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
|
||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
|
||||
|
||||
## Cold Backup (planned downtime)
|
||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
|
||||
2. Stop services:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml down
|
||||
```
|
||||
3. Back up volumes directly using `tar`:
|
||||
```bash
|
||||
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
|
||||
```
|
||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
|
||||
5. Restart services and verify health:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml up -d
|
||||
curl -fsS http://localhost:8080/ready
|
||||
```
|
||||
|
||||
## Restore Procedure
|
||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
|
||||
2. **Restore Mongo:**
|
||||
```bash
|
||||
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
|
||||
```
|
||||
Use `--drop` to replace collections; omit if doing a partial restore.
|
||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
|
||||
4. **Restore signing keys:** untar into the mounted volume:
|
||||
```bash
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
|
||||
```
|
||||
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
|
||||
5. **Start services & validate:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
curl -fsS http://localhost:8080/health
|
||||
```
|
||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../../../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
|
||||
|
||||
## Disaster Recovery Notes
|
||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
|
||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
|
||||
|
||||
## Disaster Recovery Notes
|
||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
|
||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
|
||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and [`docs/11_AUTHORITY.md`](../../../11_AUTHORITY.md)), and publish a revocation notice.
|
||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
|
||||
|
||||
## Verification Checklist
|
||||
- [ ] `/ready` reports all identity providers ready.
|
||||
- [ ] OAuth flows issue tokens signed by the restored keys.
|
||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
|
||||
- [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds.
|
||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
|
||||
|
||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
|
||||
|
||||
## Verification Checklist
|
||||
- [ ] `/ready` reports all identity providers ready.
|
||||
- [ ] OAuth flows issue tokens signed by the restored keys.
|
||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
|
||||
- [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds.
|
||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
|
||||
|
||||
|
||||
@@ -1,174 +1,174 @@
|
||||
{
|
||||
"title": "StellaOps Authority - Token & Access Monitoring",
|
||||
"uid": "authority-token-monitoring",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {}
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Token Requests – Success vs Failure",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{grant_type}} ({{status}})"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
|
||||
"legendFormat": "{{grant_type}} {{status}}"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Rate Limiter Rejections",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{limiter}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
|
||||
"legendFormat": "{{limiter}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Bypass Events (5m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Lockout Events (15m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 5 },
|
||||
{ "color": "red", "value": 10 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Trace Explorer Shortcut",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"mode": "markdown",
|
||||
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
|
||||
}
|
||||
}
|
||||
],
|
||||
"links": []
|
||||
}
|
||||
{
|
||||
"title": "StellaOps Authority - Token & Access Monitoring",
|
||||
"uid": "authority-token-monitoring",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {}
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Token Requests – Success vs Failure",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{grant_type}} ({{status}})"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
|
||||
"legendFormat": "{{grant_type}} {{status}}"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Rate Limiter Rejections",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{limiter}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
|
||||
"legendFormat": "{{limiter}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Bypass Events (5m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Lockout Events (15m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 5 },
|
||||
{ "color": "red", "value": 10 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Trace Explorer Shortcut",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"mode": "markdown",
|
||||
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
|
||||
}
|
||||
}
|
||||
],
|
||||
"links": []
|
||||
}
|
||||
|
||||
@@ -1,94 +1,94 @@
|
||||
# Authority Signing Key Rotation Playbook
|
||||
|
||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.
|
||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
|
||||
|
||||
- **Automation script:** `ops/authority/key-rotation.sh`
|
||||
Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
|
||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`
|
||||
Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
|
||||
|
||||
This playbook documents the repeatable sequence for all environments.
|
||||
|
||||
## 2. Pre-requisites
|
||||
|
||||
1. **Generate a new PEM key (per environment)**
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout \
|
||||
-out certificates/authority-signing-<env>-<year>.pem
|
||||
chmod 600 certificates/authority-signing-<env>-<year>.pem
|
||||
```
|
||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
|
||||
3. **Ensure secrets/vars exist in Gitea**
|
||||
- `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
|
||||
- `<ENV>_AUTHORITY_URL`
|
||||
- Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
|
||||
|
||||
## 3. Executing the rotation
|
||||
|
||||
### Option A – via CI workflow (recommended)
|
||||
|
||||
1. Navigate to **Actions → Authority Key Rotation**.
|
||||
2. Provide inputs:
|
||||
- `environment`: `staging`, `production`, etc.
|
||||
- `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
|
||||
- `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
|
||||
- Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
|
||||
3. Trigger. The workflow:
|
||||
- Reads the bootstrap key/URL from secrets.
|
||||
- Runs `ops/authority/key-rotation.sh`.
|
||||
- Prints the JWKS response for verification.
|
||||
|
||||
### Option B – manual shell invocation
|
||||
|
||||
```bash
|
||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
|
||||
./ops/authority/key-rotation.sh \
|
||||
--authority-url https://authority.example.com \
|
||||
--key-id authority-signing-2025-dev \
|
||||
--key-path ../certificates/authority-signing-2025-dev.pem \
|
||||
--meta rotatedBy=ops --meta changeTicket=OPS-1234
|
||||
```
|
||||
|
||||
Use `--dry-run` to inspect the payload before execution.
|
||||
|
||||
## 4. Post-rotation checklist
|
||||
|
||||
1. Update `authority.yaml` (or environment-specific overrides):
|
||||
- Set `signing.activeKeyId` to the new key.
|
||||
- Set `signing.keyPath` to the new PEM.
|
||||
- Append the previous key into `signing.additionalKeys`.
|
||||
- Ensure `keySource`/`provider` match the values passed to the script.
|
||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
|
||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
|
||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
|
||||
|
||||
## 5. Development key state
|
||||
|
||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
|
||||
|
||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
|
||||
- Retired: `authority-signing-dev`
|
||||
|
||||
Treat these as examples; real environments must maintain their own PEM material.
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
|
||||
- `docs/modules/authority/operations/backup-restore.md` – Recovery flow referencing this playbook.
|
||||
- `ops/authority/README.md` – CLI usage and examples.
|
||||
- `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
|
||||
|
||||
## 7. Appendix — Policy CLI secret rotation
|
||||
|
||||
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
|
||||
|
||||
```bash
|
||||
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
|
||||
```
|
||||
|
||||
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.
|
||||
# Authority Signing Key Rotation Playbook
|
||||
|
||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.
|
||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
|
||||
|
||||
- **Automation script:** `ops/authority/key-rotation.sh`
|
||||
Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
|
||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`
|
||||
Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
|
||||
|
||||
This playbook documents the repeatable sequence for all environments.
|
||||
|
||||
## 2. Pre-requisites
|
||||
|
||||
1. **Generate a new PEM key (per environment)**
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout \
|
||||
-out certificates/authority-signing-<env>-<year>.pem
|
||||
chmod 600 certificates/authority-signing-<env>-<year>.pem
|
||||
```
|
||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
|
||||
3. **Ensure secrets/vars exist in Gitea**
|
||||
- `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
|
||||
- `<ENV>_AUTHORITY_URL`
|
||||
- Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
|
||||
|
||||
## 3. Executing the rotation
|
||||
|
||||
### Option A – via CI workflow (recommended)
|
||||
|
||||
1. Navigate to **Actions → Authority Key Rotation**.
|
||||
2. Provide inputs:
|
||||
- `environment`: `staging`, `production`, etc.
|
||||
- `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
|
||||
- `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
|
||||
- Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
|
||||
3. Trigger. The workflow:
|
||||
- Reads the bootstrap key/URL from secrets.
|
||||
- Runs `ops/authority/key-rotation.sh`.
|
||||
- Prints the JWKS response for verification.
|
||||
|
||||
### Option B – manual shell invocation
|
||||
|
||||
```bash
|
||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
|
||||
./ops/authority/key-rotation.sh \
|
||||
--authority-url https://authority.example.com \
|
||||
--key-id authority-signing-2025-dev \
|
||||
--key-path ../certificates/authority-signing-2025-dev.pem \
|
||||
--meta rotatedBy=ops --meta changeTicket=OPS-1234
|
||||
```
|
||||
|
||||
Use `--dry-run` to inspect the payload before execution.
|
||||
|
||||
## 4. Post-rotation checklist
|
||||
|
||||
1. Update `authority.yaml` (or environment-specific overrides):
|
||||
- Set `signing.activeKeyId` to the new key.
|
||||
- Set `signing.keyPath` to the new PEM.
|
||||
- Append the previous key into `signing.additionalKeys`.
|
||||
- Ensure `keySource`/`provider` match the values passed to the script.
|
||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
|
||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
|
||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
|
||||
|
||||
## 5. Development key state
|
||||
|
||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
|
||||
|
||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
|
||||
- Retired: `authority-signing-dev`
|
||||
|
||||
Treat these as examples; real environments must maintain their own PEM material.
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
|
||||
- `docs/modules/authority/operations/backup-restore.md` – Recovery flow referencing this playbook.
|
||||
- `ops/authority/README.md` – CLI usage and examples.
|
||||
- `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
|
||||
|
||||
## 7. Appendix — Policy CLI secret rotation
|
||||
|
||||
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
|
||||
|
||||
```bash
|
||||
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
|
||||
```
|
||||
|
||||
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.
|
||||
|
||||
@@ -1,83 +1,83 @@
|
||||
# Authority Monitoring & Alerting Playbook
|
||||
|
||||
## Telemetry Sources
|
||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
|
||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
|
||||
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
|
||||
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
|
||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
|
||||
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
|
||||
- `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
|
||||
- `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
|
||||
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
|
||||
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
|
||||
|
||||
## Prometheus Metrics to Collect
|
||||
| Metric | Query | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
|
||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
|
||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
|
||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
|
||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
|
||||
|
||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
|
||||
|
||||
## Alert Rules
|
||||
1. **Token Failure Surge**
|
||||
- _Expression_: `token_failure_ratio > 0.05`
|
||||
- _For_: `10m`
|
||||
- _Labels_: `severity="critical"`
|
||||
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
|
||||
2. **Lockout Spike**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
|
||||
- _For_: `15m`
|
||||
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
|
||||
3. **Bypass Threshold**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
|
||||
- _For_: `5m`
|
||||
- Alert severity `warning` — verify the calling host list.
|
||||
4. **Rate Limiter Saturation**
|
||||
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
|
||||
- Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
|
||||
|
||||
## Grafana Dashboard
|
||||
- Import `docs/modules/authority/operations/grafana-dashboard.json` to provision baseline panels:
|
||||
- **Token Success vs Failure** – stacked rate visualization split by grant type.
|
||||
- **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
|
||||
- **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
|
||||
- **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
|
||||
|
||||
## Collector Configuration Snippets
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:9464"
|
||||
processors:
|
||||
batch:
|
||||
attributes/token_grant:
|
||||
actions:
|
||||
- key: grant_type
|
||||
action: upsert
|
||||
from_attribute: authority.grant_type
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [attributes/token_grant, batch]
|
||||
exporters: [prometheus]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [loki]
|
||||
```
|
||||
|
||||
## Operational Checklist
|
||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
|
||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
|
||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
|
||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
|
||||
# Authority Monitoring & Alerting Playbook
|
||||
|
||||
## Telemetry Sources
|
||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
|
||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
|
||||
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
|
||||
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
|
||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
|
||||
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
|
||||
- `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
|
||||
- `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
|
||||
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
|
||||
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
|
||||
|
||||
## Prometheus Metrics to Collect
|
||||
| Metric | Query | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
|
||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
|
||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
|
||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
|
||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
|
||||
|
||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
|
||||
|
||||
## Alert Rules
|
||||
1. **Token Failure Surge**
|
||||
- _Expression_: `token_failure_ratio > 0.05`
|
||||
- _For_: `10m`
|
||||
- _Labels_: `severity="critical"`
|
||||
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
|
||||
2. **Lockout Spike**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
|
||||
- _For_: `15m`
|
||||
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
|
||||
3. **Bypass Threshold**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
|
||||
- _For_: `5m`
|
||||
- Alert severity `warning` — verify the calling host list.
|
||||
4. **Rate Limiter Saturation**
|
||||
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
|
||||
- Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
|
||||
|
||||
## Grafana Dashboard
|
||||
- Import `docs/modules/authority/operations/grafana-dashboard.json` to provision baseline panels:
|
||||
- **Token Success vs Failure** – stacked rate visualization split by grant type.
|
||||
- **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
|
||||
- **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
|
||||
- **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
|
||||
|
||||
## Collector Configuration Snippets
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:9464"
|
||||
processors:
|
||||
batch:
|
||||
attributes/token_grant:
|
||||
actions:
|
||||
- key: grant_type
|
||||
action: upsert
|
||||
from_attribute: authority.grant_type
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [attributes/token_grant, batch]
|
||||
exporters: [prometheus]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [loki]
|
||||
```
|
||||
|
||||
## Operational Checklist
|
||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
|
||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
|
||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
|
||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
|
||||
|
||||
Reference in New Issue
Block a user