feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions

View File

@@ -0,0 +1,97 @@
# Authority Backup & Restore Runbook
## Scope
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24h in production. Store snapshots in an encrypted, access-controlled vault.
## Inventory Checklist
| Component | Location (compose default) | Notes |
| --- | --- | --- |
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
## Hot Backup (no downtime)
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
2. **Dump Mongo:**
```bash
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
--gzip --db stellaops-authority
docker compose -f ops/authority/docker-compose.authority.yaml cp \
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
```
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
3. **Capture configuration + manifests:**
```bash
cp etc/authority.yaml backup/
rsync -a etc/authority.plugins/ backup/authority.plugins/
```
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
```bash
docker run --rm \
-v authority-keys:/keys \
-v "$(pwd)/backup:/backup" \
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
```
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
## Cold Backup (planned downtime)
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
2. Stop services:
```bash
docker compose -f ops/authority/docker-compose.authority.yaml down
```
3. Back up volumes directly using `tar`:
```bash
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
```
4. Copy configuration + manifests as in the hot backup (steps 36).
5. Restart services and verify health:
```bash
docker compose -f ops/authority/docker-compose.authority.yaml up -d
curl -fsS http://localhost:8080/ready
```
## Restore Procedure
1. **Provision clean volumes:** remove existing volumes if youre rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
2. **Restore Mongo:**
```bash
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
```
Use `--drop` to replace collections; omit if doing a partial restore.
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
4. **Restore signing keys:** untar into the mounted volume:
```bash
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
```
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
5. **Start services & validate:**
```bash
docker compose up -d
curl -fsS http://localhost:8080/health
```
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
## Disaster Recovery Notes
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
## Verification Checklist
- [ ] `/ready` reports all identity providers ready.
- [ ] OAuth flows issue tokens signed by the restored keys.
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
- [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds.
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).

View File

@@ -0,0 +1,174 @@
{
"title": "StellaOps Authority - Token & Access Monitoring",
"uid": "authority-token-monitoring",
"schemaVersion": 38,
"version": 1,
"editable": true,
"timezone": "",
"graphTooltip": 0,
"time": {
"from": "now-6h",
"to": "now"
},
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"refresh": 1,
"hide": 0,
"current": {}
}
]
},
"panels": [
{
"id": 1,
"title": "Token Requests Success vs Failure",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "req/s",
"displayName": "{{grant_type}} ({{status}})"
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
"legendFormat": "{{grant_type}} {{status}}"
}
],
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 2,
"title": "Rate Limiter Rejections",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "req/s",
"displayName": "{{limiter}}"
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
"legendFormat": "{{limiter}}"
}
]
},
{
"id": 3,
"title": "Bypass Events (5m)",
"type": "stat",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "short",
"color": {
"mode": "thresholds"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 1 },
{ "color": "red", "value": 5 }
]
}
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
}
],
"options": {
"reduceOptions": {
"calcs": ["last"],
"fields": "",
"values": false
},
"orientation": "horizontal",
"textMode": "auto"
}
},
{
"id": 4,
"title": "Lockout Events (15m)",
"type": "stat",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "short",
"color": {
"mode": "thresholds"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 5 },
{ "color": "red", "value": 10 }
]
}
},
"overrides": []
},
"targets": [
{
"refId": "A",
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
}
],
"options": {
"reduceOptions": {
"calcs": ["last"],
"fields": "",
"values": false
},
"orientation": "horizontal",
"textMode": "auto"
}
},
{
"id": 5,
"title": "Trace Explorer Shortcut",
"type": "text",
"options": {
"mode": "markdown",
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
}
}
],
"links": []
}

View File

@@ -0,0 +1,94 @@
# Authority Signing Key Rotation Playbook
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
## 1. Overview
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
- **Automation script:** `ops/authority/key-rotation.sh`
Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`
Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
This playbook documents the repeatable sequence for all environments.
## 2. Pre-requisites
1. **Generate a new PEM key (per environment)**
```bash
openssl ecparam -name prime256v1 -genkey -noout \
-out certificates/authority-signing-<env>-<year>.pem
chmod 600 certificates/authority-signing-<env>-<year>.pem
```
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
3. **Ensure secrets/vars exist in Gitea**
- `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
- `<ENV>_AUTHORITY_URL`
- Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
## 3. Executing the rotation
### Option A via CI workflow (recommended)
1. Navigate to **Actions → Authority Key Rotation**.
2. Provide inputs:
- `environment`: `staging`, `production`, etc.
- `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
- `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
- Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
3. Trigger. The workflow:
- Reads the bootstrap key/URL from secrets.
- Runs `ops/authority/key-rotation.sh`.
- Prints the JWKS response for verification.
### Option B manual shell invocation
```bash
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
./ops/authority/key-rotation.sh \
--authority-url https://authority.example.com \
--key-id authority-signing-2025-dev \
--key-path ../certificates/authority-signing-2025-dev.pem \
--meta rotatedBy=ops --meta changeTicket=OPS-1234
```
Use `--dry-run` to inspect the payload before execution.
## 4. Post-rotation checklist
1. Update `authority.yaml` (or environment-specific overrides):
- Set `signing.activeKeyId` to the new key.
- Set `signing.keyPath` to the new PEM.
- Append the previous key into `signing.additionalKeys`.
- Ensure `keySource`/`provider` match the values passed to the script.
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
## 5. Development key state
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
- Retired: `authority-signing-dev`
Treat these as examples; real environments must maintain their own PEM material.
## 6. References
- `docs/11_AUTHORITY.md` Architecture and rotation SOP (Section 5).
- `docs/modules/authority/operations/backup-restore.md` Recovery flow referencing this playbook.
- `ops/authority/README.md` CLI usage and examples.
- `scripts/rotate-policy-cli-secret.sh` Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
## 7. Appendix — Policy CLI secret rotation
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
```bash
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
```
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.

View File

@@ -0,0 +1,83 @@
# Authority Monitoring & Alerting Playbook
## Telemetry Sources
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
- **Logs:** Serilog writes structured events to stdout. Notable templates:
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
- `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
- `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
## Prometheus Metrics to Collect
| Metric | Query | Purpose |
| --- | --- | --- |
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when >5% for 10min. |
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
## Alert Rules
1. **Token Failure Surge**
- _Expression_: `token_failure_ratio > 0.05`
- _For_: `10m`
- _Labels_: `severity="critical"`
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
2. **Lockout Spike**
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
- _For_: `15m`
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
3. **Bypass Threshold**
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
- _For_: `5m`
- Alert severity `warning` — verify the calling host list.
4. **Rate Limiter Saturation**
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
- Escalate if sustained for 5min; confirm trusted clients arent misconfigured.
## Grafana Dashboard
- Import `docs/modules/authority/operations/grafana-dashboard.json` to provision baseline panels:
- **Token Success vs Failure** stacked rate visualization split by grant type.
- **Rate Limiter Hits** bar chart showing `authority-token` and `authority-authorize`.
- **Bypass & Lockout Events** dual-stat panel using Loki-derived counters.
- **Trace Explorer Link** panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
## Collector Configuration Snippets
```yaml
receivers:
otlp:
protocols:
http:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
processors:
batch:
attributes/token_grant:
actions:
- key: grant_type
action: upsert
from_attribute: authority.grant_type
service:
pipelines:
metrics:
receivers: [otlp]
processors: [attributes/token_grant, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
```
## Operational Checklist
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.