160 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			160 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Concelier Authority Audit Runbook
 | 
						||
 | 
						||
_Last updated: 2025-10-22_
 | 
						||
 | 
						||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
 | 
						||
 | 
						||
## 1. Prerequisites
 | 
						||
 | 
						||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
 | 
						||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
 | 
						||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
 | 
						||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
 | 
						||
 | 
						||
### Configuration snippet
 | 
						||
 | 
						||
```yaml
 | 
						||
concelier:
 | 
						||
  authority:
 | 
						||
    enabled: true
 | 
						||
    allowAnonymousFallback: false          # keep true only during initial rollout
 | 
						||
    issuer: "https://authority.internal"
 | 
						||
    audiences:
 | 
						||
      - "api://concelier"
 | 
						||
    requiredScopes:
 | 
						||
      - "concelier.jobs.trigger"
 | 
						||
      - "advisory:read"
 | 
						||
      - "advisory:ingest"
 | 
						||
    requiredTenants:
 | 
						||
      - "tenant-default"
 | 
						||
    bypassNetworks:
 | 
						||
      - "127.0.0.1/32"
 | 
						||
      - "::1/128"
 | 
						||
    clientId: "concelier-jobs"
 | 
						||
    clientSecretFile: "/run/secrets/concelier_authority_client"
 | 
						||
    tokenClockSkewSeconds: 60
 | 
						||
    resilience:
 | 
						||
      enableRetries: true
 | 
						||
      retryDelays:
 | 
						||
        - "00:00:01"
 | 
						||
        - "00:00:02"
 | 
						||
        - "00:00:05"
 | 
						||
      allowOfflineCacheFallback: true
 | 
						||
      offlineCacheTolerance: "00:10:00"
 | 
						||
```
 | 
						||
 | 
						||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
 | 
						||
 | 
						||
### Resilience tuning
 | 
						||
 | 
						||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
 | 
						||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
 | 
						||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
 | 
						||
 | 
						||
## 2. Key Signals
 | 
						||
 | 
						||
### 2.1 Audit log channel
 | 
						||
 | 
						||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
 | 
						||
 | 
						||
```
 | 
						||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
 | 
						||
```
 | 
						||
 | 
						||
| Field        | Sample value            | Meaning                                                                                  |
 | 
						||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
 | 
						||
| `route`      | `/jobs/definitions`     | Endpoint that processed the request.                                                     |
 | 
						||
| `status`     | `200` / `401` / `409`   | Final HTTP status code returned to the caller.                                           |
 | 
						||
| `subject`    | `ops@example.com`       | User or service principal subject (falls back to `(anonymous)` when unauthenticated).    |
 | 
						||
| `clientId`   | `concelier-cli`         | OAuth client ID provided by Authority (`(none)` if the token lacked the claim).         |
 | 
						||
| `scopes`     | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none.   |
 | 
						||
| `tenant`     | `tenant-default`        | Tenant claim extracted from the Authority token (`(none)` when the token lacked it).     |
 | 
						||
| `bypass`     | `True` / `False`        | Indicates whether the request succeeded because its source IP matched a bypass CIDR.    |
 | 
						||
| `remote`     | `10.1.4.7`              | Remote IP recorded from the connection / forwarded header test hooks.                    |
 | 
						||
 | 
						||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
 | 
						||
 | 
						||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
 | 
						||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
 | 
						||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
 | 
						||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
 | 
						||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
 | 
						||
 | 
						||
### 2.2 Metrics
 | 
						||
 | 
						||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
 | 
						||
 | 
						||
| Metric name                   | Description                                        | PromQL example |
 | 
						||
|-------------------------------|----------------------------------------------------|----------------|
 | 
						||
| `web.jobs.triggered`          | Accepted job trigger requests.                     | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
 | 
						||
| `web.jobs.trigger.conflict`   | Rejected triggers (already running, disabled…).    | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
 | 
						||
| `web.jobs.trigger.failed`     | Server-side job failures.                          | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
 | 
						||
 | 
						||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
 | 
						||
 | 
						||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
 | 
						||
 | 
						||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
 | 
						||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
 | 
						||
 | 
						||
## 3. Alerting Guidance
 | 
						||
 | 
						||
1. **Unauthorized bypass attempt**  
 | 
						||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`  
 | 
						||
   - Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
 | 
						||
 | 
						||
2. **Missing scopes**  
 | 
						||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`  
 | 
						||
   - Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
 | 
						||
 | 
						||
3. **Trigger failure surge**  
 | 
						||
   - Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.  
 | 
						||
   - Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
 | 
						||
 | 
						||
4. **Conflict spike**  
 | 
						||
   - Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).  
 | 
						||
   - Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
 | 
						||
 | 
						||
5. **Authority offline**  
 | 
						||
   - Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
 | 
						||
 | 
						||
## 4. Rollout & Verification Procedure
 | 
						||
 | 
						||
1. **Pre-checks**
 | 
						||
   - Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
 | 
						||
   - Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
 | 
						||
   - Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
 | 
						||
 | 
						||
2. **Smoke test with valid token**
 | 
						||
   - Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
 | 
						||
   - Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
 | 
						||
   - Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
 | 
						||
 | 
						||
3. **Negative test without token**
 | 
						||
   - Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
 | 
						||
   - If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
 | 
						||
 | 
						||
4. **Bypass check (if applicable)**
 | 
						||
   - From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
 | 
						||
 | 
						||
5. **Metrics validation**
 | 
						||
   - Ensure `web.jobs.triggered` counter increments during accepted runs.
 | 
						||
   - Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
 | 
						||
 | 
						||
## 5. Troubleshooting
 | 
						||
 | 
						||
| Symptom | Probable cause | Remediation |
 | 
						||
|---------|----------------|-------------|
 | 
						||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
 | 
						||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
 | 
						||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
 | 
						||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
 | 
						||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
 | 
						||
 | 
						||
## 6. References
 | 
						||
 | 
						||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
 | 
						||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
 | 
						||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
 | 
						||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
 |