Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
This commit is contained in:
		@@ -1,152 +1,159 @@
 | 
			
		||||
# Concelier Authority Audit Runbook
 | 
			
		||||
 | 
			
		||||
# Concelier Authority Audit Runbook
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-22_
 | 
			
		||||
 | 
			
		||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
 | 
			
		||||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
 | 
			
		||||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
 | 
			
		||||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
 | 
			
		||||
 | 
			
		||||
### Configuration snippet
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  authority:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    allowAnonymousFallback: false          # keep true only during initial rollout
 | 
			
		||||
    issuer: "https://authority.internal"
 | 
			
		||||
    audiences:
 | 
			
		||||
      - "api://concelier"
 | 
			
		||||
    requiredScopes:
 | 
			
		||||
      - "concelier.jobs.trigger"
 | 
			
		||||
    bypassNetworks:
 | 
			
		||||
      - "127.0.0.1/32"
 | 
			
		||||
      - "::1/128"
 | 
			
		||||
    clientId: "concelier-jobs"
 | 
			
		||||
    clientSecretFile: "/run/secrets/concelier_authority_client"
 | 
			
		||||
    tokenClockSkewSeconds: 60
 | 
			
		||||
    resilience:
 | 
			
		||||
      enableRetries: true
 | 
			
		||||
      retryDelays:
 | 
			
		||||
        - "00:00:01"
 | 
			
		||||
        - "00:00:02"
 | 
			
		||||
        - "00:00:05"
 | 
			
		||||
      allowOfflineCacheFallback: true
 | 
			
		||||
      offlineCacheTolerance: "00:10:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
 | 
			
		||||
 | 
			
		||||
### Resilience tuning
 | 
			
		||||
 | 
			
		||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
 | 
			
		||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
 | 
			
		||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
 | 
			
		||||
 | 
			
		||||
## 2. Key Signals
 | 
			
		||||
 | 
			
		||||
### 2.1 Audit log channel
 | 
			
		||||
 | 
			
		||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger bypass=False remote=10.1.4.7
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| Field        | Sample value            | Meaning                                                                                  |
 | 
			
		||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
 | 
			
		||||
| `route`      | `/jobs/definitions`     | Endpoint that processed the request.                                                     |
 | 
			
		||||
| `status`     | `200` / `401` / `409`   | Final HTTP status code returned to the caller.                                           |
 | 
			
		||||
| `subject`    | `ops@example.com`       | User or service principal subject (falls back to `(anonymous)` when unauthenticated).    |
 | 
			
		||||
| `clientId`   | `concelier-cli`           | OAuth client ID provided by Authority ( `(none)` if the token lacked the claim).         |
 | 
			
		||||
| `scopes`     | `concelier.jobs.trigger`  | Normalised scope list extracted from token claims; `(none)` if the token carried none.   |
 | 
			
		||||
| `bypass`     | `True` / `False`        | Indicates whether the request succeeded because its source IP matched a bypass CIDR.    |
 | 
			
		||||
| `remote`     | `10.1.4.7`              | Remote IP recorded from the connection / forwarded header test hooks.                    |
 | 
			
		||||
 | 
			
		||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
 | 
			
		||||
 | 
			
		||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
 | 
			
		||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
 | 
			
		||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
 | 
			
		||||
 | 
			
		||||
### 2.2 Metrics
 | 
			
		||||
 | 
			
		||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
 | 
			
		||||
 | 
			
		||||
| Metric name                   | Description                                        | PromQL example |
 | 
			
		||||
|-------------------------------|----------------------------------------------------|----------------|
 | 
			
		||||
| `web.jobs.triggered`          | Accepted job trigger requests.                     | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.conflict`   | Rejected triggers (already running, disabled…).    | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.failed`     | Server-side job failures.                          | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
 | 
			
		||||
 | 
			
		||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
 | 
			
		||||
 | 
			
		||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
 | 
			
		||||
 | 
			
		||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
 | 
			
		||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
 | 
			
		||||
 | 
			
		||||
## 3. Alerting Guidance
 | 
			
		||||
 | 
			
		||||
1. **Unauthorized bypass attempt**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`  
 | 
			
		||||
   - Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
 | 
			
		||||
 | 
			
		||||
2. **Missing scopes**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`  
 | 
			
		||||
   - Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`.
 | 
			
		||||
 | 
			
		||||
3. **Trigger failure surge**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.  
 | 
			
		||||
   - Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
 | 
			
		||||
 | 
			
		||||
4. **Conflict spike**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).  
 | 
			
		||||
   - Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
 | 
			
		||||
 | 
			
		||||
5. **Authority offline**  
 | 
			
		||||
   - Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
 | 
			
		||||
 | 
			
		||||
## 4. Rollout & Verification Procedure
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
### Configuration snippet
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  authority:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    allowAnonymousFallback: false          # keep true only during initial rollout
 | 
			
		||||
    issuer: "https://authority.internal"
 | 
			
		||||
    audiences:
 | 
			
		||||
      - "api://concelier"
 | 
			
		||||
    requiredScopes:
 | 
			
		||||
      - "concelier.jobs.trigger"
 | 
			
		||||
      - "advisory:read"
 | 
			
		||||
      - "advisory:ingest"
 | 
			
		||||
    requiredTenants:
 | 
			
		||||
      - "tenant-default"
 | 
			
		||||
    bypassNetworks:
 | 
			
		||||
      - "127.0.0.1/32"
 | 
			
		||||
      - "::1/128"
 | 
			
		||||
    clientId: "concelier-jobs"
 | 
			
		||||
    clientSecretFile: "/run/secrets/concelier_authority_client"
 | 
			
		||||
    tokenClockSkewSeconds: 60
 | 
			
		||||
    resilience:
 | 
			
		||||
      enableRetries: true
 | 
			
		||||
      retryDelays:
 | 
			
		||||
        - "00:00:01"
 | 
			
		||||
        - "00:00:02"
 | 
			
		||||
        - "00:00:05"
 | 
			
		||||
      allowOfflineCacheFallback: true
 | 
			
		||||
      offlineCacheTolerance: "00:10:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
 | 
			
		||||
 | 
			
		||||
### Resilience tuning
 | 
			
		||||
 | 
			
		||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
 | 
			
		||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
 | 
			
		||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
 | 
			
		||||
 | 
			
		||||
## 2. Key Signals
 | 
			
		||||
 | 
			
		||||
### 2.1 Audit log channel
 | 
			
		||||
 | 
			
		||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| Field        | Sample value            | Meaning                                                                                  |
 | 
			
		||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
 | 
			
		||||
| `route`      | `/jobs/definitions`     | Endpoint that processed the request.                                                     |
 | 
			
		||||
| `status`     | `200` / `401` / `409`   | Final HTTP status code returned to the caller.                                           |
 | 
			
		||||
| `subject`    | `ops@example.com`       | User or service principal subject (falls back to `(anonymous)` when unauthenticated).    |
 | 
			
		||||
| `clientId`   | `concelier-cli`         | OAuth client ID provided by Authority (`(none)` if the token lacked the claim).         |
 | 
			
		||||
| `scopes`     | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none.   |
 | 
			
		||||
| `tenant`     | `tenant-default`        | Tenant claim extracted from the Authority token (`(none)` when the token lacked it).     |
 | 
			
		||||
| `bypass`     | `True` / `False`        | Indicates whether the request succeeded because its source IP matched a bypass CIDR.    |
 | 
			
		||||
| `remote`     | `10.1.4.7`              | Remote IP recorded from the connection / forwarded header test hooks.                    |
 | 
			
		||||
 | 
			
		||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
 | 
			
		||||
 | 
			
		||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
 | 
			
		||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
 | 
			
		||||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
 | 
			
		||||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
 | 
			
		||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
 | 
			
		||||
 | 
			
		||||
### 2.2 Metrics
 | 
			
		||||
 | 
			
		||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
 | 
			
		||||
 | 
			
		||||
| Metric name                   | Description                                        | PromQL example |
 | 
			
		||||
|-------------------------------|----------------------------------------------------|----------------|
 | 
			
		||||
| `web.jobs.triggered`          | Accepted job trigger requests.                     | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.conflict`   | Rejected triggers (already running, disabled…).    | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.failed`     | Server-side job failures.                          | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
 | 
			
		||||
 | 
			
		||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
 | 
			
		||||
 | 
			
		||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
 | 
			
		||||
 | 
			
		||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
 | 
			
		||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
 | 
			
		||||
 | 
			
		||||
## 3. Alerting Guidance
 | 
			
		||||
 | 
			
		||||
1. **Unauthorized bypass attempt**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`  
 | 
			
		||||
   - Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
 | 
			
		||||
 | 
			
		||||
2. **Missing scopes**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`  
 | 
			
		||||
   - Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
 | 
			
		||||
 | 
			
		||||
3. **Trigger failure surge**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.  
 | 
			
		||||
   - Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
 | 
			
		||||
 | 
			
		||||
4. **Conflict spike**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).  
 | 
			
		||||
   - Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
 | 
			
		||||
 | 
			
		||||
5. **Authority offline**  
 | 
			
		||||
   - Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
 | 
			
		||||
 | 
			
		||||
## 4. Rollout & Verification Procedure
 | 
			
		||||
 | 
			
		||||
1. **Pre-checks**
 | 
			
		||||
   - Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
 | 
			
		||||
   - Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
 | 
			
		||||
   - Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
 | 
			
		||||
 | 
			
		||||
2. **Smoke test with valid token**
 | 
			
		||||
   - Obtain a token via CLI: `stella auth login --scope concelier.jobs.trigger`.
 | 
			
		||||
   - Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
 | 
			
		||||
   - Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger`.
 | 
			
		||||
 | 
			
		||||
3. **Negative test without token**
 | 
			
		||||
   - Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
 | 
			
		||||
   - If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
 | 
			
		||||
 | 
			
		||||
4. **Bypass check (if applicable)**
 | 
			
		||||
   - From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
 | 
			
		||||
 | 
			
		||||
5. **Metrics validation**
 | 
			
		||||
   - Ensure `web.jobs.triggered` counter increments during accepted runs.
 | 
			
		||||
   - Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
 | 
			
		||||
 | 
			
		||||
## 5. Troubleshooting
 | 
			
		||||
 | 
			
		||||
| Symptom | Probable cause | Remediation |
 | 
			
		||||
|---------|----------------|-------------|
 | 
			
		||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
 | 
			
		||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
 | 
			
		||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
 | 
			
		||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
 | 
			
		||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
 | 
			
		||||
 | 
			
		||||
## 6. References
 | 
			
		||||
 | 
			
		||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
 | 
			
		||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
 | 
			
		||||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
 | 
			
		||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
 | 
			
		||||
 | 
			
		||||
2. **Smoke test with valid token**
 | 
			
		||||
   - Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
 | 
			
		||||
   - Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
 | 
			
		||||
   - Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
 | 
			
		||||
 | 
			
		||||
3. **Negative test without token**
 | 
			
		||||
   - Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
 | 
			
		||||
   - If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
 | 
			
		||||
 | 
			
		||||
4. **Bypass check (if applicable)**
 | 
			
		||||
   - From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
 | 
			
		||||
 | 
			
		||||
5. **Metrics validation**
 | 
			
		||||
   - Ensure `web.jobs.triggered` counter increments during accepted runs.
 | 
			
		||||
   - Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
 | 
			
		||||
 | 
			
		||||
## 5. Troubleshooting
 | 
			
		||||
 | 
			
		||||
| Symptom | Probable cause | Remediation |
 | 
			
		||||
|---------|----------------|-------------|
 | 
			
		||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
 | 
			
		||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
 | 
			
		||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
 | 
			
		||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
 | 
			
		||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
 | 
			
		||||
 | 
			
		||||
## 6. References
 | 
			
		||||
 | 
			
		||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
 | 
			
		||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
 | 
			
		||||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
 | 
			
		||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
 | 
			
		||||
 
 | 
			
		||||
							
								
								
									
										151
									
								
								docs/ops/deployment-upgrade-runbook.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										151
									
								
								docs/ops/deployment-upgrade-runbook.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,151 @@
 | 
			
		||||
# Stella Ops Deployment Upgrade & Rollback Runbook
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._
 | 
			
		||||
 | 
			
		||||
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Channel overview
 | 
			
		||||
 | 
			
		||||
| Channel | Release manifest | Helm values | Compose profile |
 | 
			
		||||
|---------|------------------|-------------|-----------------|
 | 
			
		||||
| `edge`  | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` |
 | 
			
		||||
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
 | 
			
		||||
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
 | 
			
		||||
 | 
			
		||||
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Pre-flight checklist
 | 
			
		||||
 | 
			
		||||
1. **Refresh release manifest**  
 | 
			
		||||
   Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
 | 
			
		||||
 | 
			
		||||
2. **Align deployment bundles with the manifest**  
 | 
			
		||||
   Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
 | 
			
		||||
   ```bash
 | 
			
		||||
   ./deploy/tools/check-channel-alignment.py \
 | 
			
		||||
       --release deploy/releases/2025.10-edge.yaml \
 | 
			
		||||
       --target deploy/helm/stellaops/values-dev.yaml \
 | 
			
		||||
       --target deploy/compose/docker-compose.dev.yaml \
 | 
			
		||||
       --ignore-repo nats
 | 
			
		||||
   ```
 | 
			
		||||
   Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
 | 
			
		||||
 | 
			
		||||
3. **Lint and template profiles**
 | 
			
		||||
   ```bash
 | 
			
		||||
   ./deploy/tools/validate-profiles.sh
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
4. **Smoke the Offline Kit debug store (edge/stable only)**  
 | 
			
		||||
   When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
 | 
			
		||||
   ```bash
 | 
			
		||||
  ./ops/offline-kit/mirror_debug_store.py \
 | 
			
		||||
       --release-dir out/release \
 | 
			
		||||
       --offline-kit-dir out/offline-kit
 | 
			
		||||
   ```
 | 
			
		||||
   Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
 | 
			
		||||
 | 
			
		||||
5. **Review compatibility matrix**  
 | 
			
		||||
   Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258…`, `minio@sha256:14ce…`, `rustfs:2025.10.0-edge`.
 | 
			
		||||
 | 
			
		||||
6. **Create a rollback bookmark**  
 | 
			
		||||
   Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Helm upgrade procedure (staging → production)
 | 
			
		||||
 | 
			
		||||
1. Switch to the deployment branch and ensure secrets/config maps are current.
 | 
			
		||||
2. Apply the upgrade in the staging cluster:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm upgrade stellaops deploy/helm/stellaops \
 | 
			
		||||
     -f deploy/helm/stellaops/values-stage.yaml \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --atomic \
 | 
			
		||||
     --timeout 15m
 | 
			
		||||
   ```
 | 
			
		||||
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
 | 
			
		||||
4. Promote to production using the prod values file and the same command.
 | 
			
		||||
5. Record the new revision number and Git SHA in the change log.
 | 
			
		||||
 | 
			
		||||
### Rollback (Helm)
 | 
			
		||||
 | 
			
		||||
1. Identify the previous revision: `helm history stellaops -n stellaops`.
 | 
			
		||||
2. Execute:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm rollback stellaops <revision> \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --wait \
 | 
			
		||||
     --timeout 10m
 | 
			
		||||
   ```
 | 
			
		||||
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
 | 
			
		||||
4. Update the incident/operations log with root cause and rollback details.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Docker Compose upgrade procedure
 | 
			
		||||
 | 
			
		||||
1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
 | 
			
		||||
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
 | 
			
		||||
3. Apply the upgrade:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose \
 | 
			
		||||
     --env-file deploy/compose/env/prod.env \
 | 
			
		||||
     -f deploy/compose/docker-compose.prod.yaml \
 | 
			
		||||
     pull
 | 
			
		||||
 | 
			
		||||
   docker compose \
 | 
			
		||||
     --env-file deploy/compose/env/prod.env \
 | 
			
		||||
     -f deploy/compose/docker-compose.prod.yaml \
 | 
			
		||||
     up -d
 | 
			
		||||
   ```
 | 
			
		||||
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
 | 
			
		||||
5. Update monitoring dashboards/alerts to confirm normal operation.
 | 
			
		||||
 | 
			
		||||
### Rollback (Compose)
 | 
			
		||||
 | 
			
		||||
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
 | 
			
		||||
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
 | 
			
		||||
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/ops/authority-backup-restore.md` and associated service guides).
 | 
			
		||||
4. Log the rollback in the operations journal.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Channel promotion workflow
 | 
			
		||||
 | 
			
		||||
1. Author or update the channel manifest under `deploy/releases/`.
 | 
			
		||||
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
 | 
			
		||||
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
 | 
			
		||||
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
 | 
			
		||||
5. Tag the repository when promoting stable or airgap builds.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Upgrade rehearsal & rollback drill log
 | 
			
		||||
 | 
			
		||||
Maintain rehearsal notes in `docs/ops/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
 | 
			
		||||
 | 
			
		||||
- Release version tested
 | 
			
		||||
- Date/time
 | 
			
		||||
- Participants
 | 
			
		||||
- Issues encountered & fixes
 | 
			
		||||
- Rollback duration (if executed)
 | 
			
		||||
 | 
			
		||||
Attach the log to the sprint retro or operational wiki.
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | Channel | Outcome | Notes |
 | 
			
		||||
|------------|---------|---------|-------|
 | 
			
		||||
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7. References
 | 
			
		||||
 | 
			
		||||
- `deploy/README.md` – structure and validation workflow for deployment bundles.
 | 
			
		||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline.
 | 
			
		||||
- `docs/ARCHITECTURE_DEVOPS.md` – high-level DevOps architecture, SLOs, and compliance requirements.
 | 
			
		||||
- `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper.
 | 
			
		||||
- `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker.
 | 
			
		||||
							
								
								
									
										128
									
								
								docs/ops/launch-cutover.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										128
									
								
								docs/ops/launch-cutover.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,128 @@
 | 
			
		||||
# Launch Cutover Runbook - Stella Ops
 | 
			
		||||
 | 
			
		||||
_Document owner: DevOps Guild (2025-10-26)_  
 | 
			
		||||
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
 | 
			
		||||
 | 
			
		||||
## 1. Roles and Communication
 | 
			
		||||
 | 
			
		||||
| Role | Primary | Backup | Contact |
 | 
			
		||||
| --- | --- | --- | --- |
 | 
			
		||||
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
 | 
			
		||||
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
 | 
			
		||||
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
 | 
			
		||||
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
 | 
			
		||||
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
 | 
			
		||||
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
 | 
			
		||||
 | 
			
		||||
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
 | 
			
		||||
 | 
			
		||||
## 2. Timeline Overview (UTC)
 | 
			
		||||
 | 
			
		||||
| Time | Activity | Owner |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
 | 
			
		||||
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
 | 
			
		||||
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
 | 
			
		||||
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
 | 
			
		||||
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
 | 
			
		||||
| T0 | Execute production cutover steps (Section 4). | Cutover team |
 | 
			
		||||
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
 | 
			
		||||
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
 | 
			
		||||
 | 
			
		||||
## 3. Rehearsal (Staging) Checklist
 | 
			
		||||
 | 
			
		||||
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
 | 
			
		||||
2. Run `deploy/tools/validate-profiles.sh` and archive output.
 | 
			
		||||
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
 | 
			
		||||
4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster.
 | 
			
		||||
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
 | 
			
		||||
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
 | 
			
		||||
7. Document total wall time and any deviations in the rehearsal log.
 | 
			
		||||
 | 
			
		||||
Rehearsal must complete without manual interventions before proceeding to production.
 | 
			
		||||
 | 
			
		||||
## 4. Production Cutover Steps
 | 
			
		||||
 | 
			
		||||
### 4.1 Pre-flight
 | 
			
		||||
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
 | 
			
		||||
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
 | 
			
		||||
- Back up current configuration and data:
 | 
			
		||||
  - Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
 | 
			
		||||
  - MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
 | 
			
		||||
 | 
			
		||||
### 4.2 Apply Updates (Compose)
 | 
			
		||||
1. On each compose node, pull updated images for release `2025.09.2`:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
 | 
			
		||||
   ```
 | 
			
		||||
2. Deploy changes:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
 | 
			
		||||
   ```
 | 
			
		||||
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
 | 
			
		||||
 | 
			
		||||
### 4.3 Apply Updates (Helm/Kubernetes)
 | 
			
		||||
If using Kubernetes, perform:
 | 
			
		||||
```bash
 | 
			
		||||
helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m
 | 
			
		||||
```
 | 
			
		||||
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
 | 
			
		||||
 | 
			
		||||
### 4.4 Configuration Validation
 | 
			
		||||
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
 | 
			
		||||
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
 | 
			
		||||
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
 | 
			
		||||
- Ensure Notify (legacy) still accessible while Notifier migration pending.
 | 
			
		||||
 | 
			
		||||
## 5. Smoke Tests
 | 
			
		||||
 | 
			
		||||
| Test | Command / Action | Expected Result |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
 | 
			
		||||
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
 | 
			
		||||
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
 | 
			
		||||
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
 | 
			
		||||
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
 | 
			
		||||
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
 | 
			
		||||
 | 
			
		||||
Log results in the change ticket with timestamps and screenshots where applicable.
 | 
			
		||||
 | 
			
		||||
## 6. Rollback Procedure
 | 
			
		||||
 | 
			
		||||
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
 | 
			
		||||
2. For Compose:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
 | 
			
		||||
   docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
 | 
			
		||||
   ```
 | 
			
		||||
3. For Helm:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm rollback stellaops <previous-release-number> --namespace stellaops
 | 
			
		||||
   ```
 | 
			
		||||
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
 | 
			
		||||
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
 | 
			
		||||
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
 | 
			
		||||
 | 
			
		||||
## 7. Post-cutover Actions
 | 
			
		||||
 | 
			
		||||
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
 | 
			
		||||
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
 | 
			
		||||
- Update `docs/ops/launch-readiness.md` if any new gaps or follow-ups discovered.
 | 
			
		||||
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
 | 
			
		||||
 | 
			
		||||
## 8. Approval Matrix
 | 
			
		||||
 | 
			
		||||
| Step | Required Approvers | Record Location |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
 | 
			
		||||
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
 | 
			
		||||
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
 | 
			
		||||
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
 | 
			
		||||
 | 
			
		||||
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
 | 
			
		||||
 | 
			
		||||
## 9. Rehearsal Log
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | What We Exercised | Outcome | Follow-up |
 | 
			
		||||
| --- | --- | --- | --- |
 | 
			
		||||
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
 | 
			
		||||
							
								
								
									
										49
									
								
								docs/ops/launch-readiness.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										49
									
								
								docs/ops/launch-readiness.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,49 @@
 | 
			
		||||
# Launch Readiness Record - Stella Ops
 | 
			
		||||
 | 
			
		||||
_Updated: 2025-10-26 (UTC)_
 | 
			
		||||
 | 
			
		||||
This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover.
 | 
			
		||||
 | 
			
		||||
## 1. Sign-off Summary
 | 
			
		||||
 | 
			
		||||
| Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes |
 | 
			
		||||
| --- | --- | --- | --- | --- | --- |
 | 
			
		||||
| Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. |
 | 
			
		||||
| Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. |
 | 
			
		||||
| Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. |
 | 
			
		||||
| Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. |
 | 
			
		||||
| Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/ops/concelier-conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). |
 | 
			
		||||
| Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (EXECPLAN.Section  Wave 0) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. |
 | 
			
		||||
| Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. |
 | 
			
		||||
| Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. |
 | 
			
		||||
| DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. |
 | 
			
		||||
| Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-10-26T15:05Z | Awaiting release debug artefacts to finalise `DEVOPS-OFFLINE-17-004`; tracked in Section 3. |
 | 
			
		||||
 | 
			
		||||
_\* READY with caveat - remaining work noted in Section 3._
 | 
			
		||||
 | 
			
		||||
## 2. Deployment Readiness Checklist
 | 
			
		||||
 | 
			
		||||
- **Production profiles committed:** `deploy/compose/docker-compose.prod.yaml` and `deploy/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services.
 | 
			
		||||
- **Secrets placeholders documented:** `deploy/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`).
 | 
			
		||||
- **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing.
 | 
			
		||||
- **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services.
 | 
			
		||||
- **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks.
 | 
			
		||||
- **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback.
 | 
			
		||||
- **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned.
 | 
			
		||||
 | 
			
		||||
## 3. Outstanding Gaps & Follow-ups
 | 
			
		||||
 | 
			
		||||
| Item | Owner | Tracking Ref | Target / Next Step | Impact |
 | 
			
		||||
| --- | --- | --- | --- | --- |
 | 
			
		||||
| Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. |
 | 
			
		||||
| Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. |
 | 
			
		||||
| Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. |
 | 
			
		||||
| Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (BLOCKED 2025-10-26) | Release pipeline must publish `out/release/debug` artefacts; once available, run `mirror_debug_store.py` and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. |
 | 
			
		||||
| Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. |
 | 
			
		||||
| Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. |
 | 
			
		||||
 | 
			
		||||
## 4. Approvals & Distribution
 | 
			
		||||
 | 
			
		||||
- Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement.
 | 
			
		||||
- Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history.
 | 
			
		||||
- Cutover rehearsal and rollback drills are tracked separately in `docs/ops/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch
 | 
			
		||||
@@ -1,6 +1,6 @@
 | 
			
		||||
# NuGet Preview Bootstrap (Offline-Friendly)
 | 
			
		||||
 | 
			
		||||
The StellaOps build relies on .NET 10 preview packages (Microsoft.Extensions.*, JwtBearer 10.0 RC).
 | 
			
		||||
The StellaOps build relies on .NET 10 RC2 packages (Microsoft.Extensions.*, JwtBearer 10.0 RC).
 | 
			
		||||
`NuGet.config` now wires three sources:
 | 
			
		||||
 | 
			
		||||
1. `local` → `./local-nuget` (preferred, air-gapped mirror)
 | 
			
		||||
@@ -32,6 +32,21 @@ DOTNET_NOLOGO=1 dotnet restore src/StellaOps.Excititor.Connectors.Abstractions/S
 | 
			
		||||
 | 
			
		||||
The `packageSourceMapping` section keeps `Microsoft.Extensions.*`, `Microsoft.AspNetCore.*`, and `Microsoft.Data.Sqlite` bound to `local`/`dotnet-public`, so `dotnet restore` never has to reach out to nuget.org when mirrors are populated.
 | 
			
		||||
 | 
			
		||||
Before committing changes (or when wiring up a new environment) run:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
python3 ops/devops/validate_restore_sources.py
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The validator asserts:
 | 
			
		||||
 | 
			
		||||
- `NuGet.config` lists `local` → `dotnet-public` → `nuget.org` in that order.
 | 
			
		||||
- `Directory.Build.props` pins `RestoreSources` so every project prioritises the local mirror.
 | 
			
		||||
- No stray `NuGet.config` files shadow the repo root configuration.
 | 
			
		||||
 | 
			
		||||
CI executes the validator in both the `build-test-deploy` and `release` workflows,
 | 
			
		||||
so regressions trip before any restore/build begins.
 | 
			
		||||
 | 
			
		||||
If you run fully air-gapped, remember to clear the cache between SDK upgrades:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
 
 | 
			
		||||
							
								
								
									
										66
									
								
								docs/ops/registry-token-service.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										66
									
								
								docs/ops/registry-token-service.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,66 @@
 | 
			
		||||
# Registry Token Service Operations
 | 
			
		||||
 | 
			
		||||
_Component_: `src/StellaOps.Registry.TokenService`
 | 
			
		||||
 | 
			
		||||
The registry token service issues short-lived Docker registry bearer tokens after
 | 
			
		||||
validating an Authority OpTok (DPoP/mTLS sender constraint) and the customer’s
 | 
			
		||||
plan entitlements. It is fronted by the Docker registry’s `Bearer realm` flow.
 | 
			
		||||
 | 
			
		||||
## Configuration
 | 
			
		||||
 | 
			
		||||
Configuration lives in `etc/registry-token.yaml` and can be overridden through
 | 
			
		||||
environment variables prefixed with `REGISTRY_TOKEN_`. Key sections:
 | 
			
		||||
 | 
			
		||||
| Section | Purpose |
 | 
			
		||||
| ------- | ------- |
 | 
			
		||||
| `authority` | Authority issuer/metadata URL, audience list, and scopes required to request tokens (default `registry.token.issue`). |
 | 
			
		||||
| `signing` | JWT issuer, signing key (PEM or PFX), optional key ID, and token lifetime (default five minutes). The repository ships **`etc/registry-signing-sample.pem`** for local testing only – replace it with a private key generated and stored per-environment before going live. |
 | 
			
		||||
| `registry` | Registry realm URL and optional allow-list of `service` values accepted from the registry challenge. |
 | 
			
		||||
| `plans` | Plan catalogue mapping plan name → repository patterns and allowed actions. Wildcards (`*`) are supported per path segment. |
 | 
			
		||||
| `defaultPlan` | Applied when the caller’s token omits `stellaops:plan`. |
 | 
			
		||||
| `revokedLicenses` | Blocks issuance when the caller presents a matching `stellaops:license` claim. |
 | 
			
		||||
 | 
			
		||||
Plan entries must cover every private repository namespace. Actions default to
 | 
			
		||||
`pull` if omitted.
 | 
			
		||||
 | 
			
		||||
## Request flow
 | 
			
		||||
 | 
			
		||||
1. Docker/OCI client contacts the registry and receives a `401` with
 | 
			
		||||
   `WWW-Authenticate: Bearer realm=...,service=...,scope=repository:...`.
 | 
			
		||||
2. Client acquires an OpTok from Authority (DPoP/mTLS bound) with the
 | 
			
		||||
   `registry.token.issue` scope.
 | 
			
		||||
3. Client calls `GET /token?service=<service>&scope=repository:<name>:<actions>`
 | 
			
		||||
   against the token service, presenting the OpTok and matching DPoP proof.
 | 
			
		||||
4. The service validates the token, plan, and requested scopes, then issues a
 | 
			
		||||
   JWT containing an `access` claim conforming to the Docker registry spec.
 | 
			
		||||
 | 
			
		||||
All denial paths return RFC 6750-style problem responses (HTTP 400 for malformed
 | 
			
		||||
scopes, 403 for plan or revocation failures).
 | 
			
		||||
 | 
			
		||||
## Monitoring
 | 
			
		||||
 | 
			
		||||
The service emits OpenTelemetry metrics via `registry_token_issued_total` and
 | 
			
		||||
`registry_token_rejected_total`. Suggested Prometheus alerts:
 | 
			
		||||
 | 
			
		||||
| Metric | Condition | Action |
 | 
			
		||||
|--------|-----------|--------|
 | 
			
		||||
| `registry_token_rejected_total` | `increase(...) > 0` over 5 minutes | Investigate plan misconfiguration or licence revocation. |
 | 
			
		||||
| `registry_token_issued_total` | Sudden drop compared to baseline | Confirm registry is still challenging with the expected realm/service. |
 | 
			
		||||
 | 
			
		||||
Enable the built-in `/healthz` endpoint for liveness checks. Authentication and
 | 
			
		||||
DPoP failures surface via the service logs (Serilog console output).
 | 
			
		||||
 | 
			
		||||
## Sample deployment
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
dotnet run --project src/StellaOps.Registry.TokenService \
 | 
			
		||||
  --urls "http://0.0.0.0:8085"
 | 
			
		||||
 | 
			
		||||
curl -H "Authorization: Bearer <OpTok>" \
 | 
			
		||||
     -H "DPoP: $(dpop-proof ...)" \
 | 
			
		||||
     "http://localhost:8085/token?service=registry.localhost&scope=repository:stella-ops/public/base:pull"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Replace `<OpTok>` and `DPoP` with tokens issued by Authority. The response
 | 
			
		||||
contains `token`, `expires_in`, and `issued_at` fields suitable for Docker/OCI
 | 
			
		||||
clients.
 | 
			
		||||
							
								
								
									
										113
									
								
								docs/ops/telemetry-collector.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										113
									
								
								docs/ops/telemetry-collector.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,113 @@
 | 
			
		||||
# Telemetry Collector Deployment Guide
 | 
			
		||||
 | 
			
		||||
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
 | 
			
		||||
 | 
			
		||||
This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Overview
 | 
			
		||||
 | 
			
		||||
The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
 | 
			
		||||
 | 
			
		||||
| Endpoint | Purpose | TLS | Authentication |
 | 
			
		||||
| -------- | ------- | --- | -------------- |
 | 
			
		||||
| `:4317`  | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
 | 
			
		||||
| `:4318`  | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
 | 
			
		||||
| `:9464`  | Prometheus scrape | mTLS | Same client certificate |
 | 
			
		||||
| `:13133` | Health check | mTLS | Same client certificate |
 | 
			
		||||
| `:1777`  | pprof diagnostics | mTLS | Same client certificate |
 | 
			
		||||
 | 
			
		||||
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Local validation (Compose)
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
# 1. Generate dev certificates (CA + collector + client)
 | 
			
		||||
./ops/devops/telemetry/generate_dev_tls.sh
 | 
			
		||||
 | 
			
		||||
# 2. Start the collector overlay
 | 
			
		||||
cd deploy/compose
 | 
			
		||||
docker compose -f docker-compose.telemetry.yaml up -d
 | 
			
		||||
 | 
			
		||||
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
 | 
			
		||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
 | 
			
		||||
 | 
			
		||||
# 4. Run the smoke test (OTLP HTTP)
 | 
			
		||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Kubernetes deployment
 | 
			
		||||
 | 
			
		||||
Enable the collector in Helm by setting the following values (example shown for the dev profile):
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
telemetry:
 | 
			
		||||
  collector:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    defaultTenant: <tenant>
 | 
			
		||||
    tls:
 | 
			
		||||
      secretName: stellaops-otel-tls-<env>
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
kubectl create secret generic stellaops-otel-tls-stage \
 | 
			
		||||
  --from-file=tls.crt=collector.crt \
 | 
			
		||||
  --from-file=tls.key=collector.key \
 | 
			
		||||
  --from-file=ca.crt=ca.crt
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Helm renders the collector deployment, service, and config map automatically:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Offline packaging (DEVOPS-OBS-50-003)
 | 
			
		||||
 | 
			
		||||
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The script gathers:
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/README.md`
 | 
			
		||||
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
 | 
			
		||||
- Helm template/values for the collector
 | 
			
		||||
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
 | 
			
		||||
 | 
			
		||||
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
 | 
			
		||||
 | 
			
		||||
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Operational checks
 | 
			
		||||
 | 
			
		||||
1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
 | 
			
		||||
2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
 | 
			
		||||
3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
 | 
			
		||||
4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Related references
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/README.md` – source configuration and local workflow.
 | 
			
		||||
- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
 | 
			
		||||
- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
 | 
			
		||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
 | 
			
		||||
							
								
								
									
										172
									
								
								docs/ops/telemetry-storage.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										172
									
								
								docs/ops/telemetry-storage.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,172 @@
 | 
			
		||||
# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
 | 
			
		||||
 | 
			
		||||
> **Audience:** DevOps Guild, Observability Guild
 | 
			
		||||
>
 | 
			
		||||
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Components & Ports
 | 
			
		||||
 | 
			
		||||
| Service   | Port | Purpose | TLS |
 | 
			
		||||
|-----------|------|---------|-----|
 | 
			
		||||
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
 | 
			
		||||
| Tempo      | 3200 | Trace ingest + API | mTLS (client cert required) |
 | 
			
		||||
| Loki       | 3100 | Log ingest + API | mTLS (client cert required) |
 | 
			
		||||
 | 
			
		||||
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Local validation (Compose)
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
./ops/devops/telemetry/generate_dev_tls.sh
 | 
			
		||||
cd deploy/compose
 | 
			
		||||
# Start collector + storage stack
 | 
			
		||||
docker compose -f docker-compose.telemetry.yaml up -d
 | 
			
		||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
 | 
			
		||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Kubernetes blueprint
 | 
			
		||||
 | 
			
		||||
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
prometheus:
 | 
			
		||||
  server:
 | 
			
		||||
    extraFlags:
 | 
			
		||||
      - web.enable-lifecycle
 | 
			
		||||
    persistentVolume:
 | 
			
		||||
      enabled: true
 | 
			
		||||
      size: 200Gi
 | 
			
		||||
  additionalScrapeConfigsSecret: stellaops-prometheus-scrape
 | 
			
		||||
  extraSecretMounts:
 | 
			
		||||
    - name: otel-mtls
 | 
			
		||||
      secretName: stellaops-otel-tls-stage
 | 
			
		||||
      mountPath: /etc/telemetry/tls
 | 
			
		||||
      readOnly: true
 | 
			
		||||
    - name: otel-token
 | 
			
		||||
      secretName: stellaops-prometheus-token
 | 
			
		||||
      mountPath: /etc/telemetry/auth
 | 
			
		||||
      readOnly: true
 | 
			
		||||
 | 
			
		||||
loki:
 | 
			
		||||
  auth_enabled: true
 | 
			
		||||
  singleBinary:
 | 
			
		||||
    replicas: 2
 | 
			
		||||
  storage:
 | 
			
		||||
    type: filesystem
 | 
			
		||||
  existingSecretForTls: stellaops-otel-tls-stage
 | 
			
		||||
  runtimeConfig:
 | 
			
		||||
    configMap:
 | 
			
		||||
      name: stellaops-loki-tenant-overrides
 | 
			
		||||
 | 
			
		||||
tempo:
 | 
			
		||||
  server:
 | 
			
		||||
    http_listen_port: 3200
 | 
			
		||||
  storage:
 | 
			
		||||
    trace:
 | 
			
		||||
      backend: s3
 | 
			
		||||
      s3:
 | 
			
		||||
        endpoint: tempo-minio.observability.svc:9000
 | 
			
		||||
        bucket: tempo-traces
 | 
			
		||||
  multitenancyEnabled: true
 | 
			
		||||
  extraVolumeMounts:
 | 
			
		||||
    - name: otel-mtls
 | 
			
		||||
      mountPath: /etc/telemetry/tls
 | 
			
		||||
      readOnly: true
 | 
			
		||||
    - name: tempo-tenant-overrides
 | 
			
		||||
      mountPath: /etc/telemetry/tenants
 | 
			
		||||
      readOnly: true
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Staging bootstrap commands
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
 | 
			
		||||
 | 
			
		||||
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
 | 
			
		||||
kubectl -n observability create secret generic stellaops-otel-tls-stage \
 | 
			
		||||
  --from-file=tls.crt=collector-stage.crt \
 | 
			
		||||
  --from-file=tls.key=collector-stage.key \
 | 
			
		||||
  --from-file=ca.crt=collector-ca.crt
 | 
			
		||||
 | 
			
		||||
# Prometheus bearer token issued by Authority (scope obs:read)
 | 
			
		||||
kubectl -n observability create secret generic stellaops-prometheus-token \
 | 
			
		||||
  --from-file=token=prometheus-stage.token
 | 
			
		||||
 | 
			
		||||
# Tenant overrides
 | 
			
		||||
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
 | 
			
		||||
  --from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
 | 
			
		||||
 | 
			
		||||
kubectl -n observability create configmap tempo-tenant-overrides \
 | 
			
		||||
  --from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
 | 
			
		||||
 | 
			
		||||
# Additional scrape config referencing the collector service
 | 
			
		||||
kubectl -n observability create secret generic stellaops-prometheus-scrape \
 | 
			
		||||
  --from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Provision the following secrets/configs (names can be overridden via Helm values):
 | 
			
		||||
 | 
			
		||||
| Name | Type | Notes |
 | 
			
		||||
|------|------|-------|
 | 
			
		||||
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
 | 
			
		||||
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
 | 
			
		||||
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
 | 
			
		||||
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Authority & tenancy integration
 | 
			
		||||
 | 
			
		||||
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
 | 
			
		||||
   ```bash
 | 
			
		||||
   stella authority client create observability-prometheus \
 | 
			
		||||
     --scopes obs:read \
 | 
			
		||||
     --audience observability --description "Prometheus collector scrape"
 | 
			
		||||
   stella authority client create observability-loki \
 | 
			
		||||
     --scopes obs:logs timeline:read \
 | 
			
		||||
     --audience observability --description "Loki ingestion"
 | 
			
		||||
   stella authority client create observability-tempo \
 | 
			
		||||
     --scopes obs:traces \
 | 
			
		||||
     --audience observability --description "Tempo ingestion"
 | 
			
		||||
   ```
 | 
			
		||||
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
 | 
			
		||||
   ```bash
 | 
			
		||||
   stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
 | 
			
		||||
   ```
 | 
			
		||||
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Retention & isolation
 | 
			
		||||
 | 
			
		||||
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
 | 
			
		||||
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
 | 
			
		||||
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Operational checklist
 | 
			
		||||
 | 
			
		||||
- [ ] Certificates rotated and secrets updated.
 | 
			
		||||
- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
 | 
			
		||||
- [ ] Tempo and Loki report tenant activity (`/api/status`).
 | 
			
		||||
- [ ] Retention policy tested by uploading sample data and verifying expiry.
 | 
			
		||||
- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7. References
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/storage/README.md`
 | 
			
		||||
- `deploy/compose/docker-compose.telemetry-storage.yaml`
 | 
			
		||||
- `docs/ops/telemetry-collector.md`
 | 
			
		||||
- `docs/observability/observability.md`
 | 
			
		||||
@@ -137,19 +137,33 @@ Runtime events emitted by Observer now include `process.buildId` (from the ELF
 | 
			
		||||
`buildIds` list per digest. Operators can use these hashes to locate debug
 | 
			
		||||
artifacts during incident response:
 | 
			
		||||
 | 
			
		||||
1. Capture the hash from CLI/webhook/Scanner API (example:
 | 
			
		||||
1. Capture the hash from CLI/webhook/Scanner API—for example:
 | 
			
		||||
   ```bash
 | 
			
		||||
   stellaops-cli runtime policy test --image <digest> --namespace <ns>
 | 
			
		||||
   ```
 | 
			
		||||
   Copy one of the `Build IDs` (e.g.
 | 
			
		||||
   `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
 | 
			
		||||
2. Derive the path: `<hash[0:2]>/<hash[2:]>` under the debug store, e.g.
 | 
			
		||||
   `/var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug`.
 | 
			
		||||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
 | 
			
		||||
   ```bash
 | 
			
		||||
   ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
 | 
			
		||||
   ```
 | 
			
		||||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
 | 
			
		||||
   `debug-store` object bucket (mirror of release artefacts). Use:
 | 
			
		||||
   ```sh
 | 
			
		||||
   `debug-store` object bucket (mirror of release artefacts):
 | 
			
		||||
   ```bash
 | 
			
		||||
   oras cp oci://registry.internal/debug-store:latest . --include \
 | 
			
		||||
     "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
 | 
			
		||||
   ```
 | 
			
		||||
4. Attach the `.debug` file in `gdb`/`lldb` or feed it to `eu-unstrip` when
 | 
			
		||||
   preparing symbolized traces.
 | 
			
		||||
5. For musl-based images, expect shorter build-id footprints. Missing hashes in
 | 
			
		||||
4. Confirm the running process advertises the same GNU build-id before
 | 
			
		||||
   symbolising:
 | 
			
		||||
   ```bash
 | 
			
		||||
   readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
 | 
			
		||||
   ```
 | 
			
		||||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
 | 
			
		||||
   in `debuginfod` for fleet-wide symbol resolution:
 | 
			
		||||
   ```bash
 | 
			
		||||
   debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
 | 
			
		||||
   ```
 | 
			
		||||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
 | 
			
		||||
   runtime events indicate stripped binaries without the GNU note—schedule a
 | 
			
		||||
   rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
 | 
			
		||||
   allowlist so the scanner can surface a fallback symbol package.
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user