Restructure solution layout by module
This commit is contained in:
		@@ -1,97 +1,97 @@
 | 
			
		||||
# Authority Backup & Restore Runbook
 | 
			
		||||
 | 
			
		||||
## Scope
 | 
			
		||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
 | 
			
		||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
 | 
			
		||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
 | 
			
		||||
 | 
			
		||||
## Inventory Checklist
 | 
			
		||||
| Component | Location (compose default) | Notes |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
 | 
			
		||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
 | 
			
		||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
 | 
			
		||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
 | 
			
		||||
 | 
			
		||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
 | 
			
		||||
 | 
			
		||||
## Hot Backup (no downtime)
 | 
			
		||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
 | 
			
		||||
2. **Dump Mongo:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
 | 
			
		||||
     mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
 | 
			
		||||
     --gzip --db stellaops-authority
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml cp \
 | 
			
		||||
     mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
 | 
			
		||||
   ```
 | 
			
		||||
   The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
 | 
			
		||||
3. **Capture configuration + manifests:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   cp etc/authority.yaml backup/
 | 
			
		||||
   rsync -a etc/authority.plugins/ backup/authority.plugins/
 | 
			
		||||
   ```
 | 
			
		||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker run --rm \
 | 
			
		||||
     -v authority-keys:/keys \
 | 
			
		||||
     -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
 | 
			
		||||
   ```
 | 
			
		||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
 | 
			
		||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
 | 
			
		||||
 | 
			
		||||
## Cold Backup (planned downtime)
 | 
			
		||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
 | 
			
		||||
2. Stop services:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml down
 | 
			
		||||
   ```
 | 
			
		||||
3. Back up volumes directly using `tar`:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
 | 
			
		||||
   docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
 | 
			
		||||
   ```
 | 
			
		||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
 | 
			
		||||
5. Restart services and verify health:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml up -d
 | 
			
		||||
   curl -fsS http://localhost:8080/ready
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
## Restore Procedure
 | 
			
		||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
 | 
			
		||||
2. **Restore Mongo:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
 | 
			
		||||
   ```
 | 
			
		||||
   Use `--drop` to replace collections; omit if doing a partial restore.
 | 
			
		||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
 | 
			
		||||
4. **Restore signing keys:** untar into the mounted volume:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
 | 
			
		||||
   ```
 | 
			
		||||
   Ensure file permissions remain `600` for private keys (`chmod -R 600`).
 | 
			
		||||
5. **Start services & validate:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose up -d
 | 
			
		||||
   curl -fsS http://localhost:8080/health
 | 
			
		||||
   ```
 | 
			
		||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
 | 
			
		||||
 | 
			
		||||
## Disaster Recovery Notes
 | 
			
		||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
 | 
			
		||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
 | 
			
		||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
 | 
			
		||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
 | 
			
		||||
 | 
			
		||||
## Verification Checklist
 | 
			
		||||
- [ ] `/ready` reports all identity providers ready.
 | 
			
		||||
- [ ] OAuth flows issue tokens signed by the restored keys.
 | 
			
		||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
 | 
			
		||||
- [ ] Revocation manifest export (`dotnet run --project src/StellaOps.Authority`) succeeds.
 | 
			
		||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
 | 
			
		||||
 | 
			
		||||
# Authority Backup & Restore Runbook
 | 
			
		||||
 | 
			
		||||
## Scope
 | 
			
		||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
 | 
			
		||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
 | 
			
		||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
 | 
			
		||||
 | 
			
		||||
## Inventory Checklist
 | 
			
		||||
| Component | Location (compose default) | Notes |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
 | 
			
		||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
 | 
			
		||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
 | 
			
		||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
 | 
			
		||||
 | 
			
		||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
 | 
			
		||||
 | 
			
		||||
## Hot Backup (no downtime)
 | 
			
		||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
 | 
			
		||||
2. **Dump Mongo:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
 | 
			
		||||
     mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
 | 
			
		||||
     --gzip --db stellaops-authority
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml cp \
 | 
			
		||||
     mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
 | 
			
		||||
   ```
 | 
			
		||||
   The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
 | 
			
		||||
3. **Capture configuration + manifests:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   cp etc/authority.yaml backup/
 | 
			
		||||
   rsync -a etc/authority.plugins/ backup/authority.plugins/
 | 
			
		||||
   ```
 | 
			
		||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker run --rm \
 | 
			
		||||
     -v authority-keys:/keys \
 | 
			
		||||
     -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
 | 
			
		||||
   ```
 | 
			
		||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
 | 
			
		||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
 | 
			
		||||
 | 
			
		||||
## Cold Backup (planned downtime)
 | 
			
		||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
 | 
			
		||||
2. Stop services:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml down
 | 
			
		||||
   ```
 | 
			
		||||
3. Back up volumes directly using `tar`:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
 | 
			
		||||
   docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
 | 
			
		||||
   ```
 | 
			
		||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
 | 
			
		||||
5. Restart services and verify health:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose -f ops/authority/docker-compose.authority.yaml up -d
 | 
			
		||||
   curl -fsS http://localhost:8080/ready
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
## Restore Procedure
 | 
			
		||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
 | 
			
		||||
2. **Restore Mongo:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
 | 
			
		||||
   ```
 | 
			
		||||
   Use `--drop` to replace collections; omit if doing a partial restore.
 | 
			
		||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
 | 
			
		||||
4. **Restore signing keys:** untar into the mounted volume:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
 | 
			
		||||
     busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
 | 
			
		||||
   ```
 | 
			
		||||
   Ensure file permissions remain `600` for private keys (`chmod -R 600`).
 | 
			
		||||
5. **Start services & validate:**
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose up -d
 | 
			
		||||
   curl -fsS http://localhost:8080/health
 | 
			
		||||
   ```
 | 
			
		||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
 | 
			
		||||
 | 
			
		||||
## Disaster Recovery Notes
 | 
			
		||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
 | 
			
		||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
 | 
			
		||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
 | 
			
		||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
 | 
			
		||||
 | 
			
		||||
## Verification Checklist
 | 
			
		||||
- [ ] `/ready` reports all identity providers ready.
 | 
			
		||||
- [ ] OAuth flows issue tokens signed by the restored keys.
 | 
			
		||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
 | 
			
		||||
- [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds.
 | 
			
		||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
 | 
			
		||||
 | 
			
		||||
 
 | 
			
		||||
@@ -1,94 +1,94 @@
 | 
			
		||||
# Authority Signing Key Rotation Playbook
 | 
			
		||||
 | 
			
		||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.  
 | 
			
		||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
 | 
			
		||||
 | 
			
		||||
## 1. Overview
 | 
			
		||||
 | 
			
		||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
 | 
			
		||||
 | 
			
		||||
- **Automation script:** `ops/authority/key-rotation.sh`  
 | 
			
		||||
  Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
 | 
			
		||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`  
 | 
			
		||||
  Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
 | 
			
		||||
 | 
			
		||||
This playbook documents the repeatable sequence for all environments.
 | 
			
		||||
 | 
			
		||||
## 2. Pre-requisites
 | 
			
		||||
 | 
			
		||||
1. **Generate a new PEM key (per environment)**
 | 
			
		||||
   ```bash
 | 
			
		||||
   openssl ecparam -name prime256v1 -genkey -noout \
 | 
			
		||||
     -out certificates/authority-signing-<env>-<year>.pem
 | 
			
		||||
   chmod 600 certificates/authority-signing-<env>-<year>.pem
 | 
			
		||||
   ```
 | 
			
		||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
 | 
			
		||||
3. **Ensure secrets/vars exist in Gitea**
 | 
			
		||||
   - `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
 | 
			
		||||
   - `<ENV>_AUTHORITY_URL`
 | 
			
		||||
   - Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
 | 
			
		||||
 | 
			
		||||
## 3. Executing the rotation
 | 
			
		||||
 | 
			
		||||
### Option A – via CI workflow (recommended)
 | 
			
		||||
 | 
			
		||||
1. Navigate to **Actions → Authority Key Rotation**.
 | 
			
		||||
2. Provide inputs:
 | 
			
		||||
   - `environment`: `staging`, `production`, etc.
 | 
			
		||||
   - `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
 | 
			
		||||
   - `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
 | 
			
		||||
   - Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
 | 
			
		||||
3. Trigger. The workflow:
 | 
			
		||||
   - Reads the bootstrap key/URL from secrets.
 | 
			
		||||
   - Runs `ops/authority/key-rotation.sh`.
 | 
			
		||||
   - Prints the JWKS response for verification.
 | 
			
		||||
 | 
			
		||||
### Option B – manual shell invocation
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
 | 
			
		||||
./ops/authority/key-rotation.sh \
 | 
			
		||||
  --authority-url https://authority.example.com \
 | 
			
		||||
  --key-id authority-signing-2025-dev \
 | 
			
		||||
  --key-path ../certificates/authority-signing-2025-dev.pem \
 | 
			
		||||
  --meta rotatedBy=ops --meta changeTicket=OPS-1234
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Use `--dry-run` to inspect the payload before execution.
 | 
			
		||||
 | 
			
		||||
## 4. Post-rotation checklist
 | 
			
		||||
 | 
			
		||||
1. Update `authority.yaml` (or environment-specific overrides):
 | 
			
		||||
   - Set `signing.activeKeyId` to the new key.
 | 
			
		||||
   - Set `signing.keyPath` to the new PEM.
 | 
			
		||||
   - Append the previous key into `signing.additionalKeys`.
 | 
			
		||||
   - Ensure `keySource`/`provider` match the values passed to the script.
 | 
			
		||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
 | 
			
		||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
 | 
			
		||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
 | 
			
		||||
 | 
			
		||||
## 5. Development key state
 | 
			
		||||
 | 
			
		||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
 | 
			
		||||
 | 
			
		||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
 | 
			
		||||
- Retired: `authority-signing-dev`
 | 
			
		||||
 | 
			
		||||
Treat these as examples; real environments must maintain their own PEM material.
 | 
			
		||||
 | 
			
		||||
## 6. References
 | 
			
		||||
 | 
			
		||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
 | 
			
		||||
- `docs/ops/authority-backup-restore.md` – Recovery flow referencing this playbook.
 | 
			
		||||
- `ops/authority/README.md` – CLI usage and examples.
 | 
			
		||||
- `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
 | 
			
		||||
 | 
			
		||||
## 7. Appendix — Policy CLI secret rotation
 | 
			
		||||
 | 
			
		||||
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.
 | 
			
		||||
# Authority Signing Key Rotation Playbook
 | 
			
		||||
 | 
			
		||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.  
 | 
			
		||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
 | 
			
		||||
 | 
			
		||||
## 1. Overview
 | 
			
		||||
 | 
			
		||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
 | 
			
		||||
 | 
			
		||||
- **Automation script:** `ops/authority/key-rotation.sh`  
 | 
			
		||||
  Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
 | 
			
		||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`  
 | 
			
		||||
  Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
 | 
			
		||||
 | 
			
		||||
This playbook documents the repeatable sequence for all environments.
 | 
			
		||||
 | 
			
		||||
## 2. Pre-requisites
 | 
			
		||||
 | 
			
		||||
1. **Generate a new PEM key (per environment)**
 | 
			
		||||
   ```bash
 | 
			
		||||
   openssl ecparam -name prime256v1 -genkey -noout \
 | 
			
		||||
     -out certificates/authority-signing-<env>-<year>.pem
 | 
			
		||||
   chmod 600 certificates/authority-signing-<env>-<year>.pem
 | 
			
		||||
   ```
 | 
			
		||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
 | 
			
		||||
3. **Ensure secrets/vars exist in Gitea**
 | 
			
		||||
   - `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
 | 
			
		||||
   - `<ENV>_AUTHORITY_URL`
 | 
			
		||||
   - Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
 | 
			
		||||
 | 
			
		||||
## 3. Executing the rotation
 | 
			
		||||
 | 
			
		||||
### Option A – via CI workflow (recommended)
 | 
			
		||||
 | 
			
		||||
1. Navigate to **Actions → Authority Key Rotation**.
 | 
			
		||||
2. Provide inputs:
 | 
			
		||||
   - `environment`: `staging`, `production`, etc.
 | 
			
		||||
   - `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
 | 
			
		||||
   - `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
 | 
			
		||||
   - Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
 | 
			
		||||
3. Trigger. The workflow:
 | 
			
		||||
   - Reads the bootstrap key/URL from secrets.
 | 
			
		||||
   - Runs `ops/authority/key-rotation.sh`.
 | 
			
		||||
   - Prints the JWKS response for verification.
 | 
			
		||||
 | 
			
		||||
### Option B – manual shell invocation
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
 | 
			
		||||
./ops/authority/key-rotation.sh \
 | 
			
		||||
  --authority-url https://authority.example.com \
 | 
			
		||||
  --key-id authority-signing-2025-dev \
 | 
			
		||||
  --key-path ../certificates/authority-signing-2025-dev.pem \
 | 
			
		||||
  --meta rotatedBy=ops --meta changeTicket=OPS-1234
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Use `--dry-run` to inspect the payload before execution.
 | 
			
		||||
 | 
			
		||||
## 4. Post-rotation checklist
 | 
			
		||||
 | 
			
		||||
1. Update `authority.yaml` (or environment-specific overrides):
 | 
			
		||||
   - Set `signing.activeKeyId` to the new key.
 | 
			
		||||
   - Set `signing.keyPath` to the new PEM.
 | 
			
		||||
   - Append the previous key into `signing.additionalKeys`.
 | 
			
		||||
   - Ensure `keySource`/`provider` match the values passed to the script.
 | 
			
		||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
 | 
			
		||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
 | 
			
		||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
 | 
			
		||||
 | 
			
		||||
## 5. Development key state
 | 
			
		||||
 | 
			
		||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
 | 
			
		||||
 | 
			
		||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
 | 
			
		||||
- Retired: `authority-signing-dev`
 | 
			
		||||
 | 
			
		||||
Treat these as examples; real environments must maintain their own PEM material.
 | 
			
		||||
 | 
			
		||||
## 6. References
 | 
			
		||||
 | 
			
		||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
 | 
			
		||||
- `docs/ops/authority-backup-restore.md` – Recovery flow referencing this playbook.
 | 
			
		||||
- `ops/authority/README.md` – CLI usage and examples.
 | 
			
		||||
- `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
 | 
			
		||||
 | 
			
		||||
## 7. Appendix — Policy CLI secret rotation
 | 
			
		||||
 | 
			
		||||
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,83 +1,83 @@
 | 
			
		||||
# Authority Monitoring & Alerting Playbook
 | 
			
		||||
 | 
			
		||||
## Telemetry Sources
 | 
			
		||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
 | 
			
		||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
 | 
			
		||||
  - `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
 | 
			
		||||
  - `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
 | 
			
		||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
 | 
			
		||||
  - `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
 | 
			
		||||
  - `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
 | 
			
		||||
  - `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
 | 
			
		||||
  - `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
 | 
			
		||||
  - `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
 | 
			
		||||
 | 
			
		||||
## Prometheus Metrics to Collect
 | 
			
		||||
| Metric | Query | Purpose |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
 | 
			
		||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
 | 
			
		||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
 | 
			
		||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
 | 
			
		||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
 | 
			
		||||
 | 
			
		||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
 | 
			
		||||
 | 
			
		||||
## Alert Rules
 | 
			
		||||
1. **Token Failure Surge**
 | 
			
		||||
   - _Expression_: `token_failure_ratio > 0.05`
 | 
			
		||||
   - _For_: `10m`
 | 
			
		||||
   - _Labels_: `severity="critical"`
 | 
			
		||||
   - _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
 | 
			
		||||
2. **Lockout Spike**
 | 
			
		||||
   - _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
 | 
			
		||||
   - _For_: `15m`
 | 
			
		||||
   - Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
 | 
			
		||||
3. **Bypass Threshold**
 | 
			
		||||
   - _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
 | 
			
		||||
   - _For_: `5m`
 | 
			
		||||
   - Alert severity `warning` — verify the calling host list.
 | 
			
		||||
4. **Rate Limiter Saturation**
 | 
			
		||||
   - _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
 | 
			
		||||
   - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
 | 
			
		||||
 | 
			
		||||
## Grafana Dashboard
 | 
			
		||||
- Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels:
 | 
			
		||||
  - **Token Success vs Failure** – stacked rate visualization split by grant type.
 | 
			
		||||
  - **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
 | 
			
		||||
  - **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
 | 
			
		||||
  - **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
 | 
			
		||||
 | 
			
		||||
## Collector Configuration Snippets
 | 
			
		||||
```yaml
 | 
			
		||||
receivers:
 | 
			
		||||
  otlp:
 | 
			
		||||
    protocols:
 | 
			
		||||
      http:
 | 
			
		||||
exporters:
 | 
			
		||||
  prometheus:
 | 
			
		||||
    endpoint: "0.0.0.0:9464"
 | 
			
		||||
processors:
 | 
			
		||||
  batch:
 | 
			
		||||
  attributes/token_grant:
 | 
			
		||||
    actions:
 | 
			
		||||
      - key: grant_type
 | 
			
		||||
        action: upsert
 | 
			
		||||
        from_attribute: authority.grant_type
 | 
			
		||||
service:
 | 
			
		||||
  pipelines:
 | 
			
		||||
    metrics:
 | 
			
		||||
      receivers: [otlp]
 | 
			
		||||
      processors: [attributes/token_grant, batch]
 | 
			
		||||
      exporters: [prometheus]
 | 
			
		||||
    logs:
 | 
			
		||||
      receivers: [otlp]
 | 
			
		||||
      processors: [batch]
 | 
			
		||||
      exporters: [loki]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Operational Checklist
 | 
			
		||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
 | 
			
		||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
 | 
			
		||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
 | 
			
		||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
 | 
			
		||||
# Authority Monitoring & Alerting Playbook
 | 
			
		||||
 | 
			
		||||
## Telemetry Sources
 | 
			
		||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
 | 
			
		||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
 | 
			
		||||
  - `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
 | 
			
		||||
  - `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
 | 
			
		||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
 | 
			
		||||
  - `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
 | 
			
		||||
  - `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
 | 
			
		||||
  - `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
 | 
			
		||||
  - `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
 | 
			
		||||
  - `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
 | 
			
		||||
 | 
			
		||||
## Prometheus Metrics to Collect
 | 
			
		||||
| Metric | Query | Purpose |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
 | 
			
		||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
 | 
			
		||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
 | 
			
		||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
 | 
			
		||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
 | 
			
		||||
 | 
			
		||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
 | 
			
		||||
 | 
			
		||||
## Alert Rules
 | 
			
		||||
1. **Token Failure Surge**
 | 
			
		||||
   - _Expression_: `token_failure_ratio > 0.05`
 | 
			
		||||
   - _For_: `10m`
 | 
			
		||||
   - _Labels_: `severity="critical"`
 | 
			
		||||
   - _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
 | 
			
		||||
2. **Lockout Spike**
 | 
			
		||||
   - _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
 | 
			
		||||
   - _For_: `15m`
 | 
			
		||||
   - Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
 | 
			
		||||
3. **Bypass Threshold**
 | 
			
		||||
   - _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
 | 
			
		||||
   - _For_: `5m`
 | 
			
		||||
   - Alert severity `warning` — verify the calling host list.
 | 
			
		||||
4. **Rate Limiter Saturation**
 | 
			
		||||
   - _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
 | 
			
		||||
   - Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
 | 
			
		||||
 | 
			
		||||
## Grafana Dashboard
 | 
			
		||||
- Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels:
 | 
			
		||||
  - **Token Success vs Failure** – stacked rate visualization split by grant type.
 | 
			
		||||
  - **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
 | 
			
		||||
  - **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
 | 
			
		||||
  - **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
 | 
			
		||||
 | 
			
		||||
## Collector Configuration Snippets
 | 
			
		||||
```yaml
 | 
			
		||||
receivers:
 | 
			
		||||
  otlp:
 | 
			
		||||
    protocols:
 | 
			
		||||
      http:
 | 
			
		||||
exporters:
 | 
			
		||||
  prometheus:
 | 
			
		||||
    endpoint: "0.0.0.0:9464"
 | 
			
		||||
processors:
 | 
			
		||||
  batch:
 | 
			
		||||
  attributes/token_grant:
 | 
			
		||||
    actions:
 | 
			
		||||
      - key: grant_type
 | 
			
		||||
        action: upsert
 | 
			
		||||
        from_attribute: authority.grant_type
 | 
			
		||||
service:
 | 
			
		||||
  pipelines:
 | 
			
		||||
    metrics:
 | 
			
		||||
      receivers: [otlp]
 | 
			
		||||
      processors: [attributes/token_grant, batch]
 | 
			
		||||
      exporters: [prometheus]
 | 
			
		||||
    logs:
 | 
			
		||||
      receivers: [otlp]
 | 
			
		||||
      processors: [batch]
 | 
			
		||||
      exporters: [loki]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Operational Checklist
 | 
			
		||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
 | 
			
		||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
 | 
			
		||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
 | 
			
		||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,77 +1,77 @@
 | 
			
		||||
# Concelier Apple Security Update Connector Operations
 | 
			
		||||
 | 
			
		||||
This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
 | 
			
		||||
- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
 | 
			
		||||
- Updated configuration (environment variables or `concelier.yaml`) with an `apple` section. Example baseline:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    apple:
 | 
			
		||||
      softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
 | 
			
		||||
      advisoryBaseUri: "https://support.apple.com/"
 | 
			
		||||
      localeSegment: "en-us"
 | 
			
		||||
      maxAdvisoriesPerFetch: 25
 | 
			
		||||
      initialBackfill: "120.00:00:00"
 | 
			
		||||
      modifiedTolerance: "02:00:00"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> ℹ️  `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Concelier automatically adds both hosts to the connector HttpClient.
 | 
			
		||||
 | 
			
		||||
## 2. Staging Smoke Test
 | 
			
		||||
 | 
			
		||||
1. Deploy the configuration and restart the Concelier workers to ensure the Apple connector options are bound.
 | 
			
		||||
2. Trigger a full connector cycle:
 | 
			
		||||
   - CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
 | 
			
		||||
   - REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
 | 
			
		||||
3. Validate metrics exported under meter `StellaOps.Concelier.Connector.Vndr.Apple`:
 | 
			
		||||
   - `apple.fetch.items` (documents fetched)
 | 
			
		||||
   - `apple.fetch.failures`
 | 
			
		||||
   - `apple.fetch.unchanged`
 | 
			
		||||
   - `apple.parse.failures`
 | 
			
		||||
   - `apple.map.affected.count` (histogram of affected package counts)
 | 
			
		||||
4. Cross-check the shared HTTP counters:
 | 
			
		||||
   - `concelier.source.http.requests_total{concelier_source="vndr-apple"}` should increase for both index and detail phases.
 | 
			
		||||
   - `concelier.source.http.failures_total{concelier_source="vndr-apple"}` should remain flat (0) during a healthy run.
 | 
			
		||||
5. Inspect the info logs:
 | 
			
		||||
   - `Apple software index fetch … processed=X newDocuments=Y`
 | 
			
		||||
   - `Apple advisory parse complete … aliases=… affected=…`
 | 
			
		||||
   - `Mapped Apple advisory … pendingMappings=0`
 | 
			
		||||
6. Confirm MongoDB state:
 | 
			
		||||
   - `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
 | 
			
		||||
   - `dtos` store has `schemaVersion="apple.security.update.v1"`.
 | 
			
		||||
   - `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
 | 
			
		||||
   - `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
 | 
			
		||||
 | 
			
		||||
## 3. Production Monitoring
 | 
			
		||||
 | 
			
		||||
- **Dashboards** – Add the following expressions to your Concelier Grafana board (OTLP/Prometheus naming assumed):
 | 
			
		||||
  - `rate(apple_fetch_items_total[15m])` vs `rate(concelier_source_http_requests_total{concelier_source="vndr-apple"}[15m])`
 | 
			
		||||
  - `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
 | 
			
		||||
  - `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
 | 
			
		||||
  - `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
 | 
			
		||||
- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
 | 
			
		||||
- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
 | 
			
		||||
- **Telemetry pipeline** – `StellaOps.Concelier.WebService` now exports `StellaOps.Concelier.Connector.Vndr.Apple` alongside existing Concelier meters; ensure your OTEL collector or Prometheus scraper includes it.
 | 
			
		||||
 | 
			
		||||
## 4. Fixture Maintenance
 | 
			
		||||
 | 
			
		||||
Regression fixtures live under `src/StellaOps.Concelier.Connector.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
 | 
			
		||||
 | 
			
		||||
1. Run the helper script matching your platform:
 | 
			
		||||
   - Bash: `./scripts/update-apple-fixtures.sh`
 | 
			
		||||
   - PowerShell: `./scripts/update-apple-fixtures.ps1`
 | 
			
		||||
2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
 | 
			
		||||
3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/StellaOps.Concelier.Connector.Vndr.Apple.Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests.csproj`) to verify determinism.
 | 
			
		||||
4. Commit fixture updates together with any parser/mapping changes that motivated them.
 | 
			
		||||
 | 
			
		||||
## 5. Known Issues & Follow-up Tasks
 | 
			
		||||
 | 
			
		||||
- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
 | 
			
		||||
- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
 | 
			
		||||
- Multi-locale content is still under regression sweep (`src/StellaOps.Concelier.Connector.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
 | 
			
		||||
# Concelier Apple Security Update Connector Operations
 | 
			
		||||
 | 
			
		||||
This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
 | 
			
		||||
- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
 | 
			
		||||
- Updated configuration (environment variables or `concelier.yaml`) with an `apple` section. Example baseline:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    apple:
 | 
			
		||||
      softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
 | 
			
		||||
      advisoryBaseUri: "https://support.apple.com/"
 | 
			
		||||
      localeSegment: "en-us"
 | 
			
		||||
      maxAdvisoriesPerFetch: 25
 | 
			
		||||
      initialBackfill: "120.00:00:00"
 | 
			
		||||
      modifiedTolerance: "02:00:00"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> ℹ️  `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Concelier automatically adds both hosts to the connector HttpClient.
 | 
			
		||||
 | 
			
		||||
## 2. Staging Smoke Test
 | 
			
		||||
 | 
			
		||||
1. Deploy the configuration and restart the Concelier workers to ensure the Apple connector options are bound.
 | 
			
		||||
2. Trigger a full connector cycle:
 | 
			
		||||
   - CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
 | 
			
		||||
   - REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
 | 
			
		||||
3. Validate metrics exported under meter `StellaOps.Concelier.Connector.Vndr.Apple`:
 | 
			
		||||
   - `apple.fetch.items` (documents fetched)
 | 
			
		||||
   - `apple.fetch.failures`
 | 
			
		||||
   - `apple.fetch.unchanged`
 | 
			
		||||
   - `apple.parse.failures`
 | 
			
		||||
   - `apple.map.affected.count` (histogram of affected package counts)
 | 
			
		||||
4. Cross-check the shared HTTP counters:
 | 
			
		||||
   - `concelier.source.http.requests_total{concelier_source="vndr-apple"}` should increase for both index and detail phases.
 | 
			
		||||
   - `concelier.source.http.failures_total{concelier_source="vndr-apple"}` should remain flat (0) during a healthy run.
 | 
			
		||||
5. Inspect the info logs:
 | 
			
		||||
   - `Apple software index fetch … processed=X newDocuments=Y`
 | 
			
		||||
   - `Apple advisory parse complete … aliases=… affected=…`
 | 
			
		||||
   - `Mapped Apple advisory … pendingMappings=0`
 | 
			
		||||
6. Confirm MongoDB state:
 | 
			
		||||
   - `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
 | 
			
		||||
   - `dtos` store has `schemaVersion="apple.security.update.v1"`.
 | 
			
		||||
   - `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
 | 
			
		||||
   - `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
 | 
			
		||||
 | 
			
		||||
## 3. Production Monitoring
 | 
			
		||||
 | 
			
		||||
- **Dashboards** – Add the following expressions to your Concelier Grafana board (OTLP/Prometheus naming assumed):
 | 
			
		||||
  - `rate(apple_fetch_items_total[15m])` vs `rate(concelier_source_http_requests_total{concelier_source="vndr-apple"}[15m])`
 | 
			
		||||
  - `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
 | 
			
		||||
  - `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
 | 
			
		||||
  - `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
 | 
			
		||||
- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
 | 
			
		||||
- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
 | 
			
		||||
- **Telemetry pipeline** – `StellaOps.Concelier.WebService` now exports `StellaOps.Concelier.Connector.Vndr.Apple` alongside existing Concelier meters; ensure your OTEL collector or Prometheus scraper includes it.
 | 
			
		||||
 | 
			
		||||
## 4. Fixture Maintenance
 | 
			
		||||
 | 
			
		||||
Regression fixtures live under `src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
 | 
			
		||||
 | 
			
		||||
1. Run the helper script matching your platform:
 | 
			
		||||
   - Bash: `./scripts/update-apple-fixtures.sh`
 | 
			
		||||
   - PowerShell: `./scripts/update-apple-fixtures.ps1`
 | 
			
		||||
2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
 | 
			
		||||
3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests.csproj`) to verify determinism.
 | 
			
		||||
4. Commit fixture updates together with any parser/mapping changes that motivated them.
 | 
			
		||||
 | 
			
		||||
## 5. Known Issues & Follow-up Tasks
 | 
			
		||||
 | 
			
		||||
- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
 | 
			
		||||
- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
 | 
			
		||||
- Multi-locale content is still under regression sweep (`src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,159 +1,159 @@
 | 
			
		||||
# Concelier Authority Audit Runbook
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-22_
 | 
			
		||||
 | 
			
		||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
 | 
			
		||||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
 | 
			
		||||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
 | 
			
		||||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
 | 
			
		||||
 | 
			
		||||
### Configuration snippet
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  authority:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    allowAnonymousFallback: false          # keep true only during initial rollout
 | 
			
		||||
    issuer: "https://authority.internal"
 | 
			
		||||
    audiences:
 | 
			
		||||
      - "api://concelier"
 | 
			
		||||
    requiredScopes:
 | 
			
		||||
      - "concelier.jobs.trigger"
 | 
			
		||||
      - "advisory:read"
 | 
			
		||||
      - "advisory:ingest"
 | 
			
		||||
    requiredTenants:
 | 
			
		||||
      - "tenant-default"
 | 
			
		||||
    bypassNetworks:
 | 
			
		||||
      - "127.0.0.1/32"
 | 
			
		||||
      - "::1/128"
 | 
			
		||||
    clientId: "concelier-jobs"
 | 
			
		||||
    clientSecretFile: "/run/secrets/concelier_authority_client"
 | 
			
		||||
    tokenClockSkewSeconds: 60
 | 
			
		||||
    resilience:
 | 
			
		||||
      enableRetries: true
 | 
			
		||||
      retryDelays:
 | 
			
		||||
        - "00:00:01"
 | 
			
		||||
        - "00:00:02"
 | 
			
		||||
        - "00:00:05"
 | 
			
		||||
      allowOfflineCacheFallback: true
 | 
			
		||||
      offlineCacheTolerance: "00:10:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
 | 
			
		||||
 | 
			
		||||
### Resilience tuning
 | 
			
		||||
 | 
			
		||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
 | 
			
		||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
 | 
			
		||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
 | 
			
		||||
 | 
			
		||||
## 2. Key Signals
 | 
			
		||||
 | 
			
		||||
### 2.1 Audit log channel
 | 
			
		||||
 | 
			
		||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| Field        | Sample value            | Meaning                                                                                  |
 | 
			
		||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
 | 
			
		||||
| `route`      | `/jobs/definitions`     | Endpoint that processed the request.                                                     |
 | 
			
		||||
| `status`     | `200` / `401` / `409`   | Final HTTP status code returned to the caller.                                           |
 | 
			
		||||
| `subject`    | `ops@example.com`       | User or service principal subject (falls back to `(anonymous)` when unauthenticated).    |
 | 
			
		||||
| `clientId`   | `concelier-cli`         | OAuth client ID provided by Authority (`(none)` if the token lacked the claim).         |
 | 
			
		||||
| `scopes`     | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none.   |
 | 
			
		||||
| `tenant`     | `tenant-default`        | Tenant claim extracted from the Authority token (`(none)` when the token lacked it).     |
 | 
			
		||||
| `bypass`     | `True` / `False`        | Indicates whether the request succeeded because its source IP matched a bypass CIDR.    |
 | 
			
		||||
| `remote`     | `10.1.4.7`              | Remote IP recorded from the connection / forwarded header test hooks.                    |
 | 
			
		||||
 | 
			
		||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
 | 
			
		||||
 | 
			
		||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
 | 
			
		||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
 | 
			
		||||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
 | 
			
		||||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
 | 
			
		||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
 | 
			
		||||
 | 
			
		||||
### 2.2 Metrics
 | 
			
		||||
 | 
			
		||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
 | 
			
		||||
 | 
			
		||||
| Metric name                   | Description                                        | PromQL example |
 | 
			
		||||
|-------------------------------|----------------------------------------------------|----------------|
 | 
			
		||||
| `web.jobs.triggered`          | Accepted job trigger requests.                     | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.conflict`   | Rejected triggers (already running, disabled…).    | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.failed`     | Server-side job failures.                          | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
 | 
			
		||||
 | 
			
		||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
 | 
			
		||||
 | 
			
		||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
 | 
			
		||||
 | 
			
		||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
 | 
			
		||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
 | 
			
		||||
 | 
			
		||||
## 3. Alerting Guidance
 | 
			
		||||
 | 
			
		||||
1. **Unauthorized bypass attempt**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`  
 | 
			
		||||
   - Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
 | 
			
		||||
 | 
			
		||||
2. **Missing scopes**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`  
 | 
			
		||||
   - Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
 | 
			
		||||
 | 
			
		||||
3. **Trigger failure surge**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.  
 | 
			
		||||
   - Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
 | 
			
		||||
 | 
			
		||||
4. **Conflict spike**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).  
 | 
			
		||||
   - Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
 | 
			
		||||
 | 
			
		||||
5. **Authority offline**  
 | 
			
		||||
   - Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
 | 
			
		||||
 | 
			
		||||
## 4. Rollout & Verification Procedure
 | 
			
		||||
 | 
			
		||||
1. **Pre-checks**
 | 
			
		||||
   - Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
 | 
			
		||||
   - Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
 | 
			
		||||
   - Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
 | 
			
		||||
 | 
			
		||||
2. **Smoke test with valid token**
 | 
			
		||||
   - Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
 | 
			
		||||
   - Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
 | 
			
		||||
   - Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
 | 
			
		||||
 | 
			
		||||
3. **Negative test without token**
 | 
			
		||||
   - Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
 | 
			
		||||
   - If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
 | 
			
		||||
 | 
			
		||||
4. **Bypass check (if applicable)**
 | 
			
		||||
   - From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
 | 
			
		||||
 | 
			
		||||
5. **Metrics validation**
 | 
			
		||||
   - Ensure `web.jobs.triggered` counter increments during accepted runs.
 | 
			
		||||
   - Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
 | 
			
		||||
 | 
			
		||||
## 5. Troubleshooting
 | 
			
		||||
 | 
			
		||||
| Symptom | Probable cause | Remediation |
 | 
			
		||||
|---------|----------------|-------------|
 | 
			
		||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
 | 
			
		||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
 | 
			
		||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
 | 
			
		||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
 | 
			
		||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
 | 
			
		||||
 | 
			
		||||
## 6. References
 | 
			
		||||
 | 
			
		||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
 | 
			
		||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
 | 
			
		||||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
 | 
			
		||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
 | 
			
		||||
# Concelier Authority Audit Runbook
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-22_
 | 
			
		||||
 | 
			
		||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
 | 
			
		||||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
 | 
			
		||||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
 | 
			
		||||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
 | 
			
		||||
 | 
			
		||||
### Configuration snippet
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  authority:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    allowAnonymousFallback: false          # keep true only during initial rollout
 | 
			
		||||
    issuer: "https://authority.internal"
 | 
			
		||||
    audiences:
 | 
			
		||||
      - "api://concelier"
 | 
			
		||||
    requiredScopes:
 | 
			
		||||
      - "concelier.jobs.trigger"
 | 
			
		||||
      - "advisory:read"
 | 
			
		||||
      - "advisory:ingest"
 | 
			
		||||
    requiredTenants:
 | 
			
		||||
      - "tenant-default"
 | 
			
		||||
    bypassNetworks:
 | 
			
		||||
      - "127.0.0.1/32"
 | 
			
		||||
      - "::1/128"
 | 
			
		||||
    clientId: "concelier-jobs"
 | 
			
		||||
    clientSecretFile: "/run/secrets/concelier_authority_client"
 | 
			
		||||
    tokenClockSkewSeconds: 60
 | 
			
		||||
    resilience:
 | 
			
		||||
      enableRetries: true
 | 
			
		||||
      retryDelays:
 | 
			
		||||
        - "00:00:01"
 | 
			
		||||
        - "00:00:02"
 | 
			
		||||
        - "00:00:05"
 | 
			
		||||
      allowOfflineCacheFallback: true
 | 
			
		||||
      offlineCacheTolerance: "00:10:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
 | 
			
		||||
 | 
			
		||||
### Resilience tuning
 | 
			
		||||
 | 
			
		||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
 | 
			
		||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
 | 
			
		||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
 | 
			
		||||
 | 
			
		||||
## 2. Key Signals
 | 
			
		||||
 | 
			
		||||
### 2.1 Audit log channel
 | 
			
		||||
 | 
			
		||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| Field        | Sample value            | Meaning                                                                                  |
 | 
			
		||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
 | 
			
		||||
| `route`      | `/jobs/definitions`     | Endpoint that processed the request.                                                     |
 | 
			
		||||
| `status`     | `200` / `401` / `409`   | Final HTTP status code returned to the caller.                                           |
 | 
			
		||||
| `subject`    | `ops@example.com`       | User or service principal subject (falls back to `(anonymous)` when unauthenticated).    |
 | 
			
		||||
| `clientId`   | `concelier-cli`         | OAuth client ID provided by Authority (`(none)` if the token lacked the claim).         |
 | 
			
		||||
| `scopes`     | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none.   |
 | 
			
		||||
| `tenant`     | `tenant-default`        | Tenant claim extracted from the Authority token (`(none)` when the token lacked it).     |
 | 
			
		||||
| `bypass`     | `True` / `False`        | Indicates whether the request succeeded because its source IP matched a bypass CIDR.    |
 | 
			
		||||
| `remote`     | `10.1.4.7`              | Remote IP recorded from the connection / forwarded header test hooks.                    |
 | 
			
		||||
 | 
			
		||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
 | 
			
		||||
 | 
			
		||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
 | 
			
		||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
 | 
			
		||||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
 | 
			
		||||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
 | 
			
		||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
 | 
			
		||||
 | 
			
		||||
### 2.2 Metrics
 | 
			
		||||
 | 
			
		||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
 | 
			
		||||
 | 
			
		||||
| Metric name                   | Description                                        | PromQL example |
 | 
			
		||||
|-------------------------------|----------------------------------------------------|----------------|
 | 
			
		||||
| `web.jobs.triggered`          | Accepted job trigger requests.                     | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.conflict`   | Rejected triggers (already running, disabled…).    | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
 | 
			
		||||
| `web.jobs.trigger.failed`     | Server-side job failures.                          | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
 | 
			
		||||
 | 
			
		||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
 | 
			
		||||
 | 
			
		||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
 | 
			
		||||
 | 
			
		||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
 | 
			
		||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
 | 
			
		||||
 | 
			
		||||
## 3. Alerting Guidance
 | 
			
		||||
 | 
			
		||||
1. **Unauthorized bypass attempt**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`  
 | 
			
		||||
   - Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
 | 
			
		||||
 | 
			
		||||
2. **Missing scopes**  
 | 
			
		||||
   - Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`  
 | 
			
		||||
   - Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
 | 
			
		||||
 | 
			
		||||
3. **Trigger failure surge**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.  
 | 
			
		||||
   - Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
 | 
			
		||||
 | 
			
		||||
4. **Conflict spike**  
 | 
			
		||||
   - Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).  
 | 
			
		||||
   - Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
 | 
			
		||||
 | 
			
		||||
5. **Authority offline**  
 | 
			
		||||
   - Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
 | 
			
		||||
 | 
			
		||||
## 4. Rollout & Verification Procedure
 | 
			
		||||
 | 
			
		||||
1. **Pre-checks**
 | 
			
		||||
   - Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
 | 
			
		||||
   - Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
 | 
			
		||||
   - Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
 | 
			
		||||
 | 
			
		||||
2. **Smoke test with valid token**
 | 
			
		||||
   - Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
 | 
			
		||||
   - Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
 | 
			
		||||
   - Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
 | 
			
		||||
 | 
			
		||||
3. **Negative test without token**
 | 
			
		||||
   - Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
 | 
			
		||||
   - If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
 | 
			
		||||
 | 
			
		||||
4. **Bypass check (if applicable)**
 | 
			
		||||
   - From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
 | 
			
		||||
 | 
			
		||||
5. **Metrics validation**
 | 
			
		||||
   - Ensure `web.jobs.triggered` counter increments during accepted runs.
 | 
			
		||||
   - Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
 | 
			
		||||
 | 
			
		||||
## 5. Troubleshooting
 | 
			
		||||
 | 
			
		||||
| Symptom | Probable cause | Remediation |
 | 
			
		||||
|---------|----------------|-------------|
 | 
			
		||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
 | 
			
		||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
 | 
			
		||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
 | 
			
		||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
 | 
			
		||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
 | 
			
		||||
 | 
			
		||||
## 6. References
 | 
			
		||||
 | 
			
		||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
 | 
			
		||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
 | 
			
		||||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
 | 
			
		||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,72 +1,72 @@
 | 
			
		||||
# Concelier CCCS Connector Operations
 | 
			
		||||
 | 
			
		||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
 | 
			
		||||
 | 
			
		||||
## 1. Configuration Checklist
 | 
			
		||||
 | 
			
		||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
 | 
			
		||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    cccs:
 | 
			
		||||
      feeds:
 | 
			
		||||
        - language: "en"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
 | 
			
		||||
        - language: "fr"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
 | 
			
		||||
      maxEntriesPerFetch: 80        # increase temporarily for backfill runs
 | 
			
		||||
      maxKnownEntries: 512
 | 
			
		||||
      requestTimeout: "00:00:30"
 | 
			
		||||
      requestDelay: "00:00:00.250"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> ℹ️  The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
 | 
			
		||||
 | 
			
		||||
## 2. Telemetry & Logging
 | 
			
		||||
 | 
			
		||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
 | 
			
		||||
  - `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
 | 
			
		||||
  - `cccs.fetch.documents`, `cccs.fetch.unchanged`
 | 
			
		||||
  - `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
 | 
			
		||||
  - `cccs.map.success`, `cccs.map.failures`
 | 
			
		||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
 | 
			
		||||
  - `concelier.source.http.requests{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.failures{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.duration{concelier.source="cccs"}`
 | 
			
		||||
- **Structured logs**
 | 
			
		||||
  - `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
 | 
			
		||||
  - `CCCS parse completed parsed=… failures=…`
 | 
			
		||||
  - `CCCS map completed mapped=… failures=…`
 | 
			
		||||
  - Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
 | 
			
		||||
 | 
			
		||||
Suggested Grafana alerts:
 | 
			
		||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
 | 
			
		||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
 | 
			
		||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
 | 
			
		||||
 | 
			
		||||
## 3. Historical Backfill Plan
 | 
			
		||||
 | 
			
		||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
 | 
			
		||||
2. **Stage ingestion**:
 | 
			
		||||
   - Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
 | 
			
		||||
   - Run chained jobs until `pendingDocuments` drains:  
 | 
			
		||||
     `stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
 | 
			
		||||
   - Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
 | 
			
		||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
 | 
			
		||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
 | 
			
		||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
 | 
			
		||||
 | 
			
		||||
## 4. Selector & Sanitiser Notes
 | 
			
		||||
 | 
			
		||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
 | 
			
		||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
 | 
			
		||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
 | 
			
		||||
 | 
			
		||||
## 5. Fixture Maintenance
 | 
			
		||||
 | 
			
		||||
- Regression fixtures live in `src/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
 | 
			
		||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
 | 
			
		||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
 | 
			
		||||
# Concelier CCCS Connector Operations
 | 
			
		||||
 | 
			
		||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
 | 
			
		||||
 | 
			
		||||
## 1. Configuration Checklist
 | 
			
		||||
 | 
			
		||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
 | 
			
		||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    cccs:
 | 
			
		||||
      feeds:
 | 
			
		||||
        - language: "en"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
 | 
			
		||||
        - language: "fr"
 | 
			
		||||
          uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
 | 
			
		||||
      maxEntriesPerFetch: 80        # increase temporarily for backfill runs
 | 
			
		||||
      maxKnownEntries: 512
 | 
			
		||||
      requestTimeout: "00:00:30"
 | 
			
		||||
      requestDelay: "00:00:00.250"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> ℹ️  The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
 | 
			
		||||
 | 
			
		||||
## 2. Telemetry & Logging
 | 
			
		||||
 | 
			
		||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
 | 
			
		||||
  - `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
 | 
			
		||||
  - `cccs.fetch.documents`, `cccs.fetch.unchanged`
 | 
			
		||||
  - `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
 | 
			
		||||
  - `cccs.map.success`, `cccs.map.failures`
 | 
			
		||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
 | 
			
		||||
  - `concelier.source.http.requests{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.failures{concelier.source="cccs"}`
 | 
			
		||||
  - `concelier.source.http.duration{concelier.source="cccs"}`
 | 
			
		||||
- **Structured logs**
 | 
			
		||||
  - `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
 | 
			
		||||
  - `CCCS parse completed parsed=… failures=…`
 | 
			
		||||
  - `CCCS map completed mapped=… failures=…`
 | 
			
		||||
  - Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
 | 
			
		||||
 | 
			
		||||
Suggested Grafana alerts:
 | 
			
		||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
 | 
			
		||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
 | 
			
		||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
 | 
			
		||||
 | 
			
		||||
## 3. Historical Backfill Plan
 | 
			
		||||
 | 
			
		||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
 | 
			
		||||
2. **Stage ingestion**:
 | 
			
		||||
   - Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
 | 
			
		||||
   - Run chained jobs until `pendingDocuments` drains:  
 | 
			
		||||
     `stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
 | 
			
		||||
   - Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
 | 
			
		||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
 | 
			
		||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
 | 
			
		||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
 | 
			
		||||
 | 
			
		||||
## 4. Selector & Sanitiser Notes
 | 
			
		||||
 | 
			
		||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
 | 
			
		||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
 | 
			
		||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
 | 
			
		||||
 | 
			
		||||
## 5. Fixture Maintenance
 | 
			
		||||
 | 
			
		||||
- Regression fixtures live in `src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
 | 
			
		||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
 | 
			
		||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,160 +1,160 @@
 | 
			
		||||
# Concelier Conflict Resolution Runbook (Sprint 3)
 | 
			
		||||
 | 
			
		||||
This runbook equips Concelier operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Precedence Model (recap)
 | 
			
		||||
 | 
			
		||||
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `concelier:merge:precedence:ranks` to override per source when incident response requires it.
 | 
			
		||||
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
 | 
			
		||||
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
 | 
			
		||||
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Telemetry Shipped This Sprint
 | 
			
		||||
 | 
			
		||||
| Instrument | Type | Key Tags | Purpose |
 | 
			
		||||
|------------|------|----------|---------|
 | 
			
		||||
| `concelier.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
 | 
			
		||||
| `concelier.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
 | 
			
		||||
| `concelier.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
 | 
			
		||||
| `concelier.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
 | 
			
		||||
| `concelier.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
 | 
			
		||||
 | 
			
		||||
### Structured logs
 | 
			
		||||
 | 
			
		||||
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
 | 
			
		||||
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
 | 
			
		||||
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
 | 
			
		||||
- `Alias collision ...` (no EventId) - emitted when `concelier.merge.identity_conflicts` increments.
 | 
			
		||||
 | 
			
		||||
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Concelier.Merge`.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Detection & Alerting
 | 
			
		||||
 | 
			
		||||
1. **Dashboard panels**
 | 
			
		||||
   - `concelier.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
 | 
			
		||||
   - `concelier.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
 | 
			
		||||
   - `concelier.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
 | 
			
		||||
   - `concelier.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
 | 
			
		||||
2. **Log based alerts**
 | 
			
		||||
   - `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
 | 
			
		||||
   - `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
 | 
			
		||||
3. **Job health**
 | 
			
		||||
   - `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #concelier-ops.
 | 
			
		||||
 | 
			
		||||
### Threshold updates (2025-10-12)
 | 
			
		||||
 | 
			
		||||
- `concelier.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
 | 
			
		||||
- `concelier.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
 | 
			
		||||
- `concelier.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Triage Workflow
 | 
			
		||||
 | 
			
		||||
1. **Confirm job context**
 | 
			
		||||
   - `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
 | 
			
		||||
2. **Inspect metrics**
 | 
			
		||||
   - Correlate spikes in `concelier.merge.conflicts` with `primary_source`/`suppressed_source` tags from `concelier.merge.overrides`.
 | 
			
		||||
3. **Pull structured logs**
 | 
			
		||||
   - Example (vector output):
 | 
			
		||||
     ```
 | 
			
		||||
     jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-concelier.log
 | 
			
		||||
     ```
 | 
			
		||||
4. **Review merge events**
 | 
			
		||||
   - `mongosh`:
 | 
			
		||||
     ```javascript
 | 
			
		||||
     use concelier;
 | 
			
		||||
     db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
 | 
			
		||||
     ```
 | 
			
		||||
   - Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
 | 
			
		||||
5. **Interrogate provenance**
 | 
			
		||||
   - `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
 | 
			
		||||
   - Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Conflict Classification Matrix
 | 
			
		||||
 | 
			
		||||
| Signal | Likely Cause | Immediate Action |
 | 
			
		||||
|--------|--------------|------------------|
 | 
			
		||||
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
 | 
			
		||||
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
 | 
			
		||||
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `concelier:merge:precedence:ranks` to break the tie; restart merge job. |
 | 
			
		||||
| Rising `concelier.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
 | 
			
		||||
| `concelier.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Resolution Playbook
 | 
			
		||||
 | 
			
		||||
1. **Connector data fix**
 | 
			
		||||
   - Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
 | 
			
		||||
   - Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
 | 
			
		||||
2. **Temporary precedence override**
 | 
			
		||||
   - Edit `etc/concelier.yaml`:
 | 
			
		||||
     ```yaml
 | 
			
		||||
     concelier:
 | 
			
		||||
       merge:
 | 
			
		||||
         precedence:
 | 
			
		||||
           ranks:
 | 
			
		||||
             osv: 1
 | 
			
		||||
             ghsa: 0
 | 
			
		||||
     ```
 | 
			
		||||
   - Restart Concelier workers; confirm tags in `concelier.merge.overrides` show the new ranks.
 | 
			
		||||
   - Document the override with expiry in the change log.
 | 
			
		||||
3. **Alias remediation**
 | 
			
		||||
   - Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
 | 
			
		||||
   - Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
 | 
			
		||||
4. **Escalation**
 | 
			
		||||
   - If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7. Validation Checklist
 | 
			
		||||
 | 
			
		||||
- [ ] Merge job rerun returns exit code `0`.
 | 
			
		||||
- [ ] `concelier.merge.conflicts` baseline returns to zero after corrective action.
 | 
			
		||||
- [ ] Latest `merge_event` entry shows expected hash delta.
 | 
			
		||||
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
 | 
			
		||||
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 8. Reference Material
 | 
			
		||||
 | 
			
		||||
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
 | 
			
		||||
- Merge engine internals: `src/StellaOps.Concelier.Merge/Services/AdvisoryPrecedenceMerger.cs`.
 | 
			
		||||
- Metrics definitions: `src/StellaOps.Concelier.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
 | 
			
		||||
- Storage audit trail: `src/StellaOps.Concelier.Merge/Services/MergeEventWriter.cs`, `src/StellaOps.Concelier.Storage.Mongo/MergeEvents`.
 | 
			
		||||
 | 
			
		||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 9. Synthetic Regression Fixtures
 | 
			
		||||
 | 
			
		||||
- **Locations** – Canonical conflict snapshots now live at `src/StellaOps.Concelier.Connector.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/StellaOps.Concelier.Connector.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/StellaOps.Concelier.Connector.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
 | 
			
		||||
- **Validation commands** – To regenerate and verify the fixtures offline, run:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
dotnet test src/StellaOps.Concelier.Connector.Ghsa.Tests/StellaOps.Concelier.Connector.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
 | 
			
		||||
dotnet test src/StellaOps.Concelier.Connector.Nvd.Tests/StellaOps.Concelier.Connector.Nvd.Tests.csproj --filter NvdConflictFixtureTests
 | 
			
		||||
dotnet test src/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj --filter OsvConflictFixtureTests
 | 
			
		||||
dotnet test src/StellaOps.Concelier.Merge.Tests/StellaOps.Concelier.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `concelier.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 10. Change Log
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | Change | Notes |
 | 
			
		||||
|------------|--------|-------|
 | 
			
		||||
| 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Concelier Ops Guild; no additional overrides required. |
 | 
			
		||||
# Concelier Conflict Resolution Runbook (Sprint 3)
 | 
			
		||||
 | 
			
		||||
This runbook equips Concelier operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Precedence Model (recap)
 | 
			
		||||
 | 
			
		||||
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `concelier:merge:precedence:ranks` to override per source when incident response requires it.
 | 
			
		||||
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
 | 
			
		||||
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
 | 
			
		||||
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Telemetry Shipped This Sprint
 | 
			
		||||
 | 
			
		||||
| Instrument | Type | Key Tags | Purpose |
 | 
			
		||||
|------------|------|----------|---------|
 | 
			
		||||
| `concelier.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
 | 
			
		||||
| `concelier.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
 | 
			
		||||
| `concelier.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
 | 
			
		||||
| `concelier.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
 | 
			
		||||
| `concelier.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
 | 
			
		||||
 | 
			
		||||
### Structured logs
 | 
			
		||||
 | 
			
		||||
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
 | 
			
		||||
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
 | 
			
		||||
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
 | 
			
		||||
- `Alias collision ...` (no EventId) - emitted when `concelier.merge.identity_conflicts` increments.
 | 
			
		||||
 | 
			
		||||
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Concelier.Merge`.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Detection & Alerting
 | 
			
		||||
 | 
			
		||||
1. **Dashboard panels**
 | 
			
		||||
   - `concelier.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
 | 
			
		||||
   - `concelier.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
 | 
			
		||||
   - `concelier.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
 | 
			
		||||
   - `concelier.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
 | 
			
		||||
2. **Log based alerts**
 | 
			
		||||
   - `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
 | 
			
		||||
   - `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
 | 
			
		||||
3. **Job health**
 | 
			
		||||
   - `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #concelier-ops.
 | 
			
		||||
 | 
			
		||||
### Threshold updates (2025-10-12)
 | 
			
		||||
 | 
			
		||||
- `concelier.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
 | 
			
		||||
- `concelier.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
 | 
			
		||||
- `concelier.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Triage Workflow
 | 
			
		||||
 | 
			
		||||
1. **Confirm job context**
 | 
			
		||||
   - `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
 | 
			
		||||
2. **Inspect metrics**
 | 
			
		||||
   - Correlate spikes in `concelier.merge.conflicts` with `primary_source`/`suppressed_source` tags from `concelier.merge.overrides`.
 | 
			
		||||
3. **Pull structured logs**
 | 
			
		||||
   - Example (vector output):
 | 
			
		||||
     ```
 | 
			
		||||
     jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-concelier.log
 | 
			
		||||
     ```
 | 
			
		||||
4. **Review merge events**
 | 
			
		||||
   - `mongosh`:
 | 
			
		||||
     ```javascript
 | 
			
		||||
     use concelier;
 | 
			
		||||
     db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
 | 
			
		||||
     ```
 | 
			
		||||
   - Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
 | 
			
		||||
5. **Interrogate provenance**
 | 
			
		||||
   - `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
 | 
			
		||||
   - Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Conflict Classification Matrix
 | 
			
		||||
 | 
			
		||||
| Signal | Likely Cause | Immediate Action |
 | 
			
		||||
|--------|--------------|------------------|
 | 
			
		||||
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
 | 
			
		||||
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
 | 
			
		||||
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `concelier:merge:precedence:ranks` to break the tie; restart merge job. |
 | 
			
		||||
| Rising `concelier.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
 | 
			
		||||
| `concelier.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Resolution Playbook
 | 
			
		||||
 | 
			
		||||
1. **Connector data fix**
 | 
			
		||||
   - Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
 | 
			
		||||
   - Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
 | 
			
		||||
2. **Temporary precedence override**
 | 
			
		||||
   - Edit `etc/concelier.yaml`:
 | 
			
		||||
     ```yaml
 | 
			
		||||
     concelier:
 | 
			
		||||
       merge:
 | 
			
		||||
         precedence:
 | 
			
		||||
           ranks:
 | 
			
		||||
             osv: 1
 | 
			
		||||
             ghsa: 0
 | 
			
		||||
     ```
 | 
			
		||||
   - Restart Concelier workers; confirm tags in `concelier.merge.overrides` show the new ranks.
 | 
			
		||||
   - Document the override with expiry in the change log.
 | 
			
		||||
3. **Alias remediation**
 | 
			
		||||
   - Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
 | 
			
		||||
   - Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
 | 
			
		||||
4. **Escalation**
 | 
			
		||||
   - If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7. Validation Checklist
 | 
			
		||||
 | 
			
		||||
- [ ] Merge job rerun returns exit code `0`.
 | 
			
		||||
- [ ] `concelier.merge.conflicts` baseline returns to zero after corrective action.
 | 
			
		||||
- [ ] Latest `merge_event` entry shows expected hash delta.
 | 
			
		||||
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
 | 
			
		||||
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 8. Reference Material
 | 
			
		||||
 | 
			
		||||
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
 | 
			
		||||
- Merge engine internals: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryPrecedenceMerger.cs`.
 | 
			
		||||
- Metrics definitions: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
 | 
			
		||||
- Storage audit trail: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/MergeEventWriter.cs`, `src/Concelier/__Libraries/StellaOps.Concelier.Storage.Mongo/MergeEvents`.
 | 
			
		||||
 | 
			
		||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 9. Synthetic Regression Fixtures
 | 
			
		||||
 | 
			
		||||
- **Locations** – Canonical conflict snapshots now live at `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
 | 
			
		||||
- **Validation commands** – To regenerate and verify the fixtures offline, run:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/StellaOps.Concelier.Connector.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
 | 
			
		||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/StellaOps.Concelier.Connector.Nvd.Tests.csproj --filter NvdConflictFixtureTests
 | 
			
		||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj --filter OsvConflictFixtureTests
 | 
			
		||||
dotnet test src/Concelier/__Tests/StellaOps.Concelier.Merge.Tests/StellaOps.Concelier.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `concelier.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 10. Change Log
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | Change | Notes |
 | 
			
		||||
|------------|--------|-------|
 | 
			
		||||
| 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Concelier Ops Guild; no additional overrides required. |
 | 
			
		||||
 
 | 
			
		||||
@@ -58,7 +58,7 @@ concelier:
 | 
			
		||||
 | 
			
		||||
While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:
 | 
			
		||||
 | 
			
		||||
- Command: `dotnet test src/StellaOps.Concelier.Connector.Cve.Tests/StellaOps.Concelier.Connector.Cve.Tests.csproj -l "console;verbosity=detailed"`
 | 
			
		||||
- Command: `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Cve.Tests/StellaOps.Concelier.Connector.Cve.Tests.csproj -l "console;verbosity=detailed"`
 | 
			
		||||
- Summary log emitted by the connector:
 | 
			
		||||
  ```
 | 
			
		||||
  CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
 | 
			
		||||
 
 | 
			
		||||
@@ -1,74 +1,74 @@
 | 
			
		||||
# Concelier KISA Connector Operations
 | 
			
		||||
 | 
			
		||||
Operational guidance for the Korea Internet & Security Agency (KISA / KNVD) connector (`source:kisa:*`). Pair this with the engineering brief in `docs/dev/kisa_connector_notes.md`.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Outbound HTTPS (or mirrored cache) for `https://knvd.krcert.or.kr/`.
 | 
			
		||||
- Connector options defined under `concelier:sources:kisa`:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    kisa:
 | 
			
		||||
      feedUri: "https://knvd.krcert.or.kr/rss/securityInfo.do"
 | 
			
		||||
      detailApiUri: "https://knvd.krcert.or.kr/rssDetailData.do"
 | 
			
		||||
      detailPageUri: "https://knvd.krcert.or.kr/detailDos.do"
 | 
			
		||||
      maxAdvisoriesPerFetch: 10
 | 
			
		||||
      requestDelay: "00:00:01"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Ensure the URIs stay absolute—Concelier adds the `feedUri`/`detailApiUri` hosts to the HttpClient allow-list automatically.
 | 
			
		||||
 | 
			
		||||
## 2. Staging Smoke Test
 | 
			
		||||
 | 
			
		||||
1. Restart the Concelier workers so the KISA options bind.
 | 
			
		||||
2. Run a full connector cycle:
 | 
			
		||||
   - CLI: `stella db jobs run source:kisa:fetch --and-then source:kisa:parse --and-then source:kisa:map`
 | 
			
		||||
   - REST: `POST /jobs/run { "kind": "source:kisa:fetch", "chain": ["source:kisa:parse", "source:kisa:map"] }`
 | 
			
		||||
3. Confirm telemetry (Meter `StellaOps.Concelier.Connector.Kisa`):
 | 
			
		||||
   - `kisa.feed.success`, `kisa.feed.items`
 | 
			
		||||
   - `kisa.detail.success` / `.failures`
 | 
			
		||||
   - `kisa.parse.success` / `.failures`
 | 
			
		||||
   - `kisa.map.success` / `.failures`
 | 
			
		||||
   - `kisa.cursor.updates`
 | 
			
		||||
4. Inspect logs for structured entries:
 | 
			
		||||
   - `KISA feed returned {ItemCount}`
 | 
			
		||||
   - `KISA fetched detail for {Idx} … category={Category}`
 | 
			
		||||
   - `KISA mapped advisory {AdvisoryId} (severity={Severity})`
 | 
			
		||||
   - Absence of warnings such as `document missing GridFS payload`.
 | 
			
		||||
5. Validate MongoDB state:
 | 
			
		||||
   - `raw_documents.metadata` has `kisa.idx`, `kisa.category`, `kisa.title`.
 | 
			
		||||
   - DTO store contains `schemaVersion="kisa.detail.v1"`.
 | 
			
		||||
   - Advisories include aliases (`IDX`, CVE) and `language="ko"`.
 | 
			
		||||
   - `source_states` entry for `kisa` shows recent `cursor.lastFetchAt`.
 | 
			
		||||
 | 
			
		||||
## 3. Production Monitoring
 | 
			
		||||
 | 
			
		||||
- **Dashboards** – Add the following Prometheus/OTEL expressions:
 | 
			
		||||
  - `rate(kisa_feed_items_total[15m])` versus `rate(concelier_source_http_requests_total{concelier_source="kisa"}[15m])`
 | 
			
		||||
  - `increase(kisa_detail_failures_total{reason!="empty-document"}[1h])` alert at `>0`
 | 
			
		||||
  - `increase(kisa_parse_failures_total[1h])` for storage/JSON issues
 | 
			
		||||
  - `increase(kisa_map_failures_total[1h])` to flag schema drift
 | 
			
		||||
  - `increase(kisa_cursor_updates_total[6h]) == 0` during active windows → warn
 | 
			
		||||
- **Alerts** – Page when `rate(kisa_feed_success_total[2h]) == 0` while other connectors are active; back off for maintenance windows announced on `https://knvd.krcert.or.kr/`.
 | 
			
		||||
- **Logs** – Watch for repeated warnings (`document missing`, `DTO missing`) or errors with reason tags `HttpRequestException`, `download`, `parse`, `map`.
 | 
			
		||||
 | 
			
		||||
## 4. Localisation Handling
 | 
			
		||||
 | 
			
		||||
- Hangul categories (for example `취약점정보`) flow into telemetry tags (`category=…`) and logs. Dashboards must render UTF‑8 and avoid transliteration.
 | 
			
		||||
- HTML content is sanitised before storage; translation teams can consume the `ContentHtml` field safely.
 | 
			
		||||
- Advisory severity remains as provided by KISA (`High`, `Medium`, etc.). Map-level failures include the severity tag for filtering.
 | 
			
		||||
 | 
			
		||||
## 5. Fixture & Regression Maintenance
 | 
			
		||||
 | 
			
		||||
- Regression fixtures: `src/StellaOps.Concelier.Connector.Kisa.Tests/Fixtures/kisa-feed.xml` and `kisa-detail.json`.
 | 
			
		||||
- Refresh via `UPDATE_KISA_FIXTURES=1 dotnet test src/StellaOps.Concelier.Connector.Kisa.Tests/StellaOps.Concelier.Connector.Kisa.Tests.csproj`.
 | 
			
		||||
- The telemetry regression (`KisaConnectorTests.Telemetry_RecordsMetrics`) will fail if counters/log wiring drifts—treat failures as gating.
 | 
			
		||||
 | 
			
		||||
## 6. Known Issues
 | 
			
		||||
 | 
			
		||||
- RSS feeds only expose the latest 10 advisories; long outages require replay via archived feeds or manual IDX seeds.
 | 
			
		||||
- Detail endpoint occasionally throttles; the connector honours `requestDelay` and reports failures with reason `HttpRequestException`. Consider increasing delay for weekend backfills.
 | 
			
		||||
- If `kisa.category` tags suddenly appear as `unknown`, verify KISA has not renamed RSS elements; update the parser fixtures before production rollout.
 | 
			
		||||
# Concelier KISA Connector Operations
 | 
			
		||||
 | 
			
		||||
Operational guidance for the Korea Internet & Security Agency (KISA / KNVD) connector (`source:kisa:*`). Pair this with the engineering brief in `docs/dev/kisa_connector_notes.md`.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- Outbound HTTPS (or mirrored cache) for `https://knvd.krcert.or.kr/`.
 | 
			
		||||
- Connector options defined under `concelier:sources:kisa`:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
concelier:
 | 
			
		||||
  sources:
 | 
			
		||||
    kisa:
 | 
			
		||||
      feedUri: "https://knvd.krcert.or.kr/rss/securityInfo.do"
 | 
			
		||||
      detailApiUri: "https://knvd.krcert.or.kr/rssDetailData.do"
 | 
			
		||||
      detailPageUri: "https://knvd.krcert.or.kr/detailDos.do"
 | 
			
		||||
      maxAdvisoriesPerFetch: 10
 | 
			
		||||
      requestDelay: "00:00:01"
 | 
			
		||||
      failureBackoff: "00:05:00"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Ensure the URIs stay absolute—Concelier adds the `feedUri`/`detailApiUri` hosts to the HttpClient allow-list automatically.
 | 
			
		||||
 | 
			
		||||
## 2. Staging Smoke Test
 | 
			
		||||
 | 
			
		||||
1. Restart the Concelier workers so the KISA options bind.
 | 
			
		||||
2. Run a full connector cycle:
 | 
			
		||||
   - CLI: `stella db jobs run source:kisa:fetch --and-then source:kisa:parse --and-then source:kisa:map`
 | 
			
		||||
   - REST: `POST /jobs/run { "kind": "source:kisa:fetch", "chain": ["source:kisa:parse", "source:kisa:map"] }`
 | 
			
		||||
3. Confirm telemetry (Meter `StellaOps.Concelier.Connector.Kisa`):
 | 
			
		||||
   - `kisa.feed.success`, `kisa.feed.items`
 | 
			
		||||
   - `kisa.detail.success` / `.failures`
 | 
			
		||||
   - `kisa.parse.success` / `.failures`
 | 
			
		||||
   - `kisa.map.success` / `.failures`
 | 
			
		||||
   - `kisa.cursor.updates`
 | 
			
		||||
4. Inspect logs for structured entries:
 | 
			
		||||
   - `KISA feed returned {ItemCount}`
 | 
			
		||||
   - `KISA fetched detail for {Idx} … category={Category}`
 | 
			
		||||
   - `KISA mapped advisory {AdvisoryId} (severity={Severity})`
 | 
			
		||||
   - Absence of warnings such as `document missing GridFS payload`.
 | 
			
		||||
5. Validate MongoDB state:
 | 
			
		||||
   - `raw_documents.metadata` has `kisa.idx`, `kisa.category`, `kisa.title`.
 | 
			
		||||
   - DTO store contains `schemaVersion="kisa.detail.v1"`.
 | 
			
		||||
   - Advisories include aliases (`IDX`, CVE) and `language="ko"`.
 | 
			
		||||
   - `source_states` entry for `kisa` shows recent `cursor.lastFetchAt`.
 | 
			
		||||
 | 
			
		||||
## 3. Production Monitoring
 | 
			
		||||
 | 
			
		||||
- **Dashboards** – Add the following Prometheus/OTEL expressions:
 | 
			
		||||
  - `rate(kisa_feed_items_total[15m])` versus `rate(concelier_source_http_requests_total{concelier_source="kisa"}[15m])`
 | 
			
		||||
  - `increase(kisa_detail_failures_total{reason!="empty-document"}[1h])` alert at `>0`
 | 
			
		||||
  - `increase(kisa_parse_failures_total[1h])` for storage/JSON issues
 | 
			
		||||
  - `increase(kisa_map_failures_total[1h])` to flag schema drift
 | 
			
		||||
  - `increase(kisa_cursor_updates_total[6h]) == 0` during active windows → warn
 | 
			
		||||
- **Alerts** – Page when `rate(kisa_feed_success_total[2h]) == 0` while other connectors are active; back off for maintenance windows announced on `https://knvd.krcert.or.kr/`.
 | 
			
		||||
- **Logs** – Watch for repeated warnings (`document missing`, `DTO missing`) or errors with reason tags `HttpRequestException`, `download`, `parse`, `map`.
 | 
			
		||||
 | 
			
		||||
## 4. Localisation Handling
 | 
			
		||||
 | 
			
		||||
- Hangul categories (for example `취약점정보`) flow into telemetry tags (`category=…`) and logs. Dashboards must render UTF‑8 and avoid transliteration.
 | 
			
		||||
- HTML content is sanitised before storage; translation teams can consume the `ContentHtml` field safely.
 | 
			
		||||
- Advisory severity remains as provided by KISA (`High`, `Medium`, etc.). Map-level failures include the severity tag for filtering.
 | 
			
		||||
 | 
			
		||||
## 5. Fixture & Regression Maintenance
 | 
			
		||||
 | 
			
		||||
- Regression fixtures: `src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/Fixtures/kisa-feed.xml` and `kisa-detail.json`.
 | 
			
		||||
- Refresh via `UPDATE_KISA_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/StellaOps.Concelier.Connector.Kisa.Tests.csproj`.
 | 
			
		||||
- The telemetry regression (`KisaConnectorTests.Telemetry_RecordsMetrics`) will fail if counters/log wiring drifts—treat failures as gating.
 | 
			
		||||
 | 
			
		||||
## 6. Known Issues
 | 
			
		||||
 | 
			
		||||
- RSS feeds only expose the latest 10 advisories; long outages require replay via archived feeds or manual IDX seeds.
 | 
			
		||||
- Detail endpoint occasionally throttles; the connector honours `requestDelay` and reports failures with reason `HttpRequestException`. Consider increasing delay for weekend backfills.
 | 
			
		||||
- If `kisa.category` tags suddenly appear as `unknown`, verify KISA has not renamed RSS elements; update the parser fixtures before production rollout.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,238 +1,238 @@
 | 
			
		||||
# Concelier & Excititor Mirror Operations
 | 
			
		||||
 | 
			
		||||
This runbook describes how Stella Ops operates the managed mirrors under `*.stella-ops.org`.
 | 
			
		||||
It covers Docker Compose and Helm deployment overlays, secret handling for multi-tenant
 | 
			
		||||
authn, CDN fronting, and the recurring sync pipeline that keeps mirror bundles current.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- **Authority access** – client credentials (`client_id` + secret) authorised for
 | 
			
		||||
  `concelier.mirror.read` and `excititor.mirror.read` scopes. Secrets live outside git.
 | 
			
		||||
- **Signed TLS certificates** – wildcard or per-domain (`mirror-primary`, `mirror-community`).
 | 
			
		||||
  Store them under `deploy/compose/mirror-gateway/tls/` or in Kubernetes secrets.
 | 
			
		||||
- **Mirror gateway credentials** – Basic Auth htpasswd files per domain. Generate with
 | 
			
		||||
  `htpasswd -B`. Operators distribute credentials to downstream consumers.
 | 
			
		||||
- **Export artifact source** – read access to the canonical S3 buckets (or rsync share)
 | 
			
		||||
  that hold `concelier` JSON bundles and `excititor` VEX exports.
 | 
			
		||||
- **Persistent volumes** – storage for Concelier job metadata and mirror export trees.
 | 
			
		||||
  For Helm, provision PVCs (`concelier-mirror-jobs`, `concelier-mirror-exports`,
 | 
			
		||||
  `excititor-mirror-exports`, `mirror-mongo-data`, `mirror-minio-data`) before rollout.
 | 
			
		||||
 | 
			
		||||
### 1.1 Service configuration quick reference
 | 
			
		||||
 | 
			
		||||
Concelier.WebService exposes the mirror HTTP endpoints once `CONCELIER__MIRROR__ENABLED=true`.
 | 
			
		||||
Key knobs:
 | 
			
		||||
 | 
			
		||||
- `CONCELIER__MIRROR__EXPORTROOT` – root folder containing export snapshots (`<exportId>/mirror/*`).
 | 
			
		||||
- `CONCELIER__MIRROR__ACTIVEEXPORTID` – optional explicit export id; otherwise the service auto-falls back to the `latest/` symlink or newest directory.
 | 
			
		||||
- `CONCELIER__MIRROR__REQUIREAUTHENTICATION` – default auth requirement; override per domain with `CONCELIER__MIRROR__DOMAINS__{n}__REQUIREAUTHENTICATION`.
 | 
			
		||||
- `CONCELIER__MIRROR__MAXINDEXREQUESTSPERHOUR` – budget for `/concelier/exports/index.json`. Domains inherit this value unless they define `__MAXDOWNLOADREQUESTSPERHOUR`.
 | 
			
		||||
- `CONCELIER__MIRROR__DOMAINS__{n}__ID` – domain identifier matching the exporter manifest; additional keys configure display name and rate budgets.
 | 
			
		||||
 | 
			
		||||
> The service honours Stella Ops Authority when `CONCELIER__AUTHORITY__ENABLED=true` and `ALLOWANONYMOUSFALLBACK=false`. Use the bypass CIDR list (`CONCELIER__AUTHORITY__BYPASSNETWORKS__*`) for in-cluster ingress gateways that terminate Basic Auth. Unauthorized requests emit `WWW-Authenticate: Bearer` so downstream automation can detect token failures.
 | 
			
		||||
 | 
			
		||||
Mirror responses carry deterministic cache headers: `/index.json` returns `Cache-Control: public, max-age=60`, while per-domain manifests/bundles include `Cache-Control: public, max-age=300, immutable`. Rate limiting surfaces `Retry-After` when quotas are exceeded.
 | 
			
		||||
 | 
			
		||||
### 1.2 Mirror connector configuration
 | 
			
		||||
 | 
			
		||||
Downstream Concelier instances ingest published bundles using the `StellaOpsMirrorConnector`. Operators running the connector in air‑gapped or limited connectivity environments can tune the following options (environment prefix `CONCELIER__SOURCES__STELLAOPSMIRROR__`):
 | 
			
		||||
 | 
			
		||||
- `BASEADDRESS` – absolute mirror root (e.g., `https://mirror-primary.stella-ops.org`).
 | 
			
		||||
- `INDEXPATH` – relative path to the mirror index (`/concelier/exports/index.json` by default).
 | 
			
		||||
- `DOMAINID` – mirror domain identifier from the index (`primary`, `community`, etc.).
 | 
			
		||||
- `HTTPTIMEOUT` – request timeout; raise when mirrors sit behind slow WAN links.
 | 
			
		||||
- `SIGNATURE__ENABLED` – require detached JWS verification for `bundle.json`.
 | 
			
		||||
- `SIGNATURE__KEYID` / `SIGNATURE__PROVIDER` – expected signing key metadata.
 | 
			
		||||
- `SIGNATURE__PUBLICKEYPATH` – PEM fallback used when the mirror key registry is offline.
 | 
			
		||||
 | 
			
		||||
The connector keeps a per-export fingerprint (bundle digest + generated-at timestamp) and tracks outstanding document IDs. If a scan is interrupted, the next run resumes parse/map work using the stored fingerprint and pending document lists—no network requests are reissued unless the upstream digest changes.
 | 
			
		||||
 | 
			
		||||
## 2. Secret & certificate layout
 | 
			
		||||
 | 
			
		||||
### Docker Compose (`deploy/compose/docker-compose.mirror.yaml`)
 | 
			
		||||
 | 
			
		||||
- `deploy/compose/env/mirror.env.example` – copy to `.env` and adjust quotas or domain IDs.
 | 
			
		||||
- `deploy/compose/mirror-secrets/` – mount read-only into `/run/secrets`. Place:
 | 
			
		||||
  - `concelier-authority-client` – Authority client secret.
 | 
			
		||||
  - `excititor-authority-client` (optional) – reserve for future authn.
 | 
			
		||||
- `deploy/compose/mirror-gateway/tls/` – PEM-encoded cert/key pairs:
 | 
			
		||||
  - `mirror-primary.crt`, `mirror-primary.key`
 | 
			
		||||
  - `mirror-community.crt`, `mirror-community.key`
 | 
			
		||||
- `deploy/compose/mirror-gateway/secrets/` – htpasswd files:
 | 
			
		||||
  - `mirror-primary.htpasswd`
 | 
			
		||||
  - `mirror-community.htpasswd`
 | 
			
		||||
 | 
			
		||||
### Helm (`deploy/helm/stellaops/values-mirror.yaml`)
 | 
			
		||||
 | 
			
		||||
Create secrets in the target namespace:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
kubectl create secret generic concelier-mirror-auth \
 | 
			
		||||
  --from-file=concelier-authority-client=concelier-authority-client
 | 
			
		||||
 | 
			
		||||
kubectl create secret generic excititor-mirror-auth \
 | 
			
		||||
  --from-file=excititor-authority-client=excititor-authority-client
 | 
			
		||||
 | 
			
		||||
kubectl create secret tls mirror-gateway-tls \
 | 
			
		||||
  --cert=mirror-primary.crt --key=mirror-primary.key
 | 
			
		||||
 | 
			
		||||
kubectl create secret generic mirror-gateway-htpasswd \
 | 
			
		||||
  --from-file=mirror-primary.htpasswd --from-file=mirror-community.htpasswd
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Keep Basic Auth lists short-lived (rotate quarterly) and document credential recipients.
 | 
			
		||||
 | 
			
		||||
## 3. Deployment
 | 
			
		||||
 | 
			
		||||
### 3.1 Docker Compose (edge mirrors, lab validation)
 | 
			
		||||
 | 
			
		||||
1. `cp deploy/compose/env/mirror.env.example deploy/compose/env/mirror.env`
 | 
			
		||||
2. Populate secrets/tls directories as described above.
 | 
			
		||||
3. Sync mirror bundles (see §4) into `deploy/compose/mirror-data/…` and ensure they are mounted
 | 
			
		||||
   on the host path backing the `concelier-exports` and `excititor-exports` volumes.
 | 
			
		||||
4. Run the profile validator: `deploy/tools/validate-profiles.sh`.
 | 
			
		||||
5. Launch: `docker compose --env-file env/mirror.env -f docker-compose.mirror.yaml up -d`.
 | 
			
		||||
 | 
			
		||||
### 3.2 Helm (production mirrors)
 | 
			
		||||
 | 
			
		||||
1. Provision PVCs sized for mirror bundles (baseline: 20 GiB per domain).
 | 
			
		||||
2. Create secrets/tls config maps (§2).
 | 
			
		||||
3. `helm upgrade --install mirror deploy/helm/stellaops -f deploy/helm/stellaops/values-mirror.yaml`.
 | 
			
		||||
4. Annotate the `stellaops-mirror-gateway` service with ingress/LoadBalancer metadata required by
 | 
			
		||||
   your CDN (e.g., AWS load balancer scheme internal + NLB idle timeout).
 | 
			
		||||
 | 
			
		||||
## 4. Artifact sync workflow
 | 
			
		||||
 | 
			
		||||
Mirrors never generate exports—they ingest signed bundles produced by the Concelier and Excititor
 | 
			
		||||
export jobs. Recommended sync pattern:
 | 
			
		||||
 | 
			
		||||
### 4.1 Compose host (systemd timer)
 | 
			
		||||
 | 
			
		||||
`/usr/local/bin/mirror-sync.sh`:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#!/usr/bin/env bash
 | 
			
		||||
set -euo pipefail
 | 
			
		||||
export AWS_ACCESS_KEY_ID=…
 | 
			
		||||
export AWS_SECRET_ACCESS_KEY=…
 | 
			
		||||
 | 
			
		||||
aws s3 sync s3://mirror-stellaops/concelier/latest \
 | 
			
		||||
  /opt/stellaops/mirror-data/concelier --delete --size-only
 | 
			
		||||
 | 
			
		||||
aws s3 sync s3://mirror-stellaops/excititor/latest \
 | 
			
		||||
  /opt/stellaops/mirror-data/excititor --delete --size-only
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Schedule with a systemd timer every 5 minutes. The Compose volumes mount `/opt/stellaops/mirror-data/*`
 | 
			
		||||
into the containers read-only, matching `CONCELIER__MIRROR__EXPORTROOT=/exports/json` and
 | 
			
		||||
`EXCITITOR__ARTIFACTS__FILESYSTEM__ROOT=/exports`.
 | 
			
		||||
 | 
			
		||||
### 4.2 Kubernetes (CronJob)
 | 
			
		||||
 | 
			
		||||
Create a CronJob running the AWS CLI (or rclone) in the same namespace, writing into the PVCs:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
apiVersion: batch/v1
 | 
			
		||||
kind: CronJob
 | 
			
		||||
metadata:
 | 
			
		||||
  name: mirror-sync
 | 
			
		||||
spec:
 | 
			
		||||
  schedule: "*/5 * * * *"
 | 
			
		||||
  jobTemplate:
 | 
			
		||||
    spec:
 | 
			
		||||
      template:
 | 
			
		||||
        spec:
 | 
			
		||||
          containers:
 | 
			
		||||
          - name: sync
 | 
			
		||||
            image: public.ecr.aws/aws-cli/aws-cli@sha256:5df5f52c29f5e3ba46d0ad9e0e3afc98701c4a0f879400b4c5f80d943b5fadea
 | 
			
		||||
            command:
 | 
			
		||||
              - /bin/sh
 | 
			
		||||
              - -c
 | 
			
		||||
              - >
 | 
			
		||||
                aws s3 sync s3://mirror-stellaops/concelier/latest /exports/concelier --delete --size-only &&
 | 
			
		||||
                aws s3 sync s3://mirror-stellaops/excititor/latest /exports/excititor --delete --size-only
 | 
			
		||||
            volumeMounts:
 | 
			
		||||
              - name: concelier-exports
 | 
			
		||||
                mountPath: /exports/concelier
 | 
			
		||||
              - name: excititor-exports
 | 
			
		||||
                mountPath: /exports/excititor
 | 
			
		||||
            envFrom:
 | 
			
		||||
              - secretRef:
 | 
			
		||||
                  name: mirror-sync-aws
 | 
			
		||||
          restartPolicy: OnFailure
 | 
			
		||||
          volumes:
 | 
			
		||||
            - name: concelier-exports
 | 
			
		||||
              persistentVolumeClaim:
 | 
			
		||||
                claimName: concelier-mirror-exports
 | 
			
		||||
            - name: excititor-exports
 | 
			
		||||
              persistentVolumeClaim:
 | 
			
		||||
                claimName: excititor-mirror-exports
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## 5. CDN integration
 | 
			
		||||
 | 
			
		||||
1. Point the CDN origin at the mirror gateway (Compose host or Kubernetes LoadBalancer).
 | 
			
		||||
2. Honour the response headers emitted by the gateway and Concelier/Excititor:
 | 
			
		||||
   `Cache-Control: public, max-age=300, immutable` for mirror payloads.
 | 
			
		||||
3. Configure origin shields in the CDN to prevent cache stampedes. Recommended TTLs:
 | 
			
		||||
   - Index (`/concelier/exports/index.json`, `/excititor/mirror/*/index`) → 60 s.
 | 
			
		||||
   - Bundle/manifest payloads → 300 s.
 | 
			
		||||
4. Forward the `Authorization` header—Basic Auth terminates at the gateway.
 | 
			
		||||
5. Enforce per-domain rate limits at the CDN (matching gateway budgets) and enable logging
 | 
			
		||||
   to SIEM for anomaly detection.
 | 
			
		||||
 | 
			
		||||
## 6. Smoke tests
 | 
			
		||||
 | 
			
		||||
After each deployment or sync cycle (temporarily set low budgets if you need to observe 429 responses):
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
# Index with Basic Auth
 | 
			
		||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/index.json | jq 'keys'
 | 
			
		||||
 | 
			
		||||
# Mirror manifest signature and cache headers
 | 
			
		||||
curl -u $PRIMARY_CREDS -I https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/manifest.json \
 | 
			
		||||
  | tee /tmp/manifest-headers.txt
 | 
			
		||||
grep -E '^Cache-Control: ' /tmp/manifest-headers.txt   # expect public, max-age=300, immutable
 | 
			
		||||
 | 
			
		||||
# Excititor consensus bundle metadata
 | 
			
		||||
curl -u $COMMUNITY_CREDS https://mirror-community.stella-ops.org/excititor/mirror/community/index \
 | 
			
		||||
  | jq '.exports[].exportKey'
 | 
			
		||||
 | 
			
		||||
# Signed bundle + detached JWS (spot check digests)
 | 
			
		||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/bundle.json.jws \
 | 
			
		||||
  -o bundle.json.jws
 | 
			
		||||
cosign verify-blob --signature bundle.json.jws --key mirror-key.pub bundle.json
 | 
			
		||||
 | 
			
		||||
# Service-level auth check (inside cluster – no gateway credentials)
 | 
			
		||||
kubectl exec deploy/stellaops-concelier -- curl -si http://localhost:8443/concelier/exports/mirror/primary/manifest.json \
 | 
			
		||||
  | head -n 5   # expect HTTP/1.1 401 with WWW-Authenticate: Bearer
 | 
			
		||||
 | 
			
		||||
# Rate limit smoke (repeat quickly; second call should return 429 + Retry-After)
 | 
			
		||||
for i in 1 2; do
 | 
			
		||||
  curl -s -o /dev/null -D - https://mirror-primary.stella-ops.org/concelier/exports/index.json \
 | 
			
		||||
    -u $PRIMARY_CREDS | grep -E '^(HTTP/|Retry-After:)'
 | 
			
		||||
  sleep 1
 | 
			
		||||
done
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Watch the gateway metrics (`nginx_vts` or access logs) for cache hits. In Kubernetes, `kubectl logs deploy/stellaops-mirror-gateway`
 | 
			
		||||
should show `X-Cache-Status: HIT/MISS`.
 | 
			
		||||
 | 
			
		||||
## 7. Maintenance & rotation
 | 
			
		||||
 | 
			
		||||
- **Bundle freshness** – alert if sync job lag exceeds 15 minutes or if `concelier` logs
 | 
			
		||||
  `Mirror export root is not configured`.
 | 
			
		||||
- **Secret rotation** – change Authority client secrets and Basic Auth credentials quarterly.
 | 
			
		||||
  Update the mounted secrets and restart deployments (`docker compose restart concelier` or
 | 
			
		||||
  `kubectl rollout restart deploy/stellaops-concelier`).
 | 
			
		||||
- **TLS renewal** – reissue certificates, place new files, and reload gateway (`docker compose exec mirror-gateway nginx -s reload`).
 | 
			
		||||
- **Quota tuning** – adjust per-domain `MAXDOWNLOADREQUESTSPERHOUR` in `.env` or values file.
 | 
			
		||||
  Align CDN rate limits and inform downstreams.
 | 
			
		||||
 | 
			
		||||
## 8. References
 | 
			
		||||
 | 
			
		||||
- Deployment profiles: `deploy/compose/docker-compose.mirror.yaml`,
 | 
			
		||||
  `deploy/helm/stellaops/values-mirror.yaml`
 | 
			
		||||
- Mirror architecture dossiers: `docs/ARCHITECTURE_CONCELIER.md`,
 | 
			
		||||
  `docs/ARCHITECTURE_EXCITITOR_MIRRORS.md`
 | 
			
		||||
- Export bundling: `docs/ARCHITECTURE_DEVOPS.md` §3, `docs/ARCHITECTURE_EXCITITOR.md` §7
 | 
			
		||||
# Concelier & Excititor Mirror Operations
 | 
			
		||||
 | 
			
		||||
This runbook describes how Stella Ops operates the managed mirrors under `*.stella-ops.org`.
 | 
			
		||||
It covers Docker Compose and Helm deployment overlays, secret handling for multi-tenant
 | 
			
		||||
authn, CDN fronting, and the recurring sync pipeline that keeps mirror bundles current.
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- **Authority access** – client credentials (`client_id` + secret) authorised for
 | 
			
		||||
  `concelier.mirror.read` and `excititor.mirror.read` scopes. Secrets live outside git.
 | 
			
		||||
- **Signed TLS certificates** – wildcard or per-domain (`mirror-primary`, `mirror-community`).
 | 
			
		||||
  Store them under `deploy/compose/mirror-gateway/tls/` or in Kubernetes secrets.
 | 
			
		||||
- **Mirror gateway credentials** – Basic Auth htpasswd files per domain. Generate with
 | 
			
		||||
  `htpasswd -B`. Operators distribute credentials to downstream consumers.
 | 
			
		||||
- **Export artifact source** – read access to the canonical S3 buckets (or rsync share)
 | 
			
		||||
  that hold `concelier` JSON bundles and `excititor` VEX exports.
 | 
			
		||||
- **Persistent volumes** – storage for Concelier job metadata and mirror export trees.
 | 
			
		||||
  For Helm, provision PVCs (`concelier-mirror-jobs`, `concelier-mirror-exports`,
 | 
			
		||||
  `excititor-mirror-exports`, `mirror-mongo-data`, `mirror-minio-data`) before rollout.
 | 
			
		||||
 | 
			
		||||
### 1.1 Service configuration quick reference
 | 
			
		||||
 | 
			
		||||
Concelier.WebService exposes the mirror HTTP endpoints once `CONCELIER__MIRROR__ENABLED=true`.
 | 
			
		||||
Key knobs:
 | 
			
		||||
 | 
			
		||||
- `CONCELIER__MIRROR__EXPORTROOT` – root folder containing export snapshots (`<exportId>/mirror/*`).
 | 
			
		||||
- `CONCELIER__MIRROR__ACTIVEEXPORTID` – optional explicit export id; otherwise the service auto-falls back to the `latest/` symlink or newest directory.
 | 
			
		||||
- `CONCELIER__MIRROR__REQUIREAUTHENTICATION` – default auth requirement; override per domain with `CONCELIER__MIRROR__DOMAINS__{n}__REQUIREAUTHENTICATION`.
 | 
			
		||||
- `CONCELIER__MIRROR__MAXINDEXREQUESTSPERHOUR` – budget for `/concelier/exports/index.json`. Domains inherit this value unless they define `__MAXDOWNLOADREQUESTSPERHOUR`.
 | 
			
		||||
- `CONCELIER__MIRROR__DOMAINS__{n}__ID` – domain identifier matching the exporter manifest; additional keys configure display name and rate budgets.
 | 
			
		||||
 | 
			
		||||
> The service honours Stella Ops Authority when `CONCELIER__AUTHORITY__ENABLED=true` and `ALLOWANONYMOUSFALLBACK=false`. Use the bypass CIDR list (`CONCELIER__AUTHORITY__BYPASSNETWORKS__*`) for in-cluster ingress gateways that terminate Basic Auth. Unauthorized requests emit `WWW-Authenticate: Bearer` so downstream automation can detect token failures.
 | 
			
		||||
 | 
			
		||||
Mirror responses carry deterministic cache headers: `/index.json` returns `Cache-Control: public, max-age=60`, while per-domain manifests/bundles include `Cache-Control: public, max-age=300, immutable`. Rate limiting surfaces `Retry-After` when quotas are exceeded.
 | 
			
		||||
 | 
			
		||||
### 1.2 Mirror connector configuration
 | 
			
		||||
 | 
			
		||||
Downstream Concelier instances ingest published bundles using the `StellaOpsMirrorConnector`. Operators running the connector in air‑gapped or limited connectivity environments can tune the following options (environment prefix `CONCELIER__SOURCES__STELLAOPSMIRROR__`):
 | 
			
		||||
 | 
			
		||||
- `BASEADDRESS` – absolute mirror root (e.g., `https://mirror-primary.stella-ops.org`).
 | 
			
		||||
- `INDEXPATH` – relative path to the mirror index (`/concelier/exports/index.json` by default).
 | 
			
		||||
- `DOMAINID` – mirror domain identifier from the index (`primary`, `community`, etc.).
 | 
			
		||||
- `HTTPTIMEOUT` – request timeout; raise when mirrors sit behind slow WAN links.
 | 
			
		||||
- `SIGNATURE__ENABLED` – require detached JWS verification for `bundle.json`.
 | 
			
		||||
- `SIGNATURE__KEYID` / `SIGNATURE__PROVIDER` – expected signing key metadata.
 | 
			
		||||
- `SIGNATURE__PUBLICKEYPATH` – PEM fallback used when the mirror key registry is offline.
 | 
			
		||||
 | 
			
		||||
The connector keeps a per-export fingerprint (bundle digest + generated-at timestamp) and tracks outstanding document IDs. If a scan is interrupted, the next run resumes parse/map work using the stored fingerprint and pending document lists—no network requests are reissued unless the upstream digest changes.
 | 
			
		||||
 | 
			
		||||
## 2. Secret & certificate layout
 | 
			
		||||
 | 
			
		||||
### Docker Compose (`deploy/compose/docker-compose.mirror.yaml`)
 | 
			
		||||
 | 
			
		||||
- `deploy/compose/env/mirror.env.example` – copy to `.env` and adjust quotas or domain IDs.
 | 
			
		||||
- `deploy/compose/mirror-secrets/` – mount read-only into `/run/secrets`. Place:
 | 
			
		||||
  - `concelier-authority-client` – Authority client secret.
 | 
			
		||||
  - `excititor-authority-client` (optional) – reserve for future authn.
 | 
			
		||||
- `deploy/compose/mirror-gateway/tls/` – PEM-encoded cert/key pairs:
 | 
			
		||||
  - `mirror-primary.crt`, `mirror-primary.key`
 | 
			
		||||
  - `mirror-community.crt`, `mirror-community.key`
 | 
			
		||||
- `deploy/compose/mirror-gateway/secrets/` – htpasswd files:
 | 
			
		||||
  - `mirror-primary.htpasswd`
 | 
			
		||||
  - `mirror-community.htpasswd`
 | 
			
		||||
 | 
			
		||||
### Helm (`deploy/helm/stellaops/values-mirror.yaml`)
 | 
			
		||||
 | 
			
		||||
Create secrets in the target namespace:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
kubectl create secret generic concelier-mirror-auth \
 | 
			
		||||
  --from-file=concelier-authority-client=concelier-authority-client
 | 
			
		||||
 | 
			
		||||
kubectl create secret generic excititor-mirror-auth \
 | 
			
		||||
  --from-file=excititor-authority-client=excititor-authority-client
 | 
			
		||||
 | 
			
		||||
kubectl create secret tls mirror-gateway-tls \
 | 
			
		||||
  --cert=mirror-primary.crt --key=mirror-primary.key
 | 
			
		||||
 | 
			
		||||
kubectl create secret generic mirror-gateway-htpasswd \
 | 
			
		||||
  --from-file=mirror-primary.htpasswd --from-file=mirror-community.htpasswd
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Keep Basic Auth lists short-lived (rotate quarterly) and document credential recipients.
 | 
			
		||||
 | 
			
		||||
## 3. Deployment
 | 
			
		||||
 | 
			
		||||
### 3.1 Docker Compose (edge mirrors, lab validation)
 | 
			
		||||
 | 
			
		||||
1. `cp deploy/compose/env/mirror.env.example deploy/compose/env/mirror.env`
 | 
			
		||||
2. Populate secrets/tls directories as described above.
 | 
			
		||||
3. Sync mirror bundles (see §4) into `deploy/compose/mirror-data/…` and ensure they are mounted
 | 
			
		||||
   on the host path backing the `concelier-exports` and `excititor-exports` volumes.
 | 
			
		||||
4. Run the profile validator: `deploy/tools/validate-profiles.sh`.
 | 
			
		||||
5. Launch: `docker compose --env-file env/mirror.env -f docker-compose.mirror.yaml up -d`.
 | 
			
		||||
 | 
			
		||||
### 3.2 Helm (production mirrors)
 | 
			
		||||
 | 
			
		||||
1. Provision PVCs sized for mirror bundles (baseline: 20 GiB per domain).
 | 
			
		||||
2. Create secrets/tls config maps (§2).
 | 
			
		||||
3. `helm upgrade --install mirror deploy/helm/stellaops -f deploy/helm/stellaops/values-mirror.yaml`.
 | 
			
		||||
4. Annotate the `stellaops-mirror-gateway` service with ingress/LoadBalancer metadata required by
 | 
			
		||||
   your CDN (e.g., AWS load balancer scheme internal + NLB idle timeout).
 | 
			
		||||
 | 
			
		||||
## 4. Artifact sync workflow
 | 
			
		||||
 | 
			
		||||
Mirrors never generate exports—they ingest signed bundles produced by the Concelier and Excititor
 | 
			
		||||
export jobs. Recommended sync pattern:
 | 
			
		||||
 | 
			
		||||
### 4.1 Compose host (systemd timer)
 | 
			
		||||
 | 
			
		||||
`/usr/local/bin/mirror-sync.sh`:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#!/usr/bin/env bash
 | 
			
		||||
set -euo pipefail
 | 
			
		||||
export AWS_ACCESS_KEY_ID=…
 | 
			
		||||
export AWS_SECRET_ACCESS_KEY=…
 | 
			
		||||
 | 
			
		||||
aws s3 sync s3://mirror-stellaops/concelier/latest \
 | 
			
		||||
  /opt/stellaops/mirror-data/concelier --delete --size-only
 | 
			
		||||
 | 
			
		||||
aws s3 sync s3://mirror-stellaops/excititor/latest \
 | 
			
		||||
  /opt/stellaops/mirror-data/excititor --delete --size-only
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Schedule with a systemd timer every 5 minutes. The Compose volumes mount `/opt/stellaops/mirror-data/*`
 | 
			
		||||
into the containers read-only, matching `CONCELIER__MIRROR__EXPORTROOT=/exports/json` and
 | 
			
		||||
`EXCITITOR__ARTIFACTS__FILESYSTEM__ROOT=/exports`.
 | 
			
		||||
 | 
			
		||||
### 4.2 Kubernetes (CronJob)
 | 
			
		||||
 | 
			
		||||
Create a CronJob running the AWS CLI (or rclone) in the same namespace, writing into the PVCs:
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
apiVersion: batch/v1
 | 
			
		||||
kind: CronJob
 | 
			
		||||
metadata:
 | 
			
		||||
  name: mirror-sync
 | 
			
		||||
spec:
 | 
			
		||||
  schedule: "*/5 * * * *"
 | 
			
		||||
  jobTemplate:
 | 
			
		||||
    spec:
 | 
			
		||||
      template:
 | 
			
		||||
        spec:
 | 
			
		||||
          containers:
 | 
			
		||||
          - name: sync
 | 
			
		||||
            image: public.ecr.aws/aws-cli/aws-cli@sha256:5df5f52c29f5e3ba46d0ad9e0e3afc98701c4a0f879400b4c5f80d943b5fadea
 | 
			
		||||
            command:
 | 
			
		||||
              - /bin/sh
 | 
			
		||||
              - -c
 | 
			
		||||
              - >
 | 
			
		||||
                aws s3 sync s3://mirror-stellaops/concelier/latest /exports/concelier --delete --size-only &&
 | 
			
		||||
                aws s3 sync s3://mirror-stellaops/excititor/latest /exports/excititor --delete --size-only
 | 
			
		||||
            volumeMounts:
 | 
			
		||||
              - name: concelier-exports
 | 
			
		||||
                mountPath: /exports/concelier
 | 
			
		||||
              - name: excititor-exports
 | 
			
		||||
                mountPath: /exports/excititor
 | 
			
		||||
            envFrom:
 | 
			
		||||
              - secretRef:
 | 
			
		||||
                  name: mirror-sync-aws
 | 
			
		||||
          restartPolicy: OnFailure
 | 
			
		||||
          volumes:
 | 
			
		||||
            - name: concelier-exports
 | 
			
		||||
              persistentVolumeClaim:
 | 
			
		||||
                claimName: concelier-mirror-exports
 | 
			
		||||
            - name: excititor-exports
 | 
			
		||||
              persistentVolumeClaim:
 | 
			
		||||
                claimName: excititor-mirror-exports
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## 5. CDN integration
 | 
			
		||||
 | 
			
		||||
1. Point the CDN origin at the mirror gateway (Compose host or Kubernetes LoadBalancer).
 | 
			
		||||
2. Honour the response headers emitted by the gateway and Concelier/Excititor:
 | 
			
		||||
   `Cache-Control: public, max-age=300, immutable` for mirror payloads.
 | 
			
		||||
3. Configure origin shields in the CDN to prevent cache stampedes. Recommended TTLs:
 | 
			
		||||
   - Index (`/concelier/exports/index.json`, `/excititor/mirror/*/index`) → 60 s.
 | 
			
		||||
   - Bundle/manifest payloads → 300 s.
 | 
			
		||||
4. Forward the `Authorization` header—Basic Auth terminates at the gateway.
 | 
			
		||||
5. Enforce per-domain rate limits at the CDN (matching gateway budgets) and enable logging
 | 
			
		||||
   to SIEM for anomaly detection.
 | 
			
		||||
 | 
			
		||||
## 6. Smoke tests
 | 
			
		||||
 | 
			
		||||
After each deployment or sync cycle (temporarily set low budgets if you need to observe 429 responses):
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
# Index with Basic Auth
 | 
			
		||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/index.json | jq 'keys'
 | 
			
		||||
 | 
			
		||||
# Mirror manifest signature and cache headers
 | 
			
		||||
curl -u $PRIMARY_CREDS -I https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/manifest.json \
 | 
			
		||||
  | tee /tmp/manifest-headers.txt
 | 
			
		||||
grep -E '^Cache-Control: ' /tmp/manifest-headers.txt   # expect public, max-age=300, immutable
 | 
			
		||||
 | 
			
		||||
# Excititor consensus bundle metadata
 | 
			
		||||
curl -u $COMMUNITY_CREDS https://mirror-community.stella-ops.org/excititor/mirror/community/index \
 | 
			
		||||
  | jq '.exports[].exportKey'
 | 
			
		||||
 | 
			
		||||
# Signed bundle + detached JWS (spot check digests)
 | 
			
		||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/bundle.json.jws \
 | 
			
		||||
  -o bundle.json.jws
 | 
			
		||||
cosign verify-blob --signature bundle.json.jws --key mirror-key.pub bundle.json
 | 
			
		||||
 | 
			
		||||
# Service-level auth check (inside cluster – no gateway credentials)
 | 
			
		||||
kubectl exec deploy/stellaops-concelier -- curl -si http://localhost:8443/concelier/exports/mirror/primary/manifest.json \
 | 
			
		||||
  | head -n 5   # expect HTTP/1.1 401 with WWW-Authenticate: Bearer
 | 
			
		||||
 | 
			
		||||
# Rate limit smoke (repeat quickly; second call should return 429 + Retry-After)
 | 
			
		||||
for i in 1 2; do
 | 
			
		||||
  curl -s -o /dev/null -D - https://mirror-primary.stella-ops.org/concelier/exports/index.json \
 | 
			
		||||
    -u $PRIMARY_CREDS | grep -E '^(HTTP/|Retry-After:)'
 | 
			
		||||
  sleep 1
 | 
			
		||||
done
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Watch the gateway metrics (`nginx_vts` or access logs) for cache hits. In Kubernetes, `kubectl logs deploy/stellaops-mirror-gateway`
 | 
			
		||||
should show `X-Cache-Status: HIT/MISS`.
 | 
			
		||||
 | 
			
		||||
## 7. Maintenance & rotation
 | 
			
		||||
 | 
			
		||||
- **Bundle freshness** – alert if sync job lag exceeds 15 minutes or if `concelier` logs
 | 
			
		||||
  `Mirror export root is not configured`.
 | 
			
		||||
- **Secret rotation** – change Authority client secrets and Basic Auth credentials quarterly.
 | 
			
		||||
  Update the mounted secrets and restart deployments (`docker compose restart concelier` or
 | 
			
		||||
  `kubectl rollout restart deploy/stellaops-concelier`).
 | 
			
		||||
- **TLS renewal** – reissue certificates, place new files, and reload gateway (`docker compose exec mirror-gateway nginx -s reload`).
 | 
			
		||||
- **Quota tuning** – adjust per-domain `MAXDOWNLOADREQUESTSPERHOUR` in `.env` or values file.
 | 
			
		||||
  Align CDN rate limits and inform downstreams.
 | 
			
		||||
 | 
			
		||||
## 8. References
 | 
			
		||||
 | 
			
		||||
- Deployment profiles: `deploy/compose/docker-compose.mirror.yaml`,
 | 
			
		||||
  `deploy/helm/stellaops/values-mirror.yaml`
 | 
			
		||||
- Mirror architecture dossiers: `docs/ARCHITECTURE_CONCELIER.md`,
 | 
			
		||||
  `docs/ARCHITECTURE_EXCITITOR_MIRRORS.md`
 | 
			
		||||
- Export bundling: `docs/ARCHITECTURE_DEVOPS.md` §3, `docs/ARCHITECTURE_EXCITITOR.md` §7
 | 
			
		||||
 
 | 
			
		||||
@@ -1,48 +1,48 @@
 | 
			
		||||
# NKCKI Connector Operations Guide
 | 
			
		||||
 | 
			
		||||
## Overview
 | 
			
		||||
 | 
			
		||||
The NKCKI connector ingests JSON bulletin archives from cert.gov.ru, expanding each `*.json.zip` attachment into per-vulnerability DTOs before canonical mapping. The fetch pipeline now supports cache-backed recovery, deterministic pagination, and telemetry suitable for production monitoring.
 | 
			
		||||
 | 
			
		||||
## Configuration
 | 
			
		||||
 | 
			
		||||
Key options exposed through `concelier:sources:ru-nkcki:http`:
 | 
			
		||||
 | 
			
		||||
- `maxBulletinsPerFetch` – limits new bulletin downloads in a single run (default `5`).
 | 
			
		||||
- `maxListingPagesPerFetch` – maximum listing pages visited during pagination (default `3`).
 | 
			
		||||
- `listingCacheDuration` – minimum interval between listing fetches before falling back to cached artefacts (default `00:10:00`).
 | 
			
		||||
- `cacheDirectory` – optional path for persisted bulletin archives used during offline or failure scenarios.
 | 
			
		||||
- `requestDelay` – delay inserted between bulletin downloads to respect upstream politeness.
 | 
			
		||||
 | 
			
		||||
When operating in offline-first mode, set `cacheDirectory` to a writable path (e.g. `/var/lib/concelier/cache/ru-nkcki`) and pre-populate bulletin archives via the offline kit.
 | 
			
		||||
 | 
			
		||||
## Telemetry
 | 
			
		||||
 | 
			
		||||
`RuNkckiDiagnostics` emits the following metrics under meter `StellaOps.Concelier.Connector.Ru.Nkcki`:
 | 
			
		||||
 | 
			
		||||
- `nkcki.listing.fetch.attempts` / `nkcki.listing.fetch.success` / `nkcki.listing.fetch.failures`
 | 
			
		||||
- `nkcki.listing.pages.visited` (histogram, `pages`)
 | 
			
		||||
- `nkcki.listing.attachments.discovered` / `nkcki.listing.attachments.new`
 | 
			
		||||
- `nkcki.bulletin.fetch.success` / `nkcki.bulletin.fetch.cached` / `nkcki.bulletin.fetch.failures`
 | 
			
		||||
- `nkcki.entries.processed` (histogram, `entries`)
 | 
			
		||||
 | 
			
		||||
Integrate these counters into standard Concelier observability dashboards to track crawl coverage and cache hit rates.
 | 
			
		||||
 | 
			
		||||
## Archive Backfill Strategy
 | 
			
		||||
 | 
			
		||||
Bitrix pagination surfaces archives via `?PAGEN_1=n`. The connector now walks up to `maxListingPagesPerFetch` pages, deduplicating bulletin IDs and maintaining a rolling `knownBulletins` window. Backfill strategy:
 | 
			
		||||
 | 
			
		||||
1. Enumerate pages from newest to oldest, respecting `maxListingPagesPerFetch` and `listingCacheDuration` to avoid refetch storms.
 | 
			
		||||
2. Persist every `*.json.zip` attachment to the configured cache directory. This enables replay when listing access is temporarily blocked.
 | 
			
		||||
3. During archive replay, `ProcessCachedBulletinsAsync` enqueues missing documents while respecting `maxVulnerabilitiesPerFetch`.
 | 
			
		||||
4. For historical HTML-only advisories, collect page URLs and metadata while offline (future work: HTML and PDF extraction pipeline documented in `docs/concelier-connector-research-20251011.md`).
 | 
			
		||||
 | 
			
		||||
For large migrations, seed caches with archived zip bundles, then run fetch/parse/map cycles in chronological order to maintain deterministic outputs.
 | 
			
		||||
 | 
			
		||||
## Failure Handling
 | 
			
		||||
 | 
			
		||||
- Listing failures mark the source state with exponential backoff while attempting cache replay.
 | 
			
		||||
- Bulletin fetches fall back to cached copies before surfacing an error.
 | 
			
		||||
- Mongo integration tests rely on bundled OpenSSL 1.1 libraries (`tools/openssl/linux-x64`) to keep `Mongo2Go` operational on modern distros.
 | 
			
		||||
 | 
			
		||||
Refer to `ru-nkcki` entries in `src/StellaOps.Concelier.Connector.Ru.Nkcki/TASKS.md` for outstanding items.
 | 
			
		||||
# NKCKI Connector Operations Guide
 | 
			
		||||
 | 
			
		||||
## Overview
 | 
			
		||||
 | 
			
		||||
The NKCKI connector ingests JSON bulletin archives from cert.gov.ru, expanding each `*.json.zip` attachment into per-vulnerability DTOs before canonical mapping. The fetch pipeline now supports cache-backed recovery, deterministic pagination, and telemetry suitable for production monitoring.
 | 
			
		||||
 | 
			
		||||
## Configuration
 | 
			
		||||
 | 
			
		||||
Key options exposed through `concelier:sources:ru-nkcki:http`:
 | 
			
		||||
 | 
			
		||||
- `maxBulletinsPerFetch` – limits new bulletin downloads in a single run (default `5`).
 | 
			
		||||
- `maxListingPagesPerFetch` – maximum listing pages visited during pagination (default `3`).
 | 
			
		||||
- `listingCacheDuration` – minimum interval between listing fetches before falling back to cached artefacts (default `00:10:00`).
 | 
			
		||||
- `cacheDirectory` – optional path for persisted bulletin archives used during offline or failure scenarios.
 | 
			
		||||
- `requestDelay` – delay inserted between bulletin downloads to respect upstream politeness.
 | 
			
		||||
 | 
			
		||||
When operating in offline-first mode, set `cacheDirectory` to a writable path (e.g. `/var/lib/concelier/cache/ru-nkcki`) and pre-populate bulletin archives via the offline kit.
 | 
			
		||||
 | 
			
		||||
## Telemetry
 | 
			
		||||
 | 
			
		||||
`RuNkckiDiagnostics` emits the following metrics under meter `StellaOps.Concelier.Connector.Ru.Nkcki`:
 | 
			
		||||
 | 
			
		||||
- `nkcki.listing.fetch.attempts` / `nkcki.listing.fetch.success` / `nkcki.listing.fetch.failures`
 | 
			
		||||
- `nkcki.listing.pages.visited` (histogram, `pages`)
 | 
			
		||||
- `nkcki.listing.attachments.discovered` / `nkcki.listing.attachments.new`
 | 
			
		||||
- `nkcki.bulletin.fetch.success` / `nkcki.bulletin.fetch.cached` / `nkcki.bulletin.fetch.failures`
 | 
			
		||||
- `nkcki.entries.processed` (histogram, `entries`)
 | 
			
		||||
 | 
			
		||||
Integrate these counters into standard Concelier observability dashboards to track crawl coverage and cache hit rates.
 | 
			
		||||
 | 
			
		||||
## Archive Backfill Strategy
 | 
			
		||||
 | 
			
		||||
Bitrix pagination surfaces archives via `?PAGEN_1=n`. The connector now walks up to `maxListingPagesPerFetch` pages, deduplicating bulletin IDs and maintaining a rolling `knownBulletins` window. Backfill strategy:
 | 
			
		||||
 | 
			
		||||
1. Enumerate pages from newest to oldest, respecting `maxListingPagesPerFetch` and `listingCacheDuration` to avoid refetch storms.
 | 
			
		||||
2. Persist every `*.json.zip` attachment to the configured cache directory. This enables replay when listing access is temporarily blocked.
 | 
			
		||||
3. During archive replay, `ProcessCachedBulletinsAsync` enqueues missing documents while respecting `maxVulnerabilitiesPerFetch`.
 | 
			
		||||
4. For historical HTML-only advisories, collect page URLs and metadata while offline (future work: HTML and PDF extraction pipeline documented in `docs/concelier-connector-research-20251011.md`).
 | 
			
		||||
 | 
			
		||||
For large migrations, seed caches with archived zip bundles, then run fetch/parse/map cycles in chronological order to maintain deterministic outputs.
 | 
			
		||||
 | 
			
		||||
## Failure Handling
 | 
			
		||||
 | 
			
		||||
- Listing failures mark the source state with exponential backoff while attempting cache replay.
 | 
			
		||||
- Bulletin fetches fall back to cached copies before surfacing an error.
 | 
			
		||||
- Mongo integration tests rely on bundled OpenSSL 1.1 libraries (`tools/openssl/linux-x64`) to keep `Mongo2Go` operational on modern distros.
 | 
			
		||||
 | 
			
		||||
Refer to `ru-nkcki` entries in `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ru.Nkcki/TASKS.md` for outstanding items.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,24 +1,24 @@
 | 
			
		||||
# Concelier OSV Connector – Operations Notes
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-16_
 | 
			
		||||
 | 
			
		||||
The OSV connector ingests advisories from OSV.dev across OSS ecosystems. This note highlights the additional merge/export expectations introduced with the canonical metric fallback work in Sprint 4.
 | 
			
		||||
 | 
			
		||||
## 1. Canonical metric fallbacks
 | 
			
		||||
- When OSV omits CVSS vectors (common for CVSS v4-only payloads) the mapper now emits a deterministic canonical metric id in the form `osv:severity/<level>` and normalises the advisory severity to the same `<level>`.
 | 
			
		||||
- Metric: `osv.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `ecosystem`, `reason=no_cvss`. Watch this alongside merge parity dashboards to catch spikes where OSV publishes severity-only advisories.
 | 
			
		||||
- Merge precedence still prefers GHSA over OSV; the shared severity-based canonical id keeps Merge/export parity deterministic even when only OSV supplies severity data.
 | 
			
		||||
 | 
			
		||||
## 2. CWE provenance
 | 
			
		||||
- `database_specific.cwe_ids` now populates provenance decision reasons for every mapped weakness. Expect `decisionReason="database_specific.cwe_ids"` on OSV weakness provenance and confirm exporters preserve the value.
 | 
			
		||||
- If OSV ever attaches `database_specific.cwe_notes`, the connector will surface the joined note string in `decisionReason` instead of the default marker.
 | 
			
		||||
 | 
			
		||||
## 3. Dashboards & alerts
 | 
			
		||||
- Extend existing merge dashboards with the new counter:
 | 
			
		||||
  - Overlay `sum(osv.map.canonical_metric_fallbacks{ecosystem=~".+"})` with Merge severity overrides to confirm fallback advisories are reconciling cleanly.
 | 
			
		||||
  - Alert when the 1-hour sum exceeds 50 for any ecosystem; baseline volume is currently <5 per day (mostly GHSA mirrors emitting CVSS v4 only).
 | 
			
		||||
- Exporters already surface `canonicalMetricId`; no schema change is required, but ORAS/Trivy bundles should be spot-checked after deploying the connector update.
 | 
			
		||||
 | 
			
		||||
## 4. Runbook updates
 | 
			
		||||
- Fixture parity suites (`osv-ghsa.*`) now assert the fallback id and provenance notes. Regenerate via `dotnet test src/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj`.
 | 
			
		||||
- When investigating merge severity conflicts, include the fallback counter and confirm OSV advisories carry the expected `osv:severity/<level>` id before raising connector bugs.
 | 
			
		||||
# Concelier OSV Connector – Operations Notes
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-16_
 | 
			
		||||
 | 
			
		||||
The OSV connector ingests advisories from OSV.dev across OSS ecosystems. This note highlights the additional merge/export expectations introduced with the canonical metric fallback work in Sprint 4.
 | 
			
		||||
 | 
			
		||||
## 1. Canonical metric fallbacks
 | 
			
		||||
- When OSV omits CVSS vectors (common for CVSS v4-only payloads) the mapper now emits a deterministic canonical metric id in the form `osv:severity/<level>` and normalises the advisory severity to the same `<level>`.
 | 
			
		||||
- Metric: `osv.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `ecosystem`, `reason=no_cvss`. Watch this alongside merge parity dashboards to catch spikes where OSV publishes severity-only advisories.
 | 
			
		||||
- Merge precedence still prefers GHSA over OSV; the shared severity-based canonical id keeps Merge/export parity deterministic even when only OSV supplies severity data.
 | 
			
		||||
 | 
			
		||||
## 2. CWE provenance
 | 
			
		||||
- `database_specific.cwe_ids` now populates provenance decision reasons for every mapped weakness. Expect `decisionReason="database_specific.cwe_ids"` on OSV weakness provenance and confirm exporters preserve the value.
 | 
			
		||||
- If OSV ever attaches `database_specific.cwe_notes`, the connector will surface the joined note string in `decisionReason` instead of the default marker.
 | 
			
		||||
 | 
			
		||||
## 3. Dashboards & alerts
 | 
			
		||||
- Extend existing merge dashboards with the new counter:
 | 
			
		||||
  - Overlay `sum(osv.map.canonical_metric_fallbacks{ecosystem=~".+"})` with Merge severity overrides to confirm fallback advisories are reconciling cleanly.
 | 
			
		||||
  - Alert when the 1-hour sum exceeds 50 for any ecosystem; baseline volume is currently <5 per day (mostly GHSA mirrors emitting CVSS v4 only).
 | 
			
		||||
- Exporters already surface `canonicalMetricId`; no schema change is required, but ORAS/Trivy bundles should be spot-checked after deploying the connector update.
 | 
			
		||||
 | 
			
		||||
## 4. Runbook updates
 | 
			
		||||
- Fixture parity suites (`osv-ghsa.*`) now assert the fallback id and provenance notes. Regenerate via `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj`.
 | 
			
		||||
- When investigating merge severity conflicts, include the fallback counter and confirm OSV advisories carry the expected `osv:severity/<level>` id before raising connector bugs.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,151 +1,151 @@
 | 
			
		||||
# Stella Ops Deployment Upgrade & Rollback Runbook
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._
 | 
			
		||||
 | 
			
		||||
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Channel overview
 | 
			
		||||
 | 
			
		||||
| Channel | Release manifest | Helm values | Compose profile |
 | 
			
		||||
|---------|------------------|-------------|-----------------|
 | 
			
		||||
| `edge`  | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` |
 | 
			
		||||
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
 | 
			
		||||
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
 | 
			
		||||
 | 
			
		||||
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Pre-flight checklist
 | 
			
		||||
 | 
			
		||||
1. **Refresh release manifest**  
 | 
			
		||||
   Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
 | 
			
		||||
 | 
			
		||||
2. **Align deployment bundles with the manifest**  
 | 
			
		||||
   Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
 | 
			
		||||
   ```bash
 | 
			
		||||
   ./deploy/tools/check-channel-alignment.py \
 | 
			
		||||
       --release deploy/releases/2025.10-edge.yaml \
 | 
			
		||||
       --target deploy/helm/stellaops/values-dev.yaml \
 | 
			
		||||
       --target deploy/compose/docker-compose.dev.yaml \
 | 
			
		||||
       --ignore-repo nats
 | 
			
		||||
   ```
 | 
			
		||||
   Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
 | 
			
		||||
 | 
			
		||||
3. **Lint and template profiles**
 | 
			
		||||
   ```bash
 | 
			
		||||
   ./deploy/tools/validate-profiles.sh
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
4. **Smoke the Offline Kit debug store (edge/stable only)**  
 | 
			
		||||
   When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
 | 
			
		||||
   ```bash
 | 
			
		||||
  ./ops/offline-kit/mirror_debug_store.py \
 | 
			
		||||
       --release-dir out/release \
 | 
			
		||||
       --offline-kit-dir out/offline-kit
 | 
			
		||||
   ```
 | 
			
		||||
   Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
 | 
			
		||||
 | 
			
		||||
5. **Review compatibility matrix**  
 | 
			
		||||
   Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258…`, `minio@sha256:14ce…`, `rustfs:2025.10.0-edge`.
 | 
			
		||||
 | 
			
		||||
6. **Create a rollback bookmark**  
 | 
			
		||||
   Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Helm upgrade procedure (staging → production)
 | 
			
		||||
 | 
			
		||||
1. Switch to the deployment branch and ensure secrets/config maps are current.
 | 
			
		||||
2. Apply the upgrade in the staging cluster:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm upgrade stellaops deploy/helm/stellaops \
 | 
			
		||||
     -f deploy/helm/stellaops/values-stage.yaml \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --atomic \
 | 
			
		||||
     --timeout 15m
 | 
			
		||||
   ```
 | 
			
		||||
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
 | 
			
		||||
4. Promote to production using the prod values file and the same command.
 | 
			
		||||
5. Record the new revision number and Git SHA in the change log.
 | 
			
		||||
 | 
			
		||||
### Rollback (Helm)
 | 
			
		||||
 | 
			
		||||
1. Identify the previous revision: `helm history stellaops -n stellaops`.
 | 
			
		||||
2. Execute:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm rollback stellaops <revision> \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --wait \
 | 
			
		||||
     --timeout 10m
 | 
			
		||||
   ```
 | 
			
		||||
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
 | 
			
		||||
4. Update the incident/operations log with root cause and rollback details.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Docker Compose upgrade procedure
 | 
			
		||||
 | 
			
		||||
1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
 | 
			
		||||
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
 | 
			
		||||
3. Apply the upgrade:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose \
 | 
			
		||||
     --env-file deploy/compose/env/prod.env \
 | 
			
		||||
     -f deploy/compose/docker-compose.prod.yaml \
 | 
			
		||||
     pull
 | 
			
		||||
 | 
			
		||||
   docker compose \
 | 
			
		||||
     --env-file deploy/compose/env/prod.env \
 | 
			
		||||
     -f deploy/compose/docker-compose.prod.yaml \
 | 
			
		||||
     up -d
 | 
			
		||||
   ```
 | 
			
		||||
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
 | 
			
		||||
5. Update monitoring dashboards/alerts to confirm normal operation.
 | 
			
		||||
 | 
			
		||||
### Rollback (Compose)
 | 
			
		||||
 | 
			
		||||
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
 | 
			
		||||
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
 | 
			
		||||
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/ops/authority-backup-restore.md` and associated service guides).
 | 
			
		||||
4. Log the rollback in the operations journal.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Channel promotion workflow
 | 
			
		||||
 | 
			
		||||
1. Author or update the channel manifest under `deploy/releases/`.
 | 
			
		||||
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
 | 
			
		||||
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
 | 
			
		||||
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
 | 
			
		||||
5. Tag the repository when promoting stable or airgap builds.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Upgrade rehearsal & rollback drill log
 | 
			
		||||
 | 
			
		||||
Maintain rehearsal notes in `docs/ops/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
 | 
			
		||||
 | 
			
		||||
- Release version tested
 | 
			
		||||
- Date/time
 | 
			
		||||
- Participants
 | 
			
		||||
- Issues encountered & fixes
 | 
			
		||||
- Rollback duration (if executed)
 | 
			
		||||
 | 
			
		||||
Attach the log to the sprint retro or operational wiki.
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | Channel | Outcome | Notes |
 | 
			
		||||
|------------|---------|---------|-------|
 | 
			
		||||
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7. References
 | 
			
		||||
 | 
			
		||||
- `deploy/README.md` – structure and validation workflow for deployment bundles.
 | 
			
		||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline.
 | 
			
		||||
- `docs/ARCHITECTURE_DEVOPS.md` – high-level DevOps architecture, SLOs, and compliance requirements.
 | 
			
		||||
- `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper.
 | 
			
		||||
- `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker.
 | 
			
		||||
# Stella Ops Deployment Upgrade & Rollback Runbook
 | 
			
		||||
 | 
			
		||||
_Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._
 | 
			
		||||
 | 
			
		||||
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Channel overview
 | 
			
		||||
 | 
			
		||||
| Channel | Release manifest | Helm values | Compose profile |
 | 
			
		||||
|---------|------------------|-------------|-----------------|
 | 
			
		||||
| `edge`  | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` |
 | 
			
		||||
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
 | 
			
		||||
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
 | 
			
		||||
 | 
			
		||||
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Pre-flight checklist
 | 
			
		||||
 | 
			
		||||
1. **Refresh release manifest**  
 | 
			
		||||
   Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
 | 
			
		||||
 | 
			
		||||
2. **Align deployment bundles with the manifest**  
 | 
			
		||||
   Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
 | 
			
		||||
   ```bash
 | 
			
		||||
   ./deploy/tools/check-channel-alignment.py \
 | 
			
		||||
       --release deploy/releases/2025.10-edge.yaml \
 | 
			
		||||
       --target deploy/helm/stellaops/values-dev.yaml \
 | 
			
		||||
       --target deploy/compose/docker-compose.dev.yaml \
 | 
			
		||||
       --ignore-repo nats
 | 
			
		||||
   ```
 | 
			
		||||
   Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
 | 
			
		||||
 | 
			
		||||
3. **Lint and template profiles**
 | 
			
		||||
   ```bash
 | 
			
		||||
   ./deploy/tools/validate-profiles.sh
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
4. **Smoke the Offline Kit debug store (edge/stable only)**  
 | 
			
		||||
   When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
 | 
			
		||||
   ```bash
 | 
			
		||||
  ./ops/offline-kit/mirror_debug_store.py \
 | 
			
		||||
       --release-dir out/release \
 | 
			
		||||
       --offline-kit-dir out/offline-kit
 | 
			
		||||
   ```
 | 
			
		||||
   Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
 | 
			
		||||
 | 
			
		||||
5. **Review compatibility matrix**  
 | 
			
		||||
   Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258…`, `minio@sha256:14ce…`, `rustfs:2025.10.0-edge`.
 | 
			
		||||
 | 
			
		||||
6. **Create a rollback bookmark**  
 | 
			
		||||
   Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Helm upgrade procedure (staging → production)
 | 
			
		||||
 | 
			
		||||
1. Switch to the deployment branch and ensure secrets/config maps are current.
 | 
			
		||||
2. Apply the upgrade in the staging cluster:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm upgrade stellaops deploy/helm/stellaops \
 | 
			
		||||
     -f deploy/helm/stellaops/values-stage.yaml \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --atomic \
 | 
			
		||||
     --timeout 15m
 | 
			
		||||
   ```
 | 
			
		||||
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
 | 
			
		||||
4. Promote to production using the prod values file and the same command.
 | 
			
		||||
5. Record the new revision number and Git SHA in the change log.
 | 
			
		||||
 | 
			
		||||
### Rollback (Helm)
 | 
			
		||||
 | 
			
		||||
1. Identify the previous revision: `helm history stellaops -n stellaops`.
 | 
			
		||||
2. Execute:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm rollback stellaops <revision> \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --wait \
 | 
			
		||||
     --timeout 10m
 | 
			
		||||
   ```
 | 
			
		||||
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
 | 
			
		||||
4. Update the incident/operations log with root cause and rollback details.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Docker Compose upgrade procedure
 | 
			
		||||
 | 
			
		||||
1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
 | 
			
		||||
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
 | 
			
		||||
3. Apply the upgrade:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose \
 | 
			
		||||
     --env-file deploy/compose/env/prod.env \
 | 
			
		||||
     -f deploy/compose/docker-compose.prod.yaml \
 | 
			
		||||
     pull
 | 
			
		||||
 | 
			
		||||
   docker compose \
 | 
			
		||||
     --env-file deploy/compose/env/prod.env \
 | 
			
		||||
     -f deploy/compose/docker-compose.prod.yaml \
 | 
			
		||||
     up -d
 | 
			
		||||
   ```
 | 
			
		||||
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
 | 
			
		||||
5. Update monitoring dashboards/alerts to confirm normal operation.
 | 
			
		||||
 | 
			
		||||
### Rollback (Compose)
 | 
			
		||||
 | 
			
		||||
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
 | 
			
		||||
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
 | 
			
		||||
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/ops/authority-backup-restore.md` and associated service guides).
 | 
			
		||||
4. Log the rollback in the operations journal.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Channel promotion workflow
 | 
			
		||||
 | 
			
		||||
1. Author or update the channel manifest under `deploy/releases/`.
 | 
			
		||||
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
 | 
			
		||||
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
 | 
			
		||||
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
 | 
			
		||||
5. Tag the repository when promoting stable or airgap builds.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Upgrade rehearsal & rollback drill log
 | 
			
		||||
 | 
			
		||||
Maintain rehearsal notes in `docs/ops/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
 | 
			
		||||
 | 
			
		||||
- Release version tested
 | 
			
		||||
- Date/time
 | 
			
		||||
- Participants
 | 
			
		||||
- Issues encountered & fixes
 | 
			
		||||
- Rollback duration (if executed)
 | 
			
		||||
 | 
			
		||||
Attach the log to the sprint retro or operational wiki.
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | Channel | Outcome | Notes |
 | 
			
		||||
|------------|---------|---------|-------|
 | 
			
		||||
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 7. References
 | 
			
		||||
 | 
			
		||||
- `deploy/README.md` – structure and validation workflow for deployment bundles.
 | 
			
		||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline.
 | 
			
		||||
- `docs/ARCHITECTURE_DEVOPS.md` – high-level DevOps architecture, SLOs, and compliance requirements.
 | 
			
		||||
- `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper.
 | 
			
		||||
- `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,128 +1,128 @@
 | 
			
		||||
# Launch Cutover Runbook - Stella Ops
 | 
			
		||||
 | 
			
		||||
_Document owner: DevOps Guild (2025-10-26)_  
 | 
			
		||||
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
 | 
			
		||||
 | 
			
		||||
## 1. Roles and Communication
 | 
			
		||||
 | 
			
		||||
| Role | Primary | Backup | Contact |
 | 
			
		||||
| --- | --- | --- | --- |
 | 
			
		||||
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
 | 
			
		||||
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
 | 
			
		||||
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
 | 
			
		||||
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
 | 
			
		||||
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
 | 
			
		||||
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
 | 
			
		||||
 | 
			
		||||
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
 | 
			
		||||
 | 
			
		||||
## 2. Timeline Overview (UTC)
 | 
			
		||||
 | 
			
		||||
| Time | Activity | Owner |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
 | 
			
		||||
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
 | 
			
		||||
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
 | 
			
		||||
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
 | 
			
		||||
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
 | 
			
		||||
| T0 | Execute production cutover steps (Section 4). | Cutover team |
 | 
			
		||||
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
 | 
			
		||||
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
 | 
			
		||||
 | 
			
		||||
## 3. Rehearsal (Staging) Checklist
 | 
			
		||||
 | 
			
		||||
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
 | 
			
		||||
2. Run `deploy/tools/validate-profiles.sh` and archive output.
 | 
			
		||||
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
 | 
			
		||||
4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster.
 | 
			
		||||
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
 | 
			
		||||
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
 | 
			
		||||
7. Document total wall time and any deviations in the rehearsal log.
 | 
			
		||||
 | 
			
		||||
Rehearsal must complete without manual interventions before proceeding to production.
 | 
			
		||||
 | 
			
		||||
## 4. Production Cutover Steps
 | 
			
		||||
 | 
			
		||||
### 4.1 Pre-flight
 | 
			
		||||
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
 | 
			
		||||
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
 | 
			
		||||
- Back up current configuration and data:
 | 
			
		||||
  - Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
 | 
			
		||||
  - MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
 | 
			
		||||
 | 
			
		||||
### 4.2 Apply Updates (Compose)
 | 
			
		||||
1. On each compose node, pull updated images for release `2025.09.2`:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
 | 
			
		||||
   ```
 | 
			
		||||
2. Deploy changes:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
 | 
			
		||||
   ```
 | 
			
		||||
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
 | 
			
		||||
 | 
			
		||||
### 4.3 Apply Updates (Helm/Kubernetes)
 | 
			
		||||
If using Kubernetes, perform:
 | 
			
		||||
```bash
 | 
			
		||||
helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m
 | 
			
		||||
```
 | 
			
		||||
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
 | 
			
		||||
 | 
			
		||||
### 4.4 Configuration Validation
 | 
			
		||||
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
 | 
			
		||||
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
 | 
			
		||||
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
 | 
			
		||||
- Ensure Notify (legacy) still accessible while Notifier migration pending.
 | 
			
		||||
 | 
			
		||||
## 5. Smoke Tests
 | 
			
		||||
 | 
			
		||||
| Test | Command / Action | Expected Result |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
 | 
			
		||||
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
 | 
			
		||||
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
 | 
			
		||||
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
 | 
			
		||||
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
 | 
			
		||||
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
 | 
			
		||||
 | 
			
		||||
Log results in the change ticket with timestamps and screenshots where applicable.
 | 
			
		||||
 | 
			
		||||
## 6. Rollback Procedure
 | 
			
		||||
 | 
			
		||||
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
 | 
			
		||||
2. For Compose:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
 | 
			
		||||
   docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
 | 
			
		||||
   ```
 | 
			
		||||
3. For Helm:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm rollback stellaops <previous-release-number> --namespace stellaops
 | 
			
		||||
   ```
 | 
			
		||||
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
 | 
			
		||||
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
 | 
			
		||||
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
 | 
			
		||||
 | 
			
		||||
## 7. Post-cutover Actions
 | 
			
		||||
 | 
			
		||||
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
 | 
			
		||||
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
 | 
			
		||||
- Update `docs/ops/launch-readiness.md` if any new gaps or follow-ups discovered.
 | 
			
		||||
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
 | 
			
		||||
 | 
			
		||||
## 8. Approval Matrix
 | 
			
		||||
 | 
			
		||||
| Step | Required Approvers | Record Location |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
 | 
			
		||||
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
 | 
			
		||||
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
 | 
			
		||||
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
 | 
			
		||||
 | 
			
		||||
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
 | 
			
		||||
 | 
			
		||||
## 9. Rehearsal Log
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | What We Exercised | Outcome | Follow-up |
 | 
			
		||||
| --- | --- | --- | --- |
 | 
			
		||||
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
 | 
			
		||||
# Launch Cutover Runbook - Stella Ops
 | 
			
		||||
 | 
			
		||||
_Document owner: DevOps Guild (2025-10-26)_  
 | 
			
		||||
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
 | 
			
		||||
 | 
			
		||||
## 1. Roles and Communication
 | 
			
		||||
 | 
			
		||||
| Role | Primary | Backup | Contact |
 | 
			
		||||
| --- | --- | --- | --- |
 | 
			
		||||
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
 | 
			
		||||
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
 | 
			
		||||
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
 | 
			
		||||
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
 | 
			
		||||
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
 | 
			
		||||
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
 | 
			
		||||
 | 
			
		||||
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
 | 
			
		||||
 | 
			
		||||
## 2. Timeline Overview (UTC)
 | 
			
		||||
 | 
			
		||||
| Time | Activity | Owner |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
 | 
			
		||||
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
 | 
			
		||||
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
 | 
			
		||||
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
 | 
			
		||||
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
 | 
			
		||||
| T0 | Execute production cutover steps (Section 4). | Cutover team |
 | 
			
		||||
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
 | 
			
		||||
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
 | 
			
		||||
 | 
			
		||||
## 3. Rehearsal (Staging) Checklist
 | 
			
		||||
 | 
			
		||||
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
 | 
			
		||||
2. Run `deploy/tools/validate-profiles.sh` and archive output.
 | 
			
		||||
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
 | 
			
		||||
4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster.
 | 
			
		||||
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
 | 
			
		||||
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
 | 
			
		||||
7. Document total wall time and any deviations in the rehearsal log.
 | 
			
		||||
 | 
			
		||||
Rehearsal must complete without manual interventions before proceeding to production.
 | 
			
		||||
 | 
			
		||||
## 4. Production Cutover Steps
 | 
			
		||||
 | 
			
		||||
### 4.1 Pre-flight
 | 
			
		||||
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
 | 
			
		||||
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
 | 
			
		||||
- Back up current configuration and data:
 | 
			
		||||
  - Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
 | 
			
		||||
  - MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
 | 
			
		||||
 | 
			
		||||
### 4.2 Apply Updates (Compose)
 | 
			
		||||
1. On each compose node, pull updated images for release `2025.09.2`:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
 | 
			
		||||
   ```
 | 
			
		||||
2. Deploy changes:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
 | 
			
		||||
   ```
 | 
			
		||||
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
 | 
			
		||||
 | 
			
		||||
### 4.3 Apply Updates (Helm/Kubernetes)
 | 
			
		||||
If using Kubernetes, perform:
 | 
			
		||||
```bash
 | 
			
		||||
helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m
 | 
			
		||||
```
 | 
			
		||||
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
 | 
			
		||||
 | 
			
		||||
### 4.4 Configuration Validation
 | 
			
		||||
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
 | 
			
		||||
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
 | 
			
		||||
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
 | 
			
		||||
- Ensure Notify (legacy) still accessible while Notifier migration pending.
 | 
			
		||||
 | 
			
		||||
## 5. Smoke Tests
 | 
			
		||||
 | 
			
		||||
| Test | Command / Action | Expected Result |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
 | 
			
		||||
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
 | 
			
		||||
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
 | 
			
		||||
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
 | 
			
		||||
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
 | 
			
		||||
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
 | 
			
		||||
 | 
			
		||||
Log results in the change ticket with timestamps and screenshots where applicable.
 | 
			
		||||
 | 
			
		||||
## 6. Rollback Procedure
 | 
			
		||||
 | 
			
		||||
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
 | 
			
		||||
2. For Compose:
 | 
			
		||||
   ```bash
 | 
			
		||||
   docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
 | 
			
		||||
   docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
 | 
			
		||||
   ```
 | 
			
		||||
3. For Helm:
 | 
			
		||||
   ```bash
 | 
			
		||||
   helm rollback stellaops <previous-release-number> --namespace stellaops
 | 
			
		||||
   ```
 | 
			
		||||
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
 | 
			
		||||
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
 | 
			
		||||
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
 | 
			
		||||
 | 
			
		||||
## 7. Post-cutover Actions
 | 
			
		||||
 | 
			
		||||
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
 | 
			
		||||
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
 | 
			
		||||
- Update `docs/ops/launch-readiness.md` if any new gaps or follow-ups discovered.
 | 
			
		||||
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
 | 
			
		||||
 | 
			
		||||
## 8. Approval Matrix
 | 
			
		||||
 | 
			
		||||
| Step | Required Approvers | Record Location |
 | 
			
		||||
| --- | --- | --- |
 | 
			
		||||
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
 | 
			
		||||
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
 | 
			
		||||
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
 | 
			
		||||
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
 | 
			
		||||
 | 
			
		||||
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
 | 
			
		||||
 | 
			
		||||
## 9. Rehearsal Log
 | 
			
		||||
 | 
			
		||||
| Date (UTC) | What We Exercised | Outcome | Follow-up |
 | 
			
		||||
| --- | --- | --- | --- |
 | 
			
		||||
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
 | 
			
		||||
 
 | 
			
		||||
@@ -1,49 +1,49 @@
 | 
			
		||||
# Launch Readiness Record - Stella Ops
 | 
			
		||||
 | 
			
		||||
_Updated: 2025-10-26 (UTC)_
 | 
			
		||||
 | 
			
		||||
This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover.
 | 
			
		||||
 | 
			
		||||
## 1. Sign-off Summary
 | 
			
		||||
 | 
			
		||||
| Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes |
 | 
			
		||||
| --- | --- | --- | --- | --- | --- |
 | 
			
		||||
| Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. |
 | 
			
		||||
| Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. |
 | 
			
		||||
| Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. |
 | 
			
		||||
| Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. |
 | 
			
		||||
| Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/ops/concelier-conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). |
 | 
			
		||||
| Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (EXECPLAN.Section  Wave 0) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. |
 | 
			
		||||
| Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. |
 | 
			
		||||
| Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. |
 | 
			
		||||
| DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. |
 | 
			
		||||
| Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-10-26T15:05Z | Awaiting release debug artefacts to finalise `DEVOPS-OFFLINE-17-004`; tracked in Section 3. |
 | 
			
		||||
 | 
			
		||||
_\* READY with caveat - remaining work noted in Section 3._
 | 
			
		||||
 | 
			
		||||
## 2. Deployment Readiness Checklist
 | 
			
		||||
 | 
			
		||||
- **Production profiles committed:** `deploy/compose/docker-compose.prod.yaml` and `deploy/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services.
 | 
			
		||||
- **Secrets placeholders documented:** `deploy/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`).
 | 
			
		||||
- **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing.
 | 
			
		||||
- **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services.
 | 
			
		||||
- **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks.
 | 
			
		||||
- **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback.
 | 
			
		||||
- **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned.
 | 
			
		||||
 | 
			
		||||
## 3. Outstanding Gaps & Follow-ups
 | 
			
		||||
 | 
			
		||||
| Item | Owner | Tracking Ref | Target / Next Step | Impact |
 | 
			
		||||
| --- | --- | --- | --- | --- |
 | 
			
		||||
| Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. |
 | 
			
		||||
| Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. |
 | 
			
		||||
| Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. |
 | 
			
		||||
| Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (BLOCKED 2025-10-26) | Release pipeline must publish `out/release/debug` artefacts; once available, run `mirror_debug_store.py` and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. |
 | 
			
		||||
| Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. |
 | 
			
		||||
| Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. |
 | 
			
		||||
 | 
			
		||||
## 4. Approvals & Distribution
 | 
			
		||||
 | 
			
		||||
- Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement.
 | 
			
		||||
- Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history.
 | 
			
		||||
- Cutover rehearsal and rollback drills are tracked separately in `docs/ops/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch
 | 
			
		||||
# Launch Readiness Record - Stella Ops
 | 
			
		||||
 | 
			
		||||
_Updated: 2025-10-26 (UTC)_
 | 
			
		||||
 | 
			
		||||
This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover.
 | 
			
		||||
 | 
			
		||||
## 1. Sign-off Summary
 | 
			
		||||
 | 
			
		||||
| Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes |
 | 
			
		||||
| --- | --- | --- | --- | --- | --- |
 | 
			
		||||
| Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. |
 | 
			
		||||
| Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. |
 | 
			
		||||
| Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. |
 | 
			
		||||
| Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. |
 | 
			
		||||
| Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/ops/concelier-conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). |
 | 
			
		||||
| Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (EXECPLAN.Section  Wave 0) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. |
 | 
			
		||||
| Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. |
 | 
			
		||||
| Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. |
 | 
			
		||||
| DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. |
 | 
			
		||||
| Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-10-26T15:05Z | Awaiting release debug artefacts to finalise `DEVOPS-OFFLINE-17-004`; tracked in Section 3. |
 | 
			
		||||
 | 
			
		||||
_\* READY with caveat - remaining work noted in Section 3._
 | 
			
		||||
 | 
			
		||||
## 2. Deployment Readiness Checklist
 | 
			
		||||
 | 
			
		||||
- **Production profiles committed:** `deploy/compose/docker-compose.prod.yaml` and `deploy/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services.
 | 
			
		||||
- **Secrets placeholders documented:** `deploy/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`).
 | 
			
		||||
- **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing.
 | 
			
		||||
- **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services.
 | 
			
		||||
- **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks.
 | 
			
		||||
- **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback.
 | 
			
		||||
- **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned.
 | 
			
		||||
 | 
			
		||||
## 3. Outstanding Gaps & Follow-ups
 | 
			
		||||
 | 
			
		||||
| Item | Owner | Tracking Ref | Target / Next Step | Impact |
 | 
			
		||||
| --- | --- | --- | --- | --- |
 | 
			
		||||
| Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. |
 | 
			
		||||
| Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. |
 | 
			
		||||
| Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. |
 | 
			
		||||
| Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (BLOCKED 2025-10-26) | Release pipeline must publish `out/release/debug` artefacts; once available, run `mirror_debug_store.py` and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. |
 | 
			
		||||
| Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. |
 | 
			
		||||
| Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. |
 | 
			
		||||
 | 
			
		||||
## 4. Approvals & Distribution
 | 
			
		||||
 | 
			
		||||
- Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement.
 | 
			
		||||
- Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history.
 | 
			
		||||
- Cutover rehearsal and rollback drills are tracked separately in `docs/ops/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch
 | 
			
		||||
 
 | 
			
		||||
@@ -26,7 +26,7 @@ Follow the steps below whenever you refresh the repo or roll a new Offline Kit d
 | 
			
		||||
From the repo root:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
DOTNET_NOLOGO=1 dotnet restore src/StellaOps.Excititor.Connectors.Abstractions/StellaOps.Excititor.Connectors.Abstractions.csproj \
 | 
			
		||||
DOTNET_NOLOGO=1 dotnet restore src/Excititor/__Libraries/StellaOps.Excititor.Connectors.Abstractions/StellaOps.Excititor.Connectors.Abstractions.csproj \
 | 
			
		||||
  --configfile NuGet.config
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
 
 | 
			
		||||
@@ -1,6 +1,6 @@
 | 
			
		||||
# Registry Token Service Operations
 | 
			
		||||
 | 
			
		||||
_Component_: `src/StellaOps.Registry.TokenService`
 | 
			
		||||
_Component_: `src/Registry/StellaOps.Registry.TokenService`
 | 
			
		||||
 | 
			
		||||
The registry token service issues short-lived Docker registry bearer tokens after
 | 
			
		||||
validating an Authority OpTok (DPoP/mTLS sender constraint) and the customer’s
 | 
			
		||||
@@ -53,7 +53,7 @@ DPoP failures surface via the service logs (Serilog console output).
 | 
			
		||||
## Sample deployment
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
dotnet run --project src/StellaOps.Registry.TokenService \
 | 
			
		||||
dotnet run --project src/Registry/StellaOps.Registry.TokenService \
 | 
			
		||||
  --urls "http://0.0.0.0:8085"
 | 
			
		||||
 | 
			
		||||
curl -H "Authorization: Bearer <OpTok>" \
 | 
			
		||||
 
 | 
			
		||||
@@ -9,10 +9,10 @@ Keep the language analyzer microbench under the < 5 s SBOM pledge. CI emits
 | 
			
		||||
1. CI (or engineers running locally) execute:
 | 
			
		||||
   ```bash
 | 
			
		||||
   dotnet run \
 | 
			
		||||
     --project src/StellaOps.Bench/Scanner.Analyzers/StellaOps.Bench.ScannerAnalyzers/StellaOps.Bench.ScannerAnalyzers.csproj \
 | 
			
		||||
     --project src/Bench/StellaOps.Bench/Scanner.Analyzers/StellaOps.Bench.ScannerAnalyzers/StellaOps.Bench.ScannerAnalyzers.csproj \
 | 
			
		||||
     -- \
 | 
			
		||||
     --repo-root . \
 | 
			
		||||
     --out src/StellaOps.Bench/Scanner.Analyzers/baseline.csv \
 | 
			
		||||
     --out src/Bench/StellaOps.Bench/Scanner.Analyzers/baseline.csv \
 | 
			
		||||
     --json out/bench/scanner-analyzers/latest.json \
 | 
			
		||||
     --prom out/bench/scanner-analyzers/latest.prom \
 | 
			
		||||
     --commit "$(git rev-parse HEAD)" \
 | 
			
		||||
 
 | 
			
		||||
@@ -1,113 +1,113 @@
 | 
			
		||||
# Telemetry Collector Deployment Guide
 | 
			
		||||
 | 
			
		||||
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
 | 
			
		||||
 | 
			
		||||
This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Overview
 | 
			
		||||
 | 
			
		||||
The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
 | 
			
		||||
 | 
			
		||||
| Endpoint | Purpose | TLS | Authentication |
 | 
			
		||||
| -------- | ------- | --- | -------------- |
 | 
			
		||||
| `:4317`  | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
 | 
			
		||||
| `:4318`  | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
 | 
			
		||||
| `:9464`  | Prometheus scrape | mTLS | Same client certificate |
 | 
			
		||||
| `:13133` | Health check | mTLS | Same client certificate |
 | 
			
		||||
| `:1777`  | pprof diagnostics | mTLS | Same client certificate |
 | 
			
		||||
 | 
			
		||||
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Local validation (Compose)
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
# 1. Generate dev certificates (CA + collector + client)
 | 
			
		||||
./ops/devops/telemetry/generate_dev_tls.sh
 | 
			
		||||
 | 
			
		||||
# 2. Start the collector overlay
 | 
			
		||||
cd deploy/compose
 | 
			
		||||
docker compose -f docker-compose.telemetry.yaml up -d
 | 
			
		||||
 | 
			
		||||
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
 | 
			
		||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
 | 
			
		||||
 | 
			
		||||
# 4. Run the smoke test (OTLP HTTP)
 | 
			
		||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Kubernetes deployment
 | 
			
		||||
 | 
			
		||||
Enable the collector in Helm by setting the following values (example shown for the dev profile):
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
telemetry:
 | 
			
		||||
  collector:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    defaultTenant: <tenant>
 | 
			
		||||
    tls:
 | 
			
		||||
      secretName: stellaops-otel-tls-<env>
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
kubectl create secret generic stellaops-otel-tls-stage \
 | 
			
		||||
  --from-file=tls.crt=collector.crt \
 | 
			
		||||
  --from-file=tls.key=collector.key \
 | 
			
		||||
  --from-file=ca.crt=ca.crt
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Helm renders the collector deployment, service, and config map automatically:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Offline packaging (DEVOPS-OBS-50-003)
 | 
			
		||||
 | 
			
		||||
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The script gathers:
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/README.md`
 | 
			
		||||
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
 | 
			
		||||
- Helm template/values for the collector
 | 
			
		||||
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
 | 
			
		||||
 | 
			
		||||
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
 | 
			
		||||
 | 
			
		||||
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Operational checks
 | 
			
		||||
 | 
			
		||||
1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
 | 
			
		||||
2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
 | 
			
		||||
3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
 | 
			
		||||
4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Related references
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/README.md` – source configuration and local workflow.
 | 
			
		||||
- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
 | 
			
		||||
- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
 | 
			
		||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
 | 
			
		||||
# Telemetry Collector Deployment Guide
 | 
			
		||||
 | 
			
		||||
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
 | 
			
		||||
 | 
			
		||||
This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Overview
 | 
			
		||||
 | 
			
		||||
The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
 | 
			
		||||
 | 
			
		||||
| Endpoint | Purpose | TLS | Authentication |
 | 
			
		||||
| -------- | ------- | --- | -------------- |
 | 
			
		||||
| `:4317`  | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
 | 
			
		||||
| `:4318`  | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
 | 
			
		||||
| `:9464`  | Prometheus scrape | mTLS | Same client certificate |
 | 
			
		||||
| `:13133` | Health check | mTLS | Same client certificate |
 | 
			
		||||
| `:1777`  | pprof diagnostics | mTLS | Same client certificate |
 | 
			
		||||
 | 
			
		||||
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Local validation (Compose)
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
# 1. Generate dev certificates (CA + collector + client)
 | 
			
		||||
./ops/devops/telemetry/generate_dev_tls.sh
 | 
			
		||||
 | 
			
		||||
# 2. Start the collector overlay
 | 
			
		||||
cd deploy/compose
 | 
			
		||||
docker compose -f docker-compose.telemetry.yaml up -d
 | 
			
		||||
 | 
			
		||||
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
 | 
			
		||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
 | 
			
		||||
 | 
			
		||||
# 4. Run the smoke test (OTLP HTTP)
 | 
			
		||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Kubernetes deployment
 | 
			
		||||
 | 
			
		||||
Enable the collector in Helm by setting the following values (example shown for the dev profile):
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
telemetry:
 | 
			
		||||
  collector:
 | 
			
		||||
    enabled: true
 | 
			
		||||
    defaultTenant: <tenant>
 | 
			
		||||
    tls:
 | 
			
		||||
      secretName: stellaops-otel-tls-<env>
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
kubectl create secret generic stellaops-otel-tls-stage \
 | 
			
		||||
  --from-file=tls.crt=collector.crt \
 | 
			
		||||
  --from-file=tls.key=collector.key \
 | 
			
		||||
  --from-file=ca.crt=ca.crt
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Helm renders the collector deployment, service, and config map automatically:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Offline packaging (DEVOPS-OBS-50-003)
 | 
			
		||||
 | 
			
		||||
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The script gathers:
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/README.md`
 | 
			
		||||
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
 | 
			
		||||
- Helm template/values for the collector
 | 
			
		||||
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
 | 
			
		||||
 | 
			
		||||
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
 | 
			
		||||
 | 
			
		||||
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Operational checks
 | 
			
		||||
 | 
			
		||||
1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
 | 
			
		||||
2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
 | 
			
		||||
3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
 | 
			
		||||
4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 6. Related references
 | 
			
		||||
 | 
			
		||||
- `deploy/telemetry/README.md` – source configuration and local workflow.
 | 
			
		||||
- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
 | 
			
		||||
- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
 | 
			
		||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,32 +1,32 @@
 | 
			
		||||
# UI Auth Smoke Job (Playwright)
 | 
			
		||||
 | 
			
		||||
The DevOps Guild tracks **DEVOPS-UI-13-006** to wire the new Playwright auth
 | 
			
		||||
smoke checks into CI and the Offline Kit pipeline. These tests exercise the
 | 
			
		||||
Angular UI login flow against a stubbed Authority instance to verify that
 | 
			
		||||
`/config.json` is discovered, DPoP proofs are minted, and error handling is
 | 
			
		||||
surfaced when the backend rejects a request.
 | 
			
		||||
 | 
			
		||||
## What the job does
 | 
			
		||||
 | 
			
		||||
1. Builds the UI bundle (or consumes the artifact from the release pipeline).
 | 
			
		||||
2. Copies the environment stub from `src/config/config.sample.json` into the
 | 
			
		||||
   runtime directory as `config.json` so the UI can bootstrap without a live
 | 
			
		||||
   gateway.
 | 
			
		||||
3. Runs `npm run test:e2e`, which launches Playwright with the auth fixtures
 | 
			
		||||
   under `tests/e2e/auth.spec.ts`:
 | 
			
		||||
   - Validates that the Sign-in button generates an Authorization Code + PKCE
 | 
			
		||||
     redirect to `https://authority.local/connect/authorize`.
 | 
			
		||||
   - Confirms the callback view shows an actionable error when the redirect is
 | 
			
		||||
     missing the pending login state.
 | 
			
		||||
4. Publishes JUnit + Playwright traces (retain-on-failure) for troubleshooting.
 | 
			
		||||
 | 
			
		||||
## Pipeline integration notes
 | 
			
		||||
 | 
			
		||||
- Chromium must already be available (`npx playwright install --with-deps`).
 | 
			
		||||
- Set `PLAYWRIGHT_BASE_URL` if the UI serves on a non-default host/port.
 | 
			
		||||
- For Offline Kit packaging, bundle the Playwright browser cache under
 | 
			
		||||
  `.cache/ms-playwright/` so the job runs without network access.
 | 
			
		||||
- Failures should block release promotion; export the traces to the artifacts
 | 
			
		||||
  tab for debugging.
 | 
			
		||||
 | 
			
		||||
Refer to `ops/devops/TASKS.md` (DEVOPS-UI-13-006) for progress and ownership.
 | 
			
		||||
# UI Auth Smoke Job (Playwright)
 | 
			
		||||
 | 
			
		||||
The DevOps Guild tracks **DEVOPS-UI-13-006** to wire the new Playwright auth
 | 
			
		||||
smoke checks into CI and the Offline Kit pipeline. These tests exercise the
 | 
			
		||||
Angular UI login flow against a stubbed Authority instance to verify that
 | 
			
		||||
`/config.json` is discovered, DPoP proofs are minted, and error handling is
 | 
			
		||||
surfaced when the backend rejects a request.
 | 
			
		||||
 | 
			
		||||
## What the job does
 | 
			
		||||
 | 
			
		||||
1. Builds the UI bundle (or consumes the artifact from the release pipeline).
 | 
			
		||||
2. Copies the environment stub from `src/config/config.sample.json` into the
 | 
			
		||||
   runtime directory as `config.json` so the UI can bootstrap without a live
 | 
			
		||||
   gateway.
 | 
			
		||||
3. Runs `npm run test:e2e`, which launches Playwright with the auth fixtures
 | 
			
		||||
   under `tests/e2e/auth.spec.ts`:
 | 
			
		||||
   - Validates that the Sign-in button generates an Authorization Code + PKCE
 | 
			
		||||
     redirect to `https://authority.local/connect/authorize`.
 | 
			
		||||
   - Confirms the callback view shows an actionable error when the redirect is
 | 
			
		||||
     missing the pending login state.
 | 
			
		||||
4. Publishes JUnit + Playwright traces (retain-on-failure) for troubleshooting.
 | 
			
		||||
 | 
			
		||||
## Pipeline integration notes
 | 
			
		||||
 | 
			
		||||
- Chromium must already be available (`npx playwright install --with-deps`).
 | 
			
		||||
- Set `PLAYWRIGHT_BASE_URL` if the UI serves on a non-default host/port.
 | 
			
		||||
- For Offline Kit packaging, bundle the Playwright browser cache under
 | 
			
		||||
  `.cache/ms-playwright/` so the job runs without network access.
 | 
			
		||||
- Failures should block release promotion; export the traces to the artifacts
 | 
			
		||||
  tab for debugging.
 | 
			
		||||
 | 
			
		||||
Refer to `ops/devops/TASKS.md` (DEVOPS-UI-13-006) for progress and ownership.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,205 +1,205 @@
 | 
			
		||||
{
 | 
			
		||||
  "title": "Zastava Runtime Plane",
 | 
			
		||||
  "uid": "zastava-runtime",
 | 
			
		||||
  "timezone": "utc",
 | 
			
		||||
  "schemaVersion": 38,
 | 
			
		||||
  "version": 1,
 | 
			
		||||
  "refresh": "30s",
 | 
			
		||||
  "time": {
 | 
			
		||||
    "from": "now-6h",
 | 
			
		||||
    "to": "now"
 | 
			
		||||
  },
 | 
			
		||||
  "panels": [
 | 
			
		||||
    {
 | 
			
		||||
      "id": 1,
 | 
			
		||||
      "type": "timeseries",
 | 
			
		||||
      "title": "Observer Event Rate",
 | 
			
		||||
      "datasource": {
 | 
			
		||||
        "type": "prometheus",
 | 
			
		||||
        "uid": "${datasource}"
 | 
			
		||||
      },
 | 
			
		||||
      "targets": [
 | 
			
		||||
        {
 | 
			
		||||
          "expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
 | 
			
		||||
          "legendFormat": "{{tenant}}/{{component}}/{{kind}}"
 | 
			
		||||
        }
 | 
			
		||||
      ],
 | 
			
		||||
      "gridPos": {
 | 
			
		||||
        "h": 8,
 | 
			
		||||
        "w": 12,
 | 
			
		||||
        "x": 0,
 | 
			
		||||
        "y": 0
 | 
			
		||||
      },
 | 
			
		||||
      "fieldConfig": {
 | 
			
		||||
        "defaults": {
 | 
			
		||||
          "unit": "1/s",
 | 
			
		||||
          "thresholds": {
 | 
			
		||||
            "mode": "absolute",
 | 
			
		||||
            "steps": [
 | 
			
		||||
              {
 | 
			
		||||
                "color": "green"
 | 
			
		||||
              }
 | 
			
		||||
            ]
 | 
			
		||||
          }
 | 
			
		||||
        },
 | 
			
		||||
        "overrides": []
 | 
			
		||||
      },
 | 
			
		||||
      "options": {
 | 
			
		||||
        "legend": {
 | 
			
		||||
          "showLegend": true,
 | 
			
		||||
          "placement": "bottom"
 | 
			
		||||
        },
 | 
			
		||||
        "tooltip": {
 | 
			
		||||
          "mode": "multi"
 | 
			
		||||
        }
 | 
			
		||||
      }
 | 
			
		||||
    },
 | 
			
		||||
    {
 | 
			
		||||
      "id": 2,
 | 
			
		||||
      "type": "timeseries",
 | 
			
		||||
      "title": "Admission Decisions",
 | 
			
		||||
      "datasource": {
 | 
			
		||||
        "type": "prometheus",
 | 
			
		||||
        "uid": "${datasource}"
 | 
			
		||||
      },
 | 
			
		||||
      "targets": [
 | 
			
		||||
        {
 | 
			
		||||
          "expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
 | 
			
		||||
          "legendFormat": "{{decision}}"
 | 
			
		||||
        }
 | 
			
		||||
      ],
 | 
			
		||||
      "gridPos": {
 | 
			
		||||
        "h": 8,
 | 
			
		||||
        "w": 12,
 | 
			
		||||
        "x": 12,
 | 
			
		||||
        "y": 0
 | 
			
		||||
      },
 | 
			
		||||
      "fieldConfig": {
 | 
			
		||||
        "defaults": {
 | 
			
		||||
          "unit": "1/s",
 | 
			
		||||
          "thresholds": {
 | 
			
		||||
            "mode": "absolute",
 | 
			
		||||
            "steps": [
 | 
			
		||||
              {
 | 
			
		||||
                "color": "green"
 | 
			
		||||
              },
 | 
			
		||||
              {
 | 
			
		||||
                "color": "red",
 | 
			
		||||
                "value": 20
 | 
			
		||||
              }
 | 
			
		||||
            ]
 | 
			
		||||
          }
 | 
			
		||||
        },
 | 
			
		||||
        "overrides": []
 | 
			
		||||
      },
 | 
			
		||||
      "options": {
 | 
			
		||||
        "legend": {
 | 
			
		||||
          "showLegend": true,
 | 
			
		||||
          "placement": "bottom"
 | 
			
		||||
        },
 | 
			
		||||
        "tooltip": {
 | 
			
		||||
          "mode": "multi"
 | 
			
		||||
        }
 | 
			
		||||
      }
 | 
			
		||||
    },
 | 
			
		||||
    {
 | 
			
		||||
      "id": 3,
 | 
			
		||||
      "type": "timeseries",
 | 
			
		||||
      "title": "Backend Latency P95",
 | 
			
		||||
      "datasource": {
 | 
			
		||||
        "type": "prometheus",
 | 
			
		||||
        "uid": "${datasource}"
 | 
			
		||||
      },
 | 
			
		||||
      "targets": [
 | 
			
		||||
        {
 | 
			
		||||
          "expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
 | 
			
		||||
          "legendFormat": "p95 latency"
 | 
			
		||||
        }
 | 
			
		||||
      ],
 | 
			
		||||
      "gridPos": {
 | 
			
		||||
        "h": 8,
 | 
			
		||||
        "w": 12,
 | 
			
		||||
        "x": 0,
 | 
			
		||||
        "y": 8
 | 
			
		||||
      },
 | 
			
		||||
      "fieldConfig": {
 | 
			
		||||
        "defaults": {
 | 
			
		||||
          "unit": "ms",
 | 
			
		||||
          "thresholds": {
 | 
			
		||||
            "mode": "absolute",
 | 
			
		||||
            "steps": [
 | 
			
		||||
              {
 | 
			
		||||
                "color": "green"
 | 
			
		||||
              },
 | 
			
		||||
              {
 | 
			
		||||
                "color": "orange",
 | 
			
		||||
                "value": 500
 | 
			
		||||
              },
 | 
			
		||||
              {
 | 
			
		||||
                "color": "red",
 | 
			
		||||
                "value": 750
 | 
			
		||||
              }
 | 
			
		||||
            ]
 | 
			
		||||
          }
 | 
			
		||||
        },
 | 
			
		||||
        "overrides": []
 | 
			
		||||
      },
 | 
			
		||||
      "options": {
 | 
			
		||||
        "legend": {
 | 
			
		||||
          "showLegend": true,
 | 
			
		||||
          "placement": "bottom"
 | 
			
		||||
        },
 | 
			
		||||
        "tooltip": {
 | 
			
		||||
          "mode": "multi"
 | 
			
		||||
        }
 | 
			
		||||
      }
 | 
			
		||||
    }
 | 
			
		||||
  ],
 | 
			
		||||
  "templating": {
 | 
			
		||||
    "list": [
 | 
			
		||||
      {
 | 
			
		||||
        "name": "datasource",
 | 
			
		||||
        "type": "datasource",
 | 
			
		||||
        "query": "prometheus",
 | 
			
		||||
        "label": "Prometheus",
 | 
			
		||||
        "current": {
 | 
			
		||||
          "text": "Prometheus",
 | 
			
		||||
          "value": "Prometheus"
 | 
			
		||||
        }
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "name": "tenant",
 | 
			
		||||
        "type": "query",
 | 
			
		||||
        "datasource": {
 | 
			
		||||
          "type": "prometheus",
 | 
			
		||||
          "uid": "${datasource}"
 | 
			
		||||
        },
 | 
			
		||||
        "definition": "label_values(zastava_runtime_events_total, tenant)",
 | 
			
		||||
        "refresh": 1,
 | 
			
		||||
        "hide": 0,
 | 
			
		||||
        "current": {
 | 
			
		||||
          "text": ".*",
 | 
			
		||||
          "value": ".*"
 | 
			
		||||
        },
 | 
			
		||||
        "regex": "",
 | 
			
		||||
        "includeAll": true,
 | 
			
		||||
        "multi": true,
 | 
			
		||||
        "sort": 1
 | 
			
		||||
      }
 | 
			
		||||
    ]
 | 
			
		||||
  },
 | 
			
		||||
  "annotations": {
 | 
			
		||||
    "list": [
 | 
			
		||||
      {
 | 
			
		||||
        "name": "Deployments",
 | 
			
		||||
        "type": "tags",
 | 
			
		||||
        "datasource": {
 | 
			
		||||
          "type": "prometheus",
 | 
			
		||||
          "uid": "${datasource}"
 | 
			
		||||
        },
 | 
			
		||||
        "enable": true,
 | 
			
		||||
        "iconColor": "rgba(255, 96, 96, 1)"
 | 
			
		||||
      }
 | 
			
		||||
    ]
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
{
 | 
			
		||||
  "title": "Zastava Runtime Plane",
 | 
			
		||||
  "uid": "zastava-runtime",
 | 
			
		||||
  "timezone": "utc",
 | 
			
		||||
  "schemaVersion": 38,
 | 
			
		||||
  "version": 1,
 | 
			
		||||
  "refresh": "30s",
 | 
			
		||||
  "time": {
 | 
			
		||||
    "from": "now-6h",
 | 
			
		||||
    "to": "now"
 | 
			
		||||
  },
 | 
			
		||||
  "panels": [
 | 
			
		||||
    {
 | 
			
		||||
      "id": 1,
 | 
			
		||||
      "type": "timeseries",
 | 
			
		||||
      "title": "Observer Event Rate",
 | 
			
		||||
      "datasource": {
 | 
			
		||||
        "type": "prometheus",
 | 
			
		||||
        "uid": "${datasource}"
 | 
			
		||||
      },
 | 
			
		||||
      "targets": [
 | 
			
		||||
        {
 | 
			
		||||
          "expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
 | 
			
		||||
          "legendFormat": "{{tenant}}/{{component}}/{{kind}}"
 | 
			
		||||
        }
 | 
			
		||||
      ],
 | 
			
		||||
      "gridPos": {
 | 
			
		||||
        "h": 8,
 | 
			
		||||
        "w": 12,
 | 
			
		||||
        "x": 0,
 | 
			
		||||
        "y": 0
 | 
			
		||||
      },
 | 
			
		||||
      "fieldConfig": {
 | 
			
		||||
        "defaults": {
 | 
			
		||||
          "unit": "1/s",
 | 
			
		||||
          "thresholds": {
 | 
			
		||||
            "mode": "absolute",
 | 
			
		||||
            "steps": [
 | 
			
		||||
              {
 | 
			
		||||
                "color": "green"
 | 
			
		||||
              }
 | 
			
		||||
            ]
 | 
			
		||||
          }
 | 
			
		||||
        },
 | 
			
		||||
        "overrides": []
 | 
			
		||||
      },
 | 
			
		||||
      "options": {
 | 
			
		||||
        "legend": {
 | 
			
		||||
          "showLegend": true,
 | 
			
		||||
          "placement": "bottom"
 | 
			
		||||
        },
 | 
			
		||||
        "tooltip": {
 | 
			
		||||
          "mode": "multi"
 | 
			
		||||
        }
 | 
			
		||||
      }
 | 
			
		||||
    },
 | 
			
		||||
    {
 | 
			
		||||
      "id": 2,
 | 
			
		||||
      "type": "timeseries",
 | 
			
		||||
      "title": "Admission Decisions",
 | 
			
		||||
      "datasource": {
 | 
			
		||||
        "type": "prometheus",
 | 
			
		||||
        "uid": "${datasource}"
 | 
			
		||||
      },
 | 
			
		||||
      "targets": [
 | 
			
		||||
        {
 | 
			
		||||
          "expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
 | 
			
		||||
          "legendFormat": "{{decision}}"
 | 
			
		||||
        }
 | 
			
		||||
      ],
 | 
			
		||||
      "gridPos": {
 | 
			
		||||
        "h": 8,
 | 
			
		||||
        "w": 12,
 | 
			
		||||
        "x": 12,
 | 
			
		||||
        "y": 0
 | 
			
		||||
      },
 | 
			
		||||
      "fieldConfig": {
 | 
			
		||||
        "defaults": {
 | 
			
		||||
          "unit": "1/s",
 | 
			
		||||
          "thresholds": {
 | 
			
		||||
            "mode": "absolute",
 | 
			
		||||
            "steps": [
 | 
			
		||||
              {
 | 
			
		||||
                "color": "green"
 | 
			
		||||
              },
 | 
			
		||||
              {
 | 
			
		||||
                "color": "red",
 | 
			
		||||
                "value": 20
 | 
			
		||||
              }
 | 
			
		||||
            ]
 | 
			
		||||
          }
 | 
			
		||||
        },
 | 
			
		||||
        "overrides": []
 | 
			
		||||
      },
 | 
			
		||||
      "options": {
 | 
			
		||||
        "legend": {
 | 
			
		||||
          "showLegend": true,
 | 
			
		||||
          "placement": "bottom"
 | 
			
		||||
        },
 | 
			
		||||
        "tooltip": {
 | 
			
		||||
          "mode": "multi"
 | 
			
		||||
        }
 | 
			
		||||
      }
 | 
			
		||||
    },
 | 
			
		||||
    {
 | 
			
		||||
      "id": 3,
 | 
			
		||||
      "type": "timeseries",
 | 
			
		||||
      "title": "Backend Latency P95",
 | 
			
		||||
      "datasource": {
 | 
			
		||||
        "type": "prometheus",
 | 
			
		||||
        "uid": "${datasource}"
 | 
			
		||||
      },
 | 
			
		||||
      "targets": [
 | 
			
		||||
        {
 | 
			
		||||
          "expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
 | 
			
		||||
          "legendFormat": "p95 latency"
 | 
			
		||||
        }
 | 
			
		||||
      ],
 | 
			
		||||
      "gridPos": {
 | 
			
		||||
        "h": 8,
 | 
			
		||||
        "w": 12,
 | 
			
		||||
        "x": 0,
 | 
			
		||||
        "y": 8
 | 
			
		||||
      },
 | 
			
		||||
      "fieldConfig": {
 | 
			
		||||
        "defaults": {
 | 
			
		||||
          "unit": "ms",
 | 
			
		||||
          "thresholds": {
 | 
			
		||||
            "mode": "absolute",
 | 
			
		||||
            "steps": [
 | 
			
		||||
              {
 | 
			
		||||
                "color": "green"
 | 
			
		||||
              },
 | 
			
		||||
              {
 | 
			
		||||
                "color": "orange",
 | 
			
		||||
                "value": 500
 | 
			
		||||
              },
 | 
			
		||||
              {
 | 
			
		||||
                "color": "red",
 | 
			
		||||
                "value": 750
 | 
			
		||||
              }
 | 
			
		||||
            ]
 | 
			
		||||
          }
 | 
			
		||||
        },
 | 
			
		||||
        "overrides": []
 | 
			
		||||
      },
 | 
			
		||||
      "options": {
 | 
			
		||||
        "legend": {
 | 
			
		||||
          "showLegend": true,
 | 
			
		||||
          "placement": "bottom"
 | 
			
		||||
        },
 | 
			
		||||
        "tooltip": {
 | 
			
		||||
          "mode": "multi"
 | 
			
		||||
        }
 | 
			
		||||
      }
 | 
			
		||||
    }
 | 
			
		||||
  ],
 | 
			
		||||
  "templating": {
 | 
			
		||||
    "list": [
 | 
			
		||||
      {
 | 
			
		||||
        "name": "datasource",
 | 
			
		||||
        "type": "datasource",
 | 
			
		||||
        "query": "prometheus",
 | 
			
		||||
        "label": "Prometheus",
 | 
			
		||||
        "current": {
 | 
			
		||||
          "text": "Prometheus",
 | 
			
		||||
          "value": "Prometheus"
 | 
			
		||||
        }
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "name": "tenant",
 | 
			
		||||
        "type": "query",
 | 
			
		||||
        "datasource": {
 | 
			
		||||
          "type": "prometheus",
 | 
			
		||||
          "uid": "${datasource}"
 | 
			
		||||
        },
 | 
			
		||||
        "definition": "label_values(zastava_runtime_events_total, tenant)",
 | 
			
		||||
        "refresh": 1,
 | 
			
		||||
        "hide": 0,
 | 
			
		||||
        "current": {
 | 
			
		||||
          "text": ".*",
 | 
			
		||||
          "value": ".*"
 | 
			
		||||
        },
 | 
			
		||||
        "regex": "",
 | 
			
		||||
        "includeAll": true,
 | 
			
		||||
        "multi": true,
 | 
			
		||||
        "sort": 1
 | 
			
		||||
      }
 | 
			
		||||
    ]
 | 
			
		||||
  },
 | 
			
		||||
  "annotations": {
 | 
			
		||||
    "list": [
 | 
			
		||||
      {
 | 
			
		||||
        "name": "Deployments",
 | 
			
		||||
        "type": "tags",
 | 
			
		||||
        "datasource": {
 | 
			
		||||
          "type": "prometheus",
 | 
			
		||||
          "uid": "${datasource}"
 | 
			
		||||
        },
 | 
			
		||||
        "enable": true,
 | 
			
		||||
        "iconColor": "rgba(255, 96, 96, 1)"
 | 
			
		||||
      }
 | 
			
		||||
    ]
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
 
 | 
			
		||||
@@ -1,174 +1,174 @@
 | 
			
		||||
# Zastava Runtime Operations Runbook
 | 
			
		||||
 | 
			
		||||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
 | 
			
		||||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
 | 
			
		||||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
 | 
			
		||||
  `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
 | 
			
		||||
  certs before rollout.
 | 
			
		||||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
 | 
			
		||||
  resolvable from every node running Observer/Webhook.
 | 
			
		||||
- **Host mounts** – read-only access to `/proc`, container runtime state
 | 
			
		||||
  (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
 | 
			
		||||
  (`/var/run/zastava`).
 | 
			
		||||
- **Offline kit bundle** – operators staging air-gapped installs must download
 | 
			
		||||
  `offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
 | 
			
		||||
  Grafana dashboards, and Prometheus rules referenced below.
 | 
			
		||||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
 | 
			
		||||
  live outside git. For air-gapped installs copy them to the sealed secrets vault.
 | 
			
		||||
 | 
			
		||||
### 1.1 Telemetry quick reference
 | 
			
		||||
 | 
			
		||||
| Metric | Description | Notes |
 | 
			
		||||
|--------|-------------|-------|
 | 
			
		||||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
 | 
			
		||||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
 | 
			
		||||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
 | 
			
		||||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
 | 
			
		||||
 | 
			
		||||
## 2. Deployment workflows
 | 
			
		||||
 | 
			
		||||
### 2.1 Fresh install (Helm overlay)
 | 
			
		||||
 | 
			
		||||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
 | 
			
		||||
2. Render values:
 | 
			
		||||
   - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
 | 
			
		||||
   - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
 | 
			
		||||
   - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
 | 
			
		||||
3. Pre-create secrets:
 | 
			
		||||
   - `zastava-authority-dpop` (JWK + private key).
 | 
			
		||||
   - `zastava-authority-mtls` (client cert/key chain).
 | 
			
		||||
   - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
 | 
			
		||||
4. Deploy Observer DaemonSet and Webhook chart:
 | 
			
		||||
   ```sh
 | 
			
		||||
   helm upgrade --install zastava-runtime deploy/helm/zastava \
 | 
			
		||||
     -f values/zastava-runtime.yaml \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --create-namespace
 | 
			
		||||
   ```
 | 
			
		||||
5. Verify:
 | 
			
		||||
   - `kubectl -n stellaops get pods -l app=zastava-observer` ready.
 | 
			
		||||
   - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
 | 
			
		||||
     `Issued runtime OpTok` audit line with DPoP token type.
 | 
			
		||||
   - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
 | 
			
		||||
 | 
			
		||||
### 2.2 Upgrades
 | 
			
		||||
 | 
			
		||||
1. Scale webhook deployment to `--replicas=3` (rolling).
 | 
			
		||||
2. Drain one node per AZ to ensure Observer tolerates disruption.
 | 
			
		||||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
 | 
			
		||||
4. Post-upgrade, run smoke tests:
 | 
			
		||||
   - Apply unsigned Pod manifest → expect `deny` (policy fail).
 | 
			
		||||
   - Apply signed Pod manifest → expect `allow`.
 | 
			
		||||
5. Record upgrade in ops log with Git SHA + Helm chart version.
 | 
			
		||||
 | 
			
		||||
### 2.3 Rollback
 | 
			
		||||
 | 
			
		||||
1. Use Helm revision history: `helm history zastava-runtime`.
 | 
			
		||||
2. Rollback: `helm rollback zastava-runtime <revision>`.
 | 
			
		||||
3. Invalidate cached OpToks:
 | 
			
		||||
   ```sh
 | 
			
		||||
   kubectl -n stellaops exec deploy/zastava-webhook -- \
 | 
			
		||||
     zastava-webhook invalidate-op-token --audience scanner
 | 
			
		||||
   ```
 | 
			
		||||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
 | 
			
		||||
 | 
			
		||||
## 3. Authority & security guardrails
 | 
			
		||||
 | 
			
		||||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
 | 
			
		||||
  `authority.token.issue` scope with decision data; absence indicates misconfig.
 | 
			
		||||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
 | 
			
		||||
  lab clusters; expect warning log `Mutual TLS requirement disabled`.
 | 
			
		||||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
 | 
			
		||||
  initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
 | 
			
		||||
- Audit every change in `zastava.runtime.authority` through change management.
 | 
			
		||||
  Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
 | 
			
		||||
  to confirm key rotation.
 | 
			
		||||
 | 
			
		||||
## 4. Incident response
 | 
			
		||||
 | 
			
		||||
### 4.1 Authority offline
 | 
			
		||||
 | 
			
		||||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
 | 
			
		||||
2. Inspect Observer logs for `authority.token.fallback` scope.
 | 
			
		||||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
 | 
			
		||||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
 | 
			
		||||
 | 
			
		||||
### 4.2 Scanner/WebService latency spike
 | 
			
		||||
 | 
			
		||||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
 | 
			
		||||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
 | 
			
		||||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
 | 
			
		||||
   `kubectl logs ds/zastava-observer | grep buffer.drops`.
 | 
			
		||||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
 | 
			
		||||
 | 
			
		||||
### 4.3 Admission deny storm
 | 
			
		||||
 | 
			
		||||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
 | 
			
		||||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
 | 
			
		||||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
 | 
			
		||||
   owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
 | 
			
		||||
 | 
			
		||||
## 5. Offline kit & air-gapped notes
 | 
			
		||||
 | 
			
		||||
- Bundle contents:
 | 
			
		||||
  - Observer/Webhook container images (multi-arch).
 | 
			
		||||
  - `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
 | 
			
		||||
  - Sample `zastava-runtime.values.yaml`.
 | 
			
		||||
- Verification:
 | 
			
		||||
  - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
 | 
			
		||||
  - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
 | 
			
		||||
  - Import Grafana dashboard via `grafana-cli --config ...`.
 | 
			
		||||
 | 
			
		||||
## 6. Observability assets
 | 
			
		||||
 | 
			
		||||
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
 | 
			
		||||
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
 | 
			
		||||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
 | 
			
		||||
  the Offline Kit manifest.
 | 
			
		||||
 | 
			
		||||
## 7. Build-id correlation & symbol retrieval
 | 
			
		||||
 | 
			
		||||
Runtime events emitted by Observer now include `process.buildId` (from the ELF
 | 
			
		||||
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
 | 
			
		||||
`buildIds` list per digest. Operators can use these hashes to locate debug
 | 
			
		||||
artifacts during incident response:
 | 
			
		||||
 | 
			
		||||
1. Capture the hash from CLI/webhook/Scanner API—for example:
 | 
			
		||||
   ```bash
 | 
			
		||||
   stellaops-cli runtime policy test --image <digest> --namespace <ns>
 | 
			
		||||
   ```
 | 
			
		||||
   Copy one of the `Build IDs` (e.g.
 | 
			
		||||
   `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
 | 
			
		||||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
 | 
			
		||||
   ```bash
 | 
			
		||||
   ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
 | 
			
		||||
   ```
 | 
			
		||||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
 | 
			
		||||
   `debug-store` object bucket (mirror of release artefacts):
 | 
			
		||||
   ```bash
 | 
			
		||||
   oras cp oci://registry.internal/debug-store:latest . --include \
 | 
			
		||||
     "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
 | 
			
		||||
   ```
 | 
			
		||||
4. Confirm the running process advertises the same GNU build-id before
 | 
			
		||||
   symbolising:
 | 
			
		||||
   ```bash
 | 
			
		||||
   readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
 | 
			
		||||
   ```
 | 
			
		||||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
 | 
			
		||||
   in `debuginfod` for fleet-wide symbol resolution:
 | 
			
		||||
   ```bash
 | 
			
		||||
   debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
 | 
			
		||||
   ```
 | 
			
		||||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
 | 
			
		||||
   runtime events indicate stripped binaries without the GNU note—schedule a
 | 
			
		||||
   rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
 | 
			
		||||
   allowlist so the scanner can surface a fallback symbol package.
 | 
			
		||||
 | 
			
		||||
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
 | 
			
		||||
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
 | 
			
		||||
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
 | 
			
		||||
Observer to trigger a fresh capture if symbol parity is required.
 | 
			
		||||
# Zastava Runtime Operations Runbook
 | 
			
		||||
 | 
			
		||||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
 | 
			
		||||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
 | 
			
		||||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
 | 
			
		||||
 | 
			
		||||
## 1. Prerequisites
 | 
			
		||||
 | 
			
		||||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
 | 
			
		||||
  `aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
 | 
			
		||||
  certs before rollout.
 | 
			
		||||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
 | 
			
		||||
  resolvable from every node running Observer/Webhook.
 | 
			
		||||
- **Host mounts** – read-only access to `/proc`, container runtime state
 | 
			
		||||
  (`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
 | 
			
		||||
  (`/var/run/zastava`).
 | 
			
		||||
- **Offline kit bundle** – operators staging air-gapped installs must download
 | 
			
		||||
  `offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
 | 
			
		||||
  Grafana dashboards, and Prometheus rules referenced below.
 | 
			
		||||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
 | 
			
		||||
  live outside git. For air-gapped installs copy them to the sealed secrets vault.
 | 
			
		||||
 | 
			
		||||
### 1.1 Telemetry quick reference
 | 
			
		||||
 | 
			
		||||
| Metric | Description | Notes |
 | 
			
		||||
|--------|-------------|-------|
 | 
			
		||||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
 | 
			
		||||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
 | 
			
		||||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
 | 
			
		||||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
 | 
			
		||||
 | 
			
		||||
## 2. Deployment workflows
 | 
			
		||||
 | 
			
		||||
### 2.1 Fresh install (Helm overlay)
 | 
			
		||||
 | 
			
		||||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
 | 
			
		||||
2. Render values:
 | 
			
		||||
   - `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
 | 
			
		||||
   - `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
 | 
			
		||||
   - `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
 | 
			
		||||
3. Pre-create secrets:
 | 
			
		||||
   - `zastava-authority-dpop` (JWK + private key).
 | 
			
		||||
   - `zastava-authority-mtls` (client cert/key chain).
 | 
			
		||||
   - `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
 | 
			
		||||
4. Deploy Observer DaemonSet and Webhook chart:
 | 
			
		||||
   ```sh
 | 
			
		||||
   helm upgrade --install zastava-runtime deploy/helm/zastava \
 | 
			
		||||
     -f values/zastava-runtime.yaml \
 | 
			
		||||
     --namespace stellaops \
 | 
			
		||||
     --create-namespace
 | 
			
		||||
   ```
 | 
			
		||||
5. Verify:
 | 
			
		||||
   - `kubectl -n stellaops get pods -l app=zastava-observer` ready.
 | 
			
		||||
   - `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
 | 
			
		||||
     `Issued runtime OpTok` audit line with DPoP token type.
 | 
			
		||||
   - Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
 | 
			
		||||
 | 
			
		||||
### 2.2 Upgrades
 | 
			
		||||
 | 
			
		||||
1. Scale webhook deployment to `--replicas=3` (rolling).
 | 
			
		||||
2. Drain one node per AZ to ensure Observer tolerates disruption.
 | 
			
		||||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
 | 
			
		||||
4. Post-upgrade, run smoke tests:
 | 
			
		||||
   - Apply unsigned Pod manifest → expect `deny` (policy fail).
 | 
			
		||||
   - Apply signed Pod manifest → expect `allow`.
 | 
			
		||||
5. Record upgrade in ops log with Git SHA + Helm chart version.
 | 
			
		||||
 | 
			
		||||
### 2.3 Rollback
 | 
			
		||||
 | 
			
		||||
1. Use Helm revision history: `helm history zastava-runtime`.
 | 
			
		||||
2. Rollback: `helm rollback zastava-runtime <revision>`.
 | 
			
		||||
3. Invalidate cached OpToks:
 | 
			
		||||
   ```sh
 | 
			
		||||
   kubectl -n stellaops exec deploy/zastava-webhook -- \
 | 
			
		||||
     zastava-webhook invalidate-op-token --audience scanner
 | 
			
		||||
   ```
 | 
			
		||||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
 | 
			
		||||
 | 
			
		||||
## 3. Authority & security guardrails
 | 
			
		||||
 | 
			
		||||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
 | 
			
		||||
  `authority.token.issue` scope with decision data; absence indicates misconfig.
 | 
			
		||||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
 | 
			
		||||
  lab clusters; expect warning log `Mutual TLS requirement disabled`.
 | 
			
		||||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
 | 
			
		||||
  initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
 | 
			
		||||
- Audit every change in `zastava.runtime.authority` through change management.
 | 
			
		||||
  Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
 | 
			
		||||
  to confirm key rotation.
 | 
			
		||||
 | 
			
		||||
## 4. Incident response
 | 
			
		||||
 | 
			
		||||
### 4.1 Authority offline
 | 
			
		||||
 | 
			
		||||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
 | 
			
		||||
2. Inspect Observer logs for `authority.token.fallback` scope.
 | 
			
		||||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
 | 
			
		||||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
 | 
			
		||||
 | 
			
		||||
### 4.2 Scanner/WebService latency spike
 | 
			
		||||
 | 
			
		||||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
 | 
			
		||||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
 | 
			
		||||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
 | 
			
		||||
   `kubectl logs ds/zastava-observer | grep buffer.drops`.
 | 
			
		||||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
 | 
			
		||||
 | 
			
		||||
### 4.3 Admission deny storm
 | 
			
		||||
 | 
			
		||||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
 | 
			
		||||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
 | 
			
		||||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
 | 
			
		||||
   owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
 | 
			
		||||
 | 
			
		||||
## 5. Offline kit & air-gapped notes
 | 
			
		||||
 | 
			
		||||
- Bundle contents:
 | 
			
		||||
  - Observer/Webhook container images (multi-arch).
 | 
			
		||||
  - `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
 | 
			
		||||
  - Sample `zastava-runtime.values.yaml`.
 | 
			
		||||
- Verification:
 | 
			
		||||
  - Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
 | 
			
		||||
  - Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
 | 
			
		||||
  - Import Grafana dashboard via `grafana-cli --config ...`.
 | 
			
		||||
 | 
			
		||||
## 6. Observability assets
 | 
			
		||||
 | 
			
		||||
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
 | 
			
		||||
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
 | 
			
		||||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
 | 
			
		||||
  the Offline Kit manifest.
 | 
			
		||||
 | 
			
		||||
## 7. Build-id correlation & symbol retrieval
 | 
			
		||||
 | 
			
		||||
Runtime events emitted by Observer now include `process.buildId` (from the ELF
 | 
			
		||||
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
 | 
			
		||||
`buildIds` list per digest. Operators can use these hashes to locate debug
 | 
			
		||||
artifacts during incident response:
 | 
			
		||||
 | 
			
		||||
1. Capture the hash from CLI/webhook/Scanner API—for example:
 | 
			
		||||
   ```bash
 | 
			
		||||
   stellaops-cli runtime policy test --image <digest> --namespace <ns>
 | 
			
		||||
   ```
 | 
			
		||||
   Copy one of the `Build IDs` (e.g.
 | 
			
		||||
   `5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
 | 
			
		||||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
 | 
			
		||||
   ```bash
 | 
			
		||||
   ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
 | 
			
		||||
   ```
 | 
			
		||||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
 | 
			
		||||
   `debug-store` object bucket (mirror of release artefacts):
 | 
			
		||||
   ```bash
 | 
			
		||||
   oras cp oci://registry.internal/debug-store:latest . --include \
 | 
			
		||||
     "5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
 | 
			
		||||
   ```
 | 
			
		||||
4. Confirm the running process advertises the same GNU build-id before
 | 
			
		||||
   symbolising:
 | 
			
		||||
   ```bash
 | 
			
		||||
   readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
 | 
			
		||||
   ```
 | 
			
		||||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
 | 
			
		||||
   in `debuginfod` for fleet-wide symbol resolution:
 | 
			
		||||
   ```bash
 | 
			
		||||
   debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
 | 
			
		||||
   ```
 | 
			
		||||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
 | 
			
		||||
   runtime events indicate stripped binaries without the GNU note—schedule a
 | 
			
		||||
   rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
 | 
			
		||||
   allowlist so the scanner can surface a fallback symbol package.
 | 
			
		||||
 | 
			
		||||
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
 | 
			
		||||
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
 | 
			
		||||
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
 | 
			
		||||
Observer to trigger a fresh capture if symbol parity is required.
 | 
			
		||||
 
 | 
			
		||||
@@ -1,31 +1,31 @@
 | 
			
		||||
groups:
 | 
			
		||||
  - name: zastava-runtime
 | 
			
		||||
    interval: 30s
 | 
			
		||||
    rules:
 | 
			
		||||
      - alert: ZastavaRuntimeEventsSilent
 | 
			
		||||
        expr: sum(rate(zastava_runtime_events_total[10m])) == 0
 | 
			
		||||
        for: 15m
 | 
			
		||||
        labels:
 | 
			
		||||
          severity: warning
 | 
			
		||||
          service: zastava-runtime
 | 
			
		||||
        annotations:
 | 
			
		||||
          summary: "Observer events stalled"
 | 
			
		||||
          description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
 | 
			
		||||
      - alert: ZastavaRuntimeBackendLatencyHigh
 | 
			
		||||
        expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
 | 
			
		||||
        for: 10m
 | 
			
		||||
        labels:
 | 
			
		||||
          severity: critical
 | 
			
		||||
          service: zastava-runtime
 | 
			
		||||
        annotations:
 | 
			
		||||
          summary: "Runtime backend latency p95 above 750 ms"
 | 
			
		||||
          description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
 | 
			
		||||
      - alert: ZastavaAdmissionDenySpike
 | 
			
		||||
        expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
 | 
			
		||||
        for: 5m
 | 
			
		||||
        labels:
 | 
			
		||||
          severity: warning
 | 
			
		||||
          service: zastava-runtime
 | 
			
		||||
        annotations:
 | 
			
		||||
          summary: "Admission webhook denies exceeding threshold"
 | 
			
		||||
          description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."
 | 
			
		||||
groups:
 | 
			
		||||
  - name: zastava-runtime
 | 
			
		||||
    interval: 30s
 | 
			
		||||
    rules:
 | 
			
		||||
      - alert: ZastavaRuntimeEventsSilent
 | 
			
		||||
        expr: sum(rate(zastava_runtime_events_total[10m])) == 0
 | 
			
		||||
        for: 15m
 | 
			
		||||
        labels:
 | 
			
		||||
          severity: warning
 | 
			
		||||
          service: zastava-runtime
 | 
			
		||||
        annotations:
 | 
			
		||||
          summary: "Observer events stalled"
 | 
			
		||||
          description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
 | 
			
		||||
      - alert: ZastavaRuntimeBackendLatencyHigh
 | 
			
		||||
        expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
 | 
			
		||||
        for: 10m
 | 
			
		||||
        labels:
 | 
			
		||||
          severity: critical
 | 
			
		||||
          service: zastava-runtime
 | 
			
		||||
        annotations:
 | 
			
		||||
          summary: "Runtime backend latency p95 above 750 ms"
 | 
			
		||||
          description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
 | 
			
		||||
      - alert: ZastavaAdmissionDenySpike
 | 
			
		||||
        expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
 | 
			
		||||
        for: 5m
 | 
			
		||||
        labels:
 | 
			
		||||
          severity: warning
 | 
			
		||||
          service: zastava-runtime
 | 
			
		||||
        annotations:
 | 
			
		||||
          summary: "Admission webhook denies exceeding threshold"
 | 
			
		||||
          description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user