feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
@@ -1,97 +0,0 @@
|
||||
# Authority Backup & Restore Runbook
|
||||
|
||||
## Scope
|
||||
- **Applies to:** StellaOps Authority deployments running the official `ops/authority/docker-compose.authority.yaml` stack or equivalent Kubernetes packaging.
|
||||
- **Artifacts covered:** MongoDB (`stellaops-authority` database), Authority configuration (`etc/authority.yaml`), plugin manifests under `etc/authority.plugins/`, and signing key material stored in the `authority-keys` volume (defaults to `/app/keys` inside the container).
|
||||
- **Frequency:** Run the full procedure prior to upgrades, before rotating keys, and at least once per 24 h in production. Store snapshots in an encrypted, access-controlled vault.
|
||||
|
||||
## Inventory Checklist
|
||||
| Component | Location (compose default) | Notes |
|
||||
| --- | --- | --- |
|
||||
| Mongo data | `mongo-data` volume (`/var/lib/docker/volumes/.../mongo-data`) | Contains all Authority collections (`AuthorityUser`, `AuthorityClient`, `AuthorityToken`, etc.). |
|
||||
| Configuration | `etc/authority.yaml` | Mounted read-only into the container at `/etc/authority.yaml`. |
|
||||
| Plugin manifests | `etc/authority.plugins/*.yaml` | Includes `standard.yaml` with `tokenSigning.keyDirectory`. |
|
||||
| Signing keys | `authority-keys` volume -> `/app/keys` | Path is derived from `tokenSigning.keyDirectory` (defaults to `../keys` relative to the manifest). |
|
||||
|
||||
> **TIP:** Confirm the deployed key directory via `tokenSigning.keyDirectory` in `etc/authority.plugins/standard.yaml`; some installations relocate keys to `/var/lib/stellaops/authority/keys`.
|
||||
|
||||
## Hot Backup (no downtime)
|
||||
1. **Create output directory:** `mkdir -p backup/$(date +%Y-%m-%d)` on the host.
|
||||
2. **Dump Mongo:**
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml exec mongo \
|
||||
mongodump --archive=/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz \
|
||||
--gzip --db stellaops-authority
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml cp \
|
||||
mongo:/dump/authority-$(date +%Y%m%dT%H%M%SZ).gz backup/
|
||||
```
|
||||
The `mongodump` archive preserves indexes and can be restored with `mongorestore --archive --gzip`.
|
||||
3. **Capture configuration + manifests:**
|
||||
```bash
|
||||
cp etc/authority.yaml backup/
|
||||
rsync -a etc/authority.plugins/ backup/authority.plugins/
|
||||
```
|
||||
4. **Export signing keys:** the compose file maps `authority-keys` to a local Docker volume. Snapshot it without stopping the service:
|
||||
```bash
|
||||
docker run --rm \
|
||||
-v authority-keys:/keys \
|
||||
-v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%dT%H%M%SZ).tar.gz -C /keys .
|
||||
```
|
||||
5. **Checksum:** generate SHA-256 digests for every file and store them alongside the artefacts.
|
||||
6. **Encrypt & upload:** wrap the backup folder using your secrets management standard (e.g., age, GPG) and upload to the designated offline vault.
|
||||
|
||||
## Cold Backup (planned downtime)
|
||||
1. Notify stakeholders and drain traffic (CLI clients should refresh tokens afterwards).
|
||||
2. Stop services:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml down
|
||||
```
|
||||
3. Back up volumes directly using `tar`:
|
||||
```bash
|
||||
docker run --rm -v mongo-data:/data -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/mongo-data-$(date +%Y%m%d).tar.gz -C /data .
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar czf /backup/authority-keys-$(date +%Y%m%d).tar.gz -C /keys .
|
||||
```
|
||||
4. Copy configuration + manifests as in the hot backup (steps 3–6).
|
||||
5. Restart services and verify health:
|
||||
```bash
|
||||
docker compose -f ops/authority/docker-compose.authority.yaml up -d
|
||||
curl -fsS http://localhost:8080/ready
|
||||
```
|
||||
|
||||
## Restore Procedure
|
||||
1. **Provision clean volumes:** remove existing volumes if you’re rebuilding a node (`docker volume rm mongo-data authority-keys`), then recreate the compose stack so empty volumes exist.
|
||||
2. **Restore Mongo:**
|
||||
```bash
|
||||
docker compose exec -T mongo mongorestore --archive --gzip --drop < backup/authority-YYYYMMDDTHHMMSSZ.gz
|
||||
```
|
||||
Use `--drop` to replace collections; omit if doing a partial restore.
|
||||
3. **Restore configuration/manifests:** copy `authority.yaml` and `authority.plugins/*` into place before starting the Authority container.
|
||||
4. **Restore signing keys:** untar into the mounted volume:
|
||||
```bash
|
||||
docker run --rm -v authority-keys:/keys -v "$(pwd)/backup:/backup" \
|
||||
busybox tar xzf /backup/authority-keys-YYYYMMDD.tar.gz -C /keys
|
||||
```
|
||||
Ensure file permissions remain `600` for private keys (`chmod -R 600`).
|
||||
5. **Start services & validate:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
curl -fsS http://localhost:8080/health
|
||||
```
|
||||
6. **Validate JWKS and tokens:** call `/jwks` and issue a short-lived token via the CLI to confirm key material matches expectations. If the restored environment requires a fresh signing key, follow the rotation SOP in [`docs/11_AUTHORITY.md`](../11_AUTHORITY.md) using `ops/authority/key-rotation.sh` to invoke `/internal/signing/rotate`.
|
||||
|
||||
## Disaster Recovery Notes
|
||||
- **Air-gapped replication:** replicate archives via the Offline Update Kit transport channels; never attach USB devices without scanning.
|
||||
- **Retention:** maintain 30 daily snapshots + 12 monthly archival copies. Rotate encryption keys annually.
|
||||
- **Key compromise:** if signing keys are suspected compromised, restore from the latest clean backup, rotate via OPS3 (see `ops/authority/key-rotation.sh` and `docs/11_AUTHORITY.md`), and publish a revocation notice.
|
||||
- **Mongo version:** keep dump/restore images pinned to the deployment version (compose uses `mongo:7`). Driver 3.5.0 requires MongoDB **4.2+**—clusters still on 4.0 must be upgraded before restore, and future driver releases will drop 4.0 entirely. citeturn1open1
|
||||
|
||||
## Verification Checklist
|
||||
- [ ] `/ready` reports all identity providers ready.
|
||||
- [ ] OAuth flows issue tokens signed by the restored keys.
|
||||
- [ ] `PluginRegistrationSummary` logs expected providers on startup.
|
||||
- [ ] Revocation manifest export (`dotnet run --project src/Authority/StellaOps.Authority`) succeeds.
|
||||
- [ ] Monitoring dashboards show metrics resuming (see OPS5 deliverables).
|
||||
|
||||
@@ -1,174 +0,0 @@
|
||||
{
|
||||
"title": "StellaOps Authority - Token & Access Monitoring",
|
||||
"uid": "authority-token-monitoring",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {}
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Token Requests – Success vs Failure",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{grant_type}} ({{status}})"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name=\"stellaops-authority\", http_route=\"/token\"}[5m]))",
|
||||
"legendFormat": "{{grant_type}} {{status}}"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Rate Limiter Rejections",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "req/s",
|
||||
"displayName": "{{limiter}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (limiter) (rate(aspnetcore_rate_limiting_rejections_total{service_name=\"stellaops-authority\"}[5m]))",
|
||||
"legendFormat": "{{limiter}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Bypass Events (5m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}.\"}[5m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Lockout Events (15m)",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "orange", "value": 5 },
|
||||
{ "color": "red", "value": 10 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(log_messages_total{message_template=\"Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter}).\"}[15m]))"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["last"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Trace Explorer Shortcut",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"mode": "markdown",
|
||||
"content": "[Open Trace Explorer](#/explore?left={\"datasource\":\"tempo\",\"queries\":[{\"query\":\"{service.name=\\\"stellaops-authority\\\", span_name=~\\\"authority.token.*\\\"}\",\"refId\":\"A\"}]})"
|
||||
}
|
||||
}
|
||||
],
|
||||
"links": []
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
# Authority Signing Key Rotation Playbook
|
||||
|
||||
> **Status:** Authored 2025-10-12 as part of OPS3.KEY-ROTATION rollout.
|
||||
> Use together with `docs/11_AUTHORITY.md` (Authority service guide) and the automation shipped under `ops/authority/`.
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Authority publishes JWKS and revocation bundles signed with ES256 keys. To rotate those keys without downtime we now provide:
|
||||
|
||||
- **Automation script:** `ops/authority/key-rotation.sh`
|
||||
Shell helper that POSTS to `/internal/signing/rotate`, supports metadata, dry-run, and confirms JWKS afterwards.
|
||||
- **CI workflow:** `.gitea/workflows/authority-key-rotation.yml`
|
||||
Manual dispatch workflow that pulls environment-specific secrets, runs the script, and records the result. Works across staging/production by passing the `environment` input.
|
||||
|
||||
This playbook documents the repeatable sequence for all environments.
|
||||
|
||||
## 2. Pre-requisites
|
||||
|
||||
1. **Generate a new PEM key (per environment)**
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout \
|
||||
-out certificates/authority-signing-<env>-<year>.pem
|
||||
chmod 600 certificates/authority-signing-<env>-<year>.pem
|
||||
```
|
||||
2. **Stash the previous key** under the same volume so it can be referenced in `signing.additionalKeys` after rotation.
|
||||
3. **Ensure secrets/vars exist in Gitea**
|
||||
- `<ENV>_AUTHORITY_BOOTSTRAP_KEY`
|
||||
- `<ENV>_AUTHORITY_URL`
|
||||
- Optional shared defaults `AUTHORITY_BOOTSTRAP_KEY`, `AUTHORITY_URL`.
|
||||
|
||||
## 3. Executing the rotation
|
||||
|
||||
### Option A – via CI workflow (recommended)
|
||||
|
||||
1. Navigate to **Actions → Authority Key Rotation**.
|
||||
2. Provide inputs:
|
||||
- `environment`: `staging`, `production`, etc.
|
||||
- `key_id`: new `kid` (e.g. `authority-signing-2025-dev`).
|
||||
- `key_path`: path as seen by the Authority service (e.g. `../certificates/authority-signing-2025-dev.pem`).
|
||||
- Optional `metadata`: comma-separated `key=value` pairs (for audit trails).
|
||||
3. Trigger. The workflow:
|
||||
- Reads the bootstrap key/URL from secrets.
|
||||
- Runs `ops/authority/key-rotation.sh`.
|
||||
- Prints the JWKS response for verification.
|
||||
|
||||
### Option B – manual shell invocation
|
||||
|
||||
```bash
|
||||
AUTHORITY_BOOTSTRAP_KEY=$(cat /secure/authority-bootstrap.key) \
|
||||
./ops/authority/key-rotation.sh \
|
||||
--authority-url https://authority.example.com \
|
||||
--key-id authority-signing-2025-dev \
|
||||
--key-path ../certificates/authority-signing-2025-dev.pem \
|
||||
--meta rotatedBy=ops --meta changeTicket=OPS-1234
|
||||
```
|
||||
|
||||
Use `--dry-run` to inspect the payload before execution.
|
||||
|
||||
## 4. Post-rotation checklist
|
||||
|
||||
1. Update `authority.yaml` (or environment-specific overrides):
|
||||
- Set `signing.activeKeyId` to the new key.
|
||||
- Set `signing.keyPath` to the new PEM.
|
||||
- Append the previous key into `signing.additionalKeys`.
|
||||
- Ensure `keySource`/`provider` match the values passed to the script.
|
||||
2. Run `stellaops-cli auth revoke export` so revocation bundles are re-signed with the new key.
|
||||
3. Confirm `/jwks` lists the new `kid` with `status: "active"` and the previous one as `retired`.
|
||||
4. Archive the old key securely; keep it available until all tokens/bundles signed with it have expired.
|
||||
|
||||
## 5. Development key state
|
||||
|
||||
For the sample configuration (`etc/authority.yaml.sample`) we minted a placeholder dev key:
|
||||
|
||||
- Active: `authority-signing-2025-dev` (`certificates/authority-signing-2025-dev.pem`)
|
||||
- Retired: `authority-signing-dev`
|
||||
|
||||
Treat these as examples; real environments must maintain their own PEM material.
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/11_AUTHORITY.md` – Architecture and rotation SOP (Section 5).
|
||||
- `docs/ops/authority-backup-restore.md` – Recovery flow referencing this playbook.
|
||||
- `ops/authority/README.md` – CLI usage and examples.
|
||||
- `scripts/rotate-policy-cli-secret.sh` – Helper to mint new `policy-cli` shared secrets when policy scope bundles change.
|
||||
|
||||
## 7. Appendix — Policy CLI secret rotation
|
||||
|
||||
Scope migrations such as AUTH-POLICY-23-004 require issuing fresh credentials for the `policy-cli` client. Use the helper script committed with the repo to keep secrets deterministic across environments.
|
||||
|
||||
```bash
|
||||
./scripts/rotate-policy-cli-secret.sh --output etc/secrets/policy-cli.secret
|
||||
```
|
||||
|
||||
The script writes a timestamped header and a random secret into the target file. Use `--dry-run` when generating material for external secret stores. After updating secrets in staging/production, recycle the Authority pods and confirm the new client credentials work before the next release freeze.
|
||||
@@ -1,83 +0,0 @@
|
||||
# Authority Monitoring & Alerting Playbook
|
||||
|
||||
## Telemetry Sources
|
||||
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
|
||||
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
|
||||
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
|
||||
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
|
||||
- **Logs:** Serilog writes structured events to stdout. Notable templates:
|
||||
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
|
||||
- `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
|
||||
- `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
|
||||
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
|
||||
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
|
||||
|
||||
## Prometheus Metrics to Collect
|
||||
| Metric | Query | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
|
||||
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when > 5 % for 10 min. |
|
||||
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
|
||||
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
|
||||
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
|
||||
|
||||
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
|
||||
|
||||
## Alert Rules
|
||||
1. **Token Failure Surge**
|
||||
- _Expression_: `token_failure_ratio > 0.05`
|
||||
- _For_: `10m`
|
||||
- _Labels_: `severity="critical"`
|
||||
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
|
||||
2. **Lockout Spike**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
|
||||
- _For_: `15m`
|
||||
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
|
||||
3. **Bypass Threshold**
|
||||
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
|
||||
- _For_: `5m`
|
||||
- Alert severity `warning` — verify the calling host list.
|
||||
4. **Rate Limiter Saturation**
|
||||
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
|
||||
- Escalate if sustained for 5 min; confirm trusted clients aren’t misconfigured.
|
||||
|
||||
## Grafana Dashboard
|
||||
- Import `docs/ops/authority-grafana-dashboard.json` to provision baseline panels:
|
||||
- **Token Success vs Failure** – stacked rate visualization split by grant type.
|
||||
- **Rate Limiter Hits** – bar chart showing `authority-token` and `authority-authorize`.
|
||||
- **Bypass & Lockout Events** – dual-stat panel using Loki-derived counters.
|
||||
- **Trace Explorer Link** – panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
|
||||
|
||||
## Collector Configuration Snippets
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:9464"
|
||||
processors:
|
||||
batch:
|
||||
attributes/token_grant:
|
||||
actions:
|
||||
- key: grant_type
|
||||
action: upsert
|
||||
from_attribute: authority.grant_type
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [attributes/token_grant, batch]
|
||||
exporters: [prometheus]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [loki]
|
||||
```
|
||||
|
||||
## Operational Checklist
|
||||
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
|
||||
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
|
||||
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
|
||||
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.
|
||||
@@ -1,77 +0,0 @@
|
||||
# Concelier Apple Security Update Connector Operations
|
||||
|
||||
This runbook covers staging and production rollout for the Apple security updates connector (`source:vndr-apple:*`), including observability checks and fixture maintenance.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Network egress (or mirrored cache) for `https://gdmf.apple.com/v2/pmv` and the Apple Support domain (`https://support.apple.com/`).
|
||||
- Optional: corporate proxy exclusions for the Apple hosts if outbound traffic is normally filtered.
|
||||
- Updated configuration (environment variables or `concelier.yaml`) with an `apple` section. Example baseline:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
apple:
|
||||
softwareLookupUri: "https://gdmf.apple.com/v2/pmv"
|
||||
advisoryBaseUri: "https://support.apple.com/"
|
||||
localeSegment: "en-us"
|
||||
maxAdvisoriesPerFetch: 25
|
||||
initialBackfill: "120.00:00:00"
|
||||
modifiedTolerance: "02:00:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ `softwareLookupUri` and `advisoryBaseUri` must stay absolute and aligned with the HTTP allow-list; Concelier automatically adds both hosts to the connector HttpClient.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Deploy the configuration and restart the Concelier workers to ensure the Apple connector options are bound.
|
||||
2. Trigger a full connector cycle:
|
||||
- CLI: `stella db jobs run source:vndr-apple:fetch --and-then source:vndr-apple:parse --and-then source:vndr-apple:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:vndr-apple:fetch", "chain": ["source:vndr-apple:parse", "source:vndr-apple:map"] }`
|
||||
3. Validate metrics exported under meter `StellaOps.Concelier.Connector.Vndr.Apple`:
|
||||
- `apple.fetch.items` (documents fetched)
|
||||
- `apple.fetch.failures`
|
||||
- `apple.fetch.unchanged`
|
||||
- `apple.parse.failures`
|
||||
- `apple.map.affected.count` (histogram of affected package counts)
|
||||
4. Cross-check the shared HTTP counters:
|
||||
- `concelier.source.http.requests_total{concelier_source="vndr-apple"}` should increase for both index and detail phases.
|
||||
- `concelier.source.http.failures_total{concelier_source="vndr-apple"}` should remain flat (0) during a healthy run.
|
||||
5. Inspect the info logs:
|
||||
- `Apple software index fetch … processed=X newDocuments=Y`
|
||||
- `Apple advisory parse complete … aliases=… affected=…`
|
||||
- `Mapped Apple advisory … pendingMappings=0`
|
||||
6. Confirm MongoDB state:
|
||||
- `raw_documents` store contains the HT article HTML with metadata (`apple.articleId`, `apple.postingDate`).
|
||||
- `dtos` store has `schemaVersion="apple.security.update.v1"`.
|
||||
- `advisories` collection includes keys `HTxxxxxx` with normalized SemVer rules.
|
||||
- `source_states` entry for `apple` shows a recent `cursor.lastPosted`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following expressions to your Concelier Grafana board (OTLP/Prometheus naming assumed):
|
||||
- `rate(apple_fetch_items_total[15m])` vs `rate(concelier_source_http_requests_total{concelier_source="vndr-apple"}[15m])`
|
||||
- `rate(apple_fetch_failures_total[5m])` for error spikes (`severity=warning` at `>0`)
|
||||
- `histogram_quantile(0.95, rate(apple_map_affected_count_bucket[1h]))` to watch affected-package fan-out
|
||||
- `increase(apple_parse_failures_total[6h])` to catch parser drift (alerts at `>0`)
|
||||
- **Alerts** – Page if `rate(apple_fetch_items_total[2h]) == 0` during business hours while other connectors are active. This often indicates lookup feed failures or misconfigured allow-lists.
|
||||
- **Logs** – Surface warnings `Apple document {DocumentId} missing GridFS payload` or `Apple parse failed`—repeated hits imply storage issues or HTML regressions.
|
||||
- **Telemetry pipeline** – `StellaOps.Concelier.WebService` now exports `StellaOps.Concelier.Connector.Vndr.Apple` alongside existing Concelier meters; ensure your OTEL collector or Prometheus scraper includes it.
|
||||
|
||||
## 4. Fixture Maintenance
|
||||
|
||||
Regression fixtures live under `src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/Apple/Fixtures`. Refresh them whenever Apple reshapes the HT layout or when new platforms appear.
|
||||
|
||||
1. Run the helper script matching your platform:
|
||||
- Bash: `./scripts/update-apple-fixtures.sh`
|
||||
- PowerShell: `./scripts/update-apple-fixtures.ps1`
|
||||
2. Each script exports `UPDATE_APPLE_FIXTURES=1`, updates the `WSLENV` passthrough, and touches `.update-apple-fixtures` so WSL+VS Code test runs observe the flag. The subsequent test execution fetches the live HT articles listed in `AppleFixtureManager`, sanitises the HTML, and rewrites the `.expected.json` DTO snapshots.
|
||||
3. Review the diff for localisation or nav noise. Once satisfied, re-run the tests without the env var (`dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests/StellaOps.Concelier.Connector.Vndr.Apple.Tests.csproj`) to verify determinism.
|
||||
4. Commit fixture updates together with any parser/mapping changes that motivated them.
|
||||
|
||||
## 5. Known Issues & Follow-up Tasks
|
||||
|
||||
- Apple occasionally throttles anonymous requests after bursts. The connector backs off automatically, but persistent `apple.fetch.failures` spikes might require mirroring the HT content or scheduling wider fetch windows.
|
||||
- Rapid Security Responses may appear before the general patch notes surface in the lookup JSON. When that happens, the fetch run will log `detailFailures>0`. Collect sample HTML and refresh fixtures to confirm parser coverage.
|
||||
- Multi-locale content is still under regression sweep (`src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Vndr.Apple/TASKS.md`). Capture non-`en-us` snapshots once the fixture tooling stabilises.
|
||||
@@ -1,159 +0,0 @@
|
||||
# Concelier Authority Audit Runbook
|
||||
|
||||
_Last updated: 2025-10-22_
|
||||
|
||||
This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the `/jobs*` surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Authority integration is enabled in `concelier.yaml` (or via `CONCELIER_AUTHORITY__*` environment variables) with a valid `clientId`, secret, audience, and required scopes.
|
||||
- OTLP metrics/log exporters are configured (`concelier.telemetry.*`) or container stdout is shipped to your SIEM.
|
||||
- Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
|
||||
- The rollout table in `docs/10_CONCELIER_CLI_QUICKSTART.md` has been reviewed so stakeholders align on the staged → enforced toggle timeline.
|
||||
|
||||
### Configuration snippet
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
authority:
|
||||
enabled: true
|
||||
allowAnonymousFallback: false # keep true only during initial rollout
|
||||
issuer: "https://authority.internal"
|
||||
audiences:
|
||||
- "api://concelier"
|
||||
requiredScopes:
|
||||
- "concelier.jobs.trigger"
|
||||
- "advisory:read"
|
||||
- "advisory:ingest"
|
||||
requiredTenants:
|
||||
- "tenant-default"
|
||||
bypassNetworks:
|
||||
- "127.0.0.1/32"
|
||||
- "::1/128"
|
||||
clientId: "concelier-jobs"
|
||||
clientSecretFile: "/run/secrets/concelier_authority_client"
|
||||
tokenClockSkewSeconds: 60
|
||||
resilience:
|
||||
enableRetries: true
|
||||
retryDelays:
|
||||
- "00:00:01"
|
||||
- "00:00:02"
|
||||
- "00:00:05"
|
||||
allowOfflineCacheFallback: true
|
||||
offlineCacheTolerance: "00:10:00"
|
||||
```
|
||||
|
||||
> Store secrets outside source control. Concelier reads `clientSecretFile` on startup; rotate by updating the mounted file and restarting the service.
|
||||
|
||||
### Resilience tuning
|
||||
|
||||
- **Connected sites:** keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave `allowOfflineCacheFallback=true` so cached discovery/JWKS data can bridge short Pathfinder restarts.
|
||||
- **Air-gapped/Offline Kit installs:** extend `offlineCacheTolerance` (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (`enableRetries=false`) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
|
||||
- Concelier resolves these knobs through `IOptionsMonitor<StellaOpsAuthClientOptions>`. Edits to `concelier.yaml` are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.
|
||||
|
||||
## 2. Key Signals
|
||||
|
||||
### 2.1 Audit log channel
|
||||
|
||||
Concelier emits structured audit entries via the `Concelier.Authorization.Audit` logger for every `/jobs*` request once Authority enforcement is active.
|
||||
|
||||
```
|
||||
Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
|
||||
```
|
||||
|
||||
| Field | Sample value | Meaning |
|
||||
|--------------|-------------------------|------------------------------------------------------------------------------------------|
|
||||
| `route` | `/jobs/definitions` | Endpoint that processed the request. |
|
||||
| `status` | `200` / `401` / `409` | Final HTTP status code returned to the caller. |
|
||||
| `subject` | `ops@example.com` | User or service principal subject (falls back to `(anonymous)` when unauthenticated). |
|
||||
| `clientId` | `concelier-cli` | OAuth client ID provided by Authority (`(none)` if the token lacked the claim). |
|
||||
| `scopes` | `concelier.jobs.trigger advisory:ingest advisory:read` | Normalised scope list extracted from token claims; `(none)` if the token carried none. |
|
||||
| `tenant` | `tenant-default` | Tenant claim extracted from the Authority token (`(none)` when the token lacked it). |
|
||||
| `bypass` | `True` / `False` | Indicates whether the request succeeded because its source IP matched a bypass CIDR. |
|
||||
| `remote` | `10.1.4.7` | Remote IP recorded from the connection / forwarded header test hooks. |
|
||||
|
||||
Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:
|
||||
|
||||
- `status=401 AND bypass=True` – bypass network accepted an unauthenticated call (should be temporary during rollout).
|
||||
- `status=202 AND scopes="(none)"` – a token without scopes triggered a job; tighten client configuration.
|
||||
- `status=202 AND NOT contains(scopes,"advisory:ingest")` – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
|
||||
- `tenant!=(tenant-default)` – indicates a cross-tenant token was accepted. Ensure Concelier `requiredTenants` is aligned with Authority client registration.
|
||||
- Spike in `clientId="(none)"` – indicates upstream Authority is not issuing `client_id` claims or the CLI is outdated.
|
||||
|
||||
### 2.2 Metrics
|
||||
|
||||
Concelier publishes counters under the OTEL meter `StellaOps.Concelier.WebService.Jobs`. Tags: `job.kind`, `job.trigger`, `job.outcome`.
|
||||
|
||||
| Metric name | Description | PromQL example |
|
||||
|-------------------------------|----------------------------------------------------|----------------|
|
||||
| `web.jobs.triggered` | Accepted job trigger requests. | `sum by (job_kind) (rate(web_jobs_triggered_total[5m]))` |
|
||||
| `web.jobs.trigger.conflict` | Rejected triggers (already running, disabled…). | `sum(rate(web_jobs_trigger_conflict_total[5m]))` |
|
||||
| `web.jobs.trigger.failed` | Server-side job failures. | `sum(rate(web_jobs_trigger_failed_total[5m]))` |
|
||||
|
||||
> Prometheus/OTEL collectors typically surface counters with `_total` suffix. Adjust queries to match your pipeline’s generated metric names.
|
||||
|
||||
Correlate audit logs with the following global meter exported via `Concelier.SourceDiagnostics`:
|
||||
|
||||
- `concelier.source.http.requests_total{concelier_source="jobs-run"}` – ensures REST/manual triggers route through Authority.
|
||||
- If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.
|
||||
|
||||
## 3. Alerting Guidance
|
||||
|
||||
1. **Unauthorized bypass attempt**
|
||||
- Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0`
|
||||
- Action: verify `bypassNetworks` list; confirm expected maintenance windows; rotate credentials if suspicious.
|
||||
|
||||
2. **Missing scopes**
|
||||
- Query: `sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0`
|
||||
- Action: audit Authority client registration; ensure `requiredScopes` includes `concelier.jobs.trigger`, `advisory:ingest`, and `advisory:read`.
|
||||
|
||||
3. **Trigger failure surge**
|
||||
- Query: `sum(rate(web_jobs_trigger_failed_total[10m])) > 0` with severity `warning` if sustained for 10 minutes.
|
||||
- Action: inspect correlated audit entries and `Concelier.Telemetry` traces for job execution errors.
|
||||
|
||||
4. **Conflict spike**
|
||||
- Query: `sum(rate(web_jobs_trigger_conflict_total[10m])) > 5` (tune threshold).
|
||||
- Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
|
||||
|
||||
5. **Authority offline**
|
||||
- Watch `Concelier.Authorization.Audit` logs for `status=503` or `status=500` along with `clientId="(none)"`. Investigate Authority availability before re-enabling anonymous fallback.
|
||||
|
||||
## 4. Rollout & Verification Procedure
|
||||
|
||||
1. **Pre-checks**
|
||||
- Align with the rollout phases documented in `docs/10_CONCELIER_CLI_QUICKSTART.md` (validation → rehearsal → enforced) and record the target dates in your change request.
|
||||
- Confirm `allowAnonymousFallback` is `false` in production; keep `true` only during staged validation.
|
||||
- Validate Authority issuer metadata is reachable from Concelier (`curl https://authority.internal/.well-known/openid-configuration` from the host).
|
||||
|
||||
2. **Smoke test with valid token**
|
||||
- Obtain a token via CLI: `stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read`.
|
||||
- Trigger a read-only endpoint: `curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions`.
|
||||
- Expect HTTP 200/202 and an audit log with `bypass=False`, `scopes=concelier.jobs.trigger advisory:ingest advisory:read`, and `tenant=tenant-default`.
|
||||
|
||||
3. **Negative test without token**
|
||||
- Call the same endpoint without a token. Expect HTTP 401, `bypass=False`.
|
||||
- If the request succeeds, double-check `bypassNetworks` and ensure fallback is disabled.
|
||||
|
||||
4. **Bypass check (if applicable)**
|
||||
- From an allowed maintenance IP, call `/jobs/definitions` without a token. Confirm the audit log shows `bypass=True`. Review business justification and expiry date for such entries.
|
||||
|
||||
5. **Metrics validation**
|
||||
- Ensure `web.jobs.triggered` counter increments during accepted runs.
|
||||
- Exporters should show corresponding spans (`concelier.job.trigger`) if tracing is enabled.
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
| Symptom | Probable cause | Remediation |
|
||||
|---------|----------------|-------------|
|
||||
| Audit log shows `clientId=(none)` for all requests | Authority not issuing `client_id` claim or CLI outdated | Update StellaOps Authority configuration (`StellaOpsAuthorityOptions.Token.Claims.ClientId`), or upgrade the CLI token acquisition flow. |
|
||||
| Requests succeed with `bypass=True` unexpectedly | Local network added to `bypassNetworks` or fallback still enabled | Remove/adjust the CIDR list, disable anonymous fallback, restart Concelier. |
|
||||
| HTTP 401 with valid token | `requiredScopes` missing from client registration or token audience mismatch | Verify Authority client scopes (`concelier.jobs.trigger`) and ensure the token audience matches `audiences` config. |
|
||||
| Metrics missing from Prometheus | Telemetry exporters disabled or filter missing OTEL meter | Set `concelier.telemetry.enableMetrics=true`, ensure collector includes `StellaOps.Concelier.WebService.Jobs` meter. |
|
||||
| Sudden spike in `web.jobs.trigger.failed` | Downstream job failure or Authority timeout mid-request | Inspect Concelier job logs, re-run with tracing enabled, validate Authority latency. |
|
||||
|
||||
## 6. References
|
||||
|
||||
- `docs/21_INSTALL_GUIDE.md` – Authority configuration quick start.
|
||||
- `docs/17_SECURITY_HARDENING_GUIDE.md` – Security guardrails and enforcement deadlines.
|
||||
- `docs/ops/authority-monitoring.md` – Authority-side monitoring and alerting playbook.
|
||||
- `StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs` – source of audit log fields.
|
||||
@@ -1,72 +0,0 @@
|
||||
# Concelier CCCS Connector Operations
|
||||
|
||||
This runbook covers day‑to‑day operation of the Canadian Centre for Cyber Security (`source:cccs:*`) connector, including configuration, telemetry, and historical backfill guidance for English/French advisories.
|
||||
|
||||
## 1. Configuration Checklist
|
||||
|
||||
- Network egress (or mirrored cache) for `https://www.cyber.gc.ca/` and the JSON API endpoints under `/api/cccs/`.
|
||||
- Set the Concelier options before restarting workers. Example `concelier.yaml` snippet:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cccs:
|
||||
feeds:
|
||||
- language: "en"
|
||||
uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=en&content_type=cccs_threat"
|
||||
- language: "fr"
|
||||
uri: "https://www.cyber.gc.ca/api/cccs/threats/v1/get?lang=fr&content_type=cccs_threat"
|
||||
maxEntriesPerFetch: 80 # increase temporarily for backfill runs
|
||||
maxKnownEntries: 512
|
||||
requestTimeout: "00:00:30"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> ℹ️ The `/api/cccs/threats/v1/get` endpoint returns thousands of records per language (≈5 100 rows each as of 2025‑10‑14). The connector honours `maxEntriesPerFetch`, so leave it low for steady‑state and raise it for planned backfills.
|
||||
|
||||
## 2. Telemetry & Logging
|
||||
|
||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Cccs`):**
|
||||
- `cccs.fetch.attempts`, `cccs.fetch.success`, `cccs.fetch.failures`
|
||||
- `cccs.fetch.documents`, `cccs.fetch.unchanged`
|
||||
- `cccs.parse.success`, `cccs.parse.failures`, `cccs.parse.quarantine`
|
||||
- `cccs.map.success`, `cccs.map.failures`
|
||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
|
||||
- `concelier.source.http.requests{concelier.source="cccs"}`
|
||||
- `concelier.source.http.failures{concelier.source="cccs"}`
|
||||
- `concelier.source.http.duration{concelier.source="cccs"}`
|
||||
- **Structured logs**
|
||||
- `CCCS fetch completed feeds=… items=… newDocuments=… pendingDocuments=…`
|
||||
- `CCCS parse completed parsed=… failures=…`
|
||||
- `CCCS map completed mapped=… failures=…`
|
||||
- Warnings fire when GridFS payloads/DTOs go missing or parser sanitisation fails.
|
||||
|
||||
Suggested Grafana alerts:
|
||||
- `increase(cccs.fetch.failures_total[15m]) > 0`
|
||||
- `rate(cccs.map.success_total[1h]) == 0` while other connectors are active
|
||||
- `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cccs"}[1h])) > 5s`
|
||||
|
||||
## 3. Historical Backfill Plan
|
||||
|
||||
1. **Snapshot the source** – the API accepts `page=<n>` and `lang=<en|fr>` query parameters. `page=0` returns the full dataset (observed earliest `date_created`: 2018‑06‑08 for EN, 2018‑06‑08 for FR). Mirror those responses into Offline Kit storage when operating air‑gapped.
|
||||
2. **Stage ingestion**:
|
||||
- Temporarily raise `maxEntriesPerFetch` (e.g. 500) and restart Concelier workers.
|
||||
- Run chained jobs until `pendingDocuments` drains:
|
||||
`stella db jobs run source:cccs:fetch --and-then source:cccs:parse --and-then source:cccs:map`
|
||||
- Monitor `cccs.fetch.unchanged` growth; once it approaches dataset size the backfill is complete.
|
||||
3. **Optional pagination sweep** – for incremental mirrors, iterate `page=<n>` (0…N) while `response.Count == 50`, persisting JSON to disk. Store alongside metadata (`language`, `page`, SHA256) so repeated runs detect drift.
|
||||
4. **Language split** – keep EN/FR payloads separate to preserve canonical language fields. The connector emits `Language` directly from the feed entry, so mixed ingestion simply produces parallel advisories keyed by the same serial number.
|
||||
5. **Throttle planning** – schedule backfills during maintenance windows; the API tolerates burst downloads but respect the 250 ms request delay or raise it if mirrored traffic is not available.
|
||||
|
||||
## 4. Selector & Sanitiser Notes
|
||||
|
||||
- `CccsHtmlParser` now parses the **unsanitised DOM** (via AngleSharp) and only sanitises when persisting `ContentHtml`.
|
||||
- Product extraction walks headings (`Affected Products`, `Produits touchés`, `Mesures recommandées`) and consumes nested lists within `div/section/article` containers.
|
||||
- `HtmlContentSanitizer` allows `<h1>…<h6>` and `<section>` so stored HTML keeps headings for UI rendering and downstream summarisation.
|
||||
|
||||
## 5. Fixture Maintenance
|
||||
|
||||
- Regression fixtures live in `src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/Fixtures`.
|
||||
- Refresh via `UPDATE_CCCS_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Cccs.Tests/StellaOps.Concelier.Connector.Cccs.Tests.csproj`.
|
||||
- Fixtures capture both EN/FR advisories with nested lists to guard against sanitiser regressions; review diffs for heading/list changes before committing.
|
||||
@@ -1,146 +0,0 @@
|
||||
# Concelier CERT-Bund Connector Operations
|
||||
|
||||
_Last updated: 2025-10-17_
|
||||
|
||||
Germany’s Federal Office for Information Security (BSI) operates the Warn- und Informationsdienst (WID) portal. The Concelier CERT-Bund connector (`source:cert-bund:*`) ingests the public RSS feed, hydrates the portal’s JSON detail endpoint, and maps the result into canonical advisories while preserving the original German content.
|
||||
|
||||
---
|
||||
|
||||
## 1. Configuration Checklist
|
||||
|
||||
- Allow outbound access (or stage mirrors) for:
|
||||
- `https://wid.cert-bund.de/content/public/securityAdvisory/rss`
|
||||
- `https://wid.cert-bund.de/portal/` (session/bootstrap)
|
||||
- `https://wid.cert-bund.de/portal/api/securityadvisory` (detail/search/export JSON)
|
||||
- Ensure the HTTP client reuses a cookie container (the connector’s dependency injection wiring already sets this up).
|
||||
|
||||
Example `concelier.yaml` fragment:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cert-bund:
|
||||
feedUri: "https://wid.cert-bund.de/content/public/securityAdvisory/rss"
|
||||
portalBootstrapUri: "https://wid.cert-bund.de/portal/"
|
||||
detailApiUri: "https://wid.cert-bund.de/portal/api/securityadvisory"
|
||||
maxAdvisoriesPerFetch: 50
|
||||
maxKnownAdvisories: 512
|
||||
requestTimeout: "00:00:30"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> Leave `maxAdvisoriesPerFetch` at 50 during normal operation. Raise it only for controlled backfills, then restore the default to avoid overwhelming the portal.
|
||||
|
||||
---
|
||||
|
||||
## 2. Telemetry & Logging
|
||||
|
||||
- **Meter**: `StellaOps.Concelier.Connector.CertBund`
|
||||
- **Counters / histograms**:
|
||||
- `certbund.feed.fetch.attempts|success|failures`
|
||||
- `certbund.feed.items.count`
|
||||
- `certbund.feed.enqueued.count`
|
||||
- `certbund.feed.coverage.days`
|
||||
- `certbund.detail.fetch.attempts|success|not_modified|failures{reason}`
|
||||
- `certbund.parse.success|failures{reason}`
|
||||
- `certbund.parse.products.count`, `certbund.parse.cve.count`
|
||||
- `certbund.map.success|failures{reason}`
|
||||
- `certbund.map.affected.count`, `certbund.map.aliases.count`
|
||||
- Shared HTTP metrics remain available through `concelier.source.http.*`.
|
||||
|
||||
**Structured logs** (all emitted at information level when work occurs):
|
||||
|
||||
- `CERT-Bund fetch cycle: … truncated {Truncated}, coverageDays={CoverageDays}`
|
||||
- `CERT-Bund parse cycle: parsed {Parsed}, failures {Failures}, …`
|
||||
- `CERT-Bund map cycle: mapped {Mapped}, failures {Failures}, …`
|
||||
|
||||
Alerting ideas:
|
||||
|
||||
1. `increase(certbund.detail.fetch.failures_total[10m]) > 0`
|
||||
2. `rate(certbund.map.success_total[30m]) == 0`
|
||||
3. `histogram_quantile(0.95, rate(concelier_source_http_duration_bucket{concelier_source="cert-bund"}[15m])) > 5s`
|
||||
|
||||
The WebService now registers the meter so metrics surface automatically once OpenTelemetry metrics are enabled.
|
||||
|
||||
---
|
||||
|
||||
## 3. Historical Backfill & Export Strategy
|
||||
|
||||
### 3.1 Retention snapshot
|
||||
|
||||
- RSS window: ~250 advisories (≈90 days at current cadence).
|
||||
- Older advisories are accessible through the JSON search/export APIs once the anti-CSRF token is supplied.
|
||||
|
||||
### 3.2 JSON search pagination
|
||||
|
||||
```bash
|
||||
# 1. Bootstrap cookies (client_config + XSRF-TOKEN)
|
||||
curl -s -c cookies.txt "https://wid.cert-bund.de/portal/" > /dev/null
|
||||
curl -s -b cookies.txt -c cookies.txt \
|
||||
-H "X-Requested-With: XMLHttpRequest" \
|
||||
"https://wid.cert-bund.de/portal/api/security/csrf" > /dev/null
|
||||
|
||||
XSRF=$(awk '/XSRF-TOKEN/ {print $7}' cookies.txt)
|
||||
|
||||
# 2. Page search results
|
||||
curl -s -b cookies.txt \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Accept: application/json" \
|
||||
-H "X-XSRF-TOKEN: ${XSRF}" \
|
||||
-X POST \
|
||||
--data '{"page":4,"size":100,"sort":["published,desc"]}' \
|
||||
"https://wid.cert-bund.de/portal/api/securityadvisory/search" \
|
||||
> certbund-page4.json
|
||||
```
|
||||
|
||||
Iterate `page` until the response `content` array is empty. Pages 0–9 currently cover 2014→present. Persist JSON responses (plus SHA256) for Offline Kit parity.
|
||||
|
||||
> **Shortcut** – run `python tools/certbund_offline_snapshot.py --output seed-data/cert-bund`
|
||||
> to bootstrap the session, capture the paginated search responses, and regenerate
|
||||
> the manifest/checksum files automatically. Supply `--cookie-file` and `--xsrf-token`
|
||||
> if the portal requires a browser-derived session (see options via `--help`).
|
||||
|
||||
### 3.3 Export bundles
|
||||
|
||||
```bash
|
||||
python tools/certbund_offline_snapshot.py \
|
||||
--output seed-data/cert-bund \
|
||||
--start-year 2014 \
|
||||
--end-year "$(date -u +%Y)"
|
||||
```
|
||||
|
||||
The helper stores yearly exports under `seed-data/cert-bund/export/`,
|
||||
captures paginated search snapshots in `seed-data/cert-bund/search/`,
|
||||
and generates the manifest + SHA files in `seed-data/cert-bund/manifest/`.
|
||||
Split ranges according to your compliance window (default: one file per
|
||||
calendar year). Concelier can ingest these JSON payloads directly when
|
||||
operating offline.
|
||||
|
||||
> When automatic bootstrap fails (e.g. portal introduces CAPTCHA), run the
|
||||
> manual `curl` flow above, then rerun the helper with `--skip-fetch` to
|
||||
> rebuild the manifest from the existing files.
|
||||
|
||||
### 3.4 Connector-driven catch-up
|
||||
|
||||
1. Temporarily raise `maxAdvisoriesPerFetch` (e.g. 150) and reduce `requestDelay`.
|
||||
2. Run `stella db jobs run source:cert-bund:fetch --and-then source:cert-bund:parse --and-then source:cert-bund:map` until the fetch log reports `enqueued=0`.
|
||||
3. Restore defaults and capture the cursor snapshot for audit.
|
||||
|
||||
---
|
||||
|
||||
## 4. Locale & Translation Guidance
|
||||
|
||||
- Advisories remain in German (`language: "de"`). Preserve wording for provenance and legal accuracy.
|
||||
- UI localisation: enable the translation bundles documented in `docs/15_UI_GUIDE.md` if English UI copy is required. Operators can overlay machine or human translations, but the canonical database stores the source text.
|
||||
- Docs guild is compiling a CERT-Bund terminology glossary under `docs/locale/certbund-glossary.md` so downstream teams can reference consistent English equivalents without altering the stored advisories.
|
||||
|
||||
---
|
||||
|
||||
## 5. Verification Checklist
|
||||
|
||||
1. Observe `certbund.feed.fetch.success` and `certbund.detail.fetch.success` increments after runs; `certbund.feed.coverage.days` should hover near the observed RSS window.
|
||||
2. Ensure summary logs report `truncated=false` in steady state—`true` indicates the fetch cap was hit.
|
||||
3. During backfills, watch `certbund.feed.enqueued.count` trend to zero.
|
||||
4. Spot-check stored advisories in Mongo to confirm `language="de"` and reference URLs match the portal detail endpoint.
|
||||
5. For Offline Kit exports, validate SHA256 hashes before distribution.
|
||||
@@ -1,94 +0,0 @@
|
||||
# Concelier Cisco PSIRT Connector – OAuth Provisioning SOP
|
||||
|
||||
_Last updated: 2025-10-14_
|
||||
|
||||
## 1. Scope
|
||||
|
||||
This runbook describes how Ops provisions, rotates, and distributes Cisco PSIRT openVuln OAuth client credentials for the Concelier Cisco connector. It covers online and air-gapped (Offline Kit) environments, quota-aware execution, and escalation paths.
|
||||
|
||||
## 2. Prerequisites
|
||||
|
||||
- Active Cisco.com (CCO) account with access to the Cisco API Console.
|
||||
- Cisco PSIRT openVuln API entitlement (visible under “My Apps & Keys” once granted).citeturn3search0
|
||||
- Concelier configuration location (typically `/etc/stella/concelier.yaml` in production) or Offline Kit secret bundle staging directory.
|
||||
|
||||
## 3. Provisioning workflow
|
||||
|
||||
1. **Register the application**
|
||||
- Sign in at <https://apiconsole.cisco.com>.
|
||||
- Select **Register a New App** → Application Type: `Service`, Grant Type: `Client Credentials`, API: `Cisco PSIRT openVuln API`.citeturn3search0
|
||||
- Record the generated `clientId` and `clientSecret` in the Ops vault.
|
||||
2. **Verify token issuance**
|
||||
- Request an access token with:
|
||||
```bash
|
||||
curl -s https://id.cisco.com/oauth2/default/v1/token \
|
||||
-H "Content-Type: application/x-www-form-urlencoded" \
|
||||
-d "grant_type=client_credentials" \
|
||||
-d "client_id=${CLIENT_ID}" \
|
||||
-d "client_secret=${CLIENT_SECRET}"
|
||||
```
|
||||
- Confirm HTTP 200 and an `expires_in` value of 3600 seconds (tokens live for one hour).citeturn3search0turn3search7
|
||||
- Preserve the response only long enough to validate syntax; do **not** persist tokens.
|
||||
3. **Authorize Concelier runtime**
|
||||
- Update `concelier:sources:cisco:auth` (or the module-specific secret template) with the stored credentials.
|
||||
- For Offline Kit delivery, export encrypted secrets into `offline-kit/secrets/cisco-openvuln.json` using the platform’s sealed secret format.
|
||||
4. **Connectivity validation**
|
||||
- From the Concelier control plane, run `stella db jobs run source:vndr-cisco:fetch --dry-run`.
|
||||
- Ensure the Source HTTP diagnostics record `Bearer` authorization headers and no 401/403 responses.
|
||||
|
||||
## 4. Rotation SOP
|
||||
|
||||
| Step | Owner | Notes |
|
||||
| --- | --- | --- |
|
||||
| 1. Schedule rotation | Ops (monthly board) | Rotate every 90 days or immediately after suspected credential exposure. |
|
||||
| 2. Create replacement app | Ops | Repeat §3.1 with “-next” suffix; verify token issuance. |
|
||||
| 3. Stage dual credentials | Ops + Concelier On-Call | Publish new credentials to secret store alongside current pair. |
|
||||
| 4. Cut over | Concelier On-Call | Restart connector workers during a low-traffic window (<10 min) to pick up the new secret. |
|
||||
| 5. Deactivate legacy app | Ops | Delete prior app in Cisco API Console once telemetry confirms successful fetch/parse cycles for 2 consecutive hours. |
|
||||
|
||||
**Automation hooks**
|
||||
- Rotation reminders are tracked in OpsRunbookOps board (`OPS-RUN-KEYS` swim lane); add checklist items for Concelier Cisco when opening a rotation task.
|
||||
- Use the secret management pipeline (`ops/secrets/rotate.sh --connector cisco`) to template vault updates; the script renders a redacted diff for audit.
|
||||
|
||||
## 5. Offline Kit packaging
|
||||
|
||||
1. Generate the credential bundle using the Offline Kit CLI:
|
||||
`offline-kit secrets add cisco-openvuln --client-id … --client-secret …`
|
||||
2. Store the encrypted payload under `offline-kit/secrets/cisco-openvuln.enc`.
|
||||
3. Distribute via the Offline Kit channel; update `offline-kit/MANIFEST.md` with the credential fingerprint (SHA256 of plaintext concatenated with metadata).
|
||||
4. Document validation steps for the receiving site (token request from an air-gapped relay or cached token mirror).
|
||||
|
||||
## 6. Quota and throttling guidance
|
||||
|
||||
- Cisco enforces combined limits of 5 requests/second, 30 requests/minute, and 5 000 requests/day per application.citeturn0search0turn3search6
|
||||
- Concelier fetch jobs must respect `Retry-After` headers on HTTP 429 responses; Ops should monitor for sustained quota saturation and consider paging window adjustments.
|
||||
- Telemetry to watch: `concelier.source.http.requests{concelier.source="vndr-cisco"}`, `concelier.source.http.failures{...}`, and connector-specific metrics once implemented.
|
||||
|
||||
## 7. Telemetry & Monitoring
|
||||
|
||||
- **Metrics (Meter `StellaOps.Concelier.Connector.Vndr.Cisco`)**
|
||||
- `cisco.fetch.documents`, `cisco.fetch.failures`, `cisco.fetch.unchanged`
|
||||
- `cisco.parse.success`, `cisco.parse.failures`
|
||||
- `cisco.map.success`, `cisco.map.failures`, `cisco.map.affected.packages`
|
||||
- **Shared HTTP metrics** via `SourceDiagnostics`:
|
||||
- `concelier.source.http.requests{concelier.source="vndr-cisco"}`
|
||||
- `concelier.source.http.failures{concelier.source="vndr-cisco"}`
|
||||
- `concelier.source.http.duration{concelier.source="vndr-cisco"}`
|
||||
- **Structured logs**
|
||||
- `Cisco fetch completed date=… pages=… added=…` (info)
|
||||
- `Cisco parse completed parsed=… failures=…` (info)
|
||||
- `Cisco map completed mapped=… failures=…` (info)
|
||||
- Warnings surface when DTO serialization fails or GridFS payload is missing.
|
||||
- Suggested alerts: non-zero `cisco.fetch.failures` in 15m, or `cisco.map.success` flatlines while fetch continues.
|
||||
|
||||
## 8. Incident response
|
||||
|
||||
- **Token compromise** – revoke the application in the Cisco API Console, purge cached secrets, rotate immediately per §4.
|
||||
- **Persistent 401/403** – confirm credentials in vault, then validate token issuance; if unresolved, open a Cisco DevNet support ticket referencing the application ID.
|
||||
- **429 spikes** – inspect job scheduler cadence and adjust connector options (`maxRequestsPerWindow`) before requesting higher quotas from Cisco.
|
||||
|
||||
## 9. References
|
||||
|
||||
- Cisco PSIRT openVuln API Authentication Guide.citeturn3search0
|
||||
- Accessing the openVuln API using curl (token lifetime).citeturn3search7
|
||||
- openVuln API rate limit documentation.citeturn0search0turn3search6
|
||||
@@ -1,160 +0,0 @@
|
||||
# Concelier Conflict Resolution Runbook (Sprint 3)
|
||||
|
||||
This runbook equips Concelier operators to detect, triage, and resolve advisory conflicts now that the Sprint 3 merge engine landed (`AdvisoryPrecedenceMerger`, merge-event hashing, and telemetry counters). It builds on the canonical rules defined in `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md` and the metrics/logging instrumentation delivered this sprint.
|
||||
|
||||
---
|
||||
|
||||
## 1. Precedence Model (recap)
|
||||
|
||||
- **Default ranking:** `GHSA -> NVD -> OSV`, with distro/vendor PSIRTs outranking ecosystem feeds (`AdvisoryPrecedenceDefaults`). Use `concelier:merge:precedence:ranks` to override per source when incident response requires it.
|
||||
- **Freshness override:** if a lower-ranked source is >= 48 hours newer for a freshness-sensitive field (title, summary, affected ranges, references, credits), it wins. Every override stamps `provenance[].decisionReason = freshness`.
|
||||
- **Tie-breakers:** when precedence and freshness tie, the engine falls back to (1) primary source order, (2) shortest normalized text, (3) lowest stable hash. Merge-generated provenance records set `decisionReason = tie-breaker`.
|
||||
- **Audit trail:** each merged advisory receives a `merge` provenance entry listing the participating sources plus a `merge_event` record with canonical before/after SHA-256 hashes.
|
||||
|
||||
---
|
||||
|
||||
## 2. Telemetry Shipped This Sprint
|
||||
|
||||
| Instrument | Type | Key Tags | Purpose |
|
||||
|------------|------|----------|---------|
|
||||
| `concelier.merge.operations` | Counter | `inputs` | Total precedence merges executed. |
|
||||
| `concelier.merge.overrides` | Counter | `primary_source`, `suppressed_source`, `primary_rank`, `suppressed_rank` | Field-level overrides chosen by precedence. |
|
||||
| `concelier.merge.range_overrides` | Counter | `advisory_key`, `package_type`, `primary_source`, `suppressed_source`, `primary_range_count`, `suppressed_range_count` | Package range overrides emitted by `AffectedPackagePrecedenceResolver`. |
|
||||
| `concelier.merge.conflicts` | Counter | `type` (`severity`, `precedence_tie`), `reason` (`mismatch`, `primary_missing`, `equal_rank`) | Conflicts requiring operator review. |
|
||||
| `concelier.merge.identity_conflicts` | Counter | `scheme`, `alias_value`, `advisory_count` | Alias collisions surfaced by the identity graph. |
|
||||
|
||||
### Structured logs
|
||||
|
||||
- `AdvisoryOverride` (EventId 1000) - logs merge suppressions with alias/provenance counts.
|
||||
- `PackageRangeOverride` (EventId 1001) - logs package-level precedence decisions.
|
||||
- `PrecedenceConflict` (EventId 1002) - logs mismatched severity or equal-rank scenarios.
|
||||
- `Alias collision ...` (no EventId) - emitted when `concelier.merge.identity_conflicts` increments.
|
||||
|
||||
Expect all logs at `Information`. Ensure OTEL exporters include the scope `StellaOps.Concelier.Merge`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detection & Alerting
|
||||
|
||||
1. **Dashboard panels**
|
||||
- `concelier.merge.conflicts` - table grouped by `type/reason`. Alert when > 0 in a 15 minute window.
|
||||
- `concelier.merge.range_overrides` - stacked bar by `package_type`. Spikes highlight vendor PSIRT overrides over registry data.
|
||||
- `concelier.merge.overrides` with `primary_source|suppressed_source` - catches unexpected precedence flips (e.g., OSV overtaking GHSA).
|
||||
- `concelier.merge.identity_conflicts` - single-stat; alert when alias collisions occur more than once per day.
|
||||
2. **Log based alerts**
|
||||
- `eventId=1002` with `reason="equal_rank"` - indicates precedence table gaps; page merge owners.
|
||||
- `eventId=1002` with `reason="mismatch"` - severity disagreement; open connector bug if sustained.
|
||||
3. **Job health**
|
||||
- `stellaops-cli db merge` exit code `1` signifies unresolved conflicts. Pipe to automation that captures logs and notifies #concelier-ops.
|
||||
|
||||
### Threshold updates (2025-10-12)
|
||||
|
||||
- `concelier.merge.conflicts` – Page only when ≥ 2 events fire within 30 minutes; the synthetic conflict fixture run produces 0 conflicts, so the first event now routes to Slack for manual review instead of paging.
|
||||
- `concelier.merge.overrides` – Raise a warning when the 30-minute sum exceeds 10 (canonical triple yields exactly 1 summary override with `primary_source=osv`, `suppressed_source=ghsa`).
|
||||
- `concelier.merge.range_overrides` – Maintain the 15-minute alert at ≥ 3 but annotate dashboards that the regression triple emits a single `package_type=semver` override so ops can spot unexpected spikes.
|
||||
|
||||
---
|
||||
|
||||
## 4. Triage Workflow
|
||||
|
||||
1. **Confirm job context**
|
||||
- `stellaops-cli db merge` (CLI) or `POST /jobs/merge:reconcile` (API) to rehydrate the merge job. Use `--verbose` to stream structured logs during triage.
|
||||
2. **Inspect metrics**
|
||||
- Correlate spikes in `concelier.merge.conflicts` with `primary_source`/`suppressed_source` tags from `concelier.merge.overrides`.
|
||||
3. **Pull structured logs**
|
||||
- Example (vector output):
|
||||
```
|
||||
jq 'select(.EventId.Name=="PrecedenceConflict") | {advisory: .State[0].Value, type: .ConflictType, reason: .Reason, primary: .PrimarySources, suppressed: .SuppressedSources}' stellaops-concelier.log
|
||||
```
|
||||
4. **Review merge events**
|
||||
- `mongosh`:
|
||||
```javascript
|
||||
use concelier;
|
||||
db.merge_event.find({ advisoryKey: "CVE-2025-1234" }).sort({ mergedAt: -1 }).limit(5);
|
||||
```
|
||||
- Compare `beforeHash` vs `afterHash` to confirm the merge actually changed canonical output.
|
||||
5. **Interrogate provenance**
|
||||
- `db.advisories.findOne({ advisoryKey: "CVE-2025-1234" }, { title: 1, severity: 1, provenance: 1, "affectedPackages.provenance": 1 })`
|
||||
- Check `provenance[].decisionReason` values (`precedence`, `freshness`, `tie-breaker`) to understand why the winning field was chosen.
|
||||
|
||||
---
|
||||
|
||||
## 5. Conflict Classification Matrix
|
||||
|
||||
| Signal | Likely Cause | Immediate Action |
|
||||
|--------|--------------|------------------|
|
||||
| `reason="mismatch"` with `type="severity"` | Upstream feeds disagree on CVSS vector/severity. | Verify which feed is freshest; if correctness is known, adjust connector mapping or precedence override. |
|
||||
| `reason="primary_missing"` | Higher-ranked source lacks the field entirely. | Backfill connector data or temporarily allow lower-ranked source via precedence override. |
|
||||
| `reason="equal_rank"` | Two feeds share the same precedence rank (custom config or missing entry). | Update `concelier:merge:precedence:ranks` to break the tie; restart merge job. |
|
||||
| Rising `concelier.merge.range_overrides` for a package type | Vendor PSIRT now supplies richer ranges. | Validate connectors emit `decisionReason="precedence"` and update dashboards to treat registry ranges as fallback. |
|
||||
| `concelier.merge.identity_conflicts` > 0 | Alias scheme mapping produced collisions (duplicate CVE <-> advisory pairs). | Inspect `Alias collision` log payload; reconcile the alias graph by adjusting connector alias output. |
|
||||
|
||||
---
|
||||
|
||||
## 6. Resolution Playbook
|
||||
|
||||
1. **Connector data fix**
|
||||
- Re-run the offending connector stages (`stellaops-cli db fetch --source ghsa --stage map` etc.).
|
||||
- Once fixed, rerun merge and verify `decisionReason` reflects `freshness` or `precedence` as expected.
|
||||
2. **Temporary precedence override**
|
||||
- Edit `etc/concelier.yaml`:
|
||||
```yaml
|
||||
concelier:
|
||||
merge:
|
||||
precedence:
|
||||
ranks:
|
||||
osv: 1
|
||||
ghsa: 0
|
||||
```
|
||||
- Restart Concelier workers; confirm tags in `concelier.merge.overrides` show the new ranks.
|
||||
- Document the override with expiry in the change log.
|
||||
3. **Alias remediation**
|
||||
- Update connector mapping rules to weed out duplicate aliases (e.g., skip GHSA aliases that mirror CVE IDs).
|
||||
- Flush cached alias graphs if necessary (`db.alias_graph.drop()` is destructive-coordinate with Storage before issuing).
|
||||
4. **Escalation**
|
||||
- If override metrics spike due to upstream regression, open an incident with Security Guild, referencing merge logs and `merge_event` IDs.
|
||||
|
||||
---
|
||||
|
||||
## 7. Validation Checklist
|
||||
|
||||
- [ ] Merge job rerun returns exit code `0`.
|
||||
- [ ] `concelier.merge.conflicts` baseline returns to zero after corrective action.
|
||||
- [ ] Latest `merge_event` entry shows expected hash delta.
|
||||
- [ ] Affected advisory document shows updated `provenance[].decisionReason`.
|
||||
- [ ] Ops change log updated with incident summary, config overrides, and rollback plan.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reference Material
|
||||
|
||||
- Canonical conflict rules: `src/DEDUP_CONFLICTS_RESOLUTION_ALGO.md`.
|
||||
- Merge engine internals: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryPrecedenceMerger.cs`.
|
||||
- Metrics definitions: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/AdvisoryMergeService.cs` (identity conflicts) and `AdvisoryPrecedenceMerger`.
|
||||
- Storage audit trail: `src/Concelier/__Libraries/StellaOps.Concelier.Merge/Services/MergeEventWriter.cs`, `src/Concelier/__Libraries/StellaOps.Concelier.Storage.Mongo/MergeEvents`.
|
||||
|
||||
Keep this runbook synchronized with future sprint notes and update alert thresholds as baseline volumes change.
|
||||
|
||||
---
|
||||
|
||||
## 9. Synthetic Regression Fixtures
|
||||
|
||||
- **Locations** – Canonical conflict snapshots now live at `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/Fixtures/conflict-ghsa.canonical.json`, `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/Nvd/Fixtures/conflict-nvd.canonical.json`, and `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/Fixtures/conflict-osv.canonical.json`.
|
||||
- **Validation commands** – To regenerate and verify the fixtures offline, run:
|
||||
|
||||
```bash
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ghsa.Tests/StellaOps.Concelier.Connector.Ghsa.Tests.csproj --filter GhsaConflictFixtureTests
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Nvd.Tests/StellaOps.Concelier.Connector.Nvd.Tests.csproj --filter NvdConflictFixtureTests
|
||||
dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj --filter OsvConflictFixtureTests
|
||||
dotnet test src/Concelier/__Tests/StellaOps.Concelier.Merge.Tests/StellaOps.Concelier.Merge.Tests.csproj --filter MergeAsync_AppliesCanonicalRulesAndPersistsDecisions
|
||||
```
|
||||
|
||||
- **Expected signals** – The triple produces one freshness-driven summary override (`primary_source=osv`, `suppressed_source=ghsa`) and one range override for the npm SemVer package while leaving `concelier.merge.conflicts` at zero. Use these values as the baseline when tuning dashboards or load-testing alert pipelines.
|
||||
|
||||
---
|
||||
|
||||
## 10. Change Log
|
||||
|
||||
| Date (UTC) | Change | Notes |
|
||||
|------------|--------|-------|
|
||||
| 2025-10-16 | Ops review signed off after connector expansion (CCCS, CERT-Bund, KISA, ICS CISA, MSRC) landed. Alert thresholds from §3 reaffirmed; dashboards updated to watch attachment signals emitted by ICS CISA connector. | Ops sign-off recorded by Concelier Ops Guild; no additional overrides required. |
|
||||
@@ -1,151 +0,0 @@
|
||||
{
|
||||
"title": "Concelier CVE & KEV Observability",
|
||||
"uid": "concelier-cve-kev",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"refresh": "5m",
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0
|
||||
}
|
||||
]
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "CVE fetch success vs failure",
|
||||
"gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(cve_fetch_success_total[5m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "success"
|
||||
},
|
||||
{
|
||||
"refId": "B",
|
||||
"expr": "rate(cve_fetch_failures_total[5m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "failure"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "KEV fetch cadence",
|
||||
"gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(kev_fetch_success_total[30m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "success"
|
||||
},
|
||||
{
|
||||
"refId": "B",
|
||||
"expr": "rate(kev_fetch_failures_total[30m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "failure"
|
||||
},
|
||||
{
|
||||
"refId": "C",
|
||||
"expr": "rate(kev_fetch_unchanged_total[30m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "unchanged"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "table",
|
||||
"title": "KEV parse anomalies (24h)",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 9 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum by (reason) (increase(kev_parse_anomalies_total[24h]))",
|
||||
"format": "table",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" }
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {
|
||||
"renameByName": {
|
||||
"Value": "count"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Advisories emitted",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 9 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(cve_map_success_total[15m])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "CVE"
|
||||
},
|
||||
{
|
||||
"refId": "B",
|
||||
"expr": "rate(kev_map_advisories_total[24h])",
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"legendFormat": "KEV"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,143 +0,0 @@
|
||||
# Concelier CVE & KEV Connector Operations
|
||||
|
||||
This playbook equips operators with the steps required to roll out and monitor the CVE Services and CISA KEV connectors across environments.
|
||||
|
||||
## 1. CVE Services Connector (`source:cve:*`)
|
||||
|
||||
### 1.1 Prerequisites
|
||||
|
||||
- CVE Services API credentials (organisation ID, user ID, API key) with access to the JSON 5 API.
|
||||
- Network egress to `https://cveawg.mitre.org` (or a mirrored endpoint) from the Concelier workers.
|
||||
- Updated `concelier.yaml` (or the matching environment variables) with the following section:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
cve:
|
||||
baseEndpoint: "https://cveawg.mitre.org/api/"
|
||||
apiOrg: "ORG123"
|
||||
apiUser: "user@example.org"
|
||||
apiKeyFile: "/var/run/secrets/concelier/cve-api-key"
|
||||
seedDirectory: "./seed-data/cve"
|
||||
pageSize: 200
|
||||
maxPagesPerFetch: 5
|
||||
initialBackfill: "30.00:00:00"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:10:00"
|
||||
```
|
||||
|
||||
> ℹ️ Store the API key outside source control. When using `apiKeyFile`, mount the secret file into the container/host; alternatively supply `apiKey` via `CONCELIER_SOURCES__CVE__APIKEY`.
|
||||
|
||||
> 🪙 When credentials are not yet available, configure `seedDirectory` to point at mirrored CVE JSON (for example, the repo’s `seed-data/cve/` bundle). The connector will ingest those records and log a warning instead of failing the job; live fetching resumes automatically once `apiOrg` / `apiUser` / `apiKey` are supplied.
|
||||
|
||||
### 1.2 Smoke Test (staging)
|
||||
|
||||
1. Deploy the updated configuration and restart the Concelier service so the connector picks up the credentials.
|
||||
2. Trigger one end-to-end cycle:
|
||||
- Concelier CLI: `stella db jobs run source:cve:fetch --and-then source:cve:parse --and-then source:cve:map`
|
||||
- REST fallback: `POST /jobs/run { "kind": "source:cve:fetch", "chain": ["source:cve:parse", "source:cve:map"] }`
|
||||
3. Observe the following metrics (exported via OTEL meter `StellaOps.Concelier.Connector.Cve`):
|
||||
- `cve.fetch.attempts`, `cve.fetch.success`, `cve.fetch.documents`, `cve.fetch.failures`, `cve.fetch.unchanged`
|
||||
- `cve.parse.success`, `cve.parse.failures`, `cve.parse.quarantine`
|
||||
- `cve.map.success`
|
||||
4. Verify Prometheus shows matching `concelier.source.http.requests_total{concelier_source="cve"}` deltas (list vs detail phases) while `concelier.source.http.failures_total{concelier_source="cve"}` stays flat.
|
||||
5. Confirm the info-level summary log `CVEs fetch window … pages=X detailDocuments=Y detailFailures=Z` appears once per fetch run and shows `detailFailures=0`.
|
||||
6. Verify the MongoDB advisory store contains fresh CVE advisories (`advisoryKey` prefix `cve/`) and that the source cursor (`source_states` collection) advanced.
|
||||
|
||||
### 1.3 Production Monitoring
|
||||
|
||||
- **Dashboards** – Plot `rate(cve_fetch_success_total[5m])`, `rate(cve_fetch_failures_total[5m])`, and `rate(cve_fetch_documents_total[5m])` alongside `concelier_source_http_requests_total{concelier_source="cve"}` to confirm HTTP and connector counters stay aligned. Keep `concelier.range.primitives{scheme=~"semver|vendor"}` on the same board for range coverage. Example alerts:
|
||||
- `rate(cve_fetch_failures_total[5m]) > 0` for 10 minutes (`severity=warning`)
|
||||
- `rate(cve_map_success_total[15m]) == 0` while `rate(cve_fetch_success_total[15m]) > 0` (`severity=critical`)
|
||||
- `sum_over_time(cve_parse_quarantine_total[1h]) > 0` to catch schema anomalies
|
||||
- **Logs** – Monitor warnings such as `Failed fetching CVE record {CveId}` and `Malformed CVE JSON`, and surface the summary info log `CVEs fetch window … detailFailures=0 detailUnchanged=0` on dashboards. A non-zero `detailFailures` usually indicates rate-limit or auth issues on detail requests.
|
||||
- **Grafana pack** – Import `docs/ops/concelier-cve-kev-grafana-dashboard.json` and filter by panel legend (`CVE`, `KEV`) to reuse the canned layout.
|
||||
- **Backfill window** – Operators can tighten or widen `initialBackfill` / `maxPagesPerFetch` after validating throughput. Update config and restart Concelier to apply changes.
|
||||
|
||||
### 1.4 Staging smoke log (2025-10-15)
|
||||
|
||||
While Ops finalises long-lived CVE Services credentials, we validated the connector end-to-end against the recorded CVE-2024-0001 payloads used in regression tests:
|
||||
|
||||
- Command: `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Cve.Tests/StellaOps.Concelier.Connector.Cve.Tests.csproj -l "console;verbosity=detailed"`
|
||||
- Summary log emitted by the connector:
|
||||
```
|
||||
CVEs fetch window 2024-09-01T00:00:00Z->2024-10-01T00:00:00Z pages=1 listSuccess=1 detailDocuments=1 detailFailures=0 detailUnchanged=0 pendingDocuments=0->1 pendingMappings=0->1 hasMorePages=False nextWindowStart=2024-09-15T12:00:00Z nextWindowEnd=(none) nextPage=1
|
||||
```
|
||||
- Telemetry captured by `Meter` `StellaOps.Concelier.Connector.Cve`:
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `cve.fetch.attempts` | 1 |
|
||||
| `cve.fetch.success` | 1 |
|
||||
| `cve.fetch.documents` | 1 |
|
||||
| `cve.parse.success` | 1 |
|
||||
| `cve.map.success` | 1 |
|
||||
|
||||
The Grafana pack `docs/ops/concelier-cve-kev-grafana-dashboard.json` has been imported into staging so the panels referenced above render against these counters once the live API keys are in place.
|
||||
|
||||
## 2. CISA KEV Connector (`source:kev:*`)
|
||||
|
||||
### 2.1 Prerequisites
|
||||
|
||||
- Network egress (or mirrored content) for `https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`.
|
||||
- No credentials are required, but the HTTP allow-list must include `www.cisa.gov`.
|
||||
- Confirm the following snippet in `concelier.yaml` (defaults shown; tune as needed):
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
kev:
|
||||
feedUri: "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
|
||||
requestTimeout: "00:01:00"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
### 2.2 Schema validation & anomaly handling
|
||||
|
||||
The connector validates each catalog against `Schemas/kev-catalog.schema.json`. Failures increment `kev.parse.failures_total{reason="schema"}` and the document is quarantined (status `Failed`). Additional failure reasons include `download`, `invalidJson`, `deserialize`, `missingPayload`, and `emptyCatalog`. Entry-level anomalies are surfaced through `kev.parse.anomalies_total` with reasons:
|
||||
|
||||
| Reason | Meaning |
|
||||
| --- | --- |
|
||||
| `missingCveId` | Catalog entry omitted `cveID`; the entry is skipped. |
|
||||
| `countMismatch` | Catalog `count` field disagreed with the actual entry total. |
|
||||
| `nullEntry` | Upstream emitted a `null` entry object (rare upstream defect). |
|
||||
|
||||
Treat repeated schema failures or growing anomaly counts as an upstream regression and coordinate with CISA or mirror maintainers.
|
||||
|
||||
### 2.3 Smoke Test (staging)
|
||||
|
||||
1. Deploy the configuration and restart Concelier.
|
||||
2. Trigger a pipeline run:
|
||||
- CLI: `stella db jobs run source:kev:fetch --and-then source:kev:parse --and-then source:kev:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kev:fetch", "chain": ["source:kev:parse", "source:kev:map"] }`
|
||||
3. Verify the metrics exposed by meter `StellaOps.Concelier.Connector.Kev`:
|
||||
- `kev.fetch.attempts`, `kev.fetch.success`, `kev.fetch.unchanged`, `kev.fetch.failures`
|
||||
- `kev.parse.entries` (tag `catalogVersion`), `kev.parse.failures`, `kev.parse.anomalies` (tag `reason`)
|
||||
- `kev.map.advisories` (tag `catalogVersion`)
|
||||
4. Confirm `concelier.source.http.requests_total{concelier_source="kev"}` increments once per fetch and that the paired `concelier.source.http.failures_total` stays flat (zero increase).
|
||||
5. Inspect the info logs `Fetched KEV catalog document … pendingDocuments=…` and `Parsed KEV catalog document … entries=…`—they should appear exactly once per run and `Mapped X/Y… skipped=0` should match the `kev.map.advisories` delta.
|
||||
6. Confirm MongoDB documents exist for the catalog JSON (`raw_documents` & `dtos`) and that advisories with prefix `kev/` are written.
|
||||
|
||||
### 2.4 Production Monitoring
|
||||
|
||||
- Alert when `rate(kev_fetch_success_total[8h]) == 0` during working hours (daily cadence breach) and when `increase(kev_fetch_failures_total[1h]) > 0`.
|
||||
- Page the on-call if `increase(kev_parse_failures_total{reason="schema"}[6h]) > 0`—this usually signals an upstream payload change. Treat repeated `reason="download"` spikes as networking issues to the mirror.
|
||||
- Track anomaly spikes through `sum_over_time(kev_parse_anomalies_total{reason="missingCveId"}[24h])`. Rising `countMismatch` trends point to catalog publishing bugs.
|
||||
- Surface the fetch/mapping info logs (`Fetched KEV catalog document …` and `Mapped X/Y KEV advisories … skipped=S`) on dashboards; absence of those logs while metrics show success typically means schema validation short-circuited the run.
|
||||
|
||||
### 2.5 Known good dashboard tiles
|
||||
|
||||
Add the following panels to the Concelier observability board:
|
||||
|
||||
| Metric | Recommended visualisation |
|
||||
|--------|---------------------------|
|
||||
| `rate(kev_fetch_success_total[30m])` | Single-stat (last 24 h) with warning threshold `>0` |
|
||||
| `rate(kev_parse_entries_total[1h])` by `catalogVersion` | Stacked area – highlights daily release size |
|
||||
| `sum_over_time(kev_parse_anomalies_total[1d])` by `reason` | Table – anomaly breakdown (matches dashboard panel) |
|
||||
| `rate(cve_map_success_total[15m])` vs `rate(kev_map_advisories_total[24h])` | Comparative timeseries for advisories emitted |
|
||||
|
||||
## 3. Runbook updates
|
||||
|
||||
- Record staging/production smoke test results (date, catalog version, advisory counts) in your team’s change log.
|
||||
- Add the CVE/KEV job kinds to the standard maintenance checklist so operators can manually trigger them after planned downtime.
|
||||
- Keep this document in sync with future connector changes (for example, new anomaly reasons or additional metrics).
|
||||
- Version-control dashboard tweaks alongside `docs/ops/concelier-cve-kev-grafana-dashboard.json` so operations can re-import the observability pack during restores.
|
||||
@@ -1,123 +0,0 @@
|
||||
# Concelier GHSA Connector – Operations Runbook
|
||||
|
||||
_Last updated: 2025-10-16_
|
||||
|
||||
## 1. Overview
|
||||
The GitHub Security Advisories (GHSA) connector pulls advisory metadata from the GitHub REST API `/security/advisories` endpoint. GitHub enforces both primary and secondary rate limits, so operators must monitor usage and configure retries to avoid throttling incidents.
|
||||
|
||||
## 2. Rate-limit telemetry
|
||||
The connector now surfaces rate-limit headers on every fetch and exposes the following metrics via OpenTelemetry:
|
||||
|
||||
| Metric | Description | Tags |
|
||||
|--------|-------------|------|
|
||||
| `ghsa.ratelimit.limit` (histogram) | Samples the reported request quota at fetch time. | `phase` = `list` or `detail`, `resource` (e.g., `core`). |
|
||||
| `ghsa.ratelimit.remaining` (histogram) | Remaining requests returned by `X-RateLimit-Remaining`. | `phase`, `resource`. |
|
||||
| `ghsa.ratelimit.reset_seconds` (histogram) | Seconds until `X-RateLimit-Reset`. | `phase`, `resource`. |
|
||||
| `ghsa.ratelimit.headroom_pct` (histogram) | Percentage of the quota still available (`remaining / limit * 100`). | `phase`, `resource`. |
|
||||
| `ghsa.ratelimit.headroom_pct_current` (observable gauge) | Latest headroom percentage reported per resource. | `phase`, `resource`. |
|
||||
| `ghsa.ratelimit.exhausted` (counter) | Incremented whenever GitHub returns a zero remaining quota and the connector delays before retrying. | `phase`. |
|
||||
|
||||
### Dashboards & alerts
|
||||
- Plot `ghsa.ratelimit.remaining` as the latest value to watch the runway. Alert when the value stays below **`RateLimitWarningThreshold`** (default `500`) for more than 5 minutes.
|
||||
- Use `ghsa.ratelimit.headroom_pct_current` to visualise remaining quota % — paging once it sits below **10 %** for longer than a single reset window helps avoid secondary limits.
|
||||
- Raise a separate alert on `increase(ghsa.ratelimit.exhausted[15m]) > 0` to catch hard throttles.
|
||||
- Overlay `ghsa.fetch.attempts` vs `ghsa.fetch.failures` to confirm retries are effective.
|
||||
|
||||
## 3. Logging signals
|
||||
When `X-RateLimit-Remaining` falls below `RateLimitWarningThreshold`, the connector emits:
|
||||
```
|
||||
GHSA rate limit warning: remaining {Remaining}/{Limit} for {Phase} {Resource} (headroom {Headroom}%)
|
||||
```
|
||||
When GitHub reports zero remaining calls, the connector logs and sleeps for the reported `Retry-After`/`X-RateLimit-Reset` interval (falling back to `SecondaryRateLimitBackoff`).
|
||||
|
||||
After the quota recovers above the warning threshold the connector writes an informational log with the refreshed remaining/headroom, letting operators clear alerts quickly.
|
||||
|
||||
## 4. Configuration knobs (`concelier.yaml`)
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
ghsa:
|
||||
apiToken: "${GITHUB_PAT}"
|
||||
pageSize: 50
|
||||
requestDelay: "00:00:00.200"
|
||||
failureBackoff: "00:05:00"
|
||||
rateLimitWarningThreshold: 500 # warn below this many remaining calls
|
||||
secondaryRateLimitBackoff: "00:02:00" # fallback delay when GitHub omits Retry-After
|
||||
```
|
||||
|
||||
### Recommendations
|
||||
- Increase `requestDelay` in air-gapped or burst-heavy deployments to smooth token consumption.
|
||||
- Lower `rateLimitWarningThreshold` only if your dashboards already page on the new histogram; never set it negative.
|
||||
- For bots using a low-privilege PAT, keep `secondaryRateLimitBackoff` at ≥60 seconds to respect GitHub’s secondary-limit guidance.
|
||||
|
||||
#### Default job schedule
|
||||
|
||||
| Job kind | Cron | Timeout | Lease |
|
||||
|----------|------|---------|-------|
|
||||
| `source:ghsa:fetch` | `1,11,21,31,41,51 * * * *` | 6 minutes | 4 minutes |
|
||||
| `source:ghsa:parse` | `3,13,23,33,43,53 * * * *` | 5 minutes | 4 minutes |
|
||||
| `source:ghsa:map` | `5,15,25,35,45,55 * * * *` | 5 minutes | 4 minutes |
|
||||
|
||||
These defaults spread GHSA stages across the hour so fetch completes before parse/map fire. Override them via `concelier.jobs.definitions[...]` when coordinating multiple connectors on the same runner.
|
||||
|
||||
## 5. Provisioning credentials
|
||||
|
||||
Concelier requires a GitHub personal access token (classic) with the **`read:org`** and **`security_events`** scopes to pull GHSA data. Store it as a secret and reference it via `concelier.sources.ghsa.apiToken`.
|
||||
|
||||
### Docker Compose (stack operators)
|
||||
```yaml
|
||||
services:
|
||||
concelier:
|
||||
environment:
|
||||
CONCELIER__SOURCES__GHSA__APITOKEN: /run/secrets/ghsa_pat
|
||||
secrets:
|
||||
- ghsa_pat
|
||||
|
||||
secrets:
|
||||
ghsa_pat:
|
||||
file: ./secrets/ghsa_pat.txt # contains only the PAT value
|
||||
```
|
||||
|
||||
### Helm values (cluster operators)
|
||||
```yaml
|
||||
concelier:
|
||||
extraEnv:
|
||||
- name: CONCELIER__SOURCES__GHSA__APITOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: concelier-ghsa
|
||||
key: apiToken
|
||||
|
||||
extraSecrets:
|
||||
concelier-ghsa:
|
||||
apiToken: "<paste PAT here or source from external secret store>"
|
||||
```
|
||||
|
||||
After rotating the PAT, restart the Concelier workers (or run `kubectl rollout restart deployment/concelier`) to ensure the configuration reloads.
|
||||
|
||||
When enabling GHSA the first time, run a staged backfill:
|
||||
|
||||
1. Trigger `source:ghsa:fetch` manually (CLI or API) outside of peak hours.
|
||||
2. Watch `concelier.jobs.health` for the GHSA jobs until they report `healthy`.
|
||||
3. Allow the scheduled cron cadence to resume once the initial backlog drains (typically < 30 minutes).
|
||||
|
||||
## 6. Runbook steps when throttled
|
||||
1. Check `ghsa.ratelimit.exhausted` for the affected phase (`list` vs `detail`).
|
||||
2. Confirm the connector is delaying—logs will show `GHSA rate limit exhausted...` with the chosen backoff.
|
||||
3. If rate limits stay exhausted:
|
||||
- Verify no other jobs are sharing the PAT.
|
||||
- Temporarily reduce `MaxPagesPerFetch` or `PageSize` to shrink burst size.
|
||||
- Consider provisioning a dedicated PAT (GHSA permissions only) for Concelier.
|
||||
4. After the quota resets, reset `rateLimitWarningThreshold`/`requestDelay` to their normal values and monitor the histograms for at least one hour.
|
||||
|
||||
## 7. Alert integration quick reference
|
||||
- Prometheus: `ghsa_ratelimit_remaining_bucket` (from histogram) – use `histogram_quantile(0.99, ...)` to trend capacity.
|
||||
- VictoriaMetrics: `LAST_over_time(ghsa_ratelimit_remaining_sum[5m])` for simple last-value graphs.
|
||||
- Grafana: stack remaining + used to visualise total limit per resource.
|
||||
|
||||
## 8. Canonical metric fallback analytics
|
||||
When GitHub omits CVSS vectors/scores, the connector now assigns a deterministic canonical metric id in the form `ghsa:severity/<level>` and publishes it to Merge so severity precedence still resolves against GHSA even without CVSS data.
|
||||
|
||||
- Metric: `ghsa.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `reason=no_cvss`.
|
||||
- Monitor the counter alongside Merge parity checks; a sudden spike suggests GitHub is shipping advisories without vectors and warrants cross-checking downstream exporters.
|
||||
- Because the canonical id feeds Merge, parity dashboards should overlay this metric to confirm fallback advisories continue to merge ahead of downstream sources when GHSA supplies more recent data.
|
||||
@@ -1,122 +0,0 @@
|
||||
# Concelier CISA ICS Connector Operations
|
||||
|
||||
This runbook documents how to provision, rotate, and validate credentials for the CISA Industrial Control Systems (ICS) connector (`source:ics-cisa:*`). Follow it before enabling the connector in staging or offline installations.
|
||||
|
||||
## 1. Credential Provisioning
|
||||
|
||||
1. **Create a service mailbox** reachable by the Ops crew (shared mailbox recommended).
|
||||
2. Browse to `https://public.govdelivery.com/accounts/USDHSCISA/subscriber/new` and subscribe the mailbox to the following GovDelivery topics:
|
||||
- `USDHSCISA_16` — ICS-CERT advisories (legacy numbering: `ICSA-YY-###`).
|
||||
- `USDHSCISA_19` — ICS medical advisories (`ICSMA-YY-###`).
|
||||
- `USDHSCISA_17` — ICS alerts (`IR-ALERT-YY-###`) for completeness.
|
||||
3. Complete the verification email. After confirmation, note the **personalised subscription code** included in the “Manage Preferences” link. It has the shape `code=AB12CD34EF`.
|
||||
4. Store the code in the shared secret vault (or Offline Kit secrets bundle) as `concelier/sources/icscisa/govdelivery/code`.
|
||||
|
||||
> ℹ️ GovDelivery does not expose a one-time API key; the personalised code is what authenticates the RSS pull. Never commit it to git.
|
||||
|
||||
## 2. Feed Validation
|
||||
|
||||
Use the following command to confirm the feed is reachable before wiring it into Concelier (substitute `<CODE>` with the personalised value):
|
||||
|
||||
```bash
|
||||
curl -H "User-Agent: StellaOpsConcelier/ics-cisa" \
|
||||
"https://content.govdelivery.com/accounts/USDHSCISA/topics/ICS-CERT/feed.rss?format=xml&code=<CODE>"
|
||||
```
|
||||
|
||||
If the endpoint returns HTTP 200 and an RSS payload, record the sample response under `docs/artifacts/icscisa/` (see Task `FEEDCONN-ICSCISA-02-007`). HTTP 403 or 406 usually means the subscription was not confirmed or the code was mistyped.
|
||||
|
||||
## 3. Configuration Snippet
|
||||
|
||||
Add the connector configuration to `concelier.yaml` (or equivalent environment variables):
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
icscisa:
|
||||
govDelivery:
|
||||
code: "${CONCELIER_ICS_CISA_GOVDELIVERY_CODE}"
|
||||
topics:
|
||||
- "USDHSCISA_16"
|
||||
- "USDHSCISA_19"
|
||||
- "USDHSCISA_17"
|
||||
rssBaseUri: "https://content.govdelivery.com/accounts/USDHSCISA"
|
||||
requestDelay: "00:00:01"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
Environment variable example:
|
||||
|
||||
```bash
|
||||
export CONCELIER_SOURCES_ICSCISA_GOVDELIVERY_CODE="AB12CD34EF"
|
||||
```
|
||||
|
||||
Concelier automatically register the host with the Source.Common HTTP allow-list when the connector assembly is loaded.
|
||||
|
||||
|
||||
Optional tuning keys (set only when needed):
|
||||
|
||||
- `proxyUri` — HTTP/HTTPS proxy URL used when Akamai blocks direct pulls.
|
||||
- `requestVersion` / `requestVersionPolicy` — override HTTP negotiation when the proxy requires HTTP/1.1.
|
||||
- `enableDetailScrape` — toggle HTML detail fallback (defaults to true).
|
||||
- `captureAttachments` — collect PDF attachments from detail pages (defaults to true).
|
||||
- `detailBaseUri` — alternate host for detail enrichment if CISA changes their layout.
|
||||
|
||||
## 4. Seeding Without GovDelivery
|
||||
|
||||
If credentials are still pending, populate the connector with the community CSV dataset before enabling the live fetch:
|
||||
|
||||
1. Run `./scripts/fetch-ics-cisa-seed.sh` (or `.ps1`) to download the latest `CISA_ICS_ADV_*.csv` files into `seed-data/ics-cisa/`.
|
||||
2. Copy the CSVs (and the generated `.sha256` files) into your Offline Kit staging area so they ship alongside the other feeds.
|
||||
3. Import the kit as usual. The connector can parse the seed data for historical context, but **live GovDelivery credentials are still required** for fresh advisories.
|
||||
4. Once credentials arrive, update `concelier:sources:icscisa:govDelivery:code` and re-trigger `source:ics-cisa:fetch` so the connector switches to the authorised feed.
|
||||
|
||||
> The CSVs are licensed under ODbL 1.0 by the ICS Advisory Project. Preserve the attribution when redistributing them.
|
||||
|
||||
## 4. Integration Validation
|
||||
|
||||
1. Ensure secrets are in place and restart the Concelier workers.
|
||||
2. Run a dry-run fetch/parse/map chain against an Akamai-protected topic:
|
||||
```bash
|
||||
CONCELIER_SOURCES_ICSCISA_GOVDELIVERY_CODE=... \
|
||||
CONCELIER_SOURCES_ICSCISA_ENABLEDETAILSCRAPE=1 \
|
||||
stella db jobs run source:ics-cisa:fetch --and-then source:ics-cisa:parse --and-then source:ics-cisa:map
|
||||
```
|
||||
3. Confirm logs contain `ics-cisa detail fetch` entries and that new documents/DTOs include attachments (see `docs/artifacts/icscisa`). Canonical advisories should expose PDF links as `references.kind == "attachment"` and affected packages should surface `primitives.semVer.exactValue` for single-version hits.
|
||||
4. If Akamai blocks direct fetches, set `concelier:sources:icscisa:proxyUri` to your allow-listed egress proxy and rerun the dry-run.
|
||||
|
||||
## 4. Rotation & Incident Response
|
||||
|
||||
- Review GovDelivery access quarterly. Rotate the personalised code whenever Ops changes the service mailbox password or membership.
|
||||
- Revoking the subscription in GovDelivery invalidates the code immediately; update the vault and configuration in the same change.
|
||||
- If the code leaks, remove the subscription (`https://public.govdelivery.com/accounts/USDHSCISA/subscriber/manage_preferences?code=<CODE>`), resubscribe, and distribute the new value via the vault.
|
||||
|
||||
## 5. Offline Kit Handling
|
||||
|
||||
Include the personalised code in `offline-kit/secrets/concelier/icscisa.env`:
|
||||
|
||||
```
|
||||
CONCELIER_SOURCES_ICSCISA_GOVDELIVERY_CODE=AB12CD34EF
|
||||
```
|
||||
|
||||
The Offline Kit deployment script copies this file into the container secret directory mounted at `/run/secrets/concelier`. Ensure permissions are `600` and ownership matches the Concelier runtime user.
|
||||
|
||||
## 6. Telemetry & Monitoring
|
||||
|
||||
The connector emits metrics under the meter `StellaOps.Concelier.Connector.Ics.Cisa`. They allow operators to track Akamai fallbacks, detail enrichment health, and advisory fan-out.
|
||||
|
||||
- `icscisa.fetch.*` – counters for `attempts`, `success`, `failures`, `not_modified`, and `fallbacks`, plus histogram `icscisa.fetch.documents` showing documents added per topic pull (tags: `concelier.source`, `icscisa.topic`).
|
||||
- `icscisa.parse.*` – counters for `success`/`failures` and histograms `icscisa.parse.advisories`, `icscisa.parse.attachments`, `icscisa.parse.detail_fetches` to monitor enrichment workload per feed document.
|
||||
- `icscisa.detail.*` – counters `success` / `failures` per advisory (tagged with `icscisa.advisory`) to alert when Akamai blocks detail pages.
|
||||
- `icscisa.map.*` – counters for `success`/`failures` and histograms `icscisa.map.references`, `icscisa.map.packages`, `icscisa.map.aliases` capturing canonical fan-out.
|
||||
|
||||
Suggested alerts:
|
||||
|
||||
- `increase(icscisa.fetch.failures_total[15m]) > 0` or `increase(icscisa.fetch.fallbacks_total[15m]) > 5` — sustained Akamai or proxy issues.
|
||||
- `increase(icscisa.detail.failures_total[30m]) > 0` — detail enrichment breaking (potential HTML layout change).
|
||||
- `histogram_quantile(0.95, rate(icscisa.map.references_bucket[1h]))` trending sharply higher — sudden advisory reference explosion worth investigating.
|
||||
- Keep an eye on shared HTTP metrics (`concelier.source.http.*{concelier.source="ics-cisa"}`) for request latency and retry patterns.
|
||||
|
||||
## 6. Related Tasks
|
||||
|
||||
- `FEEDCONN-ICSCISA-02-009` (GovDelivery credential onboarding) — completed once this runbook is followed and secrets are placed in the vault.
|
||||
- `FEEDCONN-ICSCISA-02-007` (document inventory) — archive the first successful RSS response and any attachment URL schema under `docs/artifacts/icscisa/`.
|
||||
@@ -1,74 +0,0 @@
|
||||
# Concelier KISA Connector Operations
|
||||
|
||||
Operational guidance for the Korea Internet & Security Agency (KISA / KNVD) connector (`source:kisa:*`). Pair this with the engineering brief in `docs/dev/kisa_connector_notes.md`.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Outbound HTTPS (or mirrored cache) for `https://knvd.krcert.or.kr/`.
|
||||
- Connector options defined under `concelier:sources:kisa`:
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
kisa:
|
||||
feedUri: "https://knvd.krcert.or.kr/rss/securityInfo.do"
|
||||
detailApiUri: "https://knvd.krcert.or.kr/rssDetailData.do"
|
||||
detailPageUri: "https://knvd.krcert.or.kr/detailDos.do"
|
||||
maxAdvisoriesPerFetch: 10
|
||||
requestDelay: "00:00:01"
|
||||
failureBackoff: "00:05:00"
|
||||
```
|
||||
|
||||
> Ensure the URIs stay absolute—Concelier adds the `feedUri`/`detailApiUri` hosts to the HttpClient allow-list automatically.
|
||||
|
||||
## 2. Staging Smoke Test
|
||||
|
||||
1. Restart the Concelier workers so the KISA options bind.
|
||||
2. Run a full connector cycle:
|
||||
- CLI: `stella db jobs run source:kisa:fetch --and-then source:kisa:parse --and-then source:kisa:map`
|
||||
- REST: `POST /jobs/run { "kind": "source:kisa:fetch", "chain": ["source:kisa:parse", "source:kisa:map"] }`
|
||||
3. Confirm telemetry (Meter `StellaOps.Concelier.Connector.Kisa`):
|
||||
- `kisa.feed.success`, `kisa.feed.items`
|
||||
- `kisa.detail.success` / `.failures`
|
||||
- `kisa.parse.success` / `.failures`
|
||||
- `kisa.map.success` / `.failures`
|
||||
- `kisa.cursor.updates`
|
||||
4. Inspect logs for structured entries:
|
||||
- `KISA feed returned {ItemCount}`
|
||||
- `KISA fetched detail for {Idx} … category={Category}`
|
||||
- `KISA mapped advisory {AdvisoryId} (severity={Severity})`
|
||||
- Absence of warnings such as `document missing GridFS payload`.
|
||||
5. Validate MongoDB state:
|
||||
- `raw_documents.metadata` has `kisa.idx`, `kisa.category`, `kisa.title`.
|
||||
- DTO store contains `schemaVersion="kisa.detail.v1"`.
|
||||
- Advisories include aliases (`IDX`, CVE) and `language="ko"`.
|
||||
- `source_states` entry for `kisa` shows recent `cursor.lastFetchAt`.
|
||||
|
||||
## 3. Production Monitoring
|
||||
|
||||
- **Dashboards** – Add the following Prometheus/OTEL expressions:
|
||||
- `rate(kisa_feed_items_total[15m])` versus `rate(concelier_source_http_requests_total{concelier_source="kisa"}[15m])`
|
||||
- `increase(kisa_detail_failures_total{reason!="empty-document"}[1h])` alert at `>0`
|
||||
- `increase(kisa_parse_failures_total[1h])` for storage/JSON issues
|
||||
- `increase(kisa_map_failures_total[1h])` to flag schema drift
|
||||
- `increase(kisa_cursor_updates_total[6h]) == 0` during active windows → warn
|
||||
- **Alerts** – Page when `rate(kisa_feed_success_total[2h]) == 0` while other connectors are active; back off for maintenance windows announced on `https://knvd.krcert.or.kr/`.
|
||||
- **Logs** – Watch for repeated warnings (`document missing`, `DTO missing`) or errors with reason tags `HttpRequestException`, `download`, `parse`, `map`.
|
||||
|
||||
## 4. Localisation Handling
|
||||
|
||||
- Hangul categories (for example `취약점정보`) flow into telemetry tags (`category=…`) and logs. Dashboards must render UTF‑8 and avoid transliteration.
|
||||
- HTML content is sanitised before storage; translation teams can consume the `ContentHtml` field safely.
|
||||
- Advisory severity remains as provided by KISA (`High`, `Medium`, etc.). Map-level failures include the severity tag for filtering.
|
||||
|
||||
## 5. Fixture & Regression Maintenance
|
||||
|
||||
- Regression fixtures: `src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/Fixtures/kisa-feed.xml` and `kisa-detail.json`.
|
||||
- Refresh via `UPDATE_KISA_FIXTURES=1 dotnet test src/Concelier/__Tests/StellaOps.Concelier.Connector.Kisa.Tests/StellaOps.Concelier.Connector.Kisa.Tests.csproj`.
|
||||
- The telemetry regression (`KisaConnectorTests.Telemetry_RecordsMetrics`) will fail if counters/log wiring drifts—treat failures as gating.
|
||||
|
||||
## 6. Known Issues
|
||||
|
||||
- RSS feeds only expose the latest 10 advisories; long outages require replay via archived feeds or manual IDX seeds.
|
||||
- Detail endpoint occasionally throttles; the connector honours `requestDelay` and reports failures with reason `HttpRequestException`. Consider increasing delay for weekend backfills.
|
||||
- If `kisa.category` tags suddenly appear as `unknown`, verify KISA has not renamed RSS elements; update the parser fixtures before production rollout.
|
||||
@@ -1,238 +0,0 @@
|
||||
# Concelier & Excititor Mirror Operations
|
||||
|
||||
This runbook describes how Stella Ops operates the managed mirrors under `*.stella-ops.org`.
|
||||
It covers Docker Compose and Helm deployment overlays, secret handling for multi-tenant
|
||||
authn, CDN fronting, and the recurring sync pipeline that keeps mirror bundles current.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- **Authority access** – client credentials (`client_id` + secret) authorised for
|
||||
`concelier.mirror.read` and `excititor.mirror.read` scopes. Secrets live outside git.
|
||||
- **Signed TLS certificates** – wildcard or per-domain (`mirror-primary`, `mirror-community`).
|
||||
Store them under `deploy/compose/mirror-gateway/tls/` or in Kubernetes secrets.
|
||||
- **Mirror gateway credentials** – Basic Auth htpasswd files per domain. Generate with
|
||||
`htpasswd -B`. Operators distribute credentials to downstream consumers.
|
||||
- **Export artifact source** – read access to the canonical S3 buckets (or rsync share)
|
||||
that hold `concelier` JSON bundles and `excititor` VEX exports.
|
||||
- **Persistent volumes** – storage for Concelier job metadata and mirror export trees.
|
||||
For Helm, provision PVCs (`concelier-mirror-jobs`, `concelier-mirror-exports`,
|
||||
`excititor-mirror-exports`, `mirror-mongo-data`, `mirror-minio-data`) before rollout.
|
||||
|
||||
### 1.1 Service configuration quick reference
|
||||
|
||||
Concelier.WebService exposes the mirror HTTP endpoints once `CONCELIER__MIRROR__ENABLED=true`.
|
||||
Key knobs:
|
||||
|
||||
- `CONCELIER__MIRROR__EXPORTROOT` – root folder containing export snapshots (`<exportId>/mirror/*`).
|
||||
- `CONCELIER__MIRROR__ACTIVEEXPORTID` – optional explicit export id; otherwise the service auto-falls back to the `latest/` symlink or newest directory.
|
||||
- `CONCELIER__MIRROR__REQUIREAUTHENTICATION` – default auth requirement; override per domain with `CONCELIER__MIRROR__DOMAINS__{n}__REQUIREAUTHENTICATION`.
|
||||
- `CONCELIER__MIRROR__MAXINDEXREQUESTSPERHOUR` – budget for `/concelier/exports/index.json`. Domains inherit this value unless they define `__MAXDOWNLOADREQUESTSPERHOUR`.
|
||||
- `CONCELIER__MIRROR__DOMAINS__{n}__ID` – domain identifier matching the exporter manifest; additional keys configure display name and rate budgets.
|
||||
|
||||
> The service honours Stella Ops Authority when `CONCELIER__AUTHORITY__ENABLED=true` and `ALLOWANONYMOUSFALLBACK=false`. Use the bypass CIDR list (`CONCELIER__AUTHORITY__BYPASSNETWORKS__*`) for in-cluster ingress gateways that terminate Basic Auth. Unauthorized requests emit `WWW-Authenticate: Bearer` so downstream automation can detect token failures.
|
||||
|
||||
Mirror responses carry deterministic cache headers: `/index.json` returns `Cache-Control: public, max-age=60`, while per-domain manifests/bundles include `Cache-Control: public, max-age=300, immutable`. Rate limiting surfaces `Retry-After` when quotas are exceeded.
|
||||
|
||||
### 1.2 Mirror connector configuration
|
||||
|
||||
Downstream Concelier instances ingest published bundles using the `StellaOpsMirrorConnector`. Operators running the connector in air‑gapped or limited connectivity environments can tune the following options (environment prefix `CONCELIER__SOURCES__STELLAOPSMIRROR__`):
|
||||
|
||||
- `BASEADDRESS` – absolute mirror root (e.g., `https://mirror-primary.stella-ops.org`).
|
||||
- `INDEXPATH` – relative path to the mirror index (`/concelier/exports/index.json` by default).
|
||||
- `DOMAINID` – mirror domain identifier from the index (`primary`, `community`, etc.).
|
||||
- `HTTPTIMEOUT` – request timeout; raise when mirrors sit behind slow WAN links.
|
||||
- `SIGNATURE__ENABLED` – require detached JWS verification for `bundle.json`.
|
||||
- `SIGNATURE__KEYID` / `SIGNATURE__PROVIDER` – expected signing key metadata.
|
||||
- `SIGNATURE__PUBLICKEYPATH` – PEM fallback used when the mirror key registry is offline.
|
||||
|
||||
The connector keeps a per-export fingerprint (bundle digest + generated-at timestamp) and tracks outstanding document IDs. If a scan is interrupted, the next run resumes parse/map work using the stored fingerprint and pending document lists—no network requests are reissued unless the upstream digest changes.
|
||||
|
||||
## 2. Secret & certificate layout
|
||||
|
||||
### Docker Compose (`deploy/compose/docker-compose.mirror.yaml`)
|
||||
|
||||
- `deploy/compose/env/mirror.env.example` – copy to `.env` and adjust quotas or domain IDs.
|
||||
- `deploy/compose/mirror-secrets/` – mount read-only into `/run/secrets`. Place:
|
||||
- `concelier-authority-client` – Authority client secret.
|
||||
- `excititor-authority-client` (optional) – reserve for future authn.
|
||||
- `deploy/compose/mirror-gateway/tls/` – PEM-encoded cert/key pairs:
|
||||
- `mirror-primary.crt`, `mirror-primary.key`
|
||||
- `mirror-community.crt`, `mirror-community.key`
|
||||
- `deploy/compose/mirror-gateway/secrets/` – htpasswd files:
|
||||
- `mirror-primary.htpasswd`
|
||||
- `mirror-community.htpasswd`
|
||||
|
||||
### Helm (`deploy/helm/stellaops/values-mirror.yaml`)
|
||||
|
||||
Create secrets in the target namespace:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic concelier-mirror-auth \
|
||||
--from-file=concelier-authority-client=concelier-authority-client
|
||||
|
||||
kubectl create secret generic excititor-mirror-auth \
|
||||
--from-file=excititor-authority-client=excititor-authority-client
|
||||
|
||||
kubectl create secret tls mirror-gateway-tls \
|
||||
--cert=mirror-primary.crt --key=mirror-primary.key
|
||||
|
||||
kubectl create secret generic mirror-gateway-htpasswd \
|
||||
--from-file=mirror-primary.htpasswd --from-file=mirror-community.htpasswd
|
||||
```
|
||||
|
||||
> Keep Basic Auth lists short-lived (rotate quarterly) and document credential recipients.
|
||||
|
||||
## 3. Deployment
|
||||
|
||||
### 3.1 Docker Compose (edge mirrors, lab validation)
|
||||
|
||||
1. `cp deploy/compose/env/mirror.env.example deploy/compose/env/mirror.env`
|
||||
2. Populate secrets/tls directories as described above.
|
||||
3. Sync mirror bundles (see §4) into `deploy/compose/mirror-data/…` and ensure they are mounted
|
||||
on the host path backing the `concelier-exports` and `excititor-exports` volumes.
|
||||
4. Run the profile validator: `deploy/tools/validate-profiles.sh`.
|
||||
5. Launch: `docker compose --env-file env/mirror.env -f docker-compose.mirror.yaml up -d`.
|
||||
|
||||
### 3.2 Helm (production mirrors)
|
||||
|
||||
1. Provision PVCs sized for mirror bundles (baseline: 20 GiB per domain).
|
||||
2. Create secrets/tls config maps (§2).
|
||||
3. `helm upgrade --install mirror deploy/helm/stellaops -f deploy/helm/stellaops/values-mirror.yaml`.
|
||||
4. Annotate the `stellaops-mirror-gateway` service with ingress/LoadBalancer metadata required by
|
||||
your CDN (e.g., AWS load balancer scheme internal + NLB idle timeout).
|
||||
|
||||
## 4. Artifact sync workflow
|
||||
|
||||
Mirrors never generate exports—they ingest signed bundles produced by the Concelier and Excititor
|
||||
export jobs. Recommended sync pattern:
|
||||
|
||||
### 4.1 Compose host (systemd timer)
|
||||
|
||||
`/usr/local/bin/mirror-sync.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
export AWS_ACCESS_KEY_ID=…
|
||||
export AWS_SECRET_ACCESS_KEY=…
|
||||
|
||||
aws s3 sync s3://mirror-stellaops/concelier/latest \
|
||||
/opt/stellaops/mirror-data/concelier --delete --size-only
|
||||
|
||||
aws s3 sync s3://mirror-stellaops/excititor/latest \
|
||||
/opt/stellaops/mirror-data/excititor --delete --size-only
|
||||
```
|
||||
|
||||
Schedule with a systemd timer every 5 minutes. The Compose volumes mount `/opt/stellaops/mirror-data/*`
|
||||
into the containers read-only, matching `CONCELIER__MIRROR__EXPORTROOT=/exports/json` and
|
||||
`EXCITITOR__ARTIFACTS__FILESYSTEM__ROOT=/exports`.
|
||||
|
||||
### 4.2 Kubernetes (CronJob)
|
||||
|
||||
Create a CronJob running the AWS CLI (or rclone) in the same namespace, writing into the PVCs:
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: mirror-sync
|
||||
spec:
|
||||
schedule: "*/5 * * * *"
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: sync
|
||||
image: public.ecr.aws/aws-cli/aws-cli@sha256:5df5f52c29f5e3ba46d0ad9e0e3afc98701c4a0f879400b4c5f80d943b5fadea
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- >
|
||||
aws s3 sync s3://mirror-stellaops/concelier/latest /exports/concelier --delete --size-only &&
|
||||
aws s3 sync s3://mirror-stellaops/excititor/latest /exports/excititor --delete --size-only
|
||||
volumeMounts:
|
||||
- name: concelier-exports
|
||||
mountPath: /exports/concelier
|
||||
- name: excititor-exports
|
||||
mountPath: /exports/excititor
|
||||
envFrom:
|
||||
- secretRef:
|
||||
name: mirror-sync-aws
|
||||
restartPolicy: OnFailure
|
||||
volumes:
|
||||
- name: concelier-exports
|
||||
persistentVolumeClaim:
|
||||
claimName: concelier-mirror-exports
|
||||
- name: excititor-exports
|
||||
persistentVolumeClaim:
|
||||
claimName: excititor-mirror-exports
|
||||
```
|
||||
|
||||
## 5. CDN integration
|
||||
|
||||
1. Point the CDN origin at the mirror gateway (Compose host or Kubernetes LoadBalancer).
|
||||
2. Honour the response headers emitted by the gateway and Concelier/Excititor:
|
||||
`Cache-Control: public, max-age=300, immutable` for mirror payloads.
|
||||
3. Configure origin shields in the CDN to prevent cache stampedes. Recommended TTLs:
|
||||
- Index (`/concelier/exports/index.json`, `/excititor/mirror/*/index`) → 60 s.
|
||||
- Bundle/manifest payloads → 300 s.
|
||||
4. Forward the `Authorization` header—Basic Auth terminates at the gateway.
|
||||
5. Enforce per-domain rate limits at the CDN (matching gateway budgets) and enable logging
|
||||
to SIEM for anomaly detection.
|
||||
|
||||
## 6. Smoke tests
|
||||
|
||||
After each deployment or sync cycle (temporarily set low budgets if you need to observe 429 responses):
|
||||
|
||||
```bash
|
||||
# Index with Basic Auth
|
||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/index.json | jq 'keys'
|
||||
|
||||
# Mirror manifest signature and cache headers
|
||||
curl -u $PRIMARY_CREDS -I https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/manifest.json \
|
||||
| tee /tmp/manifest-headers.txt
|
||||
grep -E '^Cache-Control: ' /tmp/manifest-headers.txt # expect public, max-age=300, immutable
|
||||
|
||||
# Excititor consensus bundle metadata
|
||||
curl -u $COMMUNITY_CREDS https://mirror-community.stella-ops.org/excititor/mirror/community/index \
|
||||
| jq '.exports[].exportKey'
|
||||
|
||||
# Signed bundle + detached JWS (spot check digests)
|
||||
curl -u $PRIMARY_CREDS https://mirror-primary.stella-ops.org/concelier/exports/mirror/primary/bundle.json.jws \
|
||||
-o bundle.json.jws
|
||||
cosign verify-blob --signature bundle.json.jws --key mirror-key.pub bundle.json
|
||||
|
||||
# Service-level auth check (inside cluster – no gateway credentials)
|
||||
kubectl exec deploy/stellaops-concelier -- curl -si http://localhost:8443/concelier/exports/mirror/primary/manifest.json \
|
||||
| head -n 5 # expect HTTP/1.1 401 with WWW-Authenticate: Bearer
|
||||
|
||||
# Rate limit smoke (repeat quickly; second call should return 429 + Retry-After)
|
||||
for i in 1 2; do
|
||||
curl -s -o /dev/null -D - https://mirror-primary.stella-ops.org/concelier/exports/index.json \
|
||||
-u $PRIMARY_CREDS | grep -E '^(HTTP/|Retry-After:)'
|
||||
sleep 1
|
||||
done
|
||||
```
|
||||
|
||||
Watch the gateway metrics (`nginx_vts` or access logs) for cache hits. In Kubernetes, `kubectl logs deploy/stellaops-mirror-gateway`
|
||||
should show `X-Cache-Status: HIT/MISS`.
|
||||
|
||||
## 7. Maintenance & rotation
|
||||
|
||||
- **Bundle freshness** – alert if sync job lag exceeds 15 minutes or if `concelier` logs
|
||||
`Mirror export root is not configured`.
|
||||
- **Secret rotation** – change Authority client secrets and Basic Auth credentials quarterly.
|
||||
Update the mounted secrets and restart deployments (`docker compose restart concelier` or
|
||||
`kubectl rollout restart deploy/stellaops-concelier`).
|
||||
- **TLS renewal** – reissue certificates, place new files, and reload gateway (`docker compose exec mirror-gateway nginx -s reload`).
|
||||
- **Quota tuning** – adjust per-domain `MAXDOWNLOADREQUESTSPERHOUR` in `.env` or values file.
|
||||
Align CDN rate limits and inform downstreams.
|
||||
|
||||
## 8. References
|
||||
|
||||
- Deployment profiles: `deploy/compose/docker-compose.mirror.yaml`,
|
||||
`deploy/helm/stellaops/values-mirror.yaml`
|
||||
- Mirror architecture dossiers: `docs/ARCHITECTURE_CONCELIER.md`,
|
||||
`docs/ARCHITECTURE_EXCITITOR_MIRRORS.md`
|
||||
- Export bundling: `docs/ARCHITECTURE_DEVOPS.md` §3, `docs/ARCHITECTURE_EXCITITOR.md` §7
|
||||
@@ -1,86 +0,0 @@
|
||||
# Concelier MSRC Connector – Azure AD Onboarding Brief
|
||||
|
||||
_Drafted: 2025-10-15_
|
||||
|
||||
## 1. App registration requirements
|
||||
|
||||
- **Tenant**: shared StellaOps production Azure AD.
|
||||
- **Application type**: confidential client (web/API) issuing client credentials.
|
||||
- **API permissions**: `api://api.msrc.microsoft.com/.default` (Application). Admin consent required once.
|
||||
- **Token audience**: `https://api.msrc.microsoft.com/`.
|
||||
- **Grant type**: client credentials. Concelier will request tokens via `POST https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token`.
|
||||
|
||||
## 2. Secret/credential policy
|
||||
|
||||
- Maintain two client secrets (primary + standby) rotating every 90 days.
|
||||
- Store secrets in the Concelier secrets vault; Offline Kit deployments must mirror the secret payloads in their encrypted store.
|
||||
- Record rotation cadence in Ops runbook and update Concelier configuration (`CONCELIER__SOURCES__VNDR__MSRC__CLIENTSECRET`) ahead of expiry.
|
||||
|
||||
## 3. Concelier configuration sample
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
sources:
|
||||
vndr.msrc:
|
||||
tenantId: "<azure-tenant-guid>"
|
||||
clientId: "<app-registration-client-id>"
|
||||
clientSecret: "<pull from secret store>"
|
||||
apiVersion: "2024-08-01"
|
||||
locale: "en-US"
|
||||
requestDelay: "00:00:00.250"
|
||||
failureBackoff: "00:05:00"
|
||||
cursorOverlapMinutes: 10
|
||||
downloadCvrf: false # set true to persist CVRF ZIP alongside JSON detail
|
||||
```
|
||||
|
||||
## 4. CVRF artefacts
|
||||
|
||||
- The MSRC REST payload exposes `cvrfUrl` per advisory. Current connector persists the link as advisory metadata and reference; it does **not** download the ZIP by default.
|
||||
- Ops should mirror CVRF ZIPs when preparing Offline Kits so air-gapped deployments can reconcile advisories without direct internet access.
|
||||
- Once Offline Kit storage guidelines are finalised, extend the connector configuration with `downloadCvrf: true` to enable automatic attachment retrieval.
|
||||
|
||||
### 4.1 State seeding helper
|
||||
|
||||
Use `tools/SourceStateSeeder` to queue historical advisories (detail JSON + optional CVRF artefacts) for replay without manual Mongo edits. Example seed file:
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "vndr.msrc",
|
||||
"cursor": {
|
||||
"lastModifiedCursor": "2024-01-01T00:00:00Z"
|
||||
},
|
||||
"documents": [
|
||||
{
|
||||
"uri": "https://api.msrc.microsoft.com/sug/v2.0/vulnerability/ADV2024-0001",
|
||||
"contentFile": "./seeds/adv2024-0001.json",
|
||||
"contentType": "application/json",
|
||||
"metadata": { "msrc.vulnerabilityId": "ADV2024-0001" },
|
||||
"addToPendingDocuments": true
|
||||
},
|
||||
{
|
||||
"uri": "https://download.microsoft.com/msrc/2024/ADV2024-0001.cvrf.zip",
|
||||
"contentFile": "./seeds/adv2024-0001.cvrf.zip",
|
||||
"contentType": "application/zip",
|
||||
"status": "mapped",
|
||||
"addToPendingDocuments": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Run the helper:
|
||||
|
||||
```bash
|
||||
dotnet run --project tools/SourceStateSeeder -- \
|
||||
--connection-string "mongodb://localhost:27017" \
|
||||
--database concelier \
|
||||
--input seeds/msrc-backfill.json
|
||||
```
|
||||
|
||||
Any documents marked `addToPendingDocuments` will appear in the connector cursor; `DownloadCvrf` can remain disabled if the ZIP artefact is pre-seeded.
|
||||
|
||||
## 5. Outstanding items
|
||||
|
||||
- Ops to confirm tenant/app names and provide client credentials through the secure channel.
|
||||
- Connector team monitors token cache health (already implemented); validate instrumentation once Ops supplies credentials.
|
||||
- Offline Kit packaging: add encrypted blob containing client credentials with rotation instructions.
|
||||
@@ -1,48 +0,0 @@
|
||||
# NKCKI Connector Operations Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The NKCKI connector ingests JSON bulletin archives from cert.gov.ru, expanding each `*.json.zip` attachment into per-vulnerability DTOs before canonical mapping. The fetch pipeline now supports cache-backed recovery, deterministic pagination, and telemetry suitable for production monitoring.
|
||||
|
||||
## Configuration
|
||||
|
||||
Key options exposed through `concelier:sources:ru-nkcki:http`:
|
||||
|
||||
- `maxBulletinsPerFetch` – limits new bulletin downloads in a single run (default `5`).
|
||||
- `maxListingPagesPerFetch` – maximum listing pages visited during pagination (default `3`).
|
||||
- `listingCacheDuration` – minimum interval between listing fetches before falling back to cached artefacts (default `00:10:00`).
|
||||
- `cacheDirectory` – optional path for persisted bulletin archives used during offline or failure scenarios.
|
||||
- `requestDelay` – delay inserted between bulletin downloads to respect upstream politeness.
|
||||
|
||||
When operating in offline-first mode, set `cacheDirectory` to a writable path (e.g. `/var/lib/concelier/cache/ru-nkcki`) and pre-populate bulletin archives via the offline kit.
|
||||
|
||||
## Telemetry
|
||||
|
||||
`RuNkckiDiagnostics` emits the following metrics under meter `StellaOps.Concelier.Connector.Ru.Nkcki`:
|
||||
|
||||
- `nkcki.listing.fetch.attempts` / `nkcki.listing.fetch.success` / `nkcki.listing.fetch.failures`
|
||||
- `nkcki.listing.pages.visited` (histogram, `pages`)
|
||||
- `nkcki.listing.attachments.discovered` / `nkcki.listing.attachments.new`
|
||||
- `nkcki.bulletin.fetch.success` / `nkcki.bulletin.fetch.cached` / `nkcki.bulletin.fetch.failures`
|
||||
- `nkcki.entries.processed` (histogram, `entries`)
|
||||
|
||||
Integrate these counters into standard Concelier observability dashboards to track crawl coverage and cache hit rates.
|
||||
|
||||
## Archive Backfill Strategy
|
||||
|
||||
Bitrix pagination surfaces archives via `?PAGEN_1=n`. The connector now walks up to `maxListingPagesPerFetch` pages, deduplicating bulletin IDs and maintaining a rolling `knownBulletins` window. Backfill strategy:
|
||||
|
||||
1. Enumerate pages from newest to oldest, respecting `maxListingPagesPerFetch` and `listingCacheDuration` to avoid refetch storms.
|
||||
2. Persist every `*.json.zip` attachment to the configured cache directory. This enables replay when listing access is temporarily blocked.
|
||||
3. During archive replay, `ProcessCachedBulletinsAsync` enqueues missing documents while respecting `maxVulnerabilitiesPerFetch`.
|
||||
4. For historical HTML-only advisories, collect page URLs and metadata while offline (future work: HTML and PDF extraction pipeline documented in `docs/concelier-connector-research-20251011.md`).
|
||||
|
||||
For large migrations, seed caches with archived zip bundles, then run fetch/parse/map cycles in chronological order to maintain deterministic outputs.
|
||||
|
||||
## Failure Handling
|
||||
|
||||
- Listing failures mark the source state with exponential backoff while attempting cache replay.
|
||||
- Bulletin fetches fall back to cached copies before surfacing an error.
|
||||
- Mongo integration tests rely on bundled OpenSSL 1.1 libraries (`tools/openssl/linux-x64`) to keep `Mongo2Go` operational on modern distros.
|
||||
|
||||
Refer to `ru-nkcki` entries in `src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Ru.Nkcki/TASKS.md` for outstanding items.
|
||||
@@ -1,24 +0,0 @@
|
||||
# Concelier OSV Connector – Operations Notes
|
||||
|
||||
_Last updated: 2025-10-16_
|
||||
|
||||
The OSV connector ingests advisories from OSV.dev across OSS ecosystems. This note highlights the additional merge/export expectations introduced with the canonical metric fallback work in Sprint 4.
|
||||
|
||||
## 1. Canonical metric fallbacks
|
||||
- When OSV omits CVSS vectors (common for CVSS v4-only payloads) the mapper now emits a deterministic canonical metric id in the form `osv:severity/<level>` and normalises the advisory severity to the same `<level>`.
|
||||
- Metric: `osv.map.canonical_metric_fallbacks` (counter) with tags `severity`, `canonical_metric_id`, `ecosystem`, `reason=no_cvss`. Watch this alongside merge parity dashboards to catch spikes where OSV publishes severity-only advisories.
|
||||
- Merge precedence still prefers GHSA over OSV; the shared severity-based canonical id keeps Merge/export parity deterministic even when only OSV supplies severity data.
|
||||
|
||||
## 2. CWE provenance
|
||||
- `database_specific.cwe_ids` now populates provenance decision reasons for every mapped weakness. Expect `decisionReason="database_specific.cwe_ids"` on OSV weakness provenance and confirm exporters preserve the value.
|
||||
- If OSV ever attaches `database_specific.cwe_notes`, the connector will surface the joined note string in `decisionReason` instead of the default marker.
|
||||
|
||||
## 3. Dashboards & alerts
|
||||
- Extend existing merge dashboards with the new counter:
|
||||
- Overlay `sum(osv.map.canonical_metric_fallbacks{ecosystem=~".+"})` with Merge severity overrides to confirm fallback advisories are reconciling cleanly.
|
||||
- Alert when the 1-hour sum exceeds 50 for any ecosystem; baseline volume is currently <5 per day (mostly GHSA mirrors emitting CVSS v4 only).
|
||||
- Exporters already surface `canonicalMetricId`; no schema change is required, but ORAS/Trivy bundles should be spot-checked after deploying the connector update.
|
||||
|
||||
## 4. Runbook updates
|
||||
- Fixture parity suites (`osv-ghsa.*`) now assert the fallback id and provenance notes. Regenerate via `dotnet test src/Concelier/StellaOps.Concelier.PluginBinaries/StellaOps.Concelier.Connector.Osv.Tests/StellaOps.Concelier.Connector.Osv.Tests.csproj`.
|
||||
- When investigating merge severity conflicts, include the fallback counter and confirm OSV advisories carry the expected `osv:severity/<level>` id before raising connector bugs.
|
||||
@@ -1,151 +0,0 @@
|
||||
# Stella Ops Deployment Upgrade & Rollback Runbook
|
||||
|
||||
_Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._
|
||||
|
||||
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
|
||||
|
||||
---
|
||||
|
||||
## 1. Channel overview
|
||||
|
||||
| Channel | Release manifest | Helm values | Compose profile |
|
||||
|---------|------------------|-------------|-----------------|
|
||||
| `edge` | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` |
|
||||
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
|
||||
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
|
||||
|
||||
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
|
||||
|
||||
---
|
||||
|
||||
## 2. Pre-flight checklist
|
||||
|
||||
1. **Refresh release manifest**
|
||||
Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
|
||||
|
||||
2. **Align deployment bundles with the manifest**
|
||||
Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
|
||||
```bash
|
||||
./deploy/tools/check-channel-alignment.py \
|
||||
--release deploy/releases/2025.10-edge.yaml \
|
||||
--target deploy/helm/stellaops/values-dev.yaml \
|
||||
--target deploy/compose/docker-compose.dev.yaml \
|
||||
--ignore-repo nats
|
||||
```
|
||||
Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
|
||||
|
||||
3. **Lint and template profiles**
|
||||
```bash
|
||||
./deploy/tools/validate-profiles.sh
|
||||
```
|
||||
|
||||
4. **Smoke the Offline Kit debug store (edge/stable only)**
|
||||
When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
|
||||
```bash
|
||||
./ops/offline-kit/mirror_debug_store.py \
|
||||
--release-dir out/release \
|
||||
--offline-kit-dir out/offline-kit
|
||||
```
|
||||
Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
|
||||
|
||||
5. **Review compatibility matrix**
|
||||
Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258…`, `minio@sha256:14ce…`, `rustfs:2025.10.0-edge`.
|
||||
|
||||
6. **Create a rollback bookmark**
|
||||
Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Helm upgrade procedure (staging → production)
|
||||
|
||||
1. Switch to the deployment branch and ensure secrets/config maps are current.
|
||||
2. Apply the upgrade in the staging cluster:
|
||||
```bash
|
||||
helm upgrade stellaops deploy/helm/stellaops \
|
||||
-f deploy/helm/stellaops/values-stage.yaml \
|
||||
--namespace stellaops \
|
||||
--atomic \
|
||||
--timeout 15m
|
||||
```
|
||||
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
|
||||
4. Promote to production using the prod values file and the same command.
|
||||
5. Record the new revision number and Git SHA in the change log.
|
||||
|
||||
### Rollback (Helm)
|
||||
|
||||
1. Identify the previous revision: `helm history stellaops -n stellaops`.
|
||||
2. Execute:
|
||||
```bash
|
||||
helm rollback stellaops <revision> \
|
||||
--namespace stellaops \
|
||||
--wait \
|
||||
--timeout 10m
|
||||
```
|
||||
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
|
||||
4. Update the incident/operations log with root cause and rollback details.
|
||||
|
||||
---
|
||||
|
||||
## 4. Docker Compose upgrade procedure
|
||||
|
||||
1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
|
||||
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
|
||||
3. Apply the upgrade:
|
||||
```bash
|
||||
docker compose \
|
||||
--env-file deploy/compose/env/prod.env \
|
||||
-f deploy/compose/docker-compose.prod.yaml \
|
||||
pull
|
||||
|
||||
docker compose \
|
||||
--env-file deploy/compose/env/prod.env \
|
||||
-f deploy/compose/docker-compose.prod.yaml \
|
||||
up -d
|
||||
```
|
||||
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
|
||||
5. Update monitoring dashboards/alerts to confirm normal operation.
|
||||
|
||||
### Rollback (Compose)
|
||||
|
||||
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
|
||||
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
|
||||
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/ops/authority-backup-restore.md` and associated service guides).
|
||||
4. Log the rollback in the operations journal.
|
||||
|
||||
---
|
||||
|
||||
## 5. Channel promotion workflow
|
||||
|
||||
1. Author or update the channel manifest under `deploy/releases/`.
|
||||
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
|
||||
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
|
||||
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
|
||||
5. Tag the repository when promoting stable or airgap builds.
|
||||
|
||||
---
|
||||
|
||||
## 6. Upgrade rehearsal & rollback drill log
|
||||
|
||||
Maintain rehearsal notes in `docs/ops/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
|
||||
|
||||
- Release version tested
|
||||
- Date/time
|
||||
- Participants
|
||||
- Issues encountered & fixes
|
||||
- Rollback duration (if executed)
|
||||
|
||||
Attach the log to the sprint retro or operational wiki.
|
||||
|
||||
| Date (UTC) | Channel | Outcome | Notes |
|
||||
|------------|---------|---------|-------|
|
||||
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- `deploy/README.md` – structure and validation workflow for deployment bundles.
|
||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline.
|
||||
- `docs/ARCHITECTURE_DEVOPS.md` – high-level DevOps architecture, SLOs, and compliance requirements.
|
||||
- `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper.
|
||||
- `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker.
|
||||
@@ -1,128 +0,0 @@
|
||||
# Launch Cutover Runbook - Stella Ops
|
||||
|
||||
_Document owner: DevOps Guild (2025-10-26)_
|
||||
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
|
||||
|
||||
## 1. Roles and Communication
|
||||
|
||||
| Role | Primary | Backup | Contact |
|
||||
| --- | --- | --- | --- |
|
||||
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
|
||||
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
|
||||
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
|
||||
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
|
||||
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
|
||||
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
|
||||
|
||||
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
|
||||
|
||||
## 2. Timeline Overview (UTC)
|
||||
|
||||
| Time | Activity | Owner |
|
||||
| --- | --- | --- |
|
||||
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
|
||||
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
|
||||
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
|
||||
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
|
||||
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
|
||||
| T0 | Execute production cutover steps (Section 4). | Cutover team |
|
||||
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
|
||||
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
|
||||
|
||||
## 3. Rehearsal (Staging) Checklist
|
||||
|
||||
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
|
||||
2. Run `deploy/tools/validate-profiles.sh` and archive output.
|
||||
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
|
||||
4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster.
|
||||
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
|
||||
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
|
||||
7. Document total wall time and any deviations in the rehearsal log.
|
||||
|
||||
Rehearsal must complete without manual interventions before proceeding to production.
|
||||
|
||||
## 4. Production Cutover Steps
|
||||
|
||||
### 4.1 Pre-flight
|
||||
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
|
||||
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
|
||||
- Back up current configuration and data:
|
||||
- Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
|
||||
- MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
|
||||
|
||||
### 4.2 Apply Updates (Compose)
|
||||
1. On each compose node, pull updated images for release `2025.09.2`:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
|
||||
```
|
||||
2. Deploy changes:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
|
||||
```
|
||||
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
|
||||
|
||||
### 4.3 Apply Updates (Helm/Kubernetes)
|
||||
If using Kubernetes, perform:
|
||||
```bash
|
||||
helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m
|
||||
```
|
||||
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
|
||||
|
||||
### 4.4 Configuration Validation
|
||||
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
|
||||
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
|
||||
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
|
||||
- Ensure Notify (legacy) still accessible while Notifier migration pending.
|
||||
|
||||
## 5. Smoke Tests
|
||||
|
||||
| Test | Command / Action | Expected Result |
|
||||
| --- | --- | --- |
|
||||
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
|
||||
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
|
||||
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
|
||||
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
|
||||
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
|
||||
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
|
||||
|
||||
Log results in the change ticket with timestamps and screenshots where applicable.
|
||||
|
||||
## 6. Rollback Procedure
|
||||
|
||||
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
|
||||
2. For Compose:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
|
||||
docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
|
||||
```
|
||||
3. For Helm:
|
||||
```bash
|
||||
helm rollback stellaops <previous-release-number> --namespace stellaops
|
||||
```
|
||||
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
|
||||
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
|
||||
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
|
||||
|
||||
## 7. Post-cutover Actions
|
||||
|
||||
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
|
||||
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
|
||||
- Update `docs/ops/launch-readiness.md` if any new gaps or follow-ups discovered.
|
||||
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
|
||||
|
||||
## 8. Approval Matrix
|
||||
|
||||
| Step | Required Approvers | Record Location |
|
||||
| --- | --- | --- |
|
||||
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
|
||||
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
|
||||
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
|
||||
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
|
||||
|
||||
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
|
||||
|
||||
## 9. Rehearsal Log
|
||||
|
||||
| Date (UTC) | What We Exercised | Outcome | Follow-up |
|
||||
| --- | --- | --- | --- |
|
||||
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
|
||||
@@ -1,49 +0,0 @@
|
||||
# Launch Readiness Record - Stella Ops
|
||||
|
||||
_Updated: 2025-10-26 (UTC)_
|
||||
|
||||
This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover.
|
||||
|
||||
## 1. Sign-off Summary
|
||||
|
||||
| Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. |
|
||||
| Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. |
|
||||
| Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. |
|
||||
| Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. |
|
||||
| Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/ops/concelier-conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). |
|
||||
| Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (EXECPLAN.Section Wave 0) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. |
|
||||
| Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. |
|
||||
| Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. |
|
||||
| DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. |
|
||||
| Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-10-26T15:05Z | Awaiting release debug artefacts to finalise `DEVOPS-OFFLINE-17-004`; tracked in Section 3. |
|
||||
|
||||
_\* READY with caveat - remaining work noted in Section 3._
|
||||
|
||||
## 2. Deployment Readiness Checklist
|
||||
|
||||
- **Production profiles committed:** `deploy/compose/docker-compose.prod.yaml` and `deploy/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services.
|
||||
- **Secrets placeholders documented:** `deploy/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`).
|
||||
- **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing.
|
||||
- **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services.
|
||||
- **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks.
|
||||
- **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback.
|
||||
- **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned.
|
||||
|
||||
## 3. Outstanding Gaps & Follow-ups
|
||||
|
||||
| Item | Owner | Tracking Ref | Target / Next Step | Impact |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. |
|
||||
| Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. |
|
||||
| Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. |
|
||||
| Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (BLOCKED 2025-10-26) | Release pipeline must publish `out/release/debug` artefacts; once available, run `mirror_debug_store.py` and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. |
|
||||
| Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. |
|
||||
| Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. |
|
||||
|
||||
## 4. Approvals & Distribution
|
||||
|
||||
- Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement.
|
||||
- Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history.
|
||||
- Cutover rehearsal and rollback drills are tracked separately in `docs/ops/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch
|
||||
@@ -1,50 +0,0 @@
|
||||
# SemVer Style Backfill Runbook
|
||||
|
||||
_Last updated: 2025-10-11_
|
||||
|
||||
## Overview
|
||||
|
||||
The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures
|
||||
provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only
|
||||
runs when the feature flag `concelier:storage:enableSemVerStyle` is enabled.
|
||||
|
||||
## Preconditions
|
||||
|
||||
1. **Review configuration** – set `concelier.storage.enableSemVerStyle` to `true` on all Concelier services.
|
||||
2. **Confirm batch size** – adjust `concelier.storage.backfillBatchSize` if you need smaller batches for older
|
||||
deployments (default: `250`).
|
||||
3. **Back up** – capture a fresh snapshot of the `advisory` collection or a full MongoDB backup.
|
||||
4. **Staging dry-run** – enable the flag in a staging environment and observe the migration output before
|
||||
rolling to production.
|
||||
|
||||
## Execution
|
||||
|
||||
No manual command is required. After deploying the configuration change, restart the Concelier WebService or
|
||||
any component that hosts the Mongo migration runner. During startup you will see log entries similar to:
|
||||
|
||||
```
|
||||
Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled.
|
||||
Mongo migration 20251011-semver-style-backfill applied
|
||||
```
|
||||
|
||||
The migration reads advisories in batches (`concelier.storage.backfillBatchSize`) and writes flattened
|
||||
`normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched.
|
||||
|
||||
## Post-checks
|
||||
|
||||
1. Verify the new indexes exist:
|
||||
```
|
||||
db.advisory.getIndexes()
|
||||
```
|
||||
You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`.
|
||||
2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches
|
||||
the embedded package data.
|
||||
3. Run `dotnet test` for `StellaOps.Concelier.Storage.Mongo.Tests` (optional but recommended) in CI to confirm
|
||||
the storage suite passes with the feature flag enabled.
|
||||
|
||||
## Rollback
|
||||
|
||||
Set `concelier.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on
|
||||
subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when
|
||||
the feature flag is off. If you must remove them entirely, restore from the backup captured during
|
||||
preparation.
|
||||
@@ -1,64 +0,0 @@
|
||||
# NuGet Preview Bootstrap (Offline-Friendly)
|
||||
|
||||
The StellaOps build relies on .NET 10 RC2 packages (Microsoft.Extensions.*, JwtBearer 10.0 RC).
|
||||
`NuGet.config` now wires three sources:
|
||||
|
||||
1. `local` → `./local-nuget` (preferred, air-gapped mirror)
|
||||
2. `dotnet-public` → `https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json`
|
||||
3. `nuget.org` → fallback for everything else
|
||||
|
||||
Follow the steps below whenever you refresh the repo or roll a new Offline Kit drop.
|
||||
|
||||
## 1. Mirror the preview packages
|
||||
|
||||
```bash
|
||||
./ops/devops/sync-preview-nuget.sh
|
||||
```
|
||||
|
||||
* Reads `ops/devops/nuget-preview-packages.csv`. Each line specifies the package, version, expected SHA-256 hash, and (optionally) the flat-container base URL (we pin to `dotnet-public`).
|
||||
* Downloads the `.nupkg` straight into `./local-nuget/` and re-verifies the checksum. Existing files are skipped when hashes already match.
|
||||
* Use `NUGET_V2_BASE` if you need to temporarily point at a different mirror.
|
||||
|
||||
💡 The script never mutates packages in place—if a checksum changes you will see a “SHA mismatch … refreshing” message.
|
||||
|
||||
## 2. Restore using the shared `NuGet.config`
|
||||
|
||||
From the repo root:
|
||||
|
||||
```bash
|
||||
DOTNET_NOLOGO=1 dotnet restore src/Excititor/__Libraries/StellaOps.Excititor.Connectors.Abstractions/StellaOps.Excititor.Connectors.Abstractions.csproj \
|
||||
--configfile NuGet.config
|
||||
```
|
||||
|
||||
The `packageSourceMapping` section keeps `Microsoft.Extensions.*`, `Microsoft.AspNetCore.*`, and `Microsoft.Data.Sqlite` bound to `local`/`dotnet-public`, so `dotnet restore` never has to reach out to nuget.org when mirrors are populated.
|
||||
|
||||
Before committing changes (or when wiring up a new environment) run:
|
||||
|
||||
```bash
|
||||
python3 ops/devops/validate_restore_sources.py
|
||||
```
|
||||
|
||||
The validator asserts:
|
||||
|
||||
- `NuGet.config` lists `local` → `dotnet-public` → `nuget.org` in that order.
|
||||
- `Directory.Build.props` pins `RestoreSources` so every project prioritises the local mirror.
|
||||
- No stray `NuGet.config` files shadow the repo root configuration.
|
||||
|
||||
CI executes the validator in both the `build-test-deploy` and `release` workflows,
|
||||
so regressions trip before any restore/build begins.
|
||||
|
||||
If you run fully air-gapped, remember to clear the cache between SDK upgrades:
|
||||
|
||||
```bash
|
||||
dotnet nuget locals all --clear
|
||||
```
|
||||
|
||||
## 3. Troubleshooting
|
||||
|
||||
| Symptom | Fix |
|
||||
| --- | --- |
|
||||
| `dotnet restore` still hits nuget.org for preview packages | Re-run `sync-preview-nuget.sh` to ensure the `.nupkg` exists locally, then delete `~/.nuget/packages/microsoft.extensions.*` so the resolver picks up the mirrored copy. |
|
||||
| SHA mismatch in the manifest | Update `ops/devops/nuget-preview-packages.csv` with the new version + checksum (from the feed) and re-run the sync script. |
|
||||
| Azure DevOps feed throttling | Set `DOTNET_PUBLIC_FLAT_BASE` env var and point it at your own mirrored flat-container, then add the URL to the 4th column of the manifest. |
|
||||
|
||||
Keep this doc alongside Offline Kit instructions so air-gapped operators know exactly how to refresh the mirror and verify packages before restore.
|
||||
@@ -1,66 +0,0 @@
|
||||
# Registry Token Service Operations
|
||||
|
||||
_Component_: `src/Registry/StellaOps.Registry.TokenService`
|
||||
|
||||
The registry token service issues short-lived Docker registry bearer tokens after
|
||||
validating an Authority OpTok (DPoP/mTLS sender constraint) and the customer’s
|
||||
plan entitlements. It is fronted by the Docker registry’s `Bearer realm` flow.
|
||||
|
||||
## Configuration
|
||||
|
||||
Configuration lives in `etc/registry-token.yaml` and can be overridden through
|
||||
environment variables prefixed with `REGISTRY_TOKEN_`. Key sections:
|
||||
|
||||
| Section | Purpose |
|
||||
| ------- | ------- |
|
||||
| `authority` | Authority issuer/metadata URL, audience list, and scopes required to request tokens (default `registry.token.issue`). |
|
||||
| `signing` | JWT issuer, signing key (PEM or PFX), optional key ID, and token lifetime (default five minutes). The repository ships **`etc/registry-signing-sample.pem`** for local testing only – replace it with a private key generated and stored per-environment before going live. |
|
||||
| `registry` | Registry realm URL and optional allow-list of `service` values accepted from the registry challenge. |
|
||||
| `plans` | Plan catalogue mapping plan name → repository patterns and allowed actions. Wildcards (`*`) are supported per path segment. |
|
||||
| `defaultPlan` | Applied when the caller’s token omits `stellaops:plan`. |
|
||||
| `revokedLicenses` | Blocks issuance when the caller presents a matching `stellaops:license` claim. |
|
||||
|
||||
Plan entries must cover every private repository namespace. Actions default to
|
||||
`pull` if omitted.
|
||||
|
||||
## Request flow
|
||||
|
||||
1. Docker/OCI client contacts the registry and receives a `401` with
|
||||
`WWW-Authenticate: Bearer realm=...,service=...,scope=repository:...`.
|
||||
2. Client acquires an OpTok from Authority (DPoP/mTLS bound) with the
|
||||
`registry.token.issue` scope.
|
||||
3. Client calls `GET /token?service=<service>&scope=repository:<name>:<actions>`
|
||||
against the token service, presenting the OpTok and matching DPoP proof.
|
||||
4. The service validates the token, plan, and requested scopes, then issues a
|
||||
JWT containing an `access` claim conforming to the Docker registry spec.
|
||||
|
||||
All denial paths return RFC 6750-style problem responses (HTTP 400 for malformed
|
||||
scopes, 403 for plan or revocation failures).
|
||||
|
||||
## Monitoring
|
||||
|
||||
The service emits OpenTelemetry metrics via `registry_token_issued_total` and
|
||||
`registry_token_rejected_total`. Suggested Prometheus alerts:
|
||||
|
||||
| Metric | Condition | Action |
|
||||
|--------|-----------|--------|
|
||||
| `registry_token_rejected_total` | `increase(...) > 0` over 5 minutes | Investigate plan misconfiguration or licence revocation. |
|
||||
| `registry_token_issued_total` | Sudden drop compared to baseline | Confirm registry is still challenging with the expected realm/service. |
|
||||
|
||||
Enable the built-in `/healthz` endpoint for liveness checks. Authentication and
|
||||
DPoP failures surface via the service logs (Serilog console output).
|
||||
|
||||
## Sample deployment
|
||||
|
||||
```bash
|
||||
dotnet run --project src/Registry/StellaOps.Registry.TokenService \
|
||||
--urls "http://0.0.0.0:8085"
|
||||
|
||||
curl -H "Authorization: Bearer <OpTok>" \
|
||||
-H "DPoP: $(dpop-proof ...)" \
|
||||
"http://localhost:8085/token?service=registry.localhost&scope=repository:stella-ops/public/base:pull"
|
||||
```
|
||||
|
||||
Replace `<OpTok>` and `DPoP` with tokens issued by Authority. The response
|
||||
contains `token`, `expires_in`, and `issued_at` fields suitable for Docker/OCI
|
||||
clients.
|
||||
@@ -1,155 +0,0 @@
|
||||
{
|
||||
"title": "StellaOps Scanner Analyzer Benchmarks",
|
||||
"uid": "scanner-analyzer-bench",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {}
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Max Duration (ms)",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"displayName": "{{scenario}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single",
|
||||
"sort": "none"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "scanner_analyzer_bench_max_ms",
|
||||
"legendFormat": "{{scenario}}",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "scanner_analyzer_bench_baseline_max_ms",
|
||||
"legendFormat": "{{scenario}} baseline",
|
||||
"refId": "B"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Regression Ratio vs Limit",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "percentunit",
|
||||
"displayName": "{{scenario}}",
|
||||
"min": 0,
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "none"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(scanner_analyzer_bench_regression_ratio - 1) * 100",
|
||||
"legendFormat": "{{scenario}} regression %",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "(scanner_analyzer_bench_regression_limit - 1) * 100",
|
||||
"legendFormat": "{{scenario}} limit %",
|
||||
"refId": "B"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Breached Scenarios",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"displayName": "{{scenario}}",
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "center",
|
||||
"reduceOptions": {
|
||||
"calcs": [
|
||||
"last"
|
||||
],
|
||||
"fields": "",
|
||||
"values": false
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "scanner_analyzer_bench_regression_breached",
|
||||
"legendFormat": "{{scenario}}",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,48 +0,0 @@
|
||||
# Scanner Analyzer Benchmarks – Operations Guide
|
||||
|
||||
## Purpose
|
||||
Keep the language analyzer microbench under the < 5 s SBOM pledge. CI emits Prometheus metrics and JSON fixtures so trend dashboards and alerts stay in lockstep with the repository baseline.
|
||||
|
||||
> **Grafana note:** Import `docs/ops/scanner-analyzers-grafana-dashboard.json` into your Prometheus-backed Grafana stack to monitor `scanner_analyzer_bench_*` metrics and alert on regressions.
|
||||
|
||||
## Publishing workflow
|
||||
1. CI (or engineers running locally) execute:
|
||||
```bash
|
||||
dotnet run \
|
||||
--project src/Bench/StellaOps.Bench/Scanner.Analyzers/StellaOps.Bench.ScannerAnalyzers/StellaOps.Bench.ScannerAnalyzers.csproj \
|
||||
-- \
|
||||
--repo-root . \
|
||||
--out src/Bench/StellaOps.Bench/Scanner.Analyzers/baseline.csv \
|
||||
--json out/bench/scanner-analyzers/latest.json \
|
||||
--prom out/bench/scanner-analyzers/latest.prom \
|
||||
--commit "$(git rev-parse HEAD)" \
|
||||
--environment "${CI_ENVIRONMENT_NAME:-local}"
|
||||
```
|
||||
2. Publish the artefacts (`baseline.csv`, `latest.json`, `latest.prom`) to `bench-artifacts/<date>/`.
|
||||
3. Promtail (or the CI job) pushes `latest.prom` into Prometheus; JSON lands in long-term storage for workbook snapshots.
|
||||
4. The harness exits non-zero if:
|
||||
- `max_ms` for any scenario breaches its configured threshold; or
|
||||
- `max_ms` regresses ≥ 20 % versus `baseline.csv`.
|
||||
|
||||
## Grafana dashboard
|
||||
- Import `docs/ops/scanner-analyzers-grafana-dashboard.json`.
|
||||
- Point the template variable `datasource` to the Prometheus instance ingesting `scanner_analyzer_bench_*` metrics.
|
||||
- Panels:
|
||||
- **Max Duration (ms)** – compares live runs vs baseline.
|
||||
- **Regression Ratio vs Limit** – plots `(max / baseline_max - 1) * 100`.
|
||||
- **Breached Scenarios** – stat panel sourced from `scanner_analyzer_bench_regression_breached`.
|
||||
|
||||
## Alerting & on-call response
|
||||
- **Primary alert**: fire when `scanner_analyzer_bench_regression_ratio{scenario=~".+"} >= 1.20` for 2 consecutive samples (10 min default). Suggested PromQL:
|
||||
```
|
||||
max_over_time(scanner_analyzer_bench_regression_ratio[10m]) >= 1.20
|
||||
```
|
||||
- Suppress duplicates using the `scenario` label.
|
||||
- Pager payload should include `scenario`, `max_ms`, `baseline_max_ms`, and `commit`.
|
||||
- Immediate triage steps:
|
||||
1. Check `latest.json` artefact for the failing scenario – confirm commit and environment.
|
||||
2. Re-run the harness with `--captured-at` and `--baseline` pointing at the last known good CSV to verify determinism.
|
||||
3. If regression persists, open an incident ticket tagged `scanner-analyzer-perf` and page the owning language guild.
|
||||
4. Roll back the offending change or update the baseline after sign-off from the guild lead and Perf captain.
|
||||
|
||||
Document the outcome in `docs/12_PERFORMANCE_WORKBOOK.md` (section 8) so trendlines reflect any accepted regressions.
|
||||
@@ -1,88 +0,0 @@
|
||||
# Scanner Artifact Store Migration (MinIO → RustFS)
|
||||
|
||||
## Overview
|
||||
|
||||
Sprint 11 introduces **RustFS** as the default artifact store for the Scanner plane. Existing
|
||||
deployments running MinIO (or any S3-compatible backend) must migrate stored SBOM artefacts to RustFS
|
||||
before switching the Scanner hosts to `scanner.artifactStore.driver = "rustfs"`.
|
||||
|
||||
This runbook covers the recommended migration workflow and validation steps.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- RustFS service deployed and reachable from the Scanner control plane (`http(s)://rustfs:8080`).
|
||||
- Existing MinIO/S3 credentials with read access to the current bucket.
|
||||
- CLI environment with the StellaOps source tree (for the migration tool) and `dotnet 10` SDK.
|
||||
- Maintenance window sized to copy all artefacts (migration is read-only on the source bucket).
|
||||
|
||||
## 1. Snapshot source bucket (optional but recommended)
|
||||
|
||||
If the MinIO deployment offers versioning or snapshots, take one before migrating. For non-versioned
|
||||
deployments, capture an external backup (e.g., `mc mirror` to offline storage).
|
||||
|
||||
## 2. Dry-run the migrator
|
||||
|
||||
```
|
||||
dotnet run --project tools/RustFsMigrator -- \
|
||||
--s3-bucket scanner-artifacts \
|
||||
--s3-endpoint http://stellaops-minio:9000 \
|
||||
--s3-access-key stellaops \
|
||||
--s3-secret-key dev-minio-secret \
|
||||
--rustfs-endpoint http://stellaops-rustfs:8080 \
|
||||
--rustfs-bucket scanner-artifacts \
|
||||
--prefix scanner/ \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
The dry-run enumerates keys and reports the object count without writing to RustFS. Use this to
|
||||
estimate migration time.
|
||||
|
||||
## 3. Execute migration
|
||||
|
||||
Remove the `--dry-run` flag to copy data. Optional flags:
|
||||
|
||||
- `--immutable` – mark all migrated objects as immutable (`X-RustFS-Immutable`).
|
||||
- `--retain-days 365` – request retention (in days) via `X-RustFS-Retain-Seconds`.
|
||||
- `--rustfs-api-key-header` / `--rustfs-api-key` – provide auth headers when RustFS is protected.
|
||||
|
||||
The tool streams each object from S3 and performs an idempotent `PUT` to RustFS preserving the key
|
||||
structure (e.g., `scanner/layers/<sha256>/sbom.cdx.json.zst`).
|
||||
|
||||
## 4. Verify sample objects
|
||||
|
||||
Pick a handful of SBOM digests and confirm:
|
||||
|
||||
1. `GET /api/v1/buckets/<bucket>/objects/<key>` returns the expected payload (size + SHA-256).
|
||||
2. Scanner WebService configured with `scanner.artifactStore.driver = "rustfs"` can fetch the same
|
||||
artefacts (Smoke test: `GET /api/v1/scanner/sboms/<digest>?format=cdx-json`).
|
||||
|
||||
## 5. Switch Scanner hosts
|
||||
|
||||
Update configuration (Helm/Compose/environment) to set:
|
||||
|
||||
```
|
||||
scanner:
|
||||
artifactStore:
|
||||
driver: rustfs
|
||||
endpoint: http://stellaops-rustfs:8080
|
||||
bucket: scanner-artifacts
|
||||
timeoutSeconds: 30
|
||||
```
|
||||
|
||||
Redeploy Scanner WebService and Worker. Monitor logs for `RustFS` upload/download messages and
|
||||
Prometheus scrape (`rustfs_requests_total`).
|
||||
|
||||
## 6. Cleanup legacy MinIO (optional)
|
||||
|
||||
After a complete migration and validation period, decommission the MinIO bucket or repurpose it for
|
||||
other components (Concelier still supports S3). Ensure backups reference RustFS snapshots going
|
||||
forward.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Uploads fail (HTTP 4xx/5xx):** Check RustFS logs and confirm API key headers. Re-run the migrator
|
||||
for the affected keys.
|
||||
- **Missing objects post-cutover:** Re-run the migrator with the specific `--prefix`. The tool is
|
||||
idempotent and safely overwrites existing objects.
|
||||
- **Performance tuning:** Run multiple instances of the migrator with disjoint prefixes if needed; the
|
||||
RustFS API is stateless and supports parallel PUTs.
|
||||
@@ -1,261 +0,0 @@
|
||||
{
|
||||
"title": "Scheduler Worker – Planning & Rescan",
|
||||
"uid": "scheduler-worker-observability",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"hide": 0,
|
||||
"refresh": 1,
|
||||
"current": {}
|
||||
},
|
||||
{
|
||||
"name": "mode",
|
||||
"label": "Mode",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"query": "label_values(scheduler_planner_runs_total, mode)",
|
||||
"refresh": 1,
|
||||
"multi": true,
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"selected": false,
|
||||
"text": "All",
|
||||
"value": ".*"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Planner Runs per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_planner_runs_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Planner Latency P95 (s)",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket{mode=~\"$mode\"}[5m])))",
|
||||
"legendFormat": "p95",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Runner Segments per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_runner_segments_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "New Findings per Severity",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{severity}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_critical_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "critical",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_high_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "high",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "total",
|
||||
"refId": "C"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Runner Backlog by Schedule",
|
||||
"type": "table",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"displayName": "{{scheduleId}}",
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"showHeader": true
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "max by (scheduleId) (scheduler_runner_backlog{mode=~\"$mode\"})",
|
||||
"format": "table",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Active Runs",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"orientation": "horizontal",
|
||||
"textMode": "value"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(scheduler_runs_active{mode=~\"$mode\"})",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,82 +0,0 @@
|
||||
# Scheduler Worker – Observability & Runbook
|
||||
|
||||
## Purpose
|
||||
Monitor planner and runner health for the Scheduler Worker (Sprint 16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
|
||||
|
||||
> **Grafana note:** Import `docs/ops/scheduler-worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
| Metric | Use case | Suggested query |
|
||||
| --- | --- | --- |
|
||||
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
|
||||
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
|
||||
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
|
||||
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
|
||||
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
|
||||
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
|
||||
|
||||
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
|
||||
|
||||
---
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
1. Import `docs/ops/scheduler-worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
|
||||
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
|
||||
3. Panels included:
|
||||
- **Planner Runs per Status** – visualises success vs failure ratio.
|
||||
- **Planner Latency P95** – highlights degradations in ImpactIndex or Mongo lookups.
|
||||
- **Runner Segments per Status** – shows retry pressure and queue health.
|
||||
- **New Findings per Severity** – rolls up delta counters (critical/high/total).
|
||||
- **Runner Backlog by Schedule** – tabulates outstanding digests per schedule.
|
||||
- **Active Runs** – stat panel showing the current number of in-flight runs.
|
||||
|
||||
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
|
||||
|
||||
---
|
||||
|
||||
## Prometheus alerts
|
||||
|
||||
Import `docs/ops/scheduler-worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
|
||||
|
||||
- **SchedulerPlannerFailuresHigh** – 5%+ of planner runs failed for 10 minutes. Page SRE.
|
||||
- **SchedulerPlannerLatencyHigh** – planner p95 latency remains above 45 s for 10 minutes. Investigate ImpactIndex, Mongo, and Feedser/Vexer event queues.
|
||||
- **SchedulerRunnerBacklogGrowing** – backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
|
||||
- **SchedulerRunStuck** – active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
|
||||
|
||||
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
|
||||
|
||||
---
|
||||
|
||||
## Runbook snapshot
|
||||
|
||||
1. **Planner failure/latency:**
|
||||
- Check Planner logs for ImpactIndex or Mongo exceptions.
|
||||
- Verify Feedser/Vexer webhook health; requeue events if necessary.
|
||||
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
|
||||
2. **Runner backlog spike:**
|
||||
- Confirm Scanner WebService health (`/healthz`).
|
||||
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
|
||||
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
|
||||
3. **Stuck runs:**
|
||||
- Use `stella scheduler runs list --state running` to identify affected runs.
|
||||
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
|
||||
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
|
||||
4. **Unexpected critical deltas:**
|
||||
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
|
||||
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
|
||||
|
||||
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
|
||||
- [ ] Prometheus alert rules deployed (see above).
|
||||
- [ ] Runbook linked from on-call rotation portal.
|
||||
- [ ] Observability Guild sign-off captured for Sprint 16 telemetry (OWNER: @obs-guild).
|
||||
|
||||
@@ -1,42 +0,0 @@
|
||||
groups:
|
||||
- name: scheduler-worker
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: SchedulerPlannerFailuresHigh
|
||||
expr: sum(rate(scheduler_planner_runs_total{status="failed"}[5m]))
|
||||
/
|
||||
sum(rate(scheduler_planner_runs_total[5m])) > 0.05
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner failure ratio above 5%"
|
||||
description: "More than 5% of planning runs are failing. Inspect scheduler logs and ImpactIndex connectivity before queues back up."
|
||||
- alert: SchedulerPlannerLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m]))) > 45
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner latency p95 above 45s"
|
||||
description: "Planning latency p95 stayed above 45 seconds for 10 minutes. Check ImpactIndex, Mongo, or external selectors to prevent missed SLAs."
|
||||
- alert: SchedulerRunnerBacklogGrowing
|
||||
expr: max_over_time(scheduler_runner_backlog[15m]) > 500
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Runner backlog above 500 images"
|
||||
description: "Runner backlog exceeded 500 images over the last 15 minutes. Verify runner workers, scanner availability, and rate limits."
|
||||
- alert: SchedulerRunStuck
|
||||
expr: sum(scheduler_runs_active) > 0 and max_over_time(scheduler_runs_active[30m]) == min_over_time(scheduler_runs_active[30m])
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Scheduler runs stuck without progress"
|
||||
description: "Active runs count has remained flat for 30 minutes. Investigate stuck segments or scanner timeouts."
|
||||
@@ -1,113 +0,0 @@
|
||||
# Telemetry Collector Deployment Guide
|
||||
|
||||
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
|
||||
|
||||
This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
|
||||
|
||||
| Endpoint | Purpose | TLS | Authentication |
|
||||
| -------- | ------- | --- | -------------- |
|
||||
| `:4317` | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
|
||||
| `:4318` | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
|
||||
| `:9464` | Prometheus scrape | mTLS | Same client certificate |
|
||||
| `:13133` | Health check | mTLS | Same client certificate |
|
||||
| `:1777` | pprof diagnostics | mTLS | Same client certificate |
|
||||
|
||||
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local validation (Compose)
|
||||
|
||||
```bash
|
||||
# 1. Generate dev certificates (CA + collector + client)
|
||||
./ops/devops/telemetry/generate_dev_tls.sh
|
||||
|
||||
# 2. Start the collector overlay
|
||||
cd deploy/compose
|
||||
docker compose -f docker-compose.telemetry.yaml up -d
|
||||
|
||||
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
|
||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
|
||||
|
||||
# 4. Run the smoke test (OTLP HTTP)
|
||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
|
||||
```
|
||||
|
||||
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes deployment
|
||||
|
||||
Enable the collector in Helm by setting the following values (example shown for the dev profile):
|
||||
|
||||
```yaml
|
||||
telemetry:
|
||||
collector:
|
||||
enabled: true
|
||||
defaultTenant: <tenant>
|
||||
tls:
|
||||
secretName: stellaops-otel-tls-<env>
|
||||
```
|
||||
|
||||
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic stellaops-otel-tls-stage \
|
||||
--from-file=tls.crt=collector.crt \
|
||||
--from-file=tls.key=collector.key \
|
||||
--from-file=ca.crt=ca.crt
|
||||
```
|
||||
|
||||
Helm renders the collector deployment, service, and config map automatically:
|
||||
|
||||
```bash
|
||||
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
|
||||
```
|
||||
|
||||
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
|
||||
|
||||
---
|
||||
|
||||
## 4. Offline packaging (DEVOPS-OBS-50-003)
|
||||
|
||||
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
|
||||
|
||||
```bash
|
||||
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
|
||||
```
|
||||
|
||||
The script gathers:
|
||||
|
||||
- `deploy/telemetry/README.md`
|
||||
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
|
||||
- Helm template/values for the collector
|
||||
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
|
||||
|
||||
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
|
||||
|
||||
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
|
||||
|
||||
---
|
||||
|
||||
## 5. Operational checks
|
||||
|
||||
1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
|
||||
2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
|
||||
3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
|
||||
4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
|
||||
|
||||
---
|
||||
|
||||
## 6. Related references
|
||||
|
||||
- `deploy/telemetry/README.md` – source configuration and local workflow.
|
||||
- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
|
||||
- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
|
||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
|
||||
@@ -1,173 +0,0 @@
|
||||
# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
|
||||
|
||||
> **Audience:** DevOps Guild, Observability Guild
|
||||
>
|
||||
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
|
||||
|
||||
---
|
||||
|
||||
## 1. Components & Ports
|
||||
|
||||
| Service | Port | Purpose | TLS |
|
||||
|-----------|------|---------|-----|
|
||||
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
|
||||
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
|
||||
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
|
||||
|
||||
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local validation (Compose)
|
||||
|
||||
```bash
|
||||
./ops/devops/telemetry/generate_dev_tls.sh
|
||||
cd deploy/compose
|
||||
# Start collector + storage stack
|
||||
docker compose -f docker-compose.telemetry.yaml up -d
|
||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
|
||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
|
||||
```
|
||||
|
||||
Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes blueprint
|
||||
|
||||
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
|
||||
|
||||
```yaml
|
||||
prometheus:
|
||||
server:
|
||||
extraFlags:
|
||||
- web.enable-lifecycle
|
||||
persistentVolume:
|
||||
enabled: true
|
||||
size: 200Gi
|
||||
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
|
||||
extraSecretMounts:
|
||||
- name: otel-mtls
|
||||
secretName: stellaops-otel-tls-stage
|
||||
mountPath: /etc/telemetry/tls
|
||||
readOnly: true
|
||||
- name: otel-token
|
||||
secretName: stellaops-prometheus-token
|
||||
mountPath: /etc/telemetry/auth
|
||||
readOnly: true
|
||||
|
||||
loki:
|
||||
auth_enabled: true
|
||||
singleBinary:
|
||||
replicas: 2
|
||||
storage:
|
||||
type: filesystem
|
||||
existingSecretForTls: stellaops-otel-tls-stage
|
||||
runtimeConfig:
|
||||
configMap:
|
||||
name: stellaops-loki-tenant-overrides
|
||||
|
||||
tempo:
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
storage:
|
||||
trace:
|
||||
backend: s3
|
||||
s3:
|
||||
endpoint: tempo-minio.observability.svc:9000
|
||||
bucket: tempo-traces
|
||||
multitenancyEnabled: true
|
||||
extraVolumeMounts:
|
||||
- name: otel-mtls
|
||||
mountPath: /etc/telemetry/tls
|
||||
readOnly: true
|
||||
- name: tempo-tenant-overrides
|
||||
mountPath: /etc/telemetry/tenants
|
||||
readOnly: true
|
||||
```
|
||||
|
||||
### Staging bootstrap commands
|
||||
|
||||
```bash
|
||||
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
|
||||
kubectl -n observability create secret generic stellaops-otel-tls-stage \
|
||||
--from-file=tls.crt=collector-stage.crt \
|
||||
--from-file=tls.key=collector-stage.key \
|
||||
--from-file=ca.crt=collector-ca.crt
|
||||
|
||||
# Prometheus bearer token issued by Authority (scope obs:read)
|
||||
kubectl -n observability create secret generic stellaops-prometheus-token \
|
||||
--from-file=token=prometheus-stage.token
|
||||
|
||||
# Tenant overrides
|
||||
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
|
||||
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
|
||||
|
||||
kubectl -n observability create configmap tempo-tenant-overrides \
|
||||
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
|
||||
|
||||
# Additional scrape config referencing the collector service
|
||||
kubectl -n observability create secret generic stellaops-prometheus-scrape \
|
||||
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
|
||||
```
|
||||
|
||||
Provision the following secrets/configs (names can be overridden via Helm values):
|
||||
|
||||
| Name | Type | Notes |
|
||||
|------|------|-------|
|
||||
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
|
||||
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
|
||||
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
|
||||
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Authority & tenancy integration
|
||||
|
||||
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
|
||||
```bash
|
||||
stella authority client create observability-prometheus \
|
||||
--scopes obs:read \
|
||||
--audience observability --description "Prometheus collector scrape"
|
||||
stella authority client create observability-loki \
|
||||
--scopes obs:logs timeline:read \
|
||||
--audience observability --description "Loki ingestion"
|
||||
stella authority client create observability-tempo \
|
||||
--scopes obs:traces \
|
||||
--audience observability --description "Tempo ingestion"
|
||||
```
|
||||
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
|
||||
```bash
|
||||
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
|
||||
```
|
||||
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
|
||||
|
||||
---
|
||||
|
||||
## 5. Retention & isolation
|
||||
|
||||
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
|
||||
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
|
||||
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
|
||||
|
||||
---
|
||||
|
||||
## 6. Operational checklist
|
||||
|
||||
- [ ] Certificates rotated and secrets updated.
|
||||
- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
|
||||
- [ ] Tempo and Loki report tenant activity (`/api/status`).
|
||||
- [ ] Retention policy tested by uploading sample data and verifying expiry.
|
||||
- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
|
||||
- [ ] Component rule packs imported (e.g. `docs/ops/scheduler-worker-prometheus-rules.yaml`).
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- `deploy/telemetry/storage/README.md`
|
||||
- `deploy/compose/docker-compose.telemetry-storage.yaml`
|
||||
- `docs/ops/telemetry-collector.md`
|
||||
- `docs/observability/observability.md`
|
||||
@@ -1,32 +0,0 @@
|
||||
# UI Auth Smoke Job (Playwright)
|
||||
|
||||
The DevOps Guild tracks **DEVOPS-UI-13-006** to wire the new Playwright auth
|
||||
smoke checks into CI and the Offline Kit pipeline. These tests exercise the
|
||||
Angular UI login flow against a stubbed Authority instance to verify that
|
||||
`/config.json` is discovered, DPoP proofs are minted, and error handling is
|
||||
surfaced when the backend rejects a request.
|
||||
|
||||
## What the job does
|
||||
|
||||
1. Builds the UI bundle (or consumes the artifact from the release pipeline).
|
||||
2. Copies the environment stub from `src/config/config.sample.json` into the
|
||||
runtime directory as `config.json` so the UI can bootstrap without a live
|
||||
gateway.
|
||||
3. Runs `npm run test:e2e`, which launches Playwright with the auth fixtures
|
||||
under `tests/e2e/auth.spec.ts`:
|
||||
- Validates that the Sign-in button generates an Authorization Code + PKCE
|
||||
redirect to `https://authority.local/connect/authorize`.
|
||||
- Confirms the callback view shows an actionable error when the redirect is
|
||||
missing the pending login state.
|
||||
4. Publishes JUnit + Playwright traces (retain-on-failure) for troubleshooting.
|
||||
|
||||
## Pipeline integration notes
|
||||
|
||||
- Chromium must already be available (`npx playwright install --with-deps`).
|
||||
- Set `PLAYWRIGHT_BASE_URL` if the UI serves on a non-default host/port.
|
||||
- For Offline Kit packaging, bundle the Playwright browser cache under
|
||||
`.cache/ms-playwright/` so the job runs without network access.
|
||||
- Failures should block release promotion; export the traces to the artifacts
|
||||
tab for debugging.
|
||||
|
||||
Refer to `ops/devops/TASKS.md` (DEVOPS-UI-13-006) for progress and ownership.
|
||||
@@ -1,205 +0,0 @@
|
||||
{
|
||||
"title": "Zastava Runtime Plane",
|
||||
"uid": "zastava-runtime",
|
||||
"timezone": "utc",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"refresh": "30s",
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"type": "timeseries",
|
||||
"title": "Observer Event Rate",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (tenant,component,kind) (rate(zastava_runtime_events_total{tenant=~\"$tenant\"}[5m]))",
|
||||
"legendFormat": "{{tenant}}/{{component}}/{{kind}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "1/s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"type": "timeseries",
|
||||
"title": "Admission Decisions",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (decision) (rate(zastava_admission_decisions_total{tenant=~\"$tenant\"}[5m]))",
|
||||
"legendFormat": "{{decision}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "1/s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"type": "timeseries",
|
||||
"title": "Backend Latency P95",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket{tenant=~\"$tenant\"}[5m])))",
|
||||
"legendFormat": "p95 latency"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green"
|
||||
},
|
||||
{
|
||||
"color": "orange",
|
||||
"value": 500
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 750
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"showLegend": true,
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"label": "Prometheus",
|
||||
"current": {
|
||||
"text": "Prometheus",
|
||||
"value": "Prometheus"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "tenant",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"definition": "label_values(zastava_runtime_events_total, tenant)",
|
||||
"refresh": 1,
|
||||
"hide": 0,
|
||||
"current": {
|
||||
"text": ".*",
|
||||
"value": ".*"
|
||||
},
|
||||
"regex": "",
|
||||
"includeAll": true,
|
||||
"multi": true,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"name": "Deployments",
|
||||
"type": "tags",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"enable": true,
|
||||
"iconColor": "rgba(255, 96, 96, 1)"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -1,174 +0,0 @@
|
||||
# Zastava Runtime Operations Runbook
|
||||
|
||||
This runbook covers the runtime plane (Observer DaemonSet + Admission Webhook).
|
||||
It aligns with `Sprint 12 – Runtime Guardrails` and assumes components consume
|
||||
`StellaOps.Zastava.Core` (`AddZastavaRuntimeCore(...)`).
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- **Authority client credentials** – service principal `zastava-runtime` with scopes
|
||||
`aud:scanner` and `api:scanner.runtime.write`. Provision DPoP keys and mTLS client
|
||||
certs before rollout.
|
||||
- **Scanner/WebService reachability** – cluster DNS entry (e.g. `scanner.internal`)
|
||||
resolvable from every node running Observer/Webhook.
|
||||
- **Host mounts** – read-only access to `/proc`, container runtime state
|
||||
(`/var/lib/containerd`, `/var/run/containerd/containerd.sock`) and scratch space
|
||||
(`/var/run/zastava`).
|
||||
- **Offline kit bundle** – operators staging air-gapped installs must download
|
||||
`offline-kit/zastava-runtime-{version}.tar.zst` containing container images,
|
||||
Grafana dashboards, and Prometheus rules referenced below.
|
||||
- **Secrets** – Authority OpTok cache dir, DPoP private keys, and webhook TLS secrets
|
||||
live outside git. For air-gapped installs copy them to the sealed secrets vault.
|
||||
|
||||
### 1.1 Telemetry quick reference
|
||||
|
||||
| Metric | Description | Notes |
|
||||
|--------|-------------|-------|
|
||||
| `zastava.runtime.events.total{tenant,component,kind}` | Rate of observer events sent to Scanner | Expect >0 on busy nodes. |
|
||||
| `zastava.runtime.backend.latency.ms` | Histogram (ms) for `/runtime/events` and `/policy/runtime` calls | P95 & P99 drive alerting. |
|
||||
| `zastava.admission.decisions.total{decision}` | Admission verdict counts | Track deny spikes or fail-open fallbacks. |
|
||||
| `zastava.admission.cache.hits.total` | (future) Cache utilisation once Observer batches land | Placeholder until Observer tasks 12-004 complete. |
|
||||
|
||||
## 2. Deployment workflows
|
||||
|
||||
### 2.1 Fresh install (Helm overlay)
|
||||
|
||||
1. Load offline kit bundle: `oras cp offline-kit/zastava-runtime-*.tar.zst oci:registry.internal/zastava`.
|
||||
2. Render values:
|
||||
- `zastava.runtime.tenant`, `environment`, `deployment` (cluster identifier).
|
||||
- `zastava.runtime.authority` block (issuer, clientId, audience, DPoP toggle).
|
||||
- `zastava.runtime.metrics.commonTags.cluster` for Prometheus labels.
|
||||
3. Pre-create secrets:
|
||||
- `zastava-authority-dpop` (JWK + private key).
|
||||
- `zastava-authority-mtls` (client cert/key chain).
|
||||
- `zastava-webhook-tls` (serving cert; CSR bundle if using auto-approval).
|
||||
4. Deploy Observer DaemonSet and Webhook chart:
|
||||
```sh
|
||||
helm upgrade --install zastava-runtime deploy/helm/zastava \
|
||||
-f values/zastava-runtime.yaml \
|
||||
--namespace stellaops \
|
||||
--create-namespace
|
||||
```
|
||||
5. Verify:
|
||||
- `kubectl -n stellaops get pods -l app=zastava-observer` ready.
|
||||
- `kubectl -n stellaops logs ds/zastava-observer --tail=20` shows
|
||||
`Issued runtime OpTok` audit line with DPoP token type.
|
||||
- Admission webhook registered: `kubectl get validatingwebhookconfiguration zastava-webhook`.
|
||||
|
||||
### 2.2 Upgrades
|
||||
|
||||
1. Scale webhook deployment to `--replicas=3` (rolling).
|
||||
2. Drain one node per AZ to ensure Observer tolerates disruption.
|
||||
3. Apply chart upgrade; watch `zastava.runtime.backend.latency.ms` P95 (<250 ms).
|
||||
4. Post-upgrade, run smoke tests:
|
||||
- Apply unsigned Pod manifest → expect `deny` (policy fail).
|
||||
- Apply signed Pod manifest → expect `allow`.
|
||||
5. Record upgrade in ops log with Git SHA + Helm chart version.
|
||||
|
||||
### 2.3 Rollback
|
||||
|
||||
1. Use Helm revision history: `helm history zastava-runtime`.
|
||||
2. Rollback: `helm rollback zastava-runtime <revision>`.
|
||||
3. Invalidate cached OpToks:
|
||||
```sh
|
||||
kubectl -n stellaops exec deploy/zastava-webhook -- \
|
||||
zastava-webhook invalidate-op-token --audience scanner
|
||||
```
|
||||
4. Confirm observers reconnect via metrics (`rate(zastava_runtime_events_total[5m])`).
|
||||
|
||||
## 3. Authority & security guardrails
|
||||
|
||||
- Tokens must be `DPoP` type when `requireDpop=true`. Logs emit
|
||||
`authority.token.issue` scope with decision data; absence indicates misconfig.
|
||||
- `requireMutualTls=true` enforces mTLS during token acquisition. Disable only in
|
||||
lab clusters; expect warning log `Mutual TLS requirement disabled`.
|
||||
- Static fallback tokens (`allowStaticTokenFallback=true`) should exist only during
|
||||
initial bootstrap. Rotate nightly; preference is to disable once Authority reachable.
|
||||
- Audit every change in `zastava.runtime.authority` through change management.
|
||||
Use `kubectl get secret zastava-authority-dpop -o jsonpath='{.metadata.annotations.revision}'`
|
||||
to confirm key rotation.
|
||||
|
||||
## 4. Incident response
|
||||
|
||||
### 4.1 Authority offline
|
||||
|
||||
1. Check Prometheus alert `ZastavaAuthorityTokenStale`.
|
||||
2. Inspect Observer logs for `authority.token.fallback` scope.
|
||||
3. If fallback engaged, verify static token validity duration; rotate secret if older than 24 h.
|
||||
4. Once Authority restored, delete static fallback secret and restart pods to rebind DPoP keys.
|
||||
|
||||
### 4.2 Scanner/WebService latency spike
|
||||
|
||||
1. Alert `ZastavaRuntimeBackendLatencyHigh` fires at P95 > 750 ms for 5 minutes.
|
||||
2. Run backend health: `kubectl -n scanner exec deploy/scanner-web -- curl -f localhost:8080/healthz/ready`.
|
||||
3. If backend degraded, auto buffer may throttle. Confirm disk-backed queue size via
|
||||
`kubectl logs ds/zastava-observer | grep buffer.drops`.
|
||||
4. Consider enabling fail-open for namespaces listed in runbook Appendix B (temporary).
|
||||
|
||||
### 4.3 Admission deny storm
|
||||
|
||||
1. Alert `ZastavaAdmissionDenySpike` indicates >20 denies/minute.
|
||||
2. Pull sample: `kubectl logs deploy/zastava-webhook --since=10m | jq '.decision'`.
|
||||
3. Cross-check policy backlog in Scanner (`/policy/runtime` logs). Engage application
|
||||
owner; optionally set namespace to `failOpenNamespaces` after risk assessment.
|
||||
|
||||
## 5. Offline kit & air-gapped notes
|
||||
|
||||
- Bundle contents:
|
||||
- Observer/Webhook container images (multi-arch).
|
||||
- `docs/ops/zastava-runtime-prometheus-rules.yaml` + Grafana dashboard JSON.
|
||||
- Sample `zastava-runtime.values.yaml`.
|
||||
- Verification:
|
||||
- Validate signature: `cosign verify-blob offline-kit/zastava-runtime-*.tar.zst --certificate offline-kit/zastava-runtime.cert`.
|
||||
- Extract Prometheus rules into offline monitoring cluster (`/etc/prometheus/rules.d`).
|
||||
- Import Grafana dashboard via `grafana-cli --config ...`.
|
||||
|
||||
## 6. Observability assets
|
||||
|
||||
- Prometheus alert rules: `docs/ops/zastava-runtime-prometheus-rules.yaml`.
|
||||
- Grafana dashboard JSON: `docs/ops/zastava-runtime-grafana-dashboard.json`.
|
||||
- Add both to the monitoring repo (`ops/monitoring/zastava`) and reference them in
|
||||
the Offline Kit manifest.
|
||||
|
||||
## 7. Build-id correlation & symbol retrieval
|
||||
|
||||
Runtime events emitted by Observer now include `process.buildId` (from the ELF
|
||||
`NT_GNU_BUILD_ID` note) and Scanner `/policy/runtime` surfaces the most recent
|
||||
`buildIds` list per digest. Operators can use these hashes to locate debug
|
||||
artifacts during incident response:
|
||||
|
||||
1. Capture the hash from CLI/webhook/Scanner API—for example:
|
||||
```bash
|
||||
stellaops-cli runtime policy test --image <digest> --namespace <ns>
|
||||
```
|
||||
Copy one of the `Build IDs` (e.g.
|
||||
`5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789`).
|
||||
2. Derive the debug path (`<aa>/<rest>` under `.build-id`) and check it exists:
|
||||
```bash
|
||||
ls /var/opt/debug/.build-id/5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug
|
||||
```
|
||||
3. If the file is missing, rehydrate it from Offline Kit bundles or the
|
||||
`debug-store` object bucket (mirror of release artefacts):
|
||||
```bash
|
||||
oras cp oci://registry.internal/debug-store:latest . --include \
|
||||
"5f/0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789.debug"
|
||||
```
|
||||
4. Confirm the running process advertises the same GNU build-id before
|
||||
symbolising:
|
||||
```bash
|
||||
readelf -n /proc/$(pgrep -f payments-api | head -n1)/exe | grep -i 'Build ID'
|
||||
```
|
||||
5. Attach the `.debug` file in `gdb`/`lldb`, feed it to `eu-unstrip`, or cache it
|
||||
in `debuginfod` for fleet-wide symbol resolution:
|
||||
```bash
|
||||
debuginfod-find debuginfo 5f0c7c3cb4d9f8a4f1c1d5c6b7e8f90123456789 >/tmp/payments-api.debug
|
||||
```
|
||||
6. For musl-based images, expect shorter build-id footprints. Missing hashes in
|
||||
runtime events indicate stripped binaries without the GNU note—schedule a
|
||||
rebuild with `-Wl,--build-id` enabled or add the binary to the debug-store
|
||||
allowlist so the scanner can surface a fallback symbol package.
|
||||
|
||||
Monitor `scanner.policy.runtime` responses for the `buildIds` field; absence of
|
||||
data after ZASTAVA-OBS-17-005 implies containers launched before the Observer
|
||||
upgrade or non-ELF entrypoints (static scripts). Re-run the workload or restart
|
||||
Observer to trigger a fresh capture if symbol parity is required.
|
||||
@@ -1,31 +0,0 @@
|
||||
groups:
|
||||
- name: zastava-runtime
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: ZastavaRuntimeEventsSilent
|
||||
expr: sum(rate(zastava_runtime_events_total[10m])) == 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Observer events stalled"
|
||||
description: "No runtime events emitted in the last 15 minutes. Check observer DaemonSet health and container runtime mounts."
|
||||
- alert: ZastavaRuntimeBackendLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(zastava_runtime_backend_latency_ms_bucket[5m]))) > 0.75
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Runtime backend latency p95 above 750 ms"
|
||||
description: "Latency to Scanner runtime APIs is elevated. Inspect Scanner.WebService readiness, Authority OpTok issuance, and cluster network."
|
||||
- alert: ZastavaAdmissionDenySpike
|
||||
expr: sum(rate(zastava_admission_decisions_total{decision="deny"}[5m])) > 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: zastava-runtime
|
||||
annotations:
|
||||
summary: "Admission webhook denies exceeding threshold"
|
||||
description: "Webhook is denying more than 20 pod admissions per minute. Confirm policy verdicts and consider fail-open exception for impacted namespaces."
|
||||
Reference in New Issue
Block a user