audit work, fixed StellaOps.sln warnings/errors, fixed tests, sprints work, new advisories
This commit is contained in:
42
docs/operations/devops/AGENTS.md
Normal file
42
docs/operations/devops/AGENTS.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# DevOps agent guide
|
||||
|
||||
## Mission
|
||||
The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments.
|
||||
|
||||
## Advisory Handling
|
||||
- Any new/updated advisory triggers immediate doc + sprint updates (no approval).
|
||||
- Update high-level + detailed docs; inline only short snippets; put runnable/long code in `docs/benchmarks/**` or `tests/**` (deterministic/offline) and link.
|
||||
- Add tasks + Execution Log entries in relevant `SPRINT_*.md` with doc paths/owners; add risks if schema/feed/transparency caps apply.
|
||||
- Check archived advisories; mark supersedes/extends if overlapping.
|
||||
- Defaults: hybrid reachability, deterministic/frozen feeds; act first, report after.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
- [Task Runner simulation notes](./task-runner-simulation.md)
|
||||
|
||||
## How to get started
|
||||
1. Open sprint file `/docs/implplan/SPRINT_*.md` and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
## Required Reading
|
||||
- `docs/modules/devops/README.md`
|
||||
- `docs/modules/devops/architecture.md`
|
||||
- `docs/modules/devops/implementation_plan.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## Working Agreement
|
||||
- 1. Update task status to `DOING`/`DONE` in both correspoding sprint file `/docs/implplan/SPRINT_*.md` and the local `TASKS.md` when you start or finish work.
|
||||
- 2. Review this charter and the Required Reading documents before coding; confirm prerequisites are met.
|
||||
- 3. Keep changes deterministic (stable ordering, timestamps, hashes) and align with offline/air-gap expectations.
|
||||
- 4. Coordinate doc updates, tests, and cross-guild communication whenever contracts or workflows change.
|
||||
- 5. Revert to `TODO` if you pause the task without shipping changes; leave notes in commit/PR descriptions for context.
|
||||
64
docs/operations/devops/README.md
Normal file
64
docs/operations/devops/README.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# StellaOps DevOps
|
||||
|
||||
The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments.
|
||||
|
||||
## Responsibilities
|
||||
- Maintain CI pipelines, signing workflows, and release packaging steps.
|
||||
- Operate shared runbooks for launch readiness, upgrades, and NuGet previews.
|
||||
- Provide offline kit assembly instructions and tooling integration.
|
||||
- Wrap observability/telemetry bootstrap flows for platform teams.
|
||||
|
||||
## Key components
|
||||
- Runbooks under ./runbooks/ (launch, deployment, nuget).
|
||||
- Migration guidance under ./migrations/.
|
||||
- Architecture overview bridging CI/CD & infrastructure concerns.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Ops pipelines (Gitea, GitHub Actions) and artifact registries.
|
||||
- Authority/Signer for supply chain signing.
|
||||
- Telemetry stack bootstrap scripts.
|
||||
|
||||
## Operational notes
|
||||
- Offline bundle packaging guidance in docs/modules/export-center/operations/runbook.md.
|
||||
- Dashboards for launch cutover rehearsals.
|
||||
- Coordination with Security for enforced guardrails.
|
||||
|
||||
## Related resources
|
||||
- ./runbooks/launch-readiness.md
|
||||
- ./runbooks/launch-cutover.md
|
||||
- ./runbooks/deployment-upgrade.md
|
||||
- ./runbooks/nuget-preview-bootstrap.md
|
||||
- ./migrations/semver-style.md
|
||||
- ./task-runner-simulation.md
|
||||
|
||||
## Backlog references
|
||||
- DEVOPS-LAUNCH-18-001 / 18-900 runbooks in ../../TASKS.md.
|
||||
- Telemetry bootstrap automation tracked in `ops/devops/TASKS.md`.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 1 – AOC enforcement:** bake AOC verifier steps, CI guards, and schema validation into pipelines.
|
||||
- **Epic 9 – Orchestrator Dashboard:** support operational dashboards, job recovery runbooks, and rate-limit governance.
|
||||
- **Epic 10 – Export Center:** manage signing workflows, Offline Kit packaging, and release promotion for exports.
|
||||
- **Epic 15 – Observability & Forensics:** coordinate telemetry deployment, evidence retention, and forensic automation.
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### Objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes
|
||||
|
||||
### Key Milestones
|
||||
- **Epic 1 – AOC enforcement:** ensure CI/CD guardrails, schema validation, and verifier pipelines are enforced
|
||||
- **Epic 9 – Orchestrator Dashboard:** deliver dashboards, recovery runbooks, and rate-limit governance
|
||||
- **Epic 10 – Export Center:** manage signing/promotions and Offline Kit bundle publishing
|
||||
- **Epic 15 – Observability & Forensics:** coordinate telemetry deployments, evidence retention, and forensic automation
|
||||
|
||||
### Workstreams
|
||||
- Backlog grooming: reconcile open stories with module roadmap
|
||||
- Implementation: collaborate with service owners to land feature work
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements
|
||||
|
||||
### Coordination
|
||||
- Review ./AGENTS.md before picking up new work
|
||||
- Sync with cross-cutting teams noted in sprint files
|
||||
- Update plan whenever scope, dependencies, or guardrails change
|
||||
489
docs/operations/devops/architecture.md
Normal file
489
docs/operations/devops/architecture.md
Normal file
@@ -0,0 +1,489 @@
|
||||
# component_architecture_devops.md — **Stella Ops Release & Operations** (2025Q4)
|
||||
|
||||
> Draws from the AOC guardrails, Orchestrator, Export Center, and Observability module plans to describe how Stella Ops is built, signed, distributed, and operated.
|
||||
|
||||
> **Scope.** Implementation‑ready blueprint for **how Stella Ops is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and air‑gapped). Covers reproducible builds, supply‑chain attestations, registries, offline kits, migration/rollback, artifact lifecycle (RustFS default + PostgreSQL, S3 fallback), monitoring SLOs, and customer activation.
|
||||
|
||||
---
|
||||
|
||||
## 0) Product vision (operations lens)
|
||||
|
||||
Stella Ops must be **trustable at a glance** and **boringly operable**:
|
||||
|
||||
* Every release ships with **first‑party SBOMs, provenance, and signatures**; services verify **each other’s** integrity at runtime.
|
||||
* Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels.
|
||||
* Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**.
|
||||
* Air‑gapped customers receive **offline kits** with verifiable digests and deterministic import.
|
||||
* Artifacts expire predictably; operators know what’s kept, for how long, and why.
|
||||
|
||||
---
|
||||
|
||||
## 1) Release trains & versioning
|
||||
|
||||
### 1.1 Channels
|
||||
|
||||
* **LTS** (12‑month support window): quarterly cadence (Q1/Q2/Q3/Q4).
|
||||
* **Stable** (default): monthly rollup (bug fixes + compatible features).
|
||||
* **Edge**: weekly; for early adopters, no guarantees.
|
||||
|
||||
### 1.2 Version strings
|
||||
|
||||
Semantic core + calendar tag:
|
||||
|
||||
```
|
||||
<MAJOR>.<MINOR>.<PATCH> (<YYYY>.<MM>) e.g., 2.4.1 (2027.06)
|
||||
```
|
||||
|
||||
* **MAJOR**: breaking API/DB changes (rare).
|
||||
* **MINOR**: new features, compatible schema migrations (expand/contract pattern).
|
||||
* **PATCH**: bug fixes, perf and security updates.
|
||||
* **Calendar tag** exposes **release year** used by Signer for **PoE window checks**.
|
||||
|
||||
### 1.3 Component alignment
|
||||
|
||||
A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wire‑compatible**. Mixed minor versions are allowed within a bounded skew:
|
||||
|
||||
* **Web UI ↔ backend**: `±1 minor`.
|
||||
* **Scanner ↔ Policy/Excititor/Concelier**: `±1 minor`.
|
||||
* **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules).
|
||||
|
||||
At startup, services **self‑advertise** their semver & channel; the UI surfaces **mismatch warnings**.
|
||||
|
||||
---
|
||||
|
||||
## 2) Supply‑chain pipeline (how a release is built)
|
||||
|
||||
### 2.1 Deterministic builds
|
||||
|
||||
* **Builders**: isolated **BuildKit** workers with pinned base images (digest only).
|
||||
* **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag.
|
||||
* **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars.
|
||||
* **Multi‑arch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).
|
||||
|
||||
### 2.2 First‑party SBOMs & provenance
|
||||
|
||||
* Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSA‑style provenance** attached as **OCI referrers**.
|
||||
* Scanner’s **Buildx generator** is used to produce SBOMs *during* build; a separate post‑build scan verifies parity (red flag if drift).
|
||||
* **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs.
|
||||
|
||||
### 2.3 Signing & transparency
|
||||
|
||||
* Images are **cosign‑signed** (keyless) with a Stella Ops release identity; inclusion in a **transparency log** (Rekor) is required.
|
||||
* SBOM and provenance attestations are **DSSE** and also transparency‑logged.
|
||||
* Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scanner‑release validation** at customer side).
|
||||
|
||||
### 2.4 Gates & tests
|
||||
|
||||
* **Static**: linters, codegen checks, protobuf API freeze (backward‑compat tests).
|
||||
* **Unit/integration**: per-component, plus **end-to-end** flows (scan→vex→policy→sign→attest).
|
||||
* **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets.
|
||||
* **Security**: dependency audit vs Concelier export; container hardening tests; minimal caps.
|
||||
* **Deployment assets**: `Build Test Deploy` workflow’s `profile-validation` job installs Helm and runs `helm lint` + `helm template` against `devops/helm/stellaops` for every `values*.yaml`, catching ConfigMap/templating drift before merges.
|
||||
* **Analyzer smoke**: restart-time language plug-ins (currently Python) verified via `dotnet run --project src/Tools/LanguageAnalyzerSmoke` to ensure manifest integrity plus cold vs warm determinism (< 30 s / < 5 s budgets); the harness logs deviations from repository goldens for follow-up.
|
||||
* **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag.
|
||||
|
||||
### 2.5 Debug-store artefacts
|
||||
|
||||
* Every release exports stripped debug information for ELF binaries discovered in service images. Debug files follow the GNU build-id layout (`debug/.build-id/<aa>/<rest>.debug`) and are generated via `objcopy --only-keep-debug`.
|
||||
* `debug/debug-manifest.json` captures build-id → component/image/source mappings with SHA-256 checksums so operators can mirror the directory into debuginfod or offline symbol stores. The manifest (and its `.sha256` companion) ships with every release bundle and Offline Kit.
|
||||
|
||||
---
|
||||
|
||||
## 3) Distribution & activation
|
||||
|
||||
### 3.1 Registries
|
||||
|
||||
* **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API).
|
||||
* **Mirrors**: GHCR (read‑only), regional mirrors for latency.
|
||||
* Operational runbook: see `docs/modules/concelier/operations/mirror.md` for deployment profiles, CDN guidance, and sync automation.
|
||||
* **Pull by digest only** in Kubernetes/Compose manifests.
|
||||
|
||||
**Gating policy**:
|
||||
|
||||
* **Core images** (Authority, Scanner, Concelier, Excititor, Attestor, UI): public **read**.
|
||||
* **Enterprise add‑ons** (if any) and **pre‑release**: private repos via the **Registry Token Service** (`src/Registry/StellaOps.Registry.TokenService`) which exchanges Authority-issued OpToks for short-lived Docker registry bearer tokens.
|
||||
|
||||
> Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume.
|
||||
|
||||
### 3.2 OAuth2 token service (for private repos)
|
||||
|
||||
* Docker Registry’s token flow backed by **Authority**:
|
||||
|
||||
1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`).
|
||||
2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`.
|
||||
3. Registry allows pull for the requested repo.
|
||||
* Tokens are **short‑lived** (60–300 s) and **DPoP‑bound**.
|
||||
|
||||
The token service enforces plan gating via `registry-token.yaml` (see `docs/modules/registry/operations/token-service.md`) and exposes Prometheus metrics (`registry_token_issued_total`, `registry_token_rejected_total`). Revoked licence identifiers halt issuance even when scope requirements are met.
|
||||
|
||||
### 3.3 Offline kits (air‑gapped)
|
||||
|
||||
* Tarball per release channel:
|
||||
|
||||
```
|
||||
stellaops-kit-<ver>-<channel>.tar.zst
|
||||
/images/ OCI layout with all first-party images (multi-arch)
|
||||
/sboms/ CycloneDX JSON+PB for each image
|
||||
/attest/ DSSE bundles + Rekor proofs
|
||||
/charts/ Helm charts + values templates
|
||||
/compose/ docker-compose.yml + .env template
|
||||
/plugins/ Concelier/Excititor connectors (restart-time)
|
||||
/policy/ example policies
|
||||
/manifest/ release.yaml (see §6.1)
|
||||
```
|
||||
* Import via CLI `offline kit import`; checks digests and signatures before load.
|
||||
|
||||
---
|
||||
|
||||
## 4) Licensing (PoE) & monetization
|
||||
|
||||
**Principle**: **Only paid Stella Ops issues valid signed attestations.** Running the stack is free; signing requires PoE.
|
||||
|
||||
### 4.1 PoE issuance
|
||||
|
||||
* Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`:
|
||||
|
||||
* **PoE‑JWT** (DPoP/mTLS‑bound) **or** **PoE mTLS client certificate**.
|
||||
* Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs.
|
||||
|
||||
### 4.2 Online enforcement
|
||||
|
||||
* **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc).
|
||||
* If **revoked/expired/out‑of‑window** → deny with machine‑readable reason.
|
||||
* All **valid** bundles are DSSE‑signed and **Attestor** logs them; Rekor UUID returned.
|
||||
* UI badges: “**Verified by Stella Ops**” with link to the public log.
|
||||
|
||||
### 4.3 Air‑gapped / offline
|
||||
|
||||
* Customers obtain a **time‑boxed PoE lease** (signed JSON, 7–30 days).
|
||||
* Signer accepts the lease and emits **provisional** attestations (clearly labeled).
|
||||
* When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**.
|
||||
* Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).
|
||||
|
||||
### 4.4 Stolen/abused PoE
|
||||
|
||||
* Customers report theft; **Licensing** flags `license_id` as **revoked**.
|
||||
* Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional re‑sign path upon new PoE).
|
||||
|
||||
---
|
||||
|
||||
## 5) Deployment path (customer side)
|
||||
|
||||
### 5.1 First install
|
||||
|
||||
* **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s):
|
||||
|
||||
```bash
|
||||
helm repo add stellaops https://charts.stella-ops.org
|
||||
helm install stella stellaops/platform \
|
||||
--version 2.4.0 \
|
||||
--set global.channel=stable \
|
||||
--set authority.issuer=https://authority.stella.local \
|
||||
--set scanner.rustfs.endpoint=http://rustfs.stella.local:8080 \
|
||||
--set global.postgres.connectionString="Host=postgres.stella.local;Database=stellaops_platform;Username=stellaops;Password=<secret>"
|
||||
```
|
||||
|
||||
* Post‑install job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets).
|
||||
* UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?).
|
||||
|
||||
### 5.2 Updates
|
||||
|
||||
* **Blue/green**: pull new bundle by **digest**; deploy side‑by‑side; cut traffic.
|
||||
|
||||
* **Rolling**: upgrade stateful components in safe order:
|
||||
|
||||
1. Authority (stateless, dual‑key rotation ready)
|
||||
2. Signer/Attestor (same minor)
|
||||
3. Scanner WebService & Workers
|
||||
4. Concelier, then Excititor (schema migrations are expand/contract)
|
||||
5. UI last
|
||||
|
||||
* **DB migrations** are **expand/contract**:
|
||||
|
||||
* Phase A (release N): **add** new fields/indexes, write old+new.
|
||||
* Phase B (N+1): **read** new fields; **drop** old.
|
||||
* Rollback is a matter of redeploying previous images and keeping both schemas valid.
|
||||
|
||||
### 5.3 Rollback
|
||||
|
||||
* Images referenced by **digest**; keep previous release manifest `K` versions back.
|
||||
* `helm rollback` or compose `docker compose -f release-K.yml up -d`.
|
||||
* PostgreSQL migrations are additive; **no destructive changes** within a single minor.
|
||||
|
||||
---
|
||||
|
||||
## 6) Release payloads & manifests
|
||||
|
||||
### 6.1 Release manifest (`release.yaml`)
|
||||
|
||||
```yaml
|
||||
release:
|
||||
version: "2.4.1"
|
||||
channel: "stable"
|
||||
date: "2027-06-20T12:00:00Z"
|
||||
calendar: "2027.06"
|
||||
components:
|
||||
- name: scanner-webservice
|
||||
image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
|
||||
sbom: oci://.../referrers/cdx-json@sha256:11..22
|
||||
provenance: oci://.../attest/provenance@sha256:33..44
|
||||
signature: { rekorUUID: "…" }
|
||||
- name: signer
|
||||
image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
|
||||
signature: { rekorUUID: "…" }
|
||||
charts:
|
||||
- name: platform
|
||||
version: "2.4.1"
|
||||
digest: "sha256:ee..ff"
|
||||
compose:
|
||||
file: "docker-compose.yml"
|
||||
digest: "sha256:77..88"
|
||||
checksums:
|
||||
sha256: "… digest of this release.yaml …"
|
||||
```
|
||||
|
||||
The manifest is **cosign‑signed**; UI/CLI can verify a bundle without talking to registries.
|
||||
|
||||
> Deployment guardrails – The repository keeps channel-aligned Compose bundles
|
||||
> in `devops/compose/` and Helm overlays in `devops/helm/stellaops/`. Both sets
|
||||
> pull their digests from `deploy/releases/` and are validated by
|
||||
> `deploy/tools/validate-profiles.sh` to guarantee lint/dry-run cleanliness.
|
||||
|
||||
### 6.2 Image labels (release metadata)
|
||||
|
||||
Each image sets OCI labels:
|
||||
|
||||
```
|
||||
org.opencontainers.image.version = "2.4.1"
|
||||
org.opencontainers.image.revision = "<git sha>"
|
||||
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
|
||||
org.stellaops.release.calendar = "2027.06"
|
||||
org.stellaops.release.channel = "stable"
|
||||
org.stellaops.build.slsaProvenance = "oci://…"
|
||||
```
|
||||
|
||||
Signer validates **scanner** image’s cosign identity + calendar tag for **release window** checks.
|
||||
|
||||
---
|
||||
|
||||
## 7) Artifact lifecycle & storage (RustFS/PostgreSQL)
|
||||
|
||||
### 7.1 Buckets & prefixes (RustFS)
|
||||
|
||||
```
|
||||
rustfs://stellaops/
|
||||
scanner/
|
||||
layers/<sha256>/sbom.cdx.json.zst
|
||||
images/<imgDigest>/inventory.cdx.pb
|
||||
images/<imgDigest>/usage.cdx.pb
|
||||
diffs/<old>_<new>/diff.json.zst
|
||||
attest/<artifactSha256>.dsse.json
|
||||
concelier/
|
||||
json/<exportId>/...
|
||||
trivy/<exportId>/...
|
||||
excititor/
|
||||
exports/<exportId>/...
|
||||
attestor/
|
||||
dsse/<bundleSha256>.json
|
||||
proof/<rekorUuid>.json
|
||||
```
|
||||
|
||||
### 7.2 ILM classes
|
||||
|
||||
* **`short`**: working artifacts (diffs, queues) — TTL 7–14 days.
|
||||
* **`default`**: SBOMs & indexes — TTL 90–180 days (configurable).
|
||||
* **`compliance`**: signed reports & attested exports — retention enforced via RustFS hold or S3 Object Lock (governance/compliance) 1–7 years.
|
||||
|
||||
### 7.3 Artifact Lifecycle Controller (ALC)
|
||||
|
||||
* A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**:
|
||||
|
||||
* Artifacts referenced by **reports** or **tickets** are pinned.
|
||||
* ILM actions logged; UI shows per‑class usage & upcoming purges.
|
||||
|
||||
> **Migration note.** Follow `docs/modules/scanner/operations/rustfs-migration.md` when transitioning existing
|
||||
> MinIO buckets to RustFS. The provided migrator is idempotent and safe to rerun per prefix.
|
||||
|
||||
### 7.4 PostgreSQL retention
|
||||
|
||||
* **Scanner**: `runtime.events` use TTL (e.g., 30–90 days); **catalog** permanent.
|
||||
* **Concelier/Excititor**: raw docs keep **last N windows**; canonical stores permanent.
|
||||
* **Attestor**: `entries` permanent; `dedupe` TTL 24–48h.
|
||||
|
||||
### 7.5 PostgreSQL server baseline
|
||||
|
||||
* **Minimum supported server:** PostgreSQL **16+**. Earlier versions lack required features (e.g., enhanced JSON functions, performance improvements).
|
||||
* **Deploy images:** Compose/Helm defaults stay on `postgres:16`. For air-gapped installs, refresh Offline Kit bundles so the packaged PostgreSQL image matches ≥16.
|
||||
* **Upgrade guard:** During rollout, verify PostgreSQL major version ≥16 before applying schema migrations; automation should hard-stop if version check fails.
|
||||
|
||||
---
|
||||
|
||||
## 8) Observability & SLOs (operations)
|
||||
|
||||
* **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%.
|
||||
* **Error budgets**: tracked per month; dashboards show burn rates.
|
||||
* **Golden signals**:
|
||||
|
||||
* **Latency**: token issuance, sign→attest round‑trip, scan enqueue→emit, export build.
|
||||
* **Saturation**: queue depth, PostgreSQL write IOPS, RustFS throughput / queue depth (or S3 metrics when in fallback mode).
|
||||
* **Traffic**: scans/min, attestations/min, webhook admits/min.
|
||||
* **Errors**: 5xx rates, cosign verification failures, Rekor timeouts.
|
||||
|
||||
Prometheus + OTLP; Grafana dashboards ship in the charts.
|
||||
|
||||
---
|
||||
|
||||
## 9) Security & compliance operations
|
||||
|
||||
* **Key rotation**:
|
||||
|
||||
* Authority JWKS: 60‑day cadence, dual‑key overlap.
|
||||
* Release signing identities: rotate per minor or quarterly.
|
||||
* Sigstore roots mirrored and pinned; alarms on drift.
|
||||
|
||||
* **FIPS mode** (Gov build):
|
||||
|
||||
* Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only.
|
||||
* Local **Rekor v2** and **Fulcio** alternatives; **air‑gapped** CA.
|
||||
|
||||
* **Vulnerability response**:
|
||||
|
||||
* Concelier red-flag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice.
|
||||
* 2025-10: Pinned `SharpCompress` **0.41.0** across services (DEVOPS-SEC-10-301) to eliminate NU1903 warnings; future bumps follow the central override pattern. MongoDB dependencies were removed in Sprint 4400 (all persistence now uses PostgreSQL).
|
||||
|
||||
* **Backups/DR**:
|
||||
|
||||
* PostgreSQL nightly snapshots; MinIO versioning + replication (if configured).
|
||||
* Restore runbooks tested quarterly with synthetic data.
|
||||
|
||||
---
|
||||
|
||||
## 10) Customer update flow (how versions are fetched & activated)
|
||||
|
||||
### 10.1 Online clusters
|
||||
|
||||
* **UI** surfaces update banner with **release manifest** diff and risk notes.
|
||||
* Operator approves → **Controller** pulls new images by digest; health‑checks; moves traffic; deprecates old revision.
|
||||
* Post‑switch, **schema Phase B** migrations (if any) run automatically.
|
||||
|
||||
### 10.2 Air‑gapped clusters
|
||||
|
||||
* Operator downloads **offline kit** from a mirror → `stellaops offline kit import`.
|
||||
* Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest.
|
||||
* After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged.
|
||||
|
||||
### 10.3 CLI self‑update (optional)
|
||||
|
||||
* `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable).
|
||||
|
||||
---
|
||||
|
||||
## 11) Compatibility & deprecation policy
|
||||
|
||||
* **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor.
|
||||
* **Storage**: expand/contract; “drop old fields” only after one minor grace.
|
||||
* **Config**: feature flags (default off) for risky features (e.g., eBPF).
|
||||
|
||||
---
|
||||
|
||||
## 12) Runbooks (selected)
|
||||
|
||||
### 12.1 Lost PoE
|
||||
|
||||
1. Suspend **automatic attestation** jobs.
|
||||
2. Use CLI `stellaops signer status` to confirm `entitlement_denied`.
|
||||
3. Obtain new PoE from portal; verify on Signer `/poe/verify`.
|
||||
4. Re‑enable; optionally **re‑sign** last N reports (UI button → batch).
|
||||
|
||||
### 12.2 Rekor outage (self‑hosted)
|
||||
|
||||
* Attestor returns `202 (pending)` with queued proof fetch.
|
||||
* Keep DSSE bundles locally; re‑submit on schedule; UI badge shows **Pending**.
|
||||
* If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored.
|
||||
|
||||
### 12.3 Emergency downgrade
|
||||
|
||||
* Identify prior release manifest (UI → Admin → Releases).
|
||||
* `helm rollback stella <revision>` (or compose apply previous file).
|
||||
* Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together.
|
||||
|
||||
---
|
||||
|
||||
## 13) Example: cluster bootstrap (Compose)
|
||||
|
||||
```yaml
|
||||
version: "3.9"
|
||||
services:
|
||||
authority:
|
||||
image: registry.stella-ops.org/stellaops/authority@sha256:...
|
||||
env_file: ./env/authority.env
|
||||
ports: ["8440:8440"]
|
||||
signer:
|
||||
image: registry.stella-ops.org/stellaops/signer@sha256:...
|
||||
depends_on: [authority]
|
||||
environment:
|
||||
- SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
|
||||
attestor:
|
||||
image: registry.stella-ops.org/stellaops/attestor@sha256:...
|
||||
depends_on: [signer]
|
||||
scanner-web:
|
||||
image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
|
||||
environment:
|
||||
- SCANNER__ARTIFACTSTORE__ENDPOINT=http://rustfs:8080
|
||||
scanner-worker:
|
||||
image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
|
||||
deploy: { replicas: 4 }
|
||||
concelier:
|
||||
image: registry.stella-ops.org/stellaops/concelier@sha256:...
|
||||
excititor:
|
||||
image: registry.stella-ops.org/stellaops/excititor@sha256:...
|
||||
web-ui:
|
||||
image: registry.stella-ops.org/stellaops/web-ui@sha256:...
|
||||
postgres:
|
||||
image: postgres:16
|
||||
valkey:
|
||||
image: valkey/valkey:8.0
|
||||
rustfs:
|
||||
image: registry.stella-ops.org/stellaops/rustfs:2025.10.0-edge
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14) Governance & keys (who owns the trust root)
|
||||
|
||||
* **Release key policy**: only the Release Engineering group can push signed releases; 4‑eyes approval; TUF‑style manifest possible in future.
|
||||
* **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported.
|
||||
* **Customer keys**: none needed for core use; enterprise add‑ons may require per‑customer registries and keys.
|
||||
|
||||
---
|
||||
|
||||
## 15) Roadmap (Ops)
|
||||
|
||||
* **Windows containers GA** (Scanner + Zastava).
|
||||
* **Key Transparency** for Signer certs.
|
||||
* **Delta‑kit** (offline) for incremental updates.
|
||||
* **Operator CRDs** (K8s) to manage policy and ILM declaratively.
|
||||
* **SBOM **protobuf** as default transport at rest (smaller, faster).
|
||||
|
||||
---
|
||||
|
||||
### Appendix A — Minimal SLO monitors
|
||||
|
||||
* `authority.tokens_issued_total` slope ≈ normal.
|
||||
* `signer.requests_total{result="success"}/minute` > 0 (when scans occur).
|
||||
* `attestor.submit_latency_seconds{quantile=0.95}` < 0.3.
|
||||
* `scanner.scan_latency_seconds{quantile=0.95}` < target per image size.
|
||||
* `concelier.export.duration_seconds` stable; `excititor.consensus.conflicts_total` not exploding after policy changes.
|
||||
* RustFS request error rate near zero (or `s3_requests_errors_total` when operating against S3); PostgreSQL `pg_stat_bgwriter` counters hit expected baseline.
|
||||
|
||||
### Appendix B — Upgrade safety checklist
|
||||
|
||||
* Verify **release manifest** signature.
|
||||
* Ensure **Signer/Authority/Attestor** are same minor.
|
||||
* Verify **DB backups** < 24h old.
|
||||
* Confirm **ILM** won’t purge compliance artifacts during upgrade window.
|
||||
* Roll **one component** at a time; watch SLOs; abort on regression.
|
||||
|
||||
---
|
||||
|
||||
**End — component_architecture_devops.md**
|
||||
102
docs/operations/devops/console-ci-contract.md
Normal file
102
docs/operations/devops/console-ci-contract.md
Normal file
@@ -0,0 +1,102 @@
|
||||
# Console CI Contract (DEVOPS-CONSOLE-23-001)
|
||||
|
||||
## Scope
|
||||
Define a deterministic, offline-friendly CI pipeline for the Console web app covering lint, type-check, unit, Storybook a11y, Playwright smoke, Lighthouse perf/a11y, and artifact retention.
|
||||
|
||||
## Stages & Gates
|
||||
1. **Setup**
|
||||
- Node 20.x, pnpm 9.x from cached tarball (`tools/cache/node20.tgz`, `tools/cache/pnpm-9.tgz`).
|
||||
- Restore `node_modules` from `.pnpm-store` cache key `console-${{ hashFiles('pnpm-lock.yaml') }}`; fallback to offline tarball `local-npm-cache.tar.zst`.
|
||||
- Export `PLAYWRIGHT_BROWSERS_PATH=./.playwright` and hydrate from `tools/cache/playwright-browsers.tar.zst`.
|
||||
2. **Lint/Format/Types** (fail-fast)
|
||||
- `pnpm lint`
|
||||
- `pnpm format:check`
|
||||
- `pnpm typecheck`
|
||||
3. **Unit Tests**
|
||||
- `pnpm test -- --runInBand --reporter=junit --outputFile=.artifacts/junit.xml`
|
||||
- Collect coverage to `.artifacts/coverage` (lcov + summary).
|
||||
4. **Storybook a11y**
|
||||
- `pnpm storybook:build` (static export)
|
||||
- `pnpm storybook:a11y --ci --output .artifacts/storybook-a11y.json`
|
||||
5. **Playwright Smoke**
|
||||
- `pnpm playwright test --config=playwright.config.ci.ts --reporter=list,junit=.artifacts/playwright.xml`
|
||||
- Upload `playwright-report/` and `.artifacts/playwright.xml`.
|
||||
6. **Lighthouse (CI mode)**
|
||||
- Serve built app with `pnpm serve --port 4173` and run `pnpm lhci autorun --config=lighthouserc.ci.js --upload.target=filesystem --upload.outputDir=.artifacts/lhci`
|
||||
- Enforce budgets: performance >= 0.80, accessibility >= 0.90, best-practices >= 0.90, seo >= 0.85.
|
||||
7. **SBOM/Provenance**
|
||||
- `pnpm exec syft packages dir:dist --output=spdx-json=.artifacts/console.spdx.json`
|
||||
- Attach `.artifacts/console.spdx.json` and provenance attestation from release job.
|
||||
|
||||
## Determinism & Offline
|
||||
- No network fetches after cache hydrate; fail if `pnpm install` hits the network (set `PNPM_FETCH_RETRIES=0`, `PNPM_OFFLINE=1`).
|
||||
- All artifacts written under `.artifacts/` and uploaded as CI artifacts.
|
||||
- Timestamps normalized via `SOURCE_DATE_EPOCH=${{ github.run_id }}` for reproducible Storybook/LH builds.
|
||||
|
||||
## Inputs/Secrets
|
||||
- Required only for Playwright auth flows: `CONSOLE_E2E_USER`, `CONSOLE_E2E_PASS` (scoped to non-prod tenant). Pipeline must soft-skip auth tests when unset.
|
||||
- No signing keys required in CI; release handles signing separately.
|
||||
|
||||
## Outputs
|
||||
- `.artifacts/junit.xml` (unit)
|
||||
- `.artifacts/playwright.xml`, `playwright-report/`
|
||||
- `.artifacts/storybook-a11y.json`
|
||||
- `.artifacts/lhci/` (Lighthouse reports)
|
||||
- `.artifacts/coverage/`
|
||||
- `.artifacts/console.spdx.json`
|
||||
|
||||
## Example Gitea workflow snippet
|
||||
```yaml
|
||||
- name: Console CI (DEVOPS-CONSOLE-23-001)
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: '20'
|
||||
|
||||
- name: Prep pnpm
|
||||
run: |
|
||||
corepack enable
|
||||
corepack prepare pnpm@9 --activate
|
||||
|
||||
- name: Cache pnpm store
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.pnpm-store
|
||||
./node_modules
|
||||
key: console-${{ hashFiles('pnpm-lock.yaml') }}
|
||||
|
||||
- name: Install (offline)
|
||||
env:
|
||||
PNPM_FETCH_RETRIES: 0
|
||||
PNPM_OFFLINE: 1
|
||||
run: pnpm install --frozen-lockfile
|
||||
|
||||
- name: Lint/Types
|
||||
run: pnpm lint && pnpm format:check && pnpm typecheck
|
||||
|
||||
- name: Unit
|
||||
run: pnpm test -- --runInBand --reporter=junit --outputFile=.artifacts/junit.xml
|
||||
|
||||
- name: Storybook a11y
|
||||
run: pnpm storybook:build && pnpm storybook:a11y --ci --output .artifacts/storybook-a11y.json
|
||||
|
||||
- name: Playwright
|
||||
run: pnpm playwright test --config=playwright.config.ci.ts --reporter=list,junit=.artifacts/playwright.xml
|
||||
|
||||
- name: Lighthouse
|
||||
run: pnpm serve --port 4173 & pnpm lhci autorun --config=lighthouserc.ci.js --upload.target=filesystem --upload.outputDir=.artifacts/lhci
|
||||
|
||||
- name: SBOM
|
||||
run: pnpm exec syft packages dir:dist --output=spdx-json=.artifacts/console.spdx.json
|
||||
|
||||
- name: Upload artifacts
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: console-ci-artifacts
|
||||
path: .artifacts
|
||||
```
|
||||
|
||||
## Acceptance to mark blocker cleared
|
||||
- Pipeline executes fully in a clean runner with network blocked after cache hydrate.
|
||||
- All artefacts uploaded and budgets enforced; failing budgets fail the job.
|
||||
- Soft-skip auth-dependent tests when secrets are absent, without failing the pipeline.
|
||||
41
docs/operations/devops/export-ci-contract.md
Normal file
41
docs/operations/devops/export-ci-contract.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Export Center CI Contract (DEVOPS-EXPORT-35-001)
|
||||
|
||||
Goal: Deterministic, offline-friendly CI for Export Center services (WebService + Worker) with storage fixtures, smoke/perf gates, and observability artefacts.
|
||||
|
||||
## Pipeline stages
|
||||
1) **Setup**
|
||||
- .NET SDK 10.x (cached); Node 20.x only if UI assets present.
|
||||
- Restore NuGet from `local-nugets/` + cache; fail on external fetch (configure `RestoreDisableParallel` and source mapping).
|
||||
- Spin up MinIO (minio/minio:RELEASE.2024-10-08T09-56-18Z) via docker-compose fixture `ops/devops/export/minio-compose.yml` with deterministic creds (`exportci/exportci123`).
|
||||
2) **Build & Lint**
|
||||
- `dotnet format --verify-no-changes` on `src/ExportCenter/**`.
|
||||
- `dotnet build src/ExportCenter/StellaOps.ExportCenter.WebService/StellaOps.ExportCenter.WebService.csproj -c Release /p:ContinuousIntegrationBuild=true`.
|
||||
3) **Unit/Integration Tests**
|
||||
- `dotnet test src/ExportCenter/__Tests/StellaOps.ExportCenter.Tests/StellaOps.ExportCenter.Tests.csproj -c Release --logger "trx;LogFileName=export-tests.trx"`
|
||||
- Tests must use MinIO fixture with bucket `export-ci` and deterministic seed objects (see fixtures below).
|
||||
4) **Perf/Smoke (optional gated)**
|
||||
- `dotnet test ... --filter Category=Smoke` against live MinIO; cap runtime < 90s.
|
||||
5) **Artifacts**
|
||||
- Publish TRX to `.artifacts/export-tests.trx`.
|
||||
- Collect coverage to `.artifacts/coverage` (coverlet; lcov + summary).
|
||||
- Export appsettings used for the run to `.artifacts/appsettings.ci.json`.
|
||||
- Syft SBOM: `syft dir:./src/ExportCenter -o spdx-json=.artifacts/exportcenter.spdx.json`.
|
||||
6) **Dashboards (seed)**
|
||||
- Produce starter Grafana JSON with: request rate, p95 latency, MinIO error rate, queue depth, export job duration histogram. Store under `.artifacts/grafana/export-center-ci.json` for import.
|
||||
|
||||
## Fixtures
|
||||
- MinIO compose file: `ops/devops/export/minio-compose.yml` (add if missing) with:
|
||||
- Access key: `exportci`
|
||||
- Secret key: `exportci123`
|
||||
- Bucket: `export-ci`
|
||||
- Seed object script: `ops/devops/export/seed-minio.sh` to create bucket and upload deterministic sample (`sample-export.ndjson`).
|
||||
|
||||
## Determinism & Offline
|
||||
- No external network after restore; MinIO uses local image tag pinned above.
|
||||
- All timestamps emitted as UTC and tests assert deterministic ordering.
|
||||
- Coverage, SBOM, Grafana seed stored under `.artifacts/` and uploaded.
|
||||
|
||||
## Acceptance to clear blocker
|
||||
- CI run passes on clean runner with network blocked post-restore.
|
||||
- Artifacts (.trx, coverage, SBOM, Grafana JSON) uploaded and MinIO fixture exercised in tests.
|
||||
- Smoke perf subset completes < 90s.
|
||||
24
docs/operations/devops/governance-rules.md
Normal file
24
docs/operations/devops/governance-rules.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# DevOps Governance Rules Anchor (Sprint 33)
|
||||
|
||||
> **Scope** · Exit deliverable for `DEVOPS-RULES-33-001`
|
||||
> **Audience** · DevOps Guild, Platform leads, service owners
|
||||
> **Related** · `ops/devops/TASKS.md`, `docs/backlog/2025-10-cleanup.md`, `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
This note consolidates the platform governance rules ratified on 30 October 2025.
|
||||
Each rule captures intent, affected surfaces, enforcement actions, and references to the
|
||||
source-of-truth backlogs so that subsequent sprints do not re‑introduce conflicting work.
|
||||
|
||||
| Rule | Intent & Rationale | Enforcement & Ownership | Follow-ups |
|
||||
|------|--------------------|-------------------------|------------|
|
||||
| **Gateway is a proxy only; Policy Engine owns overlays/simulations.** | Keep Gateway thin and deterministic: it authenticates, authorises, and forwards requests. All overlay composition, simulation, and policy evaluation stays inside Policy Engine so we avoid duplicated logic and time-of-check drift. | *Owners:* BE‑Base Platform Guild + Policy Engine Guild. <br/>*Enforcement:* Gateway PR reviews block embedded overlay code, new endpoints require `Policy Engine` contracts, CI parity checks compare Gateway ↔ Policy overlay schemas. | - Update open tasks referencing “gateway overlay” work to point at `POLICY-ENGINE-20-00x`.<br/>- Close or rewrite backlog items `WEB-POLICY-20-00x` that attempted to compute overlays in Gateway. |
|
||||
| **AOC ingestion is canonical-only; no merges at ingest.** | Concelier/Excititor persist upstream truth plus provenance. Derived severity, merges, or dedupe belong to downstream Policy workflows. This keeps ingestion auditable and replayable. | *Owners:* Concelier & Excititor guilds, DevOps Guild for CI pipelines. <br/>*Enforcement:* `StellaOps.Aoc` guard library, Mongo validators, Roslyn analyzer backlog (`WEB-AOC-19-003`), CI job `stella aoc verify`. | - Ensure ingestion tasks reference the guard library (`StellaOps.Aoc`).<br/>- Retire legacy tasks that still mention merge-at-ingest (see backlog cleanup note). |
|
||||
| **Single graph platform: Graph Indexer + Graph API (Cartographer retired).** | Replace the historical Cartographer service with the Graph Indexer + Graph API pairing so graph storage, overlays, and explorer flows share one platform. | *Owners:* Graph Platform Guild, Scheduler Guild, DevOps Guild. <br/>*Enforcement:* New graph work lands in `docs/modules/graph/**` and `src/Graph/**`. Gateway/UI/CLI tickets reference the Graph API endpoints only. | - Archive Cartographer handshake docs and mark Cartographer backlog items as historical.<br/>- Update Scheduler/SBOM/Console tickets to depend on `GRAPH-*` IDs instead of `CARTO-*`. |
|
||||
|
||||
## Tracking & documentation
|
||||
|
||||
- ✅ Rules recorded in correspoding sprint file `/docs/implplan/SPRINT_*.md` (Sprint 33) and `/docs/ops/devops/TASKS.md`.
|
||||
- ✅ Repository-wide references to “Cartographer as active platform” updated (see backlog note amendment and doc banner).
|
||||
- ✅ Changelog entry (`docs/updates/2025-10-30-devops-governance.md`) captures reviewer acknowledgement.
|
||||
|
||||
Future adjustments to these rules must update this file and reference `DEVOPS-RULES-33-001`
|
||||
when proposing changes so the DevOps Guild can track history.
|
||||
52
docs/operations/devops/migrations/semver-style.md
Normal file
52
docs/operations/devops/migrations/semver-style.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# SemVer Style Backfill Runbook
|
||||
|
||||
_Last updated: 2025-10-11_
|
||||
|
||||
> **Note (2025-12):** This runbook is obsolete. MongoDB was fully removed in Sprint 4400 and replaced with PostgreSQL. The migration functionality described here was executed during the transition period and is no longer applicable. Retained for historical reference only.
|
||||
|
||||
## Overview
|
||||
|
||||
The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures
|
||||
provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only
|
||||
runs when the feature flag `concelier:storage:enableSemVerStyle` is enabled.
|
||||
|
||||
## Preconditions
|
||||
|
||||
1. **Review configuration** – set `concelier.storage.enableSemVerStyle` to `true` on all Concelier services.
|
||||
2. **Confirm batch size** – adjust `concelier.storage.backfillBatchSize` if you need smaller batches for older
|
||||
deployments (default: `250`).
|
||||
3. **Back up** – capture a fresh snapshot of the `advisory` collection or a full MongoDB backup.
|
||||
4. **Staging dry-run** – enable the flag in a staging environment and observe the migration output before
|
||||
rolling to production.
|
||||
|
||||
## Execution
|
||||
|
||||
No manual command is required. After deploying the configuration change, restart the Concelier WebService or
|
||||
any component that hosts the Mongo migration runner. During startup you will see log entries similar to:
|
||||
|
||||
```
|
||||
Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled.
|
||||
Mongo migration 20251011-semver-style-backfill applied
|
||||
```
|
||||
|
||||
The migration reads advisories in batches (`concelier.storage.backfillBatchSize`) and writes flattened
|
||||
`normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched.
|
||||
|
||||
## Post-checks
|
||||
|
||||
1. Verify the new indexes exist:
|
||||
```
|
||||
db.advisory.getIndexes()
|
||||
```
|
||||
You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`.
|
||||
2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches
|
||||
the embedded package data.
|
||||
3. Run `dotnet test` for `StellaOps.Concelier.Storage.Mongo.Tests` (optional but recommended) in CI to confirm
|
||||
the storage suite passes with the feature flag enabled.
|
||||
|
||||
## Rollback
|
||||
|
||||
Set `concelier.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on
|
||||
subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when
|
||||
the feature flag is off. If you must remove them entirely, restore from the backup captured during
|
||||
preparation.
|
||||
27
docs/operations/devops/policy-schema-export.md
Normal file
27
docs/operations/devops/policy-schema-export.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Policy Schema Export Automation
|
||||
|
||||
This utility generates JSON Schema documents for the Policy Engine run contracts.
|
||||
|
||||
## Command
|
||||
|
||||
```
|
||||
scripts/export-policy-schemas.sh [output-directory]
|
||||
```
|
||||
|
||||
When no output directory is supplied, schemas are written to `docs/modules/policy/schemas/`.
|
||||
|
||||
The exporter builds against `StellaOps.Scheduler.Models` and emits:
|
||||
|
||||
- `policy-run-request.schema.json`
|
||||
- `policy-run-status.schema.json`
|
||||
- `policy-diff-summary.schema.json`
|
||||
- `policy-explain-trace.schema.json`
|
||||
|
||||
The build pipeline (`.gitea/workflows/build-test-deploy.yml`, job **Export policy run schemas**) runs this script on every push and pull request. Exports land under `artifacts/policy-schemas/<commit>/`, are published as the `policy-schema-exports` artifact, and changes trigger a Slack post to `#policy-engine` via the `POLICY_ENGINE_SCHEMA_WEBHOOK` secret. A unified diff is stored alongside the exports for downstream consumers.
|
||||
|
||||
## CI integration checklist
|
||||
|
||||
- [x] Invoke the script in the DevOps pipeline (see `DEVOPS-POLICY-20-004`).
|
||||
- [x] Publish the generated schemas as pipeline artifacts.
|
||||
- [x] Notify downstream consumers when schemas change (Slack `#policy-engine`, changelog snippet).
|
||||
- [ ] Gate CLI validation once schema artifacts are available.
|
||||
151
docs/operations/devops/runbooks/deployment-upgrade.md
Normal file
151
docs/operations/devops/runbooks/deployment-upgrade.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Stella Ops Deployment Upgrade & Rollback Runbook
|
||||
|
||||
_Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._
|
||||
|
||||
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
|
||||
|
||||
---
|
||||
|
||||
## 1. Channel overview
|
||||
|
||||
| Channel | Release manifest | Helm values | Compose profile |
|
||||
|---------|------------------|-------------|-----------------|
|
||||
| `edge` | `deploy/releases/2025.10-edge.yaml` | `devops/helm/stellaops/values-dev.yaml` | `devops/compose/docker-compose.dev.yaml` |
|
||||
| `stable` | `deploy/releases/2025.09-stable.yaml` | `devops/helm/stellaops/values-stage.yaml`, `devops/helm/stellaops/values-prod.yaml` | `devops/compose/docker-compose.stage.yaml`, `devops/compose/docker-compose.prod.yaml` |
|
||||
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `devops/helm/stellaops/values-airgap.yaml` | `devops/compose/docker-compose.airgap.yaml` |
|
||||
|
||||
Infrastructure components (PostgreSQL, Valkey, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `devops/compose/*.yaml` for the authoritative set.
|
||||
|
||||
---
|
||||
|
||||
## 2. Pre-flight checklist
|
||||
|
||||
1. **Refresh release manifest**
|
||||
Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
|
||||
|
||||
2. **Align deployment bundles with the manifest**
|
||||
Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
|
||||
```bash
|
||||
./deploy/tools/check-channel-alignment.py \
|
||||
--release deploy/releases/2025.10-edge.yaml \
|
||||
--target devops/helm/stellaops/values-dev.yaml \
|
||||
--target devops/compose/docker-compose.dev.yaml \
|
||||
--ignore-repo nats
|
||||
```
|
||||
Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
|
||||
|
||||
3. **Lint and template profiles**
|
||||
```bash
|
||||
./deploy/tools/validate-profiles.sh
|
||||
```
|
||||
|
||||
4. **Smoke the Offline Kit debug store (edge/stable only)**
|
||||
When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
|
||||
```bash
|
||||
./ops/offline-kit/mirror_debug_store.py \
|
||||
--release-dir out/release \
|
||||
--offline-kit-dir out/offline-kit
|
||||
```
|
||||
Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
|
||||
|
||||
5. **Review compatibility matrix**
|
||||
Confirm PostgreSQL, Valkey, and RustFS versions in the release manifest match platform SLOs. The default targets are `postgres:16-alpine`, `valkey:8.0`, `rustfs:2025.10.0-edge`.
|
||||
|
||||
6. **Create a rollback bookmark**
|
||||
Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Helm upgrade procedure (staging → production)
|
||||
|
||||
1. Switch to the deployment branch and ensure secrets/config maps are current.
|
||||
2. Apply the upgrade in the staging cluster:
|
||||
```bash
|
||||
helm upgrade stellaops devops/helm/stellaops \
|
||||
-f devops/helm/stellaops/values-stage.yaml \
|
||||
--namespace stellaops \
|
||||
--atomic \
|
||||
--timeout 15m
|
||||
```
|
||||
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
|
||||
4. Promote to production using the prod values file and the same command.
|
||||
5. Record the new revision number and Git SHA in the change log.
|
||||
|
||||
### Rollback (Helm)
|
||||
|
||||
1. Identify the previous revision: `helm history stellaops -n stellaops`.
|
||||
2. Execute:
|
||||
```bash
|
||||
helm rollback stellaops <revision> \
|
||||
--namespace stellaops \
|
||||
--wait \
|
||||
--timeout 10m
|
||||
```
|
||||
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
|
||||
4. Update the incident/operations log with root cause and rollback details.
|
||||
|
||||
---
|
||||
|
||||
## 4. Docker Compose upgrade procedure
|
||||
|
||||
1. Update environment files (`devops/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
|
||||
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
|
||||
3. Apply the upgrade:
|
||||
```bash
|
||||
docker compose \
|
||||
--env-file devops/compose/env/prod.env \
|
||||
-f devops/compose/docker-compose.prod.yaml \
|
||||
pull
|
||||
|
||||
docker compose \
|
||||
--env-file devops/compose/env/prod.env \
|
||||
-f devops/compose/docker-compose.prod.yaml \
|
||||
up -d
|
||||
```
|
||||
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
|
||||
5. Update monitoring dashboards/alerts to confirm normal operation.
|
||||
|
||||
### Rollback (Compose)
|
||||
|
||||
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
|
||||
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
|
||||
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/modules/authority/operations/backup-restore.md` and associated service guides).
|
||||
4. Log the rollback in the operations journal.
|
||||
|
||||
---
|
||||
|
||||
## 5. Channel promotion workflow
|
||||
|
||||
1. Author or update the channel manifest under `deploy/releases/`.
|
||||
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
|
||||
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
|
||||
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
|
||||
5. Tag the repository when promoting stable or airgap builds.
|
||||
|
||||
---
|
||||
|
||||
## 6. Upgrade rehearsal & rollback drill log
|
||||
|
||||
Maintain rehearsal notes in `docs/modules/devops/runbooks/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
|
||||
|
||||
- Release version tested
|
||||
- Date/time
|
||||
- Participants
|
||||
- Issues encountered & fixes
|
||||
- Rollback duration (if executed)
|
||||
|
||||
Attach the log to the sprint retro or operational wiki.
|
||||
|
||||
| Date (UTC) | Channel | Outcome | Notes |
|
||||
|------------|---------|---------|-------|
|
||||
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- `deploy/README.md` – structure and validation workflow for deployment bundles.
|
||||
- `docs/RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline.
|
||||
- `docs/modules/devops/architecture.md` – high-level DevOps architecture, SLOs, and compliance requirements.
|
||||
- `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper.
|
||||
- `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker.
|
||||
130
docs/operations/devops/runbooks/launch-cutover.md
Normal file
130
docs/operations/devops/runbooks/launch-cutover.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# Launch Cutover Runbook - Stella Ops
|
||||
|
||||
_Document owner: DevOps Guild (2025-10-26)_
|
||||
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
|
||||
|
||||
> **Note (2025-12):** This document reflects the state at initial launch. Since then, MongoDB has been fully removed (Sprint 4400) and replaced with PostgreSQL. MinIO references now use RustFS. Redis references now use Valkey. See current deployment docs in `deploy/` for up-to-date configuration.
|
||||
|
||||
## 1. Roles and Communication
|
||||
|
||||
| Role | Primary | Backup | Contact |
|
||||
| --- | --- | --- | --- |
|
||||
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
|
||||
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
|
||||
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
|
||||
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
|
||||
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
|
||||
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
|
||||
|
||||
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
|
||||
|
||||
## 2. Timeline Overview (UTC)
|
||||
|
||||
| Time | Activity | Owner |
|
||||
| --- | --- | --- |
|
||||
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
|
||||
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
|
||||
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
|
||||
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
|
||||
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
|
||||
| T0 | Execute production cutover steps (Section 4). | Cutover team |
|
||||
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
|
||||
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
|
||||
|
||||
## 3. Rehearsal (Staging) Checklist
|
||||
|
||||
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
|
||||
2. Run `deploy/tools/validate-profiles.sh` and archive output.
|
||||
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
|
||||
4. Perform `helm upgrade stellaops devops/helm/stellaops -f devops/helm/stellaops/values-stage.yaml` in staging cluster.
|
||||
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
|
||||
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
|
||||
7. Document total wall time and any deviations in the rehearsal log.
|
||||
|
||||
Rehearsal must complete without manual interventions before proceeding to production.
|
||||
|
||||
## 4. Production Cutover Steps
|
||||
|
||||
### 4.1 Pre-flight
|
||||
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
|
||||
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
|
||||
- Back up current configuration and data:
|
||||
- Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
|
||||
- MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
|
||||
|
||||
### 4.2 Apply Updates (Compose)
|
||||
1. On each compose node, pull updated images for release `2025.09.2`:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f devops/compose/docker-compose.prod.yaml pull
|
||||
```
|
||||
2. Deploy changes:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f devops/compose/docker-compose.prod.yaml up -d
|
||||
```
|
||||
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
|
||||
|
||||
### 4.3 Apply Updates (Helm/Kubernetes)
|
||||
If using Kubernetes, perform:
|
||||
```bash
|
||||
helm upgrade stellaops devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml --atomic --timeout 15m
|
||||
```
|
||||
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
|
||||
|
||||
### 4.4 Configuration Validation
|
||||
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
|
||||
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
|
||||
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
|
||||
- Ensure Notify (legacy) still accessible while Notifier migration pending.
|
||||
|
||||
## 5. Smoke Tests
|
||||
|
||||
| Test | Command / Action | Expected Result |
|
||||
| --- | --- | --- |
|
||||
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
|
||||
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
|
||||
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
|
||||
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
|
||||
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
|
||||
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
|
||||
|
||||
Log results in the change ticket with timestamps and screenshots where applicable.
|
||||
|
||||
## 6. Rollback Procedure
|
||||
|
||||
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
|
||||
2. For Compose:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f devops/compose/docker-compose.prod.yaml down
|
||||
docker compose --env-file stage.env -f devops/compose/docker-compose.stage.yaml up -d
|
||||
```
|
||||
3. For Helm:
|
||||
```bash
|
||||
helm rollback stellaops <previous-release-number> --namespace stellaops
|
||||
```
|
||||
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
|
||||
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
|
||||
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
|
||||
|
||||
## 7. Post-cutover Actions
|
||||
|
||||
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
|
||||
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
|
||||
- Update `docs/modules/devops/runbooks/launch-readiness.md` if any new gaps or follow-ups discovered.
|
||||
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
|
||||
|
||||
## 8. Approval Matrix
|
||||
|
||||
| Step | Required Approvers | Record Location |
|
||||
| --- | --- | --- |
|
||||
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
|
||||
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
|
||||
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
|
||||
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
|
||||
|
||||
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
|
||||
|
||||
## 9. Rehearsal Log
|
||||
|
||||
| Date (UTC) | What We Exercised | Outcome | Follow-up |
|
||||
| --- | --- | --- | --- |
|
||||
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
|
||||
51
docs/operations/devops/runbooks/launch-readiness.md
Normal file
51
docs/operations/devops/runbooks/launch-readiness.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Launch Readiness Record - Stella Ops
|
||||
|
||||
_Updated: 2025-10-26 (UTC)_
|
||||
|
||||
> **Note (2025-12):** This document reflects the state at initial launch. Since then, MongoDB has been fully removed (Sprint 4400) and replaced with PostgreSQL. Redis references now use Valkey. See current deployment docs in `deploy/` for up-to-date configuration.
|
||||
|
||||
This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover.
|
||||
|
||||
## 1. Sign-off Summary
|
||||
|
||||
| Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. |
|
||||
| Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. |
|
||||
| Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. |
|
||||
| Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. |
|
||||
| Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/modules/concelier/operations/conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). |
|
||||
| Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (Sprint backlog reference) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. |
|
||||
| Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. |
|
||||
| Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. |
|
||||
| DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. |
|
||||
| Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-11-23T15:05Z | Release workflow now ships `out/release/debug`; run `mirror_debug_store.py` on next release artefact and commit `metadata/debug-store.json`. |
|
||||
|
||||
_\* READY with caveat - remaining work noted in Section 3._
|
||||
|
||||
## 2. Deployment Readiness Checklist
|
||||
|
||||
- **Production profiles committed:** `devops/compose/docker-compose.prod.yaml` and `devops/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services.
|
||||
- **Secrets placeholders documented:** `devops/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`).
|
||||
- **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing.
|
||||
- **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services.
|
||||
- **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks.
|
||||
- **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback.
|
||||
- **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned.
|
||||
|
||||
## 3. Outstanding Gaps & Follow-ups
|
||||
|
||||
| Item | Owner | Tracking Ref | Target / Next Step | Impact |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. |
|
||||
| Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/modules/signals/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. |
|
||||
| Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. |
|
||||
| Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (TODO 2025-11-23) | Release pipeline now publishes `out/release/debug`; run `mirror_debug_store.py`, verify hashes, and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. |
|
||||
| Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. |
|
||||
| Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. |
|
||||
|
||||
## 4. Approvals & Distribution
|
||||
|
||||
- Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement.
|
||||
- Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history.
|
||||
- Cutover rehearsal and rollback drills are tracked separately in `docs/modules/devops/runbooks/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch
|
||||
64
docs/operations/devops/runbooks/nuget-preview-bootstrap.md
Normal file
64
docs/operations/devops/runbooks/nuget-preview-bootstrap.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# NuGet Preview Bootstrap (Offline-Friendly)
|
||||
|
||||
The StellaOps build relies on .NET 10 RC2 packages (Microsoft.Extensions.*, JwtBearer 10.0 RC).
|
||||
`NuGet.config` now wires three sources:
|
||||
|
||||
1. `local` → `./local-nuget` (preferred, air-gapped mirror)
|
||||
2. `dotnet-public` → `https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json`
|
||||
3. `nuget.org` → fallback for everything else
|
||||
|
||||
Follow the steps below whenever you refresh the repo or roll a new Offline Kit drop.
|
||||
|
||||
## 1. Mirror the preview packages
|
||||
|
||||
```bash
|
||||
./ops/devops/sync-preview-nuget.sh
|
||||
```
|
||||
|
||||
* Reads `ops/devops/nuget-preview-packages.csv`. Each line specifies the package, version, expected SHA-256 hash, and (optionally) the flat-container base URL (we pin to `dotnet-public`).
|
||||
* Downloads the `.nupkg` straight into `./local-nuget/` and re-verifies the checksum. Existing files are skipped when hashes already match.
|
||||
* Use `NUGET_V2_BASE` if you need to temporarily point at a different mirror.
|
||||
|
||||
💡 The script never mutates packages in place—if a checksum changes you will see a “SHA mismatch … refreshing” message.
|
||||
|
||||
## 2. Restore using the shared `NuGet.config`
|
||||
|
||||
From the repo root:
|
||||
|
||||
```bash
|
||||
DOTNET_NOLOGO=1 dotnet restore src/Excititor/__Libraries/StellaOps.Excititor.Connectors.Abstractions/StellaOps.Excititor.Connectors.Abstractions.csproj \
|
||||
--configfile NuGet.config
|
||||
```
|
||||
|
||||
The `packageSourceMapping` section keeps `Microsoft.Extensions.*`, `Microsoft.AspNetCore.*`, and `Microsoft.Data.Sqlite` bound to `local`/`dotnet-public`, so `dotnet restore` never has to reach out to nuget.org when mirrors are populated.
|
||||
|
||||
Before committing changes (or when wiring up a new environment) run:
|
||||
|
||||
```bash
|
||||
python3 ops/devops/validate_restore_sources.py
|
||||
```
|
||||
|
||||
The validator asserts:
|
||||
|
||||
- `NuGet.config` lists `local` → `dotnet-public` → `nuget.org` in that order.
|
||||
- `Directory.Build.props` pins `RestoreSources` so every project prioritises the local mirror.
|
||||
- No stray `NuGet.config` files shadow the repo root configuration.
|
||||
|
||||
CI executes the validator in both the `build-test-deploy` and `release` workflows,
|
||||
so regressions trip before any restore/build begins.
|
||||
|
||||
If you run fully air-gapped, remember to clear the cache between SDK upgrades:
|
||||
|
||||
```bash
|
||||
dotnet nuget locals all --clear
|
||||
```
|
||||
|
||||
## 3. Troubleshooting
|
||||
|
||||
| Symptom | Fix |
|
||||
| --- | --- |
|
||||
| `dotnet restore` still hits nuget.org for preview packages | Re-run `sync-preview-nuget.sh` to ensure the `.nupkg` exists locally, then delete `~/.nuget/packages/microsoft.extensions.*` so the resolver picks up the mirrored copy. |
|
||||
| SHA mismatch in the manifest | Update `ops/devops/nuget-preview-packages.csv` with the new version + checksum (from the feed) and re-run the sync script. |
|
||||
| Azure DevOps feed throttling | Set `DOTNET_PUBLIC_FLAT_BASE` env var and point it at your own mirrored flat-container, then add the URL to the 4th column of the manifest. |
|
||||
|
||||
Keep this doc alongside Offline Kit instructions so air-gapped operators know exactly how to refresh the mirror and verify packages before restore.
|
||||
49
docs/operations/devops/runbooks/zastava-deployment.md
Normal file
49
docs/operations/devops/runbooks/zastava-deployment.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Zastava Deployment Runbook
|
||||
|
||||
> **Audience:** DevOps, Zastava Guild
|
||||
>
|
||||
> **Purpose:** Provide steps for deploying Zastava Observer + Webhook in connected and air-gapped clusters.
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Kubernetes 1.26+ with admission registration permissions.
|
||||
- Access to StellaOps Container Registry or offline bundle with Zastava images.
|
||||
- Authority scopes and certificates configured for Zastava identities.
|
||||
- Surface.FS cache endpoint (RustFS/S3) reachable from nodes.
|
||||
|
||||
## 2. Installation Steps
|
||||
|
||||
1. **Prepare namespace & secrets**
|
||||
- Create Kubernetes namespace (default `stellaops-runtime`).
|
||||
- Provision secrets (`zastava-mtls`, `zastava-op-token`, `surface-secrets`).
|
||||
2. **Deploy Observer**
|
||||
- Apply Helm chart `helm/zastava` with values aligning to Surface.Env settings.
|
||||
- Confirm DaemonSet pods schedule on all nodes; check `/healthz` endpoints.
|
||||
3. **Deploy Webhook**
|
||||
- Install ValidatingWebhookConfiguration with CA bundle and service reference.
|
||||
- Enable dry-run mode first, monitor logs, then switch `enforce=true` once validations pass.
|
||||
4. **Configure policies**
|
||||
- Populate admission policies in Policy Engine; ensure tokens contain `runtime:read` scopes.
|
||||
- Update CLI/Console settings for runtime posture view.
|
||||
5. **Observability**
|
||||
- Scrape metrics (`zastava_observer_*`, `zastava_webhook_*`).
|
||||
- Stream logs to central collector.
|
||||
|
||||
## 3. Air-Gapped Deployment Notes
|
||||
|
||||
- Use Offline Kit bundle (`offline/zastava/`) to load images and configuration.
|
||||
- Validate Surface.FS bundles before enabling enforcement.
|
||||
- Replace webhook CA with offline authority; document rotation schedule.
|
||||
|
||||
## 4. Validation
|
||||
|
||||
- Run `stella runtime policy test` against sample workloads.
|
||||
- Trigger deployment denial for unsigned images; verify Notifier emits alerts.
|
||||
- Check timeline events for observer telemetry.
|
||||
|
||||
## 5. References
|
||||
|
||||
- `docs/modules/zastava/architecture.md`
|
||||
- `docs/modules/scanner/architecture.md`
|
||||
- `docs/airgap/airgap-mode.md`
|
||||
- `docs/forensics/timeline.md`
|
||||
48
docs/operations/devops/task-runner-simulation.md
Normal file
48
docs/operations/devops/task-runner-simulation.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Task Runner — Simulation & Failure Policy Notes
|
||||
|
||||
> **Status:** Draft (2025-11-04) — execution wiring + CLI simulate command landed; docs pending final polish
|
||||
|
||||
The Task Runner planning layer now materialises additional runtime metadata to unblock execution and simulation flows:
|
||||
|
||||
- **Execution graph builder** – converts `TaskPackPlan` steps (including `map` and `parallel`) into a deterministic graph with preserved enablement flags and per-step metadata (`maxParallel`, `continueOnError`, parameters, approval IDs).
|
||||
- **Simulation engine** – walks the execution graph and classifies steps as `pending`, `skipped`, `requires-approval`, or `requires-policy`, producing a deterministic preview for CLI/UI consumers while surfacing declared outputs.
|
||||
- **Failure policy** – pack-level `spec.failure.retries` is normalised into a `TaskPackPlanFailurePolicy` (default: `maxAttempts = 1`, `backoffSeconds = 0`). The new step state machine uses this policy to schedule retries and to determine when a run must abort.
|
||||
- **Simulation API + Worker** – `POST /v1/task-runner/simulations` returns the deterministic preview; `GET /v1/task-runner/runs/{id}` exposes persisted retry windows now written by the worker as it honours `maxParallel`, `continueOnError`, and retry windows during execution.
|
||||
|
||||
## Current behaviour
|
||||
|
||||
- Map steps expand into child iterations (`stepId[index]::templateId`) with per-item parameters preserved for runtime reference.
|
||||
- Parallel blocks honour `maxParallel` (defaults to unlimited) and the worker executes children accordingly, short-circuiting when `continueOnError` is false.
|
||||
- Simulation output mirrors approvals/policy gates, allowing the WebService/CLI to show which actions must occur before execution resumes.
|
||||
- File-backed state store persists `PackRunState` snapshots (`nextAttemptAt`, attempts, reasons) so orchestration clients and CLI can resume runs deterministically even in air-gapped environments.
|
||||
- Step state machine transitions:
|
||||
- `pending → running → succeeded`
|
||||
- `running → failed` (abort) once attempts ≥ `maxAttempts`
|
||||
- `running → pending` with scheduled `nextAttemptAt` when retries remain
|
||||
- `pending → skipped` for disabled steps (e.g., `when` expressions).
|
||||
|
||||
## CLI usage
|
||||
|
||||
Run the simulation without mutating state:
|
||||
|
||||
```bash
|
||||
stella task-runner simulate \
|
||||
--manifest ./packs/sample-pack.yaml \
|
||||
--inputs ./inputs.json \
|
||||
--format table
|
||||
```
|
||||
|
||||
Use `--format json` (or `--output path.json`) to emit the raw payload produced by `POST /api/task-runner/simulations`.
|
||||
|
||||
## Follow-up gaps
|
||||
|
||||
- Fold the CLI command into the official reference/quickstart guides and capture exit-code conventions.
|
||||
|
||||
References:
|
||||
|
||||
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Core/Execution/PackRunExecutionGraphBuilder.cs`
|
||||
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Core/Execution/Simulation/PackRunSimulationEngine.cs`
|
||||
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Core/Execution/PackRunStepStateMachine.cs`
|
||||
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Infrastructure/Execution/FilePackRunStateStore.cs`
|
||||
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.Worker/Services/PackRunWorkerService.cs`
|
||||
- `src/TaskRunner/StellaOps.TaskRunner/StellaOps.TaskRunner.WebService/Program.cs`
|
||||
Reference in New Issue
Block a user