feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										22
									
								
								docs/modules/devops/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								docs/modules/devops/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,22 @@ | ||||
| # DevOps agent guide | ||||
|  | ||||
| ## Mission | ||||
| The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments. | ||||
|  | ||||
| ## Key docs | ||||
| - [Module README](./README.md) | ||||
| - [Architecture](./architecture.md) | ||||
| - [Implementation plan](./implementation_plan.md) | ||||
| - [Task board](./TASKS.md) | ||||
|  | ||||
| ## How to get started | ||||
| 1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module. | ||||
| 2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED). | ||||
| 3. Read the architecture and README for domain context before editing code or docs. | ||||
| 4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan. | ||||
|  | ||||
| ## Guardrails | ||||
| - Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md). | ||||
| - Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts. | ||||
| - Keep Offline Kit parity in mind—document air-gapped workflows for any new feature. | ||||
| - Update runbooks/observability assets when operational characteristics change. | ||||
							
								
								
									
										41
									
								
								docs/modules/devops/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										41
									
								
								docs/modules/devops/README.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,41 @@ | ||||
| # StellaOps DevOps | ||||
|  | ||||
| The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments. | ||||
|  | ||||
| ## Responsibilities | ||||
| - Maintain CI pipelines, signing workflows, and release packaging steps. | ||||
| - Operate shared runbooks for launch readiness, upgrades, and NuGet previews. | ||||
| - Provide offline kit assembly instructions and tooling integration. | ||||
| - Wrap observability/telemetry bootstrap flows for platform teams. | ||||
|  | ||||
| ## Key components | ||||
| - Runbooks under ./runbooks/ (launch, deployment, nuget). | ||||
| - Migration guidance under ./migrations/. | ||||
| - Architecture overview bridging CI/CD & infrastructure concerns. | ||||
|  | ||||
| ## Integrations & dependencies | ||||
| - Ops pipelines (Gitea, GitHub Actions) and artifact registries. | ||||
| - Authority/Signer for supply chain signing. | ||||
| - Telemetry stack bootstrap scripts. | ||||
|  | ||||
| ## Operational notes | ||||
| - Offline bundle packaging guidance in docs/modules/export-center/operations/runbook.md. | ||||
| - Dashboards for launch cutover rehearsals. | ||||
| - Coordination with Security for enforced guardrails. | ||||
|  | ||||
| ## Related resources | ||||
| - ./runbooks/launch-readiness.md | ||||
| - ./runbooks/launch-cutover.md | ||||
| - ./runbooks/deployment-upgrade.md | ||||
| - ./runbooks/nuget-preview-bootstrap.md | ||||
| - ./migrations/semver-style.md | ||||
|  | ||||
| ## Backlog references | ||||
| - DEVOPS-LAUNCH-18-001 / 18-900 runbooks in ../../TASKS.md. | ||||
| - Telemetry bootstrap automation tracked in `ops/devops/TASKS.md`. | ||||
|  | ||||
| ## Epic alignment | ||||
| - **Epic 1 – AOC enforcement:** bake AOC verifier steps, CI guards, and schema validation into pipelines. | ||||
| - **Epic 9 – Orchestrator Dashboard:** support operational dashboards, job recovery runbooks, and rate-limit governance. | ||||
| - **Epic 10 – Export Center:** manage signing workflows, Offline Kit packaging, and release promotion for exports. | ||||
| - **Epic 15 – Observability & Forensics:** coordinate telemetry deployment, evidence retention, and forensic automation. | ||||
							
								
								
									
										9
									
								
								docs/modules/devops/TASKS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										9
									
								
								docs/modules/devops/TASKS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,9 @@ | ||||
| # Task board — DevOps | ||||
|  | ||||
| > Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable. | ||||
|  | ||||
| | ID | Status | Owner(s) | Description | Notes | | ||||
| |----|--------|----------|-------------|-------| | ||||
| | DEVOPS-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md | | ||||
| | DEVOPS-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md | | ||||
| | DEVOPS-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow | | ||||
							
								
								
									
										488
									
								
								docs/modules/devops/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										488
									
								
								docs/modules/devops/architecture.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,488 @@ | ||||
| # component_architecture_devops.md — **Stella Ops Release & Operations** (2025Q4) | ||||
|  | ||||
| > Draws from the AOC guardrails, Orchestrator, Export Center, and Observability module plans to describe how Stella Ops is built, signed, distributed, and operated. | ||||
|  | ||||
| > **Scope.** Implementation‑ready blueprint for **how Stella Ops is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and air‑gapped). Covers reproducible builds, supply‑chain attestations, registries, offline kits, migration/rollback, artifact lifecycle (RustFS default + Mongo, S3 fallback), monitoring SLOs, and customer activation. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 0) Product vision (operations lens) | ||||
|  | ||||
| Stella Ops must be **trustable at a glance** and **boringly operable**: | ||||
|  | ||||
| * Every release ships with **first‑party SBOMs, provenance, and signatures**; services verify **each other’s** integrity at runtime. | ||||
| * Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels. | ||||
| * Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**. | ||||
| * Air‑gapped customers receive **offline kits** with verifiable digests and deterministic import. | ||||
| * Artifacts expire predictably; operators know what’s kept, for how long, and why. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1) Release trains & versioning | ||||
|  | ||||
| ### 1.1 Channels | ||||
|  | ||||
| * **LTS** (12‑month support window): quarterly cadence (Q1/Q2/Q3/Q4). | ||||
| * **Stable** (default): monthly rollup (bug fixes + compatible features). | ||||
| * **Edge**: weekly; for early adopters, no guarantees. | ||||
|  | ||||
| ### 1.2 Version strings | ||||
|  | ||||
| Semantic core + calendar tag: | ||||
|  | ||||
| ``` | ||||
| <MAJOR>.<MINOR>.<PATCH>  (<YYYY>.<MM>)   e.g., 2.4.1 (2027.06) | ||||
| ``` | ||||
|  | ||||
| * **MAJOR**: breaking API/DB changes (rare). | ||||
| * **MINOR**: new features, compatible schema migrations (expand/contract pattern). | ||||
| * **PATCH**: bug fixes, perf and security updates. | ||||
| * **Calendar tag** exposes **release year** used by Signer for **PoE window checks**. | ||||
|  | ||||
| ### 1.3 Component alignment | ||||
|  | ||||
| A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wire‑compatible**. Mixed minor versions are allowed within a bounded skew: | ||||
|  | ||||
| * **Web UI ↔ backend**: `±1 minor`. | ||||
| * **Scanner ↔ Policy/Excititor/Concelier**: `±1 minor`. | ||||
| * **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules). | ||||
|  | ||||
| At startup, services **self‑advertise** their semver & channel; the UI surfaces **mismatch warnings**. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2) Supply‑chain pipeline (how a release is built) | ||||
|  | ||||
| ### 2.1 Deterministic builds | ||||
|  | ||||
| * **Builders**: isolated **BuildKit** workers with pinned base images (digest only). | ||||
| * **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag. | ||||
| * **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars. | ||||
| * **Multi‑arch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap). | ||||
|  | ||||
| ### 2.2 First‑party SBOMs & provenance | ||||
|  | ||||
| * Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSA‑style provenance** attached as **OCI referrers**. | ||||
| * Scanner’s **Buildx generator** is used to produce SBOMs *during* build; a separate post‑build scan verifies parity (red flag if drift). | ||||
| * **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs. | ||||
|  | ||||
| ### 2.3 Signing & transparency | ||||
|  | ||||
| * Images are **cosign‑signed** (keyless) with a Stella Ops release identity; inclusion in a **transparency log** (Rekor) is required. | ||||
| * SBOM and provenance attestations are **DSSE** and also transparency‑logged. | ||||
| * Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scanner‑release validation** at customer side). | ||||
|  | ||||
| ### 2.4 Gates & tests | ||||
|  | ||||
| * **Static**: linters, codegen checks, protobuf API freeze (backward‑compat tests). | ||||
| * **Unit/integration**: per‑component, plus **end‑to‑end** flows (scan→vex→policy→sign→attest). | ||||
| * **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets. | ||||
| * **Security**: dependency audit vs Concelier export; container hardening tests; minimal caps. | ||||
| * **Analyzer smoke**: restart-time language plug-ins (currently Python) verified via `dotnet run --project src/Tools/LanguageAnalyzerSmoke` to ensure manifest integrity plus cold vs warm determinism (< 30 s / < 5 s budgets); the harness logs deviations from repository goldens for follow-up. | ||||
| * **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag. | ||||
|  | ||||
| ### 2.5 Debug-store artefacts | ||||
|  | ||||
| * Every release exports stripped debug information for ELF binaries discovered in service images. Debug files follow the GNU build-id layout (`debug/.build-id/<aa>/<rest>.debug`) and are generated via `objcopy --only-keep-debug`. | ||||
| * `debug/debug-manifest.json` captures build-id → component/image/source mappings with SHA-256 checksums so operators can mirror the directory into debuginfod or offline symbol stores. The manifest (and its `.sha256` companion) ships with every release bundle and Offline Kit. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3) Distribution & activation | ||||
|  | ||||
| ### 3.1 Registries | ||||
|  | ||||
| * **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API). | ||||
| * **Mirrors**: GHCR (read‑only), regional mirrors for latency. | ||||
|   * Operational runbook: see `docs/modules/concelier/operations/mirror.md` for deployment profiles, CDN guidance, and sync automation. | ||||
| * **Pull by digest only** in Kubernetes/Compose manifests. | ||||
|  | ||||
| **Gating policy**: | ||||
|  | ||||
| * **Core images** (Authority, Scanner, Concelier, Excititor, Attestor, UI): public **read**. | ||||
| * **Enterprise add‑ons** (if any) and **pre‑release**: private repos via the **Registry Token Service** (`src/Registry/StellaOps.Registry.TokenService`) which exchanges Authority-issued OpToks for short-lived Docker registry bearer tokens. | ||||
|  | ||||
| > Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume. | ||||
|  | ||||
| ### 3.2 OAuth2 token service (for private repos) | ||||
|  | ||||
| * Docker Registry’s token flow backed by **Authority**: | ||||
|  | ||||
|   1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`). | ||||
|   2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`. | ||||
|   3. Registry allows pull for the requested repo. | ||||
| * Tokens are **short‑lived** (60–300 s) and **DPoP‑bound**. | ||||
|  | ||||
| The token service enforces plan gating via `registry-token.yaml` (see `docs/modules/registry/operations/token-service.md`) and exposes Prometheus metrics (`registry_token_issued_total`, `registry_token_rejected_total`). Revoked licence identifiers halt issuance even when scope requirements are met. | ||||
|  | ||||
| ### 3.3 Offline kits (air‑gapped) | ||||
|  | ||||
| * Tarball per release channel: | ||||
|  | ||||
|   ``` | ||||
|   stellaops-kit-<ver>-<channel>.tar.zst | ||||
|     /images/   OCI layout with all first-party images (multi-arch) | ||||
|     /sboms/    CycloneDX JSON+PB for each image | ||||
|     /attest/   DSSE bundles + Rekor proofs | ||||
|     /charts/   Helm charts + values templates | ||||
|     /compose/  docker-compose.yml + .env template | ||||
|     /plugins/  Concelier/Excititor connectors (restart-time) | ||||
|     /policy/   example policies | ||||
|     /manifest/ release.yaml  (see §6.1) | ||||
|   ``` | ||||
| * Import via CLI `offline kit import`; checks digests and signatures before load. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4) Licensing (PoE) & monetization | ||||
|  | ||||
| **Principle**: **Only paid Stella Ops issues valid signed attestations.** Running the stack is free; signing requires PoE. | ||||
|  | ||||
| ### 4.1 PoE issuance | ||||
|  | ||||
| * Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`: | ||||
|  | ||||
|   * **PoE‑JWT** (DPoP/mTLS‑bound) **or** **PoE mTLS client certificate**. | ||||
|   * Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs. | ||||
|  | ||||
| ### 4.2 Online enforcement | ||||
|  | ||||
| * **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc). | ||||
| * If **revoked/expired/out‑of‑window** → deny with machine‑readable reason. | ||||
| * All **valid** bundles are DSSE‑signed and **Attestor** logs them; Rekor UUID returned. | ||||
| * UI badges: “**Verified by Stella Ops**” with link to the public log. | ||||
|  | ||||
| ### 4.3 Air‑gapped / offline | ||||
|  | ||||
| * Customers obtain a **time‑boxed PoE lease** (signed JSON, 7–30 days). | ||||
| * Signer accepts the lease and emits **provisional** attestations (clearly labeled). | ||||
| * When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**. | ||||
| * Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot). | ||||
|  | ||||
| ### 4.4 Stolen/abused PoE | ||||
|  | ||||
| * Customers report theft; **Licensing** flags `license_id` as **revoked**. | ||||
| * Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional re‑sign path upon new PoE). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5) Deployment path (customer side) | ||||
|  | ||||
| ### 5.1 First install | ||||
|  | ||||
| * **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s): | ||||
|  | ||||
| ```bash | ||||
| helm repo add stellaops https://charts.stella-ops.org | ||||
| helm install stella stellaops/platform \ | ||||
|   --version 2.4.0 \ | ||||
|   --set global.channel=stable \ | ||||
|   --set authority.issuer=https://authority.stella.local \ | ||||
|   --set scanner.minio.endpoint=http://minio.stella.local:9000 \ | ||||
|   --set scanner.mongo.uri=mongodb://mongo/scanner \ | ||||
|   --set concelier.mongo.uri=mongodb://mongo/concelier \ | ||||
|   --set excititor.mongo.uri=mongodb://mongo/excititor | ||||
| ``` | ||||
|  | ||||
| * Post‑install job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets). | ||||
| * UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?). | ||||
|  | ||||
| ### 5.2 Updates | ||||
|  | ||||
| * **Blue/green**: pull new bundle by **digest**; deploy side‑by‑side; cut traffic. | ||||
|  | ||||
| * **Rolling**: upgrade stateful components in safe order: | ||||
|  | ||||
|   1. Authority (stateless, dual‑key rotation ready) | ||||
|   2. Signer/Attestor (same minor) | ||||
|   3. Scanner WebService & Workers | ||||
|   4. Concelier, then Excititor (schema migrations are expand/contract) | ||||
|   5. UI last | ||||
|  | ||||
| * **DB migrations** are **expand/contract**: | ||||
|  | ||||
|   * Phase A (release N): **add** new fields/indexes, write old+new. | ||||
|   * Phase B (N+1): **read** new fields; **drop** old. | ||||
|   * Rollback is a matter of redeploying previous images and keeping both schemas valid. | ||||
|  | ||||
| ### 5.3 Rollback | ||||
|  | ||||
| * Images referenced by **digest**; keep previous release manifest `K` versions back. | ||||
| * `helm rollback` or compose `docker compose -f release-K.yml up -d`. | ||||
| * Mongo migrations are additive; **no destructive changes** within a single minor. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6) Release payloads & manifests | ||||
|  | ||||
| ### 6.1 Release manifest (`release.yaml`) | ||||
|  | ||||
| ```yaml | ||||
| release: | ||||
|   version: "2.4.1" | ||||
|   channel: "stable" | ||||
|   date: "2027-06-20T12:00:00Z" | ||||
|   calendar: "2027.06" | ||||
|   components: | ||||
|     - name: scanner-webservice | ||||
|       image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb | ||||
|       sbom: oci://.../referrers/cdx-json@sha256:11..22 | ||||
|       provenance: oci://.../attest/provenance@sha256:33..44 | ||||
|       signature: { rekorUUID: "…" } | ||||
|     - name: signer | ||||
|       image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd | ||||
|       signature: { rekorUUID: "…" } | ||||
|   charts: | ||||
|     - name: platform | ||||
|       version: "2.4.1" | ||||
|       digest: "sha256:ee..ff" | ||||
|   compose: | ||||
|     file: "docker-compose.yml" | ||||
|     digest: "sha256:77..88" | ||||
|   checksums: | ||||
|     sha256: "… digest of this release.yaml …" | ||||
| ``` | ||||
|  | ||||
| The manifest is **cosign‑signed**; UI/CLI can verify a bundle without talking to registries. | ||||
|  | ||||
| > Deployment guardrails – The repository keeps channel-aligned Compose bundles | ||||
| > in `deploy/compose/` and Helm overlays in `deploy/helm/stellaops/`. Both sets | ||||
| > pull their digests from `deploy/releases/` and are validated by | ||||
| > `deploy/tools/validate-profiles.sh` to guarantee lint/dry-run cleanliness. | ||||
|  | ||||
| ### 6.2 Image labels (release metadata) | ||||
|  | ||||
| Each image sets OCI labels: | ||||
|  | ||||
| ``` | ||||
| org.opencontainers.image.version = "2.4.1" | ||||
| org.opencontainers.image.revision = "<git sha>" | ||||
| org.opencontainers.image.created = "2027-06-20T12:00:00Z" | ||||
| org.stellaops.release.calendar = "2027.06" | ||||
| org.stellaops.release.channel  = "stable" | ||||
| org.stellaops.build.slsaProvenance = "oci://…" | ||||
| ``` | ||||
|  | ||||
| Signer validates **scanner** image’s cosign identity + calendar tag for **release window** checks. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7) Artifact lifecycle & storage (RustFS/Mongo) | ||||
|  | ||||
| ### 7.1 Buckets & prefixes (RustFS) | ||||
|  | ||||
| ``` | ||||
| rustfs://stellaops/ | ||||
|   scanner/ | ||||
|     layers/<sha256>/sbom.cdx.json.zst | ||||
|     images/<imgDigest>/inventory.cdx.pb | ||||
|     images/<imgDigest>/usage.cdx.pb | ||||
|     diffs/<old>_<new>/diff.json.zst | ||||
|     attest/<artifactSha256>.dsse.json | ||||
|   concelier/ | ||||
|     json/<exportId>/... | ||||
|     trivy/<exportId>/... | ||||
|   excititor/ | ||||
|     exports/<exportId>/... | ||||
|   attestor/ | ||||
|     dsse/<bundleSha256>.json | ||||
|     proof/<rekorUuid>.json | ||||
| ``` | ||||
|  | ||||
| ### 7.2 ILM classes | ||||
|  | ||||
| * **`short`**: working artifacts (diffs, queues) — TTL 7–14 days. | ||||
| * **`default`**: SBOMs & indexes — TTL 90–180 days (configurable). | ||||
| * **`compliance`**: signed reports & attested exports — retention enforced via RustFS hold or S3 Object Lock (governance/compliance) 1–7 years. | ||||
|  | ||||
| ### 7.3 Artifact Lifecycle Controller (ALC) | ||||
|  | ||||
| * A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**: | ||||
|  | ||||
|   * Artifacts referenced by **reports** or **tickets** are pinned. | ||||
|   * ILM actions logged; UI shows per‑class usage & upcoming purges. | ||||
|  | ||||
| > **Migration note.** Follow `docs/modules/scanner/operations/rustfs-migration.md` when transitioning existing | ||||
| > MinIO buckets to RustFS. The provided migrator is idempotent and safe to rerun per prefix. | ||||
|  | ||||
| ### 7.4 Mongo retention | ||||
|  | ||||
| * **Scanner**: `runtime.events` use TTL (e.g., 30–90 days); **catalog** permanent. | ||||
| * **Concelier/Excititor**: raw docs keep **last N windows**; canonical stores permanent. | ||||
| * **Attestor**: `entries` permanent; `dedupe` TTL 24–48h. | ||||
|  | ||||
| ### 7.5 Mongo server baseline | ||||
|  | ||||
| * **Minimum supported server:** MongoDB **4.2+**. Driver 3.5.0 removes compatibility shims for 4.0; upstream has already announced 4.0 support will be dropped in upcoming C# driver releases. citeturn1open1 | ||||
| * **Deploy images:** Compose/Helm defaults stay on `mongo:7.x`. For air-gapped installs, refresh Offline Kit bundles so the packaged `mongod` matches ≥4.2. | ||||
| * **Upgrade guard:** During rollout, verify replica sets reach FCV `4.2` or above before swapping binaries; automation should hard-stop if FCV is <4.2. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8) Observability & SLOs (operations) | ||||
|  | ||||
| * **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%. | ||||
| * **Error budgets**: tracked per month; dashboards show burn rates. | ||||
| * **Golden signals**: | ||||
|  | ||||
|   * **Latency**: token issuance, sign→attest round‑trip, scan enqueue→emit, export build. | ||||
|   * **Saturation**: queue depth, Mongo write IOPS, RustFS throughput / queue depth (or S3 metrics when in fallback mode). | ||||
|   * **Traffic**: scans/min, attestations/min, webhook admits/min. | ||||
|   * **Errors**: 5xx rates, cosign verification failures, Rekor timeouts. | ||||
|  | ||||
| Prometheus + OTLP; Grafana dashboards ship in the charts. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9) Security & compliance operations | ||||
|  | ||||
| * **Key rotation**: | ||||
|  | ||||
|   * Authority JWKS: 60‑day cadence, dual‑key overlap. | ||||
|   * Release signing identities: rotate per minor or quarterly. | ||||
|   * Sigstore roots mirrored and pinned; alarms on drift. | ||||
|  | ||||
| * **FIPS mode** (Gov build): | ||||
|  | ||||
|   * Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only. | ||||
|   * Local **Rekor v2** and **Fulcio** alternatives; **air‑gapped** CA. | ||||
|  | ||||
| * **Vulnerability response**: | ||||
|  | ||||
|   * Concelier red-flag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice. | ||||
|   * 2025-10: Pinned `MongoDB.Driver` **3.5.0** and `SharpCompress` **0.41.0** across services (DEVOPS-SEC-10-301) to eliminate NU1902/NU1903 warnings surfaced during scanner cache/worker test runs; repacked the local `Mongo2Go` feed so test fixtures inherit the patched dependencies; future bumps follow the same central override pattern. | ||||
|  | ||||
| * **Backups/DR**: | ||||
|  | ||||
|   * Mongo nightly snapshots; MinIO versioning + replication (if configured). | ||||
|   * Restore runbooks tested quarterly with synthetic data. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10) Customer update flow (how versions are fetched & activated) | ||||
|  | ||||
| ### 10.1 Online clusters | ||||
|  | ||||
| * **UI** surfaces update banner with **release manifest** diff and risk notes. | ||||
| * Operator approves → **Controller** pulls new images by digest; health‑checks; moves traffic; deprecates old revision. | ||||
| * Post‑switch, **schema Phase B** migrations (if any) run automatically. | ||||
|  | ||||
| ### 10.2 Air‑gapped clusters | ||||
|  | ||||
| * Operator downloads **offline kit** from a mirror → `stellaops offline kit import`. | ||||
| * Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest. | ||||
| * After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged. | ||||
|  | ||||
| ### 10.3 CLI self‑update (optional) | ||||
|  | ||||
| * `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 11) Compatibility & deprecation policy | ||||
|  | ||||
| * **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor. | ||||
| * **Storage**: expand/contract; “drop old fields” only after one minor grace. | ||||
| * **Config**: feature flags (default off) for risky features (e.g., eBPF). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 12) Runbooks (selected) | ||||
|  | ||||
| ### 12.1 Lost PoE | ||||
|  | ||||
| 1. Suspend **automatic attestation** jobs. | ||||
| 2. Use CLI `stellaops signer status` to confirm `entitlement_denied`. | ||||
| 3. Obtain new PoE from portal; verify on Signer `/poe/verify`. | ||||
| 4. Re‑enable; optionally **re‑sign** last N reports (UI button → batch). | ||||
|  | ||||
| ### 12.2 Rekor outage (self‑hosted) | ||||
|  | ||||
| * Attestor returns `202 (pending)` with queued proof fetch. | ||||
| * Keep DSSE bundles locally; re‑submit on schedule; UI badge shows **Pending**. | ||||
| * If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored. | ||||
|  | ||||
| ### 12.3 Emergency downgrade | ||||
|  | ||||
| * Identify prior release manifest (UI → Admin → Releases). | ||||
| * `helm rollback stella <revision>` (or compose apply previous file). | ||||
| * Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 13) Example: cluster bootstrap (Compose) | ||||
|  | ||||
| ```yaml | ||||
| version: "3.9" | ||||
| services: | ||||
|   authority: | ||||
|     image: registry.stella-ops.org/stellaops/authority@sha256:... | ||||
|     env_file: ./env/authority.env | ||||
|     ports: ["8440:8440"] | ||||
|   signer: | ||||
|     image: registry.stella-ops.org/stellaops/signer@sha256:... | ||||
|     depends_on: [authority] | ||||
|     environment: | ||||
|       - SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect | ||||
|   attestor: | ||||
|     image: registry.stella-ops.org/stellaops/attestor@sha256:... | ||||
|     depends_on: [signer] | ||||
|   scanner-web: | ||||
|     image: registry.stella-ops.org/stellaops/scanner-web@sha256:... | ||||
|     environment: | ||||
|       - SCANNER__S3__ENDPOINT=http://minio:9000 | ||||
|   scanner-worker: | ||||
|     image: registry.stella-ops.org/stellaops/scanner-worker@sha256:... | ||||
|     deploy: { replicas: 4 } | ||||
|   concelier: | ||||
|     image: registry.stella-ops.org/stellaops/concelier@sha256:... | ||||
|   excititor: | ||||
|     image: registry.stella-ops.org/stellaops/excititor@sha256:... | ||||
|   web-ui: | ||||
|     image: registry.stella-ops.org/stellaops/web-ui@sha256:... | ||||
|   mongo: | ||||
|     image: mongo:7 | ||||
|   minio: | ||||
|     image: minio/minio:RELEASE.2025-07-10T00-00-00Z | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 14) Governance & keys (who owns the trust root) | ||||
|  | ||||
| * **Release key policy**: only the Release Engineering group can push signed releases; 4‑eyes approval; TUF‑style manifest possible in future. | ||||
| * **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported. | ||||
| * **Customer keys**: none needed for core use; enterprise add‑ons may require per‑customer registries and keys. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 15) Roadmap (Ops) | ||||
|  | ||||
| * **Windows containers GA** (Scanner + Zastava). | ||||
| * **Key Transparency** for Signer certs. | ||||
| * **Delta‑kit** (offline) for incremental updates. | ||||
| * **Operator CRDs** (K8s) to manage policy and ILM declaratively. | ||||
| * **SBOM **protobuf** as default transport at rest (smaller, faster). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ### Appendix A — Minimal SLO monitors | ||||
|  | ||||
| * `authority.tokens_issued_total` slope ≈ normal. | ||||
| * `signer.requests_total{result="success"}/minute` > 0 (when scans occur). | ||||
| * `attestor.submit_latency_seconds{quantile=0.95}` < 0.3. | ||||
| * `scanner.scan_latency_seconds{quantile=0.95}` < target per image size. | ||||
| * `concelier.export.duration_seconds` stable; `excititor.consensus.conflicts_total` not exploding after policy changes. | ||||
| * RustFS request error rate near zero (or `s3_requests_errors_total` when operating against S3); Mongo `opcounters` hit expected baseline. | ||||
|  | ||||
| ### Appendix B — Upgrade safety checklist | ||||
|  | ||||
| * Verify **release manifest** signature. | ||||
| * Ensure **Signer/Authority/Attestor** are same minor. | ||||
| * Verify **DB backups** < 24h old. | ||||
| * Confirm **ILM** won’t purge compliance artifacts during upgrade window. | ||||
| * Roll **one component** at a time; watch SLOs; abort on regression. | ||||
|  | ||||
| --- | ||||
|  | ||||
| **End — component_architecture_devops.md** | ||||
							
								
								
									
										22
									
								
								docs/modules/devops/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								docs/modules/devops/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,22 @@ | ||||
| # Implementation plan — DevOps | ||||
|  | ||||
| ## Current objectives | ||||
| - Maintain deterministic behaviour and offline parity across releases. | ||||
| - Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes. | ||||
|  | ||||
| ## Workstreams | ||||
| - Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap. | ||||
| - Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs. | ||||
| - Validation: extend tests/fixtures to preserve determinism and provenance requirements. | ||||
|  | ||||
| ## Epic milestones | ||||
| - **Epic 1 – AOC enforcement:** ensure CI/CD guardrails, schema validation, and verifier pipelines are enforced. | ||||
| - **Epic 9 – Orchestrator Dashboard:** deliver dashboards, recovery runbooks, and rate-limit governance. | ||||
| - **Epic 10 – Export Center:** manage signing/promotions and Offline Kit bundle publishing. | ||||
| - **Epic 15 – Observability & Forensics:** coordinate telemetry deployments, evidence retention, and forensic automation. | ||||
| - Track module runbooks (DEVOPS-LAUNCH-18-001/900) and telemetry automation via ../../TASKS.md and ops/devops/TASKS.md. | ||||
|  | ||||
| ## Coordination | ||||
| - Review ./AGENTS.md before picking up new work. | ||||
| - Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md. | ||||
| - Update this plan whenever scope, dependencies, or guardrails change. | ||||
							
								
								
									
										50
									
								
								docs/modules/devops/migrations/semver-style.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										50
									
								
								docs/modules/devops/migrations/semver-style.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,50 @@ | ||||
| # SemVer Style Backfill Runbook | ||||
|  | ||||
| _Last updated: 2025-10-11_ | ||||
|  | ||||
| ## Overview | ||||
|  | ||||
| The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures | ||||
| provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only | ||||
| runs when the feature flag `concelier:storage:enableSemVerStyle` is enabled. | ||||
|  | ||||
| ## Preconditions | ||||
|  | ||||
| 1. **Review configuration** – set `concelier.storage.enableSemVerStyle` to `true` on all Concelier services. | ||||
| 2. **Confirm batch size** – adjust `concelier.storage.backfillBatchSize` if you need smaller batches for older | ||||
|    deployments (default: `250`). | ||||
| 3. **Back up** – capture a fresh snapshot of the `advisory` collection or a full MongoDB backup. | ||||
| 4. **Staging dry-run** – enable the flag in a staging environment and observe the migration output before | ||||
|    rolling to production. | ||||
|  | ||||
| ## Execution | ||||
|  | ||||
| No manual command is required. After deploying the configuration change, restart the Concelier WebService or | ||||
| any component that hosts the Mongo migration runner. During startup you will see log entries similar to: | ||||
|  | ||||
| ``` | ||||
| Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled. | ||||
| Mongo migration 20251011-semver-style-backfill applied | ||||
| ``` | ||||
|  | ||||
| The migration reads advisories in batches (`concelier.storage.backfillBatchSize`) and writes flattened | ||||
| `normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched. | ||||
|  | ||||
| ## Post-checks | ||||
|  | ||||
| 1. Verify the new indexes exist: | ||||
|    ``` | ||||
|    db.advisory.getIndexes() | ||||
|    ``` | ||||
|    You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`. | ||||
| 2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches | ||||
|    the embedded package data. | ||||
| 3. Run `dotnet test` for `StellaOps.Concelier.Storage.Mongo.Tests` (optional but recommended) in CI to confirm | ||||
|    the storage suite passes with the feature flag enabled. | ||||
|  | ||||
| ## Rollback | ||||
|  | ||||
| Set `concelier.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on | ||||
| subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when | ||||
| the feature flag is off. If you must remove them entirely, restore from the backup captured during | ||||
| preparation. | ||||
							
								
								
									
										151
									
								
								docs/modules/devops/runbooks/deployment-upgrade.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										151
									
								
								docs/modules/devops/runbooks/deployment-upgrade.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,151 @@ | ||||
| # Stella Ops Deployment Upgrade & Rollback Runbook | ||||
|  | ||||
| _Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._ | ||||
|  | ||||
| This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1. Channel overview | ||||
|  | ||||
| | Channel | Release manifest | Helm values | Compose profile | | ||||
| |---------|------------------|-------------|-----------------| | ||||
| | `edge`  | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` | | ||||
| | `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` | | ||||
| | `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` | | ||||
|  | ||||
| Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2. Pre-flight checklist | ||||
|  | ||||
| 1. **Refresh release manifest**   | ||||
|    Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`). | ||||
|  | ||||
| 2. **Align deployment bundles with the manifest**   | ||||
|    Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services. | ||||
|    ```bash | ||||
|    ./deploy/tools/check-channel-alignment.py \ | ||||
|        --release deploy/releases/2025.10-edge.yaml \ | ||||
|        --target deploy/helm/stellaops/values-dev.yaml \ | ||||
|        --target deploy/compose/docker-compose.dev.yaml \ | ||||
|        --ignore-repo nats | ||||
|    ``` | ||||
|    Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files. | ||||
|  | ||||
| 3. **Lint and template profiles** | ||||
|    ```bash | ||||
|    ./deploy/tools/validate-profiles.sh | ||||
|    ``` | ||||
|  | ||||
| 4. **Smoke the Offline Kit debug store (edge/stable only)**   | ||||
|    When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree: | ||||
|    ```bash | ||||
|   ./ops/offline-kit/mirror_debug_store.py \ | ||||
|        --release-dir out/release \ | ||||
|        --offline-kit-dir out/offline-kit | ||||
|    ``` | ||||
|    Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle. | ||||
|  | ||||
| 5. **Review compatibility matrix**   | ||||
|    Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258…`, `minio@sha256:14ce…`, `rustfs:2025.10.0-edge`. | ||||
|  | ||||
| 6. **Create a rollback bookmark**   | ||||
|    Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3. Helm upgrade procedure (staging → production) | ||||
|  | ||||
| 1. Switch to the deployment branch and ensure secrets/config maps are current. | ||||
| 2. Apply the upgrade in the staging cluster: | ||||
|    ```bash | ||||
|    helm upgrade stellaops deploy/helm/stellaops \ | ||||
|      -f deploy/helm/stellaops/values-stage.yaml \ | ||||
|      --namespace stellaops \ | ||||
|      --atomic \ | ||||
|      --timeout 15m | ||||
|    ``` | ||||
| 3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks). | ||||
| 4. Promote to production using the prod values file and the same command. | ||||
| 5. Record the new revision number and Git SHA in the change log. | ||||
|  | ||||
| ### Rollback (Helm) | ||||
|  | ||||
| 1. Identify the previous revision: `helm history stellaops -n stellaops`. | ||||
| 2. Execute: | ||||
|    ```bash | ||||
|    helm rollback stellaops <revision> \ | ||||
|      --namespace stellaops \ | ||||
|      --wait \ | ||||
|      --timeout 10m | ||||
|    ``` | ||||
| 3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests. | ||||
| 4. Update the incident/operations log with root cause and rollback details. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4. Docker Compose upgrade procedure | ||||
|  | ||||
| 1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts. | ||||
| 2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable). | ||||
| 3. Apply the upgrade: | ||||
|    ```bash | ||||
|    docker compose \ | ||||
|      --env-file deploy/compose/env/prod.env \ | ||||
|      -f deploy/compose/docker-compose.prod.yaml \ | ||||
|      pull | ||||
|  | ||||
|    docker compose \ | ||||
|      --env-file deploy/compose/env/prod.env \ | ||||
|      -f deploy/compose/docker-compose.prod.yaml \ | ||||
|      up -d | ||||
|    ``` | ||||
| 4. Tail logs for critical services (`docker compose logs -f authority concelier`). | ||||
| 5. Update monitoring dashboards/alerts to confirm normal operation. | ||||
|  | ||||
| ### Rollback (Compose) | ||||
|  | ||||
| 1. Check out the previous release tag (e.g. `git checkout 2025.09.1`). | ||||
| 2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests. | ||||
| 3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/modules/authority/operations/backup-restore.md` and associated service guides). | ||||
| 4. Log the rollback in the operations journal. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5. Channel promotion workflow | ||||
|  | ||||
| 1. Author or update the channel manifest under `deploy/releases/`. | ||||
| 2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile. | ||||
| 3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`). | ||||
| 4. Publish release notes and update `deploy/releases/README.md` (if applicable). | ||||
| 5. Tag the repository when promoting stable or airgap builds. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6. Upgrade rehearsal & rollback drill log | ||||
|  | ||||
| Maintain rehearsal notes in `docs/modules/devops/runbooks/launch-cutover.md` or the relevant sprint planning document. After each drill capture: | ||||
|  | ||||
| - Release version tested | ||||
| - Date/time | ||||
| - Participants | ||||
| - Issues encountered & fixes | ||||
| - Rollback duration (if executed) | ||||
|  | ||||
| Attach the log to the sprint retro or operational wiki. | ||||
|  | ||||
| | Date (UTC) | Channel | Outcome | Notes | | ||||
| |------------|---------|---------|-------| | ||||
| | 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7. References | ||||
|  | ||||
| - `deploy/README.md` – structure and validation workflow for deployment bundles. | ||||
| - `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline. | ||||
| - `docs/modules/devops/architecture.md` – high-level DevOps architecture, SLOs, and compliance requirements. | ||||
| - `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper. | ||||
| - `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker. | ||||
							
								
								
									
										128
									
								
								docs/modules/devops/runbooks/launch-cutover.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										128
									
								
								docs/modules/devops/runbooks/launch-cutover.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,128 @@ | ||||
| # Launch Cutover Runbook - Stella Ops | ||||
|  | ||||
| _Document owner: DevOps Guild (2025-10-26)_   | ||||
| _Scope:_ Full-platform launch from staging to production for release `2025.09.2`. | ||||
|  | ||||
| ## 1. Roles and Communication | ||||
|  | ||||
| | Role | Primary | Backup | Contact | | ||||
| | --- | --- | --- | --- | | ||||
| | Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) | | ||||
| | Authority stack | Authority Core guild rep | Security guild rep | `#authority` | | ||||
| | Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` | | ||||
| | Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation | | ||||
| | Observability | Telemetry guild rep | SRE on-call | `#telemetry` | | ||||
| | Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket | | ||||
|  | ||||
| Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes. | ||||
|  | ||||
| ## 2. Timeline Overview (UTC) | ||||
|  | ||||
| | Time | Activity | Owner | | ||||
| | --- | --- | --- | | ||||
| | T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead | | ||||
| | T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer | | ||||
| | T-6h | Freeze non-launch deployments; notify guild leads. | Product owner | | ||||
| | T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps | | ||||
| | T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead | | ||||
| | T0 | Execute production cutover steps (Section 4). | Cutover team | | ||||
| | T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead | | ||||
| | T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner | | ||||
|  | ||||
| ## 3. Rehearsal (Staging) Checklist | ||||
|  | ||||
| 1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host). | ||||
| 2. Run `deploy/tools/validate-profiles.sh` and archive output. | ||||
| 3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`. | ||||
| 4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster. | ||||
| 5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`. | ||||
| 6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI. | ||||
| 7. Document total wall time and any deviations in the rehearsal log. | ||||
|  | ||||
| Rehearsal must complete without manual interventions before proceeding to production. | ||||
|  | ||||
| ## 4. Production Cutover Steps | ||||
|  | ||||
| ### 4.1 Pre-flight | ||||
| - Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`. | ||||
| - Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host. | ||||
| - Back up current configuration and data: | ||||
|   - Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`. | ||||
|   - MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`. | ||||
|  | ||||
| ### 4.2 Apply Updates (Compose) | ||||
| 1. On each compose node, pull updated images for release `2025.09.2`: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull | ||||
|    ``` | ||||
| 2. Deploy changes: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d | ||||
|    ``` | ||||
| 3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`. | ||||
|  | ||||
| ### 4.3 Apply Updates (Helm/Kubernetes) | ||||
| If using Kubernetes, perform: | ||||
| ```bash | ||||
| helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m | ||||
| ``` | ||||
| Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`. | ||||
|  | ||||
| ### 4.4 Configuration Validation | ||||
| - Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`. | ||||
| - Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`. | ||||
| - Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success). | ||||
| - Ensure Notify (legacy) still accessible while Notifier migration pending. | ||||
|  | ||||
| ## 5. Smoke Tests | ||||
|  | ||||
| | Test | Command / Action | Expected Result | | ||||
| | --- | --- | --- | | ||||
| | API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` | | ||||
| | Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE | | ||||
| | Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` | | ||||
| | Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata | | ||||
| | Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` | | ||||
| | Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent | | ||||
|  | ||||
| Log results in the change ticket with timestamps and screenshots where applicable. | ||||
|  | ||||
| ## 6. Rollback Procedure | ||||
|  | ||||
| 1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts. | ||||
| 2. For Compose: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down | ||||
|    docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d | ||||
|    ``` | ||||
| 3. For Helm: | ||||
|    ```bash | ||||
|    helm rollback stellaops <previous-release-number> --namespace stellaops | ||||
|    ``` | ||||
| 4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`. | ||||
| 5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`. | ||||
| 6. Notify stakeholders of rollback and capture root cause notes in incident ticket. | ||||
|  | ||||
| ## 7. Post-cutover Actions | ||||
|  | ||||
| - Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth. | ||||
| - Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored. | ||||
| - Update `docs/modules/devops/runbooks/launch-readiness.md` if any new gaps or follow-ups discovered. | ||||
| - Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner. | ||||
|  | ||||
| ## 8. Approval Matrix | ||||
|  | ||||
| | Step | Required Approvers | Record Location | | ||||
| | --- | --- | --- | | ||||
| | Production deployment plan | CTO + DevOps lead | Change ticket comment | | ||||
| | Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary | | ||||
| | Post-smoke success | DevOps lead + product owner | Change ticket closure | | ||||
| | Rollback (if invoked) | DevOps lead + CTO | Incident ticket | | ||||
|  | ||||
| Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned. | ||||
|  | ||||
| ## 9. Rehearsal Log | ||||
|  | ||||
| | Date (UTC) | What We Exercised | Outcome | Follow-up | | ||||
| | --- | --- | --- | --- | | ||||
| | 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. | | ||||
							
								
								
									
										49
									
								
								docs/modules/devops/runbooks/launch-readiness.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										49
									
								
								docs/modules/devops/runbooks/launch-readiness.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,49 @@ | ||||
| # Launch Readiness Record - Stella Ops | ||||
|  | ||||
| _Updated: 2025-10-26 (UTC)_ | ||||
|  | ||||
| This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover. | ||||
|  | ||||
| ## 1. Sign-off Summary | ||||
|  | ||||
| | Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes | | ||||
| | --- | --- | --- | --- | --- | --- | | ||||
| | Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. | | ||||
| | Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. | | ||||
| | Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. | | ||||
| | Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. | | ||||
| | Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/modules/concelier/operations/conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). | | ||||
| | Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (EXECPLAN.Section  Wave 0) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. | | ||||
| | Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. | | ||||
| | Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. | | ||||
| | DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. | | ||||
| | Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-10-26T15:05Z | Awaiting release debug artefacts to finalise `DEVOPS-OFFLINE-17-004`; tracked in Section 3. | | ||||
|  | ||||
| _\* READY with caveat - remaining work noted in Section 3._ | ||||
|  | ||||
| ## 2. Deployment Readiness Checklist | ||||
|  | ||||
| - **Production profiles committed:** `deploy/compose/docker-compose.prod.yaml` and `deploy/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services. | ||||
| - **Secrets placeholders documented:** `deploy/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`). | ||||
| - **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing. | ||||
| - **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services. | ||||
| - **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks. | ||||
| - **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback. | ||||
| - **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned. | ||||
|  | ||||
| ## 3. Outstanding Gaps & Follow-ups | ||||
|  | ||||
| | Item | Owner | Tracking Ref | Target / Next Step | Impact | | ||||
| | --- | --- | --- | --- | --- | | ||||
| | Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. | | ||||
| | Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. | | ||||
| | Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. | | ||||
| | Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (BLOCKED 2025-10-26) | Release pipeline must publish `out/release/debug` artefacts; once available, run `mirror_debug_store.py` and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. | | ||||
| | Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. | | ||||
| | Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. | | ||||
|  | ||||
| ## 4. Approvals & Distribution | ||||
|  | ||||
| - Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement. | ||||
| - Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history. | ||||
| - Cutover rehearsal and rollback drills are tracked separately in `docs/modules/devops/runbooks/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch | ||||
							
								
								
									
										64
									
								
								docs/modules/devops/runbooks/nuget-preview-bootstrap.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										64
									
								
								docs/modules/devops/runbooks/nuget-preview-bootstrap.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,64 @@ | ||||
| # NuGet Preview Bootstrap (Offline-Friendly) | ||||
|  | ||||
| The StellaOps build relies on .NET 10 RC2 packages (Microsoft.Extensions.*, JwtBearer 10.0 RC). | ||||
| `NuGet.config` now wires three sources: | ||||
|  | ||||
| 1. `local` → `./local-nuget` (preferred, air-gapped mirror) | ||||
| 2. `dotnet-public` → `https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json` | ||||
| 3. `nuget.org` → fallback for everything else | ||||
|  | ||||
| Follow the steps below whenever you refresh the repo or roll a new Offline Kit drop. | ||||
|  | ||||
| ## 1. Mirror the preview packages | ||||
|  | ||||
| ```bash | ||||
| ./ops/devops/sync-preview-nuget.sh | ||||
| ``` | ||||
|  | ||||
| * Reads `ops/devops/nuget-preview-packages.csv`. Each line specifies the package, version, expected SHA-256 hash, and (optionally) the flat-container base URL (we pin to `dotnet-public`). | ||||
| * Downloads the `.nupkg` straight into `./local-nuget/` and re-verifies the checksum. Existing files are skipped when hashes already match. | ||||
| * Use `NUGET_V2_BASE` if you need to temporarily point at a different mirror. | ||||
|  | ||||
| 💡 The script never mutates packages in place—if a checksum changes you will see a “SHA mismatch … refreshing” message. | ||||
|  | ||||
| ## 2. Restore using the shared `NuGet.config` | ||||
|  | ||||
| From the repo root: | ||||
|  | ||||
| ```bash | ||||
| DOTNET_NOLOGO=1 dotnet restore src/Excititor/__Libraries/StellaOps.Excititor.Connectors.Abstractions/StellaOps.Excititor.Connectors.Abstractions.csproj \ | ||||
|   --configfile NuGet.config | ||||
| ``` | ||||
|  | ||||
| The `packageSourceMapping` section keeps `Microsoft.Extensions.*`, `Microsoft.AspNetCore.*`, and `Microsoft.Data.Sqlite` bound to `local`/`dotnet-public`, so `dotnet restore` never has to reach out to nuget.org when mirrors are populated. | ||||
|  | ||||
| Before committing changes (or when wiring up a new environment) run: | ||||
|  | ||||
| ```bash | ||||
| python3 ops/devops/validate_restore_sources.py | ||||
| ``` | ||||
|  | ||||
| The validator asserts: | ||||
|  | ||||
| - `NuGet.config` lists `local` → `dotnet-public` → `nuget.org` in that order. | ||||
| - `Directory.Build.props` pins `RestoreSources` so every project prioritises the local mirror. | ||||
| - No stray `NuGet.config` files shadow the repo root configuration. | ||||
|  | ||||
| CI executes the validator in both the `build-test-deploy` and `release` workflows, | ||||
| so regressions trip before any restore/build begins. | ||||
|  | ||||
| If you run fully air-gapped, remember to clear the cache between SDK upgrades: | ||||
|  | ||||
| ```bash | ||||
| dotnet nuget locals all --clear | ||||
| ``` | ||||
|  | ||||
| ## 3. Troubleshooting | ||||
|  | ||||
| | Symptom | Fix | | ||||
| | --- | --- | | ||||
| | `dotnet restore` still hits nuget.org for preview packages | Re-run `sync-preview-nuget.sh` to ensure the `.nupkg` exists locally, then delete `~/.nuget/packages/microsoft.extensions.*` so the resolver picks up the mirrored copy. | | ||||
| | SHA mismatch in the manifest | Update `ops/devops/nuget-preview-packages.csv` with the new version + checksum (from the feed) and re-run the sync script. | | ||||
| | Azure DevOps feed throttling | Set `DOTNET_PUBLIC_FLAT_BASE` env var and point it at your own mirrored flat-container, then add the URL to the 4th column of the manifest. | | ||||
|  | ||||
| Keep this doc alongside Offline Kit instructions so air-gapped operators know exactly how to refresh the mirror and verify packages before restore. | ||||
		Reference in New Issue
	
	Block a user