feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
22
docs/modules/devops/AGENTS.md
Normal file
22
docs/modules/devops/AGENTS.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# DevOps agent guide
|
||||
|
||||
## Mission
|
||||
The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
41
docs/modules/devops/README.md
Normal file
41
docs/modules/devops/README.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# StellaOps DevOps
|
||||
|
||||
The DevOps module captures release, deployment, and migration playbooks that keep StellaOps deterministic across environments.
|
||||
|
||||
## Responsibilities
|
||||
- Maintain CI pipelines, signing workflows, and release packaging steps.
|
||||
- Operate shared runbooks for launch readiness, upgrades, and NuGet previews.
|
||||
- Provide offline kit assembly instructions and tooling integration.
|
||||
- Wrap observability/telemetry bootstrap flows for platform teams.
|
||||
|
||||
## Key components
|
||||
- Runbooks under ./runbooks/ (launch, deployment, nuget).
|
||||
- Migration guidance under ./migrations/.
|
||||
- Architecture overview bridging CI/CD & infrastructure concerns.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Ops pipelines (Gitea, GitHub Actions) and artifact registries.
|
||||
- Authority/Signer for supply chain signing.
|
||||
- Telemetry stack bootstrap scripts.
|
||||
|
||||
## Operational notes
|
||||
- Offline bundle packaging guidance in docs/modules/export-center/operations/runbook.md.
|
||||
- Dashboards for launch cutover rehearsals.
|
||||
- Coordination with Security for enforced guardrails.
|
||||
|
||||
## Related resources
|
||||
- ./runbooks/launch-readiness.md
|
||||
- ./runbooks/launch-cutover.md
|
||||
- ./runbooks/deployment-upgrade.md
|
||||
- ./runbooks/nuget-preview-bootstrap.md
|
||||
- ./migrations/semver-style.md
|
||||
|
||||
## Backlog references
|
||||
- DEVOPS-LAUNCH-18-001 / 18-900 runbooks in ../../TASKS.md.
|
||||
- Telemetry bootstrap automation tracked in `ops/devops/TASKS.md`.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 1 – AOC enforcement:** bake AOC verifier steps, CI guards, and schema validation into pipelines.
|
||||
- **Epic 9 – Orchestrator Dashboard:** support operational dashboards, job recovery runbooks, and rate-limit governance.
|
||||
- **Epic 10 – Export Center:** manage signing workflows, Offline Kit packaging, and release promotion for exports.
|
||||
- **Epic 15 – Observability & Forensics:** coordinate telemetry deployment, evidence retention, and forensic automation.
|
||||
9
docs/modules/devops/TASKS.md
Normal file
9
docs/modules/devops/TASKS.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Task board — DevOps
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| DEVOPS-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| DEVOPS-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| DEVOPS-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
488
docs/modules/devops/architecture.md
Normal file
488
docs/modules/devops/architecture.md
Normal file
@@ -0,0 +1,488 @@
|
||||
# component_architecture_devops.md — **Stella Ops Release & Operations** (2025Q4)
|
||||
|
||||
> Draws from the AOC guardrails, Orchestrator, Export Center, and Observability module plans to describe how Stella Ops is built, signed, distributed, and operated.
|
||||
|
||||
> **Scope.** Implementation‑ready blueprint for **how Stella Ops is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and air‑gapped). Covers reproducible builds, supply‑chain attestations, registries, offline kits, migration/rollback, artifact lifecycle (RustFS default + Mongo, S3 fallback), monitoring SLOs, and customer activation.
|
||||
|
||||
---
|
||||
|
||||
## 0) Product vision (operations lens)
|
||||
|
||||
Stella Ops must be **trustable at a glance** and **boringly operable**:
|
||||
|
||||
* Every release ships with **first‑party SBOMs, provenance, and signatures**; services verify **each other’s** integrity at runtime.
|
||||
* Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels.
|
||||
* Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**.
|
||||
* Air‑gapped customers receive **offline kits** with verifiable digests and deterministic import.
|
||||
* Artifacts expire predictably; operators know what’s kept, for how long, and why.
|
||||
|
||||
---
|
||||
|
||||
## 1) Release trains & versioning
|
||||
|
||||
### 1.1 Channels
|
||||
|
||||
* **LTS** (12‑month support window): quarterly cadence (Q1/Q2/Q3/Q4).
|
||||
* **Stable** (default): monthly rollup (bug fixes + compatible features).
|
||||
* **Edge**: weekly; for early adopters, no guarantees.
|
||||
|
||||
### 1.2 Version strings
|
||||
|
||||
Semantic core + calendar tag:
|
||||
|
||||
```
|
||||
<MAJOR>.<MINOR>.<PATCH> (<YYYY>.<MM>) e.g., 2.4.1 (2027.06)
|
||||
```
|
||||
|
||||
* **MAJOR**: breaking API/DB changes (rare).
|
||||
* **MINOR**: new features, compatible schema migrations (expand/contract pattern).
|
||||
* **PATCH**: bug fixes, perf and security updates.
|
||||
* **Calendar tag** exposes **release year** used by Signer for **PoE window checks**.
|
||||
|
||||
### 1.3 Component alignment
|
||||
|
||||
A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wire‑compatible**. Mixed minor versions are allowed within a bounded skew:
|
||||
|
||||
* **Web UI ↔ backend**: `±1 minor`.
|
||||
* **Scanner ↔ Policy/Excititor/Concelier**: `±1 minor`.
|
||||
* **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules).
|
||||
|
||||
At startup, services **self‑advertise** their semver & channel; the UI surfaces **mismatch warnings**.
|
||||
|
||||
---
|
||||
|
||||
## 2) Supply‑chain pipeline (how a release is built)
|
||||
|
||||
### 2.1 Deterministic builds
|
||||
|
||||
* **Builders**: isolated **BuildKit** workers with pinned base images (digest only).
|
||||
* **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag.
|
||||
* **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars.
|
||||
* **Multi‑arch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).
|
||||
|
||||
### 2.2 First‑party SBOMs & provenance
|
||||
|
||||
* Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSA‑style provenance** attached as **OCI referrers**.
|
||||
* Scanner’s **Buildx generator** is used to produce SBOMs *during* build; a separate post‑build scan verifies parity (red flag if drift).
|
||||
* **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs.
|
||||
|
||||
### 2.3 Signing & transparency
|
||||
|
||||
* Images are **cosign‑signed** (keyless) with a Stella Ops release identity; inclusion in a **transparency log** (Rekor) is required.
|
||||
* SBOM and provenance attestations are **DSSE** and also transparency‑logged.
|
||||
* Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scanner‑release validation** at customer side).
|
||||
|
||||
### 2.4 Gates & tests
|
||||
|
||||
* **Static**: linters, codegen checks, protobuf API freeze (backward‑compat tests).
|
||||
* **Unit/integration**: per‑component, plus **end‑to‑end** flows (scan→vex→policy→sign→attest).
|
||||
* **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets.
|
||||
* **Security**: dependency audit vs Concelier export; container hardening tests; minimal caps.
|
||||
* **Analyzer smoke**: restart-time language plug-ins (currently Python) verified via `dotnet run --project src/Tools/LanguageAnalyzerSmoke` to ensure manifest integrity plus cold vs warm determinism (< 30 s / < 5 s budgets); the harness logs deviations from repository goldens for follow-up.
|
||||
* **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag.
|
||||
|
||||
### 2.5 Debug-store artefacts
|
||||
|
||||
* Every release exports stripped debug information for ELF binaries discovered in service images. Debug files follow the GNU build-id layout (`debug/.build-id/<aa>/<rest>.debug`) and are generated via `objcopy --only-keep-debug`.
|
||||
* `debug/debug-manifest.json` captures build-id → component/image/source mappings with SHA-256 checksums so operators can mirror the directory into debuginfod or offline symbol stores. The manifest (and its `.sha256` companion) ships with every release bundle and Offline Kit.
|
||||
|
||||
---
|
||||
|
||||
## 3) Distribution & activation
|
||||
|
||||
### 3.1 Registries
|
||||
|
||||
* **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API).
|
||||
* **Mirrors**: GHCR (read‑only), regional mirrors for latency.
|
||||
* Operational runbook: see `docs/modules/concelier/operations/mirror.md` for deployment profiles, CDN guidance, and sync automation.
|
||||
* **Pull by digest only** in Kubernetes/Compose manifests.
|
||||
|
||||
**Gating policy**:
|
||||
|
||||
* **Core images** (Authority, Scanner, Concelier, Excititor, Attestor, UI): public **read**.
|
||||
* **Enterprise add‑ons** (if any) and **pre‑release**: private repos via the **Registry Token Service** (`src/Registry/StellaOps.Registry.TokenService`) which exchanges Authority-issued OpToks for short-lived Docker registry bearer tokens.
|
||||
|
||||
> Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume.
|
||||
|
||||
### 3.2 OAuth2 token service (for private repos)
|
||||
|
||||
* Docker Registry’s token flow backed by **Authority**:
|
||||
|
||||
1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`).
|
||||
2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`.
|
||||
3. Registry allows pull for the requested repo.
|
||||
* Tokens are **short‑lived** (60–300 s) and **DPoP‑bound**.
|
||||
|
||||
The token service enforces plan gating via `registry-token.yaml` (see `docs/modules/registry/operations/token-service.md`) and exposes Prometheus metrics (`registry_token_issued_total`, `registry_token_rejected_total`). Revoked licence identifiers halt issuance even when scope requirements are met.
|
||||
|
||||
### 3.3 Offline kits (air‑gapped)
|
||||
|
||||
* Tarball per release channel:
|
||||
|
||||
```
|
||||
stellaops-kit-<ver>-<channel>.tar.zst
|
||||
/images/ OCI layout with all first-party images (multi-arch)
|
||||
/sboms/ CycloneDX JSON+PB for each image
|
||||
/attest/ DSSE bundles + Rekor proofs
|
||||
/charts/ Helm charts + values templates
|
||||
/compose/ docker-compose.yml + .env template
|
||||
/plugins/ Concelier/Excititor connectors (restart-time)
|
||||
/policy/ example policies
|
||||
/manifest/ release.yaml (see §6.1)
|
||||
```
|
||||
* Import via CLI `offline kit import`; checks digests and signatures before load.
|
||||
|
||||
---
|
||||
|
||||
## 4) Licensing (PoE) & monetization
|
||||
|
||||
**Principle**: **Only paid Stella Ops issues valid signed attestations.** Running the stack is free; signing requires PoE.
|
||||
|
||||
### 4.1 PoE issuance
|
||||
|
||||
* Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`:
|
||||
|
||||
* **PoE‑JWT** (DPoP/mTLS‑bound) **or** **PoE mTLS client certificate**.
|
||||
* Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs.
|
||||
|
||||
### 4.2 Online enforcement
|
||||
|
||||
* **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc).
|
||||
* If **revoked/expired/out‑of‑window** → deny with machine‑readable reason.
|
||||
* All **valid** bundles are DSSE‑signed and **Attestor** logs them; Rekor UUID returned.
|
||||
* UI badges: “**Verified by Stella Ops**” with link to the public log.
|
||||
|
||||
### 4.3 Air‑gapped / offline
|
||||
|
||||
* Customers obtain a **time‑boxed PoE lease** (signed JSON, 7–30 days).
|
||||
* Signer accepts the lease and emits **provisional** attestations (clearly labeled).
|
||||
* When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**.
|
||||
* Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).
|
||||
|
||||
### 4.4 Stolen/abused PoE
|
||||
|
||||
* Customers report theft; **Licensing** flags `license_id` as **revoked**.
|
||||
* Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional re‑sign path upon new PoE).
|
||||
|
||||
---
|
||||
|
||||
## 5) Deployment path (customer side)
|
||||
|
||||
### 5.1 First install
|
||||
|
||||
* **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s):
|
||||
|
||||
```bash
|
||||
helm repo add stellaops https://charts.stella-ops.org
|
||||
helm install stella stellaops/platform \
|
||||
--version 2.4.0 \
|
||||
--set global.channel=stable \
|
||||
--set authority.issuer=https://authority.stella.local \
|
||||
--set scanner.minio.endpoint=http://minio.stella.local:9000 \
|
||||
--set scanner.mongo.uri=mongodb://mongo/scanner \
|
||||
--set concelier.mongo.uri=mongodb://mongo/concelier \
|
||||
--set excititor.mongo.uri=mongodb://mongo/excititor
|
||||
```
|
||||
|
||||
* Post‑install job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets).
|
||||
* UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?).
|
||||
|
||||
### 5.2 Updates
|
||||
|
||||
* **Blue/green**: pull new bundle by **digest**; deploy side‑by‑side; cut traffic.
|
||||
|
||||
* **Rolling**: upgrade stateful components in safe order:
|
||||
|
||||
1. Authority (stateless, dual‑key rotation ready)
|
||||
2. Signer/Attestor (same minor)
|
||||
3. Scanner WebService & Workers
|
||||
4. Concelier, then Excititor (schema migrations are expand/contract)
|
||||
5. UI last
|
||||
|
||||
* **DB migrations** are **expand/contract**:
|
||||
|
||||
* Phase A (release N): **add** new fields/indexes, write old+new.
|
||||
* Phase B (N+1): **read** new fields; **drop** old.
|
||||
* Rollback is a matter of redeploying previous images and keeping both schemas valid.
|
||||
|
||||
### 5.3 Rollback
|
||||
|
||||
* Images referenced by **digest**; keep previous release manifest `K` versions back.
|
||||
* `helm rollback` or compose `docker compose -f release-K.yml up -d`.
|
||||
* Mongo migrations are additive; **no destructive changes** within a single minor.
|
||||
|
||||
---
|
||||
|
||||
## 6) Release payloads & manifests
|
||||
|
||||
### 6.1 Release manifest (`release.yaml`)
|
||||
|
||||
```yaml
|
||||
release:
|
||||
version: "2.4.1"
|
||||
channel: "stable"
|
||||
date: "2027-06-20T12:00:00Z"
|
||||
calendar: "2027.06"
|
||||
components:
|
||||
- name: scanner-webservice
|
||||
image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
|
||||
sbom: oci://.../referrers/cdx-json@sha256:11..22
|
||||
provenance: oci://.../attest/provenance@sha256:33..44
|
||||
signature: { rekorUUID: "…" }
|
||||
- name: signer
|
||||
image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
|
||||
signature: { rekorUUID: "…" }
|
||||
charts:
|
||||
- name: platform
|
||||
version: "2.4.1"
|
||||
digest: "sha256:ee..ff"
|
||||
compose:
|
||||
file: "docker-compose.yml"
|
||||
digest: "sha256:77..88"
|
||||
checksums:
|
||||
sha256: "… digest of this release.yaml …"
|
||||
```
|
||||
|
||||
The manifest is **cosign‑signed**; UI/CLI can verify a bundle without talking to registries.
|
||||
|
||||
> Deployment guardrails – The repository keeps channel-aligned Compose bundles
|
||||
> in `deploy/compose/` and Helm overlays in `deploy/helm/stellaops/`. Both sets
|
||||
> pull their digests from `deploy/releases/` and are validated by
|
||||
> `deploy/tools/validate-profiles.sh` to guarantee lint/dry-run cleanliness.
|
||||
|
||||
### 6.2 Image labels (release metadata)
|
||||
|
||||
Each image sets OCI labels:
|
||||
|
||||
```
|
||||
org.opencontainers.image.version = "2.4.1"
|
||||
org.opencontainers.image.revision = "<git sha>"
|
||||
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
|
||||
org.stellaops.release.calendar = "2027.06"
|
||||
org.stellaops.release.channel = "stable"
|
||||
org.stellaops.build.slsaProvenance = "oci://…"
|
||||
```
|
||||
|
||||
Signer validates **scanner** image’s cosign identity + calendar tag for **release window** checks.
|
||||
|
||||
---
|
||||
|
||||
## 7) Artifact lifecycle & storage (RustFS/Mongo)
|
||||
|
||||
### 7.1 Buckets & prefixes (RustFS)
|
||||
|
||||
```
|
||||
rustfs://stellaops/
|
||||
scanner/
|
||||
layers/<sha256>/sbom.cdx.json.zst
|
||||
images/<imgDigest>/inventory.cdx.pb
|
||||
images/<imgDigest>/usage.cdx.pb
|
||||
diffs/<old>_<new>/diff.json.zst
|
||||
attest/<artifactSha256>.dsse.json
|
||||
concelier/
|
||||
json/<exportId>/...
|
||||
trivy/<exportId>/...
|
||||
excititor/
|
||||
exports/<exportId>/...
|
||||
attestor/
|
||||
dsse/<bundleSha256>.json
|
||||
proof/<rekorUuid>.json
|
||||
```
|
||||
|
||||
### 7.2 ILM classes
|
||||
|
||||
* **`short`**: working artifacts (diffs, queues) — TTL 7–14 days.
|
||||
* **`default`**: SBOMs & indexes — TTL 90–180 days (configurable).
|
||||
* **`compliance`**: signed reports & attested exports — retention enforced via RustFS hold or S3 Object Lock (governance/compliance) 1–7 years.
|
||||
|
||||
### 7.3 Artifact Lifecycle Controller (ALC)
|
||||
|
||||
* A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**:
|
||||
|
||||
* Artifacts referenced by **reports** or **tickets** are pinned.
|
||||
* ILM actions logged; UI shows per‑class usage & upcoming purges.
|
||||
|
||||
> **Migration note.** Follow `docs/modules/scanner/operations/rustfs-migration.md` when transitioning existing
|
||||
> MinIO buckets to RustFS. The provided migrator is idempotent and safe to rerun per prefix.
|
||||
|
||||
### 7.4 Mongo retention
|
||||
|
||||
* **Scanner**: `runtime.events` use TTL (e.g., 30–90 days); **catalog** permanent.
|
||||
* **Concelier/Excititor**: raw docs keep **last N windows**; canonical stores permanent.
|
||||
* **Attestor**: `entries` permanent; `dedupe` TTL 24–48h.
|
||||
|
||||
### 7.5 Mongo server baseline
|
||||
|
||||
* **Minimum supported server:** MongoDB **4.2+**. Driver 3.5.0 removes compatibility shims for 4.0; upstream has already announced 4.0 support will be dropped in upcoming C# driver releases. citeturn1open1
|
||||
* **Deploy images:** Compose/Helm defaults stay on `mongo:7.x`. For air-gapped installs, refresh Offline Kit bundles so the packaged `mongod` matches ≥4.2.
|
||||
* **Upgrade guard:** During rollout, verify replica sets reach FCV `4.2` or above before swapping binaries; automation should hard-stop if FCV is <4.2.
|
||||
|
||||
---
|
||||
|
||||
## 8) Observability & SLOs (operations)
|
||||
|
||||
* **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%.
|
||||
* **Error budgets**: tracked per month; dashboards show burn rates.
|
||||
* **Golden signals**:
|
||||
|
||||
* **Latency**: token issuance, sign→attest round‑trip, scan enqueue→emit, export build.
|
||||
* **Saturation**: queue depth, Mongo write IOPS, RustFS throughput / queue depth (or S3 metrics when in fallback mode).
|
||||
* **Traffic**: scans/min, attestations/min, webhook admits/min.
|
||||
* **Errors**: 5xx rates, cosign verification failures, Rekor timeouts.
|
||||
|
||||
Prometheus + OTLP; Grafana dashboards ship in the charts.
|
||||
|
||||
---
|
||||
|
||||
## 9) Security & compliance operations
|
||||
|
||||
* **Key rotation**:
|
||||
|
||||
* Authority JWKS: 60‑day cadence, dual‑key overlap.
|
||||
* Release signing identities: rotate per minor or quarterly.
|
||||
* Sigstore roots mirrored and pinned; alarms on drift.
|
||||
|
||||
* **FIPS mode** (Gov build):
|
||||
|
||||
* Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only.
|
||||
* Local **Rekor v2** and **Fulcio** alternatives; **air‑gapped** CA.
|
||||
|
||||
* **Vulnerability response**:
|
||||
|
||||
* Concelier red-flag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice.
|
||||
* 2025-10: Pinned `MongoDB.Driver` **3.5.0** and `SharpCompress` **0.41.0** across services (DEVOPS-SEC-10-301) to eliminate NU1902/NU1903 warnings surfaced during scanner cache/worker test runs; repacked the local `Mongo2Go` feed so test fixtures inherit the patched dependencies; future bumps follow the same central override pattern.
|
||||
|
||||
* **Backups/DR**:
|
||||
|
||||
* Mongo nightly snapshots; MinIO versioning + replication (if configured).
|
||||
* Restore runbooks tested quarterly with synthetic data.
|
||||
|
||||
---
|
||||
|
||||
## 10) Customer update flow (how versions are fetched & activated)
|
||||
|
||||
### 10.1 Online clusters
|
||||
|
||||
* **UI** surfaces update banner with **release manifest** diff and risk notes.
|
||||
* Operator approves → **Controller** pulls new images by digest; health‑checks; moves traffic; deprecates old revision.
|
||||
* Post‑switch, **schema Phase B** migrations (if any) run automatically.
|
||||
|
||||
### 10.2 Air‑gapped clusters
|
||||
|
||||
* Operator downloads **offline kit** from a mirror → `stellaops offline kit import`.
|
||||
* Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest.
|
||||
* After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged.
|
||||
|
||||
### 10.3 CLI self‑update (optional)
|
||||
|
||||
* `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable).
|
||||
|
||||
---
|
||||
|
||||
## 11) Compatibility & deprecation policy
|
||||
|
||||
* **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor.
|
||||
* **Storage**: expand/contract; “drop old fields” only after one minor grace.
|
||||
* **Config**: feature flags (default off) for risky features (e.g., eBPF).
|
||||
|
||||
---
|
||||
|
||||
## 12) Runbooks (selected)
|
||||
|
||||
### 12.1 Lost PoE
|
||||
|
||||
1. Suspend **automatic attestation** jobs.
|
||||
2. Use CLI `stellaops signer status` to confirm `entitlement_denied`.
|
||||
3. Obtain new PoE from portal; verify on Signer `/poe/verify`.
|
||||
4. Re‑enable; optionally **re‑sign** last N reports (UI button → batch).
|
||||
|
||||
### 12.2 Rekor outage (self‑hosted)
|
||||
|
||||
* Attestor returns `202 (pending)` with queued proof fetch.
|
||||
* Keep DSSE bundles locally; re‑submit on schedule; UI badge shows **Pending**.
|
||||
* If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored.
|
||||
|
||||
### 12.3 Emergency downgrade
|
||||
|
||||
* Identify prior release manifest (UI → Admin → Releases).
|
||||
* `helm rollback stella <revision>` (or compose apply previous file).
|
||||
* Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together.
|
||||
|
||||
---
|
||||
|
||||
## 13) Example: cluster bootstrap (Compose)
|
||||
|
||||
```yaml
|
||||
version: "3.9"
|
||||
services:
|
||||
authority:
|
||||
image: registry.stella-ops.org/stellaops/authority@sha256:...
|
||||
env_file: ./env/authority.env
|
||||
ports: ["8440:8440"]
|
||||
signer:
|
||||
image: registry.stella-ops.org/stellaops/signer@sha256:...
|
||||
depends_on: [authority]
|
||||
environment:
|
||||
- SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
|
||||
attestor:
|
||||
image: registry.stella-ops.org/stellaops/attestor@sha256:...
|
||||
depends_on: [signer]
|
||||
scanner-web:
|
||||
image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
|
||||
environment:
|
||||
- SCANNER__S3__ENDPOINT=http://minio:9000
|
||||
scanner-worker:
|
||||
image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
|
||||
deploy: { replicas: 4 }
|
||||
concelier:
|
||||
image: registry.stella-ops.org/stellaops/concelier@sha256:...
|
||||
excititor:
|
||||
image: registry.stella-ops.org/stellaops/excititor@sha256:...
|
||||
web-ui:
|
||||
image: registry.stella-ops.org/stellaops/web-ui@sha256:...
|
||||
mongo:
|
||||
image: mongo:7
|
||||
minio:
|
||||
image: minio/minio:RELEASE.2025-07-10T00-00-00Z
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14) Governance & keys (who owns the trust root)
|
||||
|
||||
* **Release key policy**: only the Release Engineering group can push signed releases; 4‑eyes approval; TUF‑style manifest possible in future.
|
||||
* **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported.
|
||||
* **Customer keys**: none needed for core use; enterprise add‑ons may require per‑customer registries and keys.
|
||||
|
||||
---
|
||||
|
||||
## 15) Roadmap (Ops)
|
||||
|
||||
* **Windows containers GA** (Scanner + Zastava).
|
||||
* **Key Transparency** for Signer certs.
|
||||
* **Delta‑kit** (offline) for incremental updates.
|
||||
* **Operator CRDs** (K8s) to manage policy and ILM declaratively.
|
||||
* **SBOM **protobuf** as default transport at rest (smaller, faster).
|
||||
|
||||
---
|
||||
|
||||
### Appendix A — Minimal SLO monitors
|
||||
|
||||
* `authority.tokens_issued_total` slope ≈ normal.
|
||||
* `signer.requests_total{result="success"}/minute` > 0 (when scans occur).
|
||||
* `attestor.submit_latency_seconds{quantile=0.95}` < 0.3.
|
||||
* `scanner.scan_latency_seconds{quantile=0.95}` < target per image size.
|
||||
* `concelier.export.duration_seconds` stable; `excititor.consensus.conflicts_total` not exploding after policy changes.
|
||||
* RustFS request error rate near zero (or `s3_requests_errors_total` when operating against S3); Mongo `opcounters` hit expected baseline.
|
||||
|
||||
### Appendix B — Upgrade safety checklist
|
||||
|
||||
* Verify **release manifest** signature.
|
||||
* Ensure **Signer/Authority/Attestor** are same minor.
|
||||
* Verify **DB backups** < 24h old.
|
||||
* Confirm **ILM** won’t purge compliance artifacts during upgrade window.
|
||||
* Roll **one component** at a time; watch SLOs; abort on regression.
|
||||
|
||||
---
|
||||
|
||||
**End — component_architecture_devops.md**
|
||||
22
docs/modules/devops/implementation_plan.md
Normal file
22
docs/modules/devops/implementation_plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Implementation plan — DevOps
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Epic milestones
|
||||
- **Epic 1 – AOC enforcement:** ensure CI/CD guardrails, schema validation, and verifier pipelines are enforced.
|
||||
- **Epic 9 – Orchestrator Dashboard:** deliver dashboards, recovery runbooks, and rate-limit governance.
|
||||
- **Epic 10 – Export Center:** manage signing/promotions and Offline Kit bundle publishing.
|
||||
- **Epic 15 – Observability & Forensics:** coordinate telemetry deployments, evidence retention, and forensic automation.
|
||||
- Track module runbooks (DEVOPS-LAUNCH-18-001/900) and telemetry automation via ../../TASKS.md and ops/devops/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
50
docs/modules/devops/migrations/semver-style.md
Normal file
50
docs/modules/devops/migrations/semver-style.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# SemVer Style Backfill Runbook
|
||||
|
||||
_Last updated: 2025-10-11_
|
||||
|
||||
## Overview
|
||||
|
||||
The SemVer style migration populates the new `normalizedVersions` field on advisory documents and ensures
|
||||
provenance `decisionReason` values are preserved during future reads. The migration is idempotent and only
|
||||
runs when the feature flag `concelier:storage:enableSemVerStyle` is enabled.
|
||||
|
||||
## Preconditions
|
||||
|
||||
1. **Review configuration** – set `concelier.storage.enableSemVerStyle` to `true` on all Concelier services.
|
||||
2. **Confirm batch size** – adjust `concelier.storage.backfillBatchSize` if you need smaller batches for older
|
||||
deployments (default: `250`).
|
||||
3. **Back up** – capture a fresh snapshot of the `advisory` collection or a full MongoDB backup.
|
||||
4. **Staging dry-run** – enable the flag in a staging environment and observe the migration output before
|
||||
rolling to production.
|
||||
|
||||
## Execution
|
||||
|
||||
No manual command is required. After deploying the configuration change, restart the Concelier WebService or
|
||||
any component that hosts the Mongo migration runner. During startup you will see log entries similar to:
|
||||
|
||||
```
|
||||
Applying Mongo migration 20251011-semver-style-backfill: Populate advisory.normalizedVersions for existing documents when SemVer style storage is enabled.
|
||||
Mongo migration 20251011-semver-style-backfill applied
|
||||
```
|
||||
|
||||
The migration reads advisories in batches (`concelier.storage.backfillBatchSize`) and writes flattened
|
||||
`normalizedVersions` arrays. Existing documents without SemVer ranges remain untouched.
|
||||
|
||||
## Post-checks
|
||||
|
||||
1. Verify the new indexes exist:
|
||||
```
|
||||
db.advisory.getIndexes()
|
||||
```
|
||||
You should see `advisory_normalizedVersions_pkg_scheme_type` and `advisory_normalizedVersions_value`.
|
||||
2. Spot check a few advisories to confirm the top-level `normalizedVersions` array exists and matches
|
||||
the embedded package data.
|
||||
3. Run `dotnet test` for `StellaOps.Concelier.Storage.Mongo.Tests` (optional but recommended) in CI to confirm
|
||||
the storage suite passes with the feature flag enabled.
|
||||
|
||||
## Rollback
|
||||
|
||||
Set `concelier.storage.enableSemVerStyle` back to `false` and redeploy. The migration will be skipped on
|
||||
subsequent startups. You can leave the populated `normalizedVersions` arrays in place; they are ignored when
|
||||
the feature flag is off. If you must remove them entirely, restore from the backup captured during
|
||||
preparation.
|
||||
151
docs/modules/devops/runbooks/deployment-upgrade.md
Normal file
151
docs/modules/devops/runbooks/deployment-upgrade.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Stella Ops Deployment Upgrade & Rollback Runbook
|
||||
|
||||
_Last updated: 2025-10-26 (Sprint 14 – DEVOPS-OPS-14-003)._
|
||||
|
||||
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
|
||||
|
||||
---
|
||||
|
||||
## 1. Channel overview
|
||||
|
||||
| Channel | Release manifest | Helm values | Compose profile |
|
||||
|---------|------------------|-------------|-----------------|
|
||||
| `edge` | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` |
|
||||
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
|
||||
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
|
||||
|
||||
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
|
||||
|
||||
---
|
||||
|
||||
## 2. Pre-flight checklist
|
||||
|
||||
1. **Refresh release manifest**
|
||||
Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
|
||||
|
||||
2. **Align deployment bundles with the manifest**
|
||||
Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
|
||||
```bash
|
||||
./deploy/tools/check-channel-alignment.py \
|
||||
--release deploy/releases/2025.10-edge.yaml \
|
||||
--target deploy/helm/stellaops/values-dev.yaml \
|
||||
--target deploy/compose/docker-compose.dev.yaml \
|
||||
--ignore-repo nats
|
||||
```
|
||||
Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
|
||||
|
||||
3. **Lint and template profiles**
|
||||
```bash
|
||||
./deploy/tools/validate-profiles.sh
|
||||
```
|
||||
|
||||
4. **Smoke the Offline Kit debug store (edge/stable only)**
|
||||
When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
|
||||
```bash
|
||||
./ops/offline-kit/mirror_debug_store.py \
|
||||
--release-dir out/release \
|
||||
--offline-kit-dir out/offline-kit
|
||||
```
|
||||
Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
|
||||
|
||||
5. **Review compatibility matrix**
|
||||
Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258…`, `minio@sha256:14ce…`, `rustfs:2025.10.0-edge`.
|
||||
|
||||
6. **Create a rollback bookmark**
|
||||
Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Helm upgrade procedure (staging → production)
|
||||
|
||||
1. Switch to the deployment branch and ensure secrets/config maps are current.
|
||||
2. Apply the upgrade in the staging cluster:
|
||||
```bash
|
||||
helm upgrade stellaops deploy/helm/stellaops \
|
||||
-f deploy/helm/stellaops/values-stage.yaml \
|
||||
--namespace stellaops \
|
||||
--atomic \
|
||||
--timeout 15m
|
||||
```
|
||||
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
|
||||
4. Promote to production using the prod values file and the same command.
|
||||
5. Record the new revision number and Git SHA in the change log.
|
||||
|
||||
### Rollback (Helm)
|
||||
|
||||
1. Identify the previous revision: `helm history stellaops -n stellaops`.
|
||||
2. Execute:
|
||||
```bash
|
||||
helm rollback stellaops <revision> \
|
||||
--namespace stellaops \
|
||||
--wait \
|
||||
--timeout 10m
|
||||
```
|
||||
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
|
||||
4. Update the incident/operations log with root cause and rollback details.
|
||||
|
||||
---
|
||||
|
||||
## 4. Docker Compose upgrade procedure
|
||||
|
||||
1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
|
||||
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
|
||||
3. Apply the upgrade:
|
||||
```bash
|
||||
docker compose \
|
||||
--env-file deploy/compose/env/prod.env \
|
||||
-f deploy/compose/docker-compose.prod.yaml \
|
||||
pull
|
||||
|
||||
docker compose \
|
||||
--env-file deploy/compose/env/prod.env \
|
||||
-f deploy/compose/docker-compose.prod.yaml \
|
||||
up -d
|
||||
```
|
||||
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
|
||||
5. Update monitoring dashboards/alerts to confirm normal operation.
|
||||
|
||||
### Rollback (Compose)
|
||||
|
||||
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
|
||||
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
|
||||
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/modules/authority/operations/backup-restore.md` and associated service guides).
|
||||
4. Log the rollback in the operations journal.
|
||||
|
||||
---
|
||||
|
||||
## 5. Channel promotion workflow
|
||||
|
||||
1. Author or update the channel manifest under `deploy/releases/`.
|
||||
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
|
||||
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
|
||||
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
|
||||
5. Tag the repository when promoting stable or airgap builds.
|
||||
|
||||
---
|
||||
|
||||
## 6. Upgrade rehearsal & rollback drill log
|
||||
|
||||
Maintain rehearsal notes in `docs/modules/devops/runbooks/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
|
||||
|
||||
- Release version tested
|
||||
- Date/time
|
||||
- Participants
|
||||
- Issues encountered & fixes
|
||||
- Rollback duration (if executed)
|
||||
|
||||
Attach the log to the sprint retro or operational wiki.
|
||||
|
||||
| Date (UTC) | Channel | Outcome | Notes |
|
||||
|------------|---------|---------|-------|
|
||||
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- `deploy/README.md` – structure and validation workflow for deployment bundles.
|
||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release automation and signing pipeline.
|
||||
- `docs/modules/devops/architecture.md` – high-level DevOps architecture, SLOs, and compliance requirements.
|
||||
- `ops/offline-kit/mirror_debug_store.py` – debug-store mirroring helper.
|
||||
- `deploy/tools/check-channel-alignment.py` – release vs deployment digest alignment checker.
|
||||
128
docs/modules/devops/runbooks/launch-cutover.md
Normal file
128
docs/modules/devops/runbooks/launch-cutover.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# Launch Cutover Runbook - Stella Ops
|
||||
|
||||
_Document owner: DevOps Guild (2025-10-26)_
|
||||
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
|
||||
|
||||
## 1. Roles and Communication
|
||||
|
||||
| Role | Primary | Backup | Contact |
|
||||
| --- | --- | --- | --- |
|
||||
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
|
||||
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
|
||||
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
|
||||
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
|
||||
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
|
||||
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
|
||||
|
||||
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
|
||||
|
||||
## 2. Timeline Overview (UTC)
|
||||
|
||||
| Time | Activity | Owner |
|
||||
| --- | --- | --- |
|
||||
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
|
||||
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
|
||||
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
|
||||
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
|
||||
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
|
||||
| T0 | Execute production cutover steps (Section 4). | Cutover team |
|
||||
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
|
||||
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
|
||||
|
||||
## 3. Rehearsal (Staging) Checklist
|
||||
|
||||
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
|
||||
2. Run `deploy/tools/validate-profiles.sh` and archive output.
|
||||
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
|
||||
4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster.
|
||||
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
|
||||
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
|
||||
7. Document total wall time and any deviations in the rehearsal log.
|
||||
|
||||
Rehearsal must complete without manual interventions before proceeding to production.
|
||||
|
||||
## 4. Production Cutover Steps
|
||||
|
||||
### 4.1 Pre-flight
|
||||
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
|
||||
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
|
||||
- Back up current configuration and data:
|
||||
- Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
|
||||
- MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
|
||||
|
||||
### 4.2 Apply Updates (Compose)
|
||||
1. On each compose node, pull updated images for release `2025.09.2`:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
|
||||
```
|
||||
2. Deploy changes:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
|
||||
```
|
||||
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
|
||||
|
||||
### 4.3 Apply Updates (Helm/Kubernetes)
|
||||
If using Kubernetes, perform:
|
||||
```bash
|
||||
helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m
|
||||
```
|
||||
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
|
||||
|
||||
### 4.4 Configuration Validation
|
||||
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
|
||||
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
|
||||
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
|
||||
- Ensure Notify (legacy) still accessible while Notifier migration pending.
|
||||
|
||||
## 5. Smoke Tests
|
||||
|
||||
| Test | Command / Action | Expected Result |
|
||||
| --- | --- | --- |
|
||||
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
|
||||
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
|
||||
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
|
||||
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
|
||||
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
|
||||
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
|
||||
|
||||
Log results in the change ticket with timestamps and screenshots where applicable.
|
||||
|
||||
## 6. Rollback Procedure
|
||||
|
||||
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
|
||||
2. For Compose:
|
||||
```bash
|
||||
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
|
||||
docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
|
||||
```
|
||||
3. For Helm:
|
||||
```bash
|
||||
helm rollback stellaops <previous-release-number> --namespace stellaops
|
||||
```
|
||||
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
|
||||
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
|
||||
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
|
||||
|
||||
## 7. Post-cutover Actions
|
||||
|
||||
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
|
||||
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
|
||||
- Update `docs/modules/devops/runbooks/launch-readiness.md` if any new gaps or follow-ups discovered.
|
||||
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
|
||||
|
||||
## 8. Approval Matrix
|
||||
|
||||
| Step | Required Approvers | Record Location |
|
||||
| --- | --- | --- |
|
||||
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
|
||||
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
|
||||
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
|
||||
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
|
||||
|
||||
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
|
||||
|
||||
## 9. Rehearsal Log
|
||||
|
||||
| Date (UTC) | What We Exercised | Outcome | Follow-up |
|
||||
| --- | --- | --- | --- |
|
||||
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
|
||||
49
docs/modules/devops/runbooks/launch-readiness.md
Normal file
49
docs/modules/devops/runbooks/launch-readiness.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Launch Readiness Record - Stella Ops
|
||||
|
||||
_Updated: 2025-10-26 (UTC)_
|
||||
|
||||
This document captures production launch sign-offs, deployment readiness checkpoints, and any open risks that must be tracked before GA cutover.
|
||||
|
||||
## 1. Sign-off Summary
|
||||
|
||||
| Module / Service | Guild / Point of Contact | Evidence (Task or Runbook) | Status | Timestamp (UTC) | Notes |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| Authority (Issuer) | Authority Core Guild | `AUTH-AOC-19-001` - scope issuance & configuration complete (DONE 2025-10-26) | READY | 2025-10-26T14:05Z | Tenant scope propagation follow-up (`AUTH-AOC-19-002`) tracked in gaps section. |
|
||||
| Signer | Signer Guild | `SIGNER-API-11-101` / `SIGNER-REF-11-102` / `SIGNER-QUOTA-11-103` (DONE 2025-10-21) | READY | 2025-10-26T14:07Z | DSSE signing, referrer verification, and quota enforcement validated in CI. |
|
||||
| Attestor | Attestor Guild | `ATTESTOR-API-11-201` / `ATTESTOR-VERIFY-11-202` / `ATTESTOR-OBS-11-203` (DONE 2025-10-19) | READY | 2025-10-26T14:10Z | Rekor submission/verification pipeline green; telemetry pack published. |
|
||||
| Scanner Web + Worker | Scanner WebService Guild | `SCANNER-WEB-09-10x`, `SCANNER-RUNTIME-12-30x` (DONE 2025-10-18 -> 2025-10-24) | READY* | 2025-10-26T14:20Z | Orchestrator envelope work (`SCANNER-EVENTS-16-301/302`) still open; see gaps. |
|
||||
| Concelier Core & Connectors | Concelier Core / Ops Guild | Ops runbook sign-off in `docs/modules/concelier/operations/conflict-resolution.md` (2025-10-16) | READY | 2025-10-26T14:25Z | Conflict resolution & connector coverage accepted; Mongo schema hardening pending (see gaps). |
|
||||
| Excititor API | Excititor Core Guild | Wave 0 connector ingest sign-offs (EXECPLAN.Section Wave 0) | READY | 2025-10-26T14:28Z | VEX linkset publishing complete for launch datasets. |
|
||||
| Notify Web (legacy) | Notify Guild | Existing stack carried forward; Notifier program tracked separately (Sprint 38-40) | PENDING | 2025-10-26T14:32Z | Legacy notify web remains operational; migration to Notifier blocked on `SCANNER-EVENTS-16-301`. |
|
||||
| Web UI | UI Guild | Stable build `registry.stella-ops.org/.../web-ui@sha256:10d9248...` deployed in stage and smoke-tested | READY | 2025-10-26T14:35Z | Policy editor GA items (Sprint 20) outside launch scope. |
|
||||
| DevOps / Release | DevOps Guild | `deploy/tools/validate-profiles.sh` run (2025-10-26) covering dev/stage/prod/airgap/mirror | READY | 2025-10-26T15:02Z | Compose/Helm lint + docker compose config validated; see Section 2 for details. |
|
||||
| Offline Kit | Offline Kit Guild | `DEVOPS-OFFLINE-18-004` (Go analyzer) and `DEVOPS-OFFLINE-18-005` (Python analyzer) complete; debug-store mirror pending (`DEVOPS-OFFLINE-17-004`). | PENDING | 2025-10-26T15:05Z | Awaiting release debug artefacts to finalise `DEVOPS-OFFLINE-17-004`; tracked in Section 3. |
|
||||
|
||||
_\* READY with caveat - remaining work noted in Section 3._
|
||||
|
||||
## 2. Deployment Readiness Checklist
|
||||
|
||||
- **Production profiles committed:** `deploy/compose/docker-compose.prod.yaml` and `deploy/helm/stellaops/values-prod.yaml` added with front-door network hand-off and secret references for Mongo/MinIO/core services.
|
||||
- **Secrets placeholders documented:** `deploy/compose/env/prod.env.example` enumerates required credentials (`MONGO_INITDB_ROOT_PASSWORD`, `MINIO_ROOT_PASSWORD`, Redis/NATS endpoints, `FRONTDOOR_NETWORK`). Helm values reference Kubernetes secrets (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`).
|
||||
- **Static validation executed:** `deploy/tools/validate-profiles.sh` run on 2025-10-26 (docker compose config + helm lint/template) with all profiles passing.
|
||||
- **Ingress model defined:** Production compose profile introduces external `frontdoor` network; README updated with creation instructions and scope of externally reachable services.
|
||||
- **Observability hooks:** Authority/Signer/Attestor telemetry packs verified; scanner runtime build-id metrics landed (`SCANNER-RUNTIME-17-401`). Grafana dashboards referenced in component runbooks.
|
||||
- **Rollback assets:** Stage Compose profile remains aligned (`docker-compose.stage.yaml`), enabling rehearsals before prod cutover; release manifests (`deploy/releases/2025.09-stable.yaml`) map digests for reproducible rollback.
|
||||
- **Rehearsal status:** 2025-10-26 validation dry-run executed (`deploy/tools/validate-profiles.sh` across dev/stage/prod/airgap/mirror). Full stage Helm rollout pending access to the managed staging cluster; target to complete once credentials are provisioned.
|
||||
|
||||
## 3. Outstanding Gaps & Follow-ups
|
||||
|
||||
| Item | Owner | Tracking Ref | Target / Next Step | Impact |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Tenant scope propagation and audit coverage | Authority Core Guild | `AUTH-AOC-19-002` (DOING 2025-10-26) | Land enforcement + audit fixtures by Sprint 19 freeze | Medium - required for multi-tenant GA but does not block initial cutover if tenants scoped manually. |
|
||||
| Orchestrator event envelopes + Notifier handshake | Scanner WebService Guild | `SCANNER-EVENTS-16-301` (BLOCKED), `SCANNER-EVENTS-16-302` (DOING) | Coordinate with Gateway/Notifier owners on preview package replacement or binding redirects; rerun `dotnet test` once patch lands and refresh schema docs. Share envelope samples in `docs/events/` after tests pass. | High — gating Notifier migration; legacy notify path remains functional meanwhile. |
|
||||
| Offline Kit Python analyzer bundle | Offline Kit Guild + Scanner Guild | `DEVOPS-OFFLINE-18-005` (DONE 2025-10-26) | Monitor for follow-up manifest updates and rerun smoke script when analyzers change. | Medium - ensures language analyzer coverage stays current for offline installs. |
|
||||
| Offline Kit debug store mirror | Offline Kit Guild + DevOps Guild | `DEVOPS-OFFLINE-17-004` (BLOCKED 2025-10-26) | Release pipeline must publish `out/release/debug` artefacts; once available, run `mirror_debug_store.py` and commit `metadata/debug-store.json`. | Low - symbol lookup remains accessible from staging assets but required before next Offline Kit tag. |
|
||||
| Mongo schema validators for advisory ingestion | Concelier Storage Guild | `CONCELIER-STORE-AOC-19-001` (TODO) | Finalize JSON schema + migration toggles; coordinate with Ops for rollout window | Low - current validation handled in app layer; schema guard adds defense-in-depth. |
|
||||
| Authority plugin telemetry alignment | Security Guild | `SEC2.PLG`, `SEC3.PLG`, `SEC5.PLG` (BLOCKED pending AUTH DPoP/MTLS tasks) | Resume once upstream auth surfacing stabilises | Low - plugin remains optional; launch uses default Authority configuration. |
|
||||
|
||||
## 4. Approvals & Distribution
|
||||
|
||||
- Record shared in `#launch-readiness` (Mattermost) 2025-10-26 15:15 UTC with DevOps + Guild leads for acknowledgement.
|
||||
- Updates to this document require dual sign-off from DevOps Guild (owner) and impacted module guild lead; retain change log via Git history.
|
||||
- Cutover rehearsal and rollback drills are tracked separately in `docs/modules/devops/runbooks/launch-cutover.md` (see associated Task `DEVOPS-LAUNCH-18-001`). *** End Patch
|
||||
64
docs/modules/devops/runbooks/nuget-preview-bootstrap.md
Normal file
64
docs/modules/devops/runbooks/nuget-preview-bootstrap.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# NuGet Preview Bootstrap (Offline-Friendly)
|
||||
|
||||
The StellaOps build relies on .NET 10 RC2 packages (Microsoft.Extensions.*, JwtBearer 10.0 RC).
|
||||
`NuGet.config` now wires three sources:
|
||||
|
||||
1. `local` → `./local-nuget` (preferred, air-gapped mirror)
|
||||
2. `dotnet-public` → `https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json`
|
||||
3. `nuget.org` → fallback for everything else
|
||||
|
||||
Follow the steps below whenever you refresh the repo or roll a new Offline Kit drop.
|
||||
|
||||
## 1. Mirror the preview packages
|
||||
|
||||
```bash
|
||||
./ops/devops/sync-preview-nuget.sh
|
||||
```
|
||||
|
||||
* Reads `ops/devops/nuget-preview-packages.csv`. Each line specifies the package, version, expected SHA-256 hash, and (optionally) the flat-container base URL (we pin to `dotnet-public`).
|
||||
* Downloads the `.nupkg` straight into `./local-nuget/` and re-verifies the checksum. Existing files are skipped when hashes already match.
|
||||
* Use `NUGET_V2_BASE` if you need to temporarily point at a different mirror.
|
||||
|
||||
💡 The script never mutates packages in place—if a checksum changes you will see a “SHA mismatch … refreshing” message.
|
||||
|
||||
## 2. Restore using the shared `NuGet.config`
|
||||
|
||||
From the repo root:
|
||||
|
||||
```bash
|
||||
DOTNET_NOLOGO=1 dotnet restore src/Excititor/__Libraries/StellaOps.Excititor.Connectors.Abstractions/StellaOps.Excititor.Connectors.Abstractions.csproj \
|
||||
--configfile NuGet.config
|
||||
```
|
||||
|
||||
The `packageSourceMapping` section keeps `Microsoft.Extensions.*`, `Microsoft.AspNetCore.*`, and `Microsoft.Data.Sqlite` bound to `local`/`dotnet-public`, so `dotnet restore` never has to reach out to nuget.org when mirrors are populated.
|
||||
|
||||
Before committing changes (or when wiring up a new environment) run:
|
||||
|
||||
```bash
|
||||
python3 ops/devops/validate_restore_sources.py
|
||||
```
|
||||
|
||||
The validator asserts:
|
||||
|
||||
- `NuGet.config` lists `local` → `dotnet-public` → `nuget.org` in that order.
|
||||
- `Directory.Build.props` pins `RestoreSources` so every project prioritises the local mirror.
|
||||
- No stray `NuGet.config` files shadow the repo root configuration.
|
||||
|
||||
CI executes the validator in both the `build-test-deploy` and `release` workflows,
|
||||
so regressions trip before any restore/build begins.
|
||||
|
||||
If you run fully air-gapped, remember to clear the cache between SDK upgrades:
|
||||
|
||||
```bash
|
||||
dotnet nuget locals all --clear
|
||||
```
|
||||
|
||||
## 3. Troubleshooting
|
||||
|
||||
| Symptom | Fix |
|
||||
| --- | --- |
|
||||
| `dotnet restore` still hits nuget.org for preview packages | Re-run `sync-preview-nuget.sh` to ensure the `.nupkg` exists locally, then delete `~/.nuget/packages/microsoft.extensions.*` so the resolver picks up the mirrored copy. |
|
||||
| SHA mismatch in the manifest | Update `ops/devops/nuget-preview-packages.csv` with the new version + checksum (from the feed) and re-run the sync script. |
|
||||
| Azure DevOps feed throttling | Set `DOTNET_PUBLIC_FLAT_BASE` env var and point it at your own mirrored flat-container, then add the URL to the 4th column of the manifest. |
|
||||
|
||||
Keep this doc alongside Offline Kit instructions so air-gapped operators know exactly how to refresh the mirror and verify packages before restore.
|
||||
Reference in New Issue
Block a user