483 lines
28 KiB
Markdown
483 lines
28 KiB
Markdown
## Pack 14 — Release Run / Deployment Timeline (workflow checkpoints, logs, rollback, evidence capture, replay/verify)
|
||
|
||
This pack adds the **“run view”** that ties together everything Stella Ops promises: *promote by digest, explain every decision, evidence-backed audit, deterministic replay* — without turning reachability into a top-level area.
|
||
|
||
---
|
||
|
||
# 14.1 Menu graph (Mermaid) — where “Release Run” sits in the IA
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
ROOT[Stella Ops Console] --> REL[Releases]
|
||
ROOT --> APPR[Approvals]
|
||
ROOT --> EVID[Evidence]
|
||
ROOT --> OPS[Operations]
|
||
ROOT --> RC[Release Control (ROOT)]
|
||
ROOT --> INT[Integrations]
|
||
ROOT --> SEC[Security]
|
||
|
||
REL --> REL_LIST[Releases (Promotions)]
|
||
REL_LIST --> PROMO_DETAIL[Promotion Detail]
|
||
PROMO_DETAIL --> RUN_TAB[Run / Timeline]
|
||
RUN_TAB --> STEP_DETAIL[Step Detail: logs + artifacts + evidence]
|
||
RUN_TAB --> ROLLBACK[Rollback / Re-run]
|
||
RUN_TAB --> SCHEDULE[Schedule / Automation]
|
||
|
||
STEP_DETAIL -. export evidence .-> EVID
|
||
STEP_DETAIL -. replay policy .-> EVID
|
||
RUN_TAB -. ops health .-> OPS
|
||
|
||
EVID --> PKT[Packets]
|
||
EVID --> CHAIN[Proof Chains]
|
||
EVID --> REPLAY[Replay/Verify]
|
||
EVID --> EXPORT[Export Center]
|
||
EVID --> BUNDLES[Evidence Bundles]
|
||
|
||
OPS --> ORCH[Orchestrator]
|
||
OPS --> SCHED[Scheduler Runs]
|
||
OPS --> DLQ[Dead Letter]
|
||
OPS --> FEEDS[Feeds + AirGap Ops]
|
||
OPS --> HEALTH[Platform Health]
|
||
|
||
RUN_TAB -. links to .-> ORCH
|
||
RUN_TAB -. links to .-> SCHED
|
||
RUN_TAB -. links to .-> FEEDS
|
||
RUN_TAB -. links to .-> HEALTH
|
||
|
||
PROMO_DETAIL -. findings snapshot .-> SEC
|
||
PROMO_DETAIL -. env inputs .-> RC
|
||
PROMO_DETAIL -. secrets/providers .-> INT
|
||
```
|
||
|
||
---
|
||
|
||
# 14.2 Run lifecycle graph (Mermaid) — promotion execution stages + checkpoints
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
A[Promotion Created] --> B[Inputs Materialized]
|
||
B --> C[Policy Gate Eval]
|
||
C --> D{Approval Required?}
|
||
D -- yes --> E[Approval Decision]
|
||
D -- no --> F[Deploy Workflow Start]
|
||
|
||
E --> F
|
||
F --> G[Canary 10%]
|
||
G --> H{SLO/Health OK?}
|
||
H -- no --> R[Auto-Rollback / Pause]
|
||
H -- yes --> I[Canary 50%]
|
||
I --> J{SLO/Health OK?}
|
||
J -- no --> R
|
||
J -- yes --> K[100% Rollout]
|
||
K --> L[Post-Deploy Verify]
|
||
L --> M[Finalize + Seal Evidence]
|
||
M --> N[Promotion Complete]
|
||
|
||
%% Evidence capture points
|
||
C -. DSSE policy decision .-> EV[Evidence Pack]
|
||
F -. provenance/attestations .-> EV
|
||
L -. runtime reachability snapshot .-> EV
|
||
M -. Rekor/tlog receipts .-> EV
|
||
```
|
||
|
||
---
|
||
|
||
# 14.3 Screen — Run / Timeline (Promotion Run)
|
||
|
||
### Formerly (where it lived pre-redesign)
|
||
|
||
Pieces existed but were **fragmented**:
|
||
|
||
* **Control Plane** dashboard showed *Active Deployments* (high-level only).
|
||
* **Operations → Orchestrator** (jobs access) and **Operations → Scheduler** (runs) were operational but not “release narrative”.
|
||
* Evidence was in **Evidence → Packets / Proof Chains / Export**, but not tied to a run timeline.
|
||
* Any detailed logs typically lived outside Stella (CI/CD, deploy system, cluster logs).
|
||
|
||
### Why changed like this
|
||
|
||
* A release promotion must be **auditable as a single storyline**:
|
||
|
||
* what happened,
|
||
* when,
|
||
* what data it used,
|
||
* what it decided,
|
||
* what evidence was sealed at each checkpoint,
|
||
* and what actions are safe now (pause, rollback, replay).
|
||
* This screen becomes the **single pane** that links out to specialized areas (Ops, Evidence), instead of forcing users to hunt.
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Run / Timeline] --> B[Stage timeline with checkpoints]
|
||
A --> C[Current status + next step]
|
||
A --> D[Links to logs, artifacts, evidence]
|
||
A --> E[Actions: pause/retry/rollback]
|
||
A --> F[Data health banner: feeds/jobs/integrations]
|
||
A --> G[Drill into Step Detail]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Promotion Run / Timeline │
|
||
│ Legacy name/location: No single screen. Pieces were Control Plane "Active Deployments" + Ops. │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Promotion: Platform Release 1.3.0-rc1 manifest sha256:beef... │
|
||
│ Target: EU-West / eu-stage → eu-prod Workflow: Canary 10→50→100 │
|
||
│ Status: RUNNING (Canary 10%) Started: Feb 18, 08:30 │
|
||
│ Data health: WARN — NVD stale 3h | Rescan job failed (worker) | Jenkins degraded │
|
||
│ Links: [Ops Feeds] [System Jobs] [Integrations] │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Timeline (click any step) │
|
||
│ 08:30 ✓ Inputs Materialized (Vault/Consul resolved, 0 missing) [View] │
|
||
│ 08:31 ✓ Gate Eval (Policy) PASS/WARN (reach runtime 35%) [View] │
|
||
│ 08:32 ✓ Approval APPROVED by bob.smith [View] │
|
||
│ 08:33 ▶ Deploy Canary 10% RUNNING (2/10 targets healthy) [View] [Pause] │
|
||
│ ---- ○ Deploy Canary 50% PENDING [—] │
|
||
│ ---- ○ Deploy 100% PENDING [—] │
|
||
│ ---- ○ Post-Deploy Verify PENDING [—] │
|
||
│ ---- ○ Seal Evidence PENDING [—] │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Quick actions: [Pause] [Retry step] [Rollback] [Export evidence (partial)] [Replay policy] │
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# 14.4 Screen — Step Detail (Logs + Artifacts + Evidence captured at that checkpoint)
|
||
|
||
### Formerly
|
||
|
||
* Logs: CI/CD (e.g., Jenkins), deploy agent logs, platform logs — outside Stella.
|
||
* Evidence: visible only under **Evidence** menus and not connected to “the step that created it”.
|
||
|
||
### Why changed like this
|
||
|
||
* Step Detail is the “unit of explanation”.
|
||
* Every meaningful checkpoint should show:
|
||
|
||
* **inputs** used,
|
||
* **outputs** produced,
|
||
* **logs**,
|
||
* **evidence items** sealed (or pending),
|
||
* and **links** to canonical storage (Evidence Packets / Proof Chains).
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Step Detail] --> B[Overview: inputs/outputs + timestamps]
|
||
A --> C[Logs (stream / download)]
|
||
A --> D[Artifacts (manifests, plans, diffs)]
|
||
A --> E[Evidence items (DSSE, receipts, proofs)]
|
||
A --> F[Actions: retry step / mark failed / pause]
|
||
A --> G[Jump: Evidence Packet / Proof Chain]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Step Detail: Gate Eval (Policy) │
|
||
│ Legacy name/location: gate result surfaced loosely on Approvals; evidence elsewhere. │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Start: 08:31 End: 08:31:12 Duration: 12s Result: PASS (2 WARN) │
|
||
│ Inputs: bundle manifest sha256:beef... | baseline Prod-EU-West | feeds: NVD stale 3h │
|
||
│ Outputs: policy verdict id: verdict-123 | decision digest: sha256:dd77... │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Tabs: [Overview] [Logs] [Artifacts] [Evidence] │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Evidence captured │
|
||
│ ✓ DSSE envelope: policy-decision.dsse (digest sha256:dd77...) │
|
||
│ ✓ Rekor receipt: rekor-entry.json (tlog index 9918271) │
|
||
│ ○ Proof chain: pending until "Seal Evidence" step │
|
||
│ Links: [Open Evidence Packet] [Open Proof Chain] [Replay this Verdict] │
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# 14.5 Screen — Deploy Stage View (targets, health, checkpoints, rollback triggers)
|
||
|
||
### Formerly
|
||
|
||
* “Active Deployments” showed minimal progress.
|
||
* Detailed rollout/targets health likely lived in your deploy system (outside Stella).
|
||
* Platform Health screen exists, but not contextualized to a specific promotion.
|
||
|
||
### Why changed like this
|
||
|
||
* This is where “release operations” actually happens:
|
||
|
||
* show **targets** in the region/env,
|
||
* show **health gates** / SLO checks,
|
||
* show **automatic rollback triggers**,
|
||
* link to platform health and logs.
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Deploy Stage View] --> B[Targets table (per region/env)]
|
||
A --> C[SLO / health checks]
|
||
A --> D[Auto-rollback rules + trigger state]
|
||
A --> E[Actions: pause/continue/rollback]
|
||
A --> F[Link: Platform Health]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Step Detail: Deploy Canary 10% │
|
||
│ Legacy name/location: Control Plane "Active Deployments" (summary only) + external deploy logs │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Stage: Canary 10% Policy: proceed if 95% healthy for 5m, error rate < 1% │
|
||
│ Current: 2/10 healthy | Error rate: 0.4% | Latency p95: 210ms | SLO: OK │
|
||
│ Auto-rollback trigger: NOT TRIGGERED │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Targets (EU-West / eu-prod) │
|
||
│ ┌───────────────┬───────────┬──────────┬──────────────┬───────────────┐ │
|
||
│ │ Target │ Version │ Health │ Notes │ Logs │ │
|
||
│ ├───────────────┼───────────┼──────────┼──────────────┼───────────────┤ │
|
||
│ │ eu-prod-01 │ bundle@beef│ ✓ │ ok │ [open] │ │
|
||
│ │ eu-prod-02 │ bundle@beef│ ✓ │ ok │ [open] │ │
|
||
│ │ eu-prod-03 │ old │ ○ │ pending │ [open] │ │
|
||
│ └───────────────┴───────────┴──────────┴──────────────┴───────────────┘ │
|
||
│ Actions: [Pause] [Continue to 50%] (disabled until criteria met) [Rollback] [Open Platform Health]│
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# 14.6 Screen — Rollback / Re-run (safe ops controls)
|
||
|
||
### Formerly
|
||
|
||
* Rollback existed as a **status** (“ROLLED_BACK”) in Releases list.
|
||
* Actual rollback execution likely happened externally or via Orchestrator privileges.
|
||
|
||
### Why changed like this
|
||
|
||
* Rollback must be:
|
||
|
||
* explicit,
|
||
* traceable,
|
||
* evidence-backed (what was rolled back, why, and what is the resulting state).
|
||
* Re-run is needed for transient failures (e.g., feed sync delay, rescan job retry), but must preserve determinism (re-run should record new evidence with timestamps, and keep old evidence).
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Rollback/Re-run] --> B[Select scope: step / stage / full rollback]
|
||
A --> C[Preview impact (targets + versions)]
|
||
A --> D[Reason + ticket]
|
||
A --> E[Execute]
|
||
E --> F[Run Timeline updates + evidence appended]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Rollback / Re-run │
|
||
│ Legacy name/location: Release status "ROLLED_BACK" existed; rollback execution was not unified │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Promotion: Platform Release 1.3.0-rc1 → EU-West/eu-prod │
|
||
│ Current stage: Canary 10% (RUNNING) │
|
||
│ │
|
||
│ Choose action: │
|
||
│ ( ) Re-run current step (Deploy Canary 10%) │
|
||
│ ( ) Pause promotion │
|
||
│ ( ) Rollback to previously deployed bundle version (manifest sha256:prev...) │
|
||
│ │
|
||
│ Preview rollback impact: │
|
||
│ - 2 targets currently on new bundle → will revert to prev bundle │
|
||
│ - 8 targets still old → unchanged │
|
||
│ │
|
||
│ Reason (required): [ incident #1234: elevated latency ] │
|
||
│ [Execute] [Cancel] │
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# 14.7 Screen — Evidence Timeline (what evidence exists now vs what seals at finalize)
|
||
|
||
### Formerly
|
||
|
||
* Evidence existed under:
|
||
|
||
* **Evidence → Packets**
|
||
* **Evidence → Proof Chains**
|
||
* **Evidence → Export**
|
||
* **Evidence → Evidence Bundles**
|
||
…but the *relationship to the run stages* wasn’t visible.
|
||
|
||
### Why changed like this
|
||
|
||
* Auditors and operators need to answer:
|
||
|
||
* “What evidence is already available mid-run?”
|
||
* “What is pending until completion?”
|
||
* “What exactly was sealed and when?”
|
||
* This is the bridge between *Ops timeline* and *audit artifacts*.
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
A[Evidence Timeline (per promotion)] --> B[Evidence items by checkpoint]
|
||
A --> C[Open Packet]
|
||
A --> D[Open Proof Chain]
|
||
A --> E[Export Evidence Pack]
|
||
A --> F[Generate Auditor Bundle]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Evidence Timeline — Promotion Run │
|
||
│ Legacy name/location: Evidence artifacts existed, but not linked to run checkpoints │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Checkpoint → Evidence │
|
||
│ Inputs Materialized │
|
||
│ ✓ resolved-inputs.json (hash sha256:aa11...) │
|
||
│ │
|
||
│ Gate Eval (Policy) │
|
||
│ ✓ policy-decision.dsse ✓ rekor receipt ✓ verdict-123 │
|
||
│ │
|
||
│ Deploy Canary 10% │
|
||
│ ○ deploy-attestation.dsse (pending) │
|
||
│ │
|
||
│ Seal Evidence (final) │
|
||
│ ○ proof-chain.json ○ audit-pack.tar.gz ○ evidence-bundle.zip │
|
||
│ │
|
||
│ Actions: [Open Evidence Packet] [Open Proof Chain] [Export Pack (partial)] [Generate Auditor Bundle]│
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# 14.8 Screen — Replay/Verify (contextual replay for *this run*)
|
||
|
||
### Formerly
|
||
|
||
* **Evidence → Replay/Verify** (“Verdict Replay”) existed as a standalone screen:
|
||
|
||
* user inputs verdict id or image reference,
|
||
* sees replay requests + determinism overview.
|
||
|
||
### Why changed like this
|
||
|
||
* Replay should be reachable from where it matters:
|
||
|
||
* a specific policy decision checkpoint in a promotion run.
|
||
* Keep the existing Replay/Verify functionality, but add a **contextual wrapper**:
|
||
|
||
* pre-fills verdict id + bundle digest + env baseline,
|
||
* shows determinism status for this promotion.
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Run → Replay/Verify] --> B[Pre-filled replay request]
|
||
B --> C[Replay requests list]
|
||
C --> D[Determinism metrics]
|
||
D --> E[Link: Evidence → Replay/Verify canonical view]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Replay/Verify — For this Promotion │
|
||
│ Legacy name/location: "Verdict Replay" (Evidence → Replay/Verify) │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Pre-filled replay request │
|
||
│ Verdict ID: verdict-123 │
|
||
│ Bundle: Platform Release 1.3.0-rc1 manifest sha256:beef... │
|
||
│ Baseline: Prod-EU-West │
|
||
│ Reason: [ Audit verification / policy change test ] │
|
||
│ [Request Replay] │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Recent replay requests (for this promotion) │
|
||
│ rr-001 COMPLETED Feb 18, 08:30 match │
|
||
│ rr-002 RUNNING Feb 18, 07:30 │
|
||
│ Determinism: total 2 | matching 1 | mismatches 1 | match rate 50% │
|
||
│ Link: [Open canonical Replay/Verify screen] │
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# 14.9 Screen — Schedule / Automation (promotion scheduling + link to Scheduler Runs)
|
||
|
||
### Formerly
|
||
|
||
* **Operations → Scheduler** existed (“Scheduler Runs”) but disconnected from promotions.
|
||
* Release list had statuses but scheduling wasn’t first-class in the release context.
|
||
|
||
### Why changed like this
|
||
|
||
* Scheduling belongs to release operations, but we don’t want a new menu.
|
||
* This screen:
|
||
|
||
* schedules this promotion (or a step),
|
||
* writes a scheduler job,
|
||
* then links to **Scheduler Runs** for execution diagnostics.
|
||
|
||
### Screen graph (Mermaid)
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
A[Schedule Promotion] --> B[Choose time/window]
|
||
A --> C[Choose constraints (feeds fresh, scans complete)]
|
||
A --> D[Create scheduler job]
|
||
D --> E[View Scheduler Runs]
|
||
E --> F[Back to Run Timeline]
|
||
```
|
||
|
||
### ASCII mock
|
||
|
||
```text
|
||
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ Schedule Promotion │
|
||
│ Legacy name/location: Ops → Scheduler (runs), no promotion-level scheduling UI │
|
||
├──────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ Promotion: Hotfix Bundle 1.2.4 → US-East/us-prod │
|
||
│ │
|
||
│ Schedule: [ Feb 19, 02:00 AM ] Window: [ 2h ] │
|
||
│ Preconditions: │
|
||
│ [x] NVD/OSV feeds fresh (< 1h) │
|
||
│ [x] SBOM rescans complete │
|
||
│ [ ] Integrations healthy (warn only) │
|
||
│ │
|
||
│ [Create Schedule] Link: [Open Scheduler Runs] │
|
||
└──────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Result: what you gain with Pack 14
|
||
|
||
* A promotion is now a **single auditable narrative**:
|
||
|
||
* timeline + logs + checkpoints,
|
||
* policy decision trace,
|
||
* deploy stage health gates,
|
||
* rollback controls,
|
||
* evidence sealing,
|
||
* deterministic replay.
|
||
* Hybrid reachability becomes a **2nd-class** signal woven into checkpoints (Policy + Post-Deploy Verify), not a top-level section.
|
||
* Existing PoC pages remain valid, but are now **linked meaningfully** from the run storyline.
|
||
|
||
---
|
||
|
||
If you want the next pack: **Pack 15** will unify **Nightly Ops Report + Data Freshness** (feeds, rescans, integration degradation) into a single **Operations “Data Integrity”** view and show how it bubbles up to Dashboard/Releases/Approvals without duplicating screens.
|