Files
git.stella-ops.org/docs/modules/ui/v2-rewire/pack-14.md
2026-02-18 23:03:07 +02:00

483 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Pack 14 — Release Run / Deployment Timeline (workflow checkpoints, logs, rollback, evidence capture, replay/verify)
This pack adds the **“run view”** that ties together everything Stella Ops promises: *promote by digest, explain every decision, evidence-backed audit, deterministic replay* — without turning reachability into a top-level area.
---
# 14.1 Menu graph (Mermaid) — where “Release Run” sits in the IA
```mermaid
flowchart TD
ROOT[Stella Ops Console] --> REL[Releases]
ROOT --> APPR[Approvals]
ROOT --> EVID[Evidence]
ROOT --> OPS[Operations]
ROOT --> RC[Release Control (ROOT)]
ROOT --> INT[Integrations]
ROOT --> SEC[Security]
REL --> REL_LIST[Releases (Promotions)]
REL_LIST --> PROMO_DETAIL[Promotion Detail]
PROMO_DETAIL --> RUN_TAB[Run / Timeline]
RUN_TAB --> STEP_DETAIL[Step Detail: logs + artifacts + evidence]
RUN_TAB --> ROLLBACK[Rollback / Re-run]
RUN_TAB --> SCHEDULE[Schedule / Automation]
STEP_DETAIL -. export evidence .-> EVID
STEP_DETAIL -. replay policy .-> EVID
RUN_TAB -. ops health .-> OPS
EVID --> PKT[Packets]
EVID --> CHAIN[Proof Chains]
EVID --> REPLAY[Replay/Verify]
EVID --> EXPORT[Export Center]
EVID --> BUNDLES[Evidence Bundles]
OPS --> ORCH[Orchestrator]
OPS --> SCHED[Scheduler Runs]
OPS --> DLQ[Dead Letter]
OPS --> FEEDS[Feeds + AirGap Ops]
OPS --> HEALTH[Platform Health]
RUN_TAB -. links to .-> ORCH
RUN_TAB -. links to .-> SCHED
RUN_TAB -. links to .-> FEEDS
RUN_TAB -. links to .-> HEALTH
PROMO_DETAIL -. findings snapshot .-> SEC
PROMO_DETAIL -. env inputs .-> RC
PROMO_DETAIL -. secrets/providers .-> INT
```
---
# 14.2 Run lifecycle graph (Mermaid) — promotion execution stages + checkpoints
```mermaid
flowchart LR
A[Promotion Created] --> B[Inputs Materialized]
B --> C[Policy Gate Eval]
C --> D{Approval Required?}
D -- yes --> E[Approval Decision]
D -- no --> F[Deploy Workflow Start]
E --> F
F --> G[Canary 10%]
G --> H{SLO/Health OK?}
H -- no --> R[Auto-Rollback / Pause]
H -- yes --> I[Canary 50%]
I --> J{SLO/Health OK?}
J -- no --> R
J -- yes --> K[100% Rollout]
K --> L[Post-Deploy Verify]
L --> M[Finalize + Seal Evidence]
M --> N[Promotion Complete]
%% Evidence capture points
C -. DSSE policy decision .-> EV[Evidence Pack]
F -. provenance/attestations .-> EV
L -. runtime reachability snapshot .-> EV
M -. Rekor/tlog receipts .-> EV
```
---
# 14.3 Screen — Run / Timeline (Promotion Run)
### Formerly (where it lived pre-redesign)
Pieces existed but were **fragmented**:
* **Control Plane** dashboard showed *Active Deployments* (high-level only).
* **Operations → Orchestrator** (jobs access) and **Operations → Scheduler** (runs) were operational but not “release narrative”.
* Evidence was in **Evidence → Packets / Proof Chains / Export**, but not tied to a run timeline.
* Any detailed logs typically lived outside Stella (CI/CD, deploy system, cluster logs).
### Why changed like this
* A release promotion must be **auditable as a single storyline**:
* what happened,
* when,
* what data it used,
* what it decided,
* what evidence was sealed at each checkpoint,
* and what actions are safe now (pause, rollback, replay).
* This screen becomes the **single pane** that links out to specialized areas (Ops, Evidence), instead of forcing users to hunt.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Run / Timeline] --> B[Stage timeline with checkpoints]
A --> C[Current status + next step]
A --> D[Links to logs, artifacts, evidence]
A --> E[Actions: pause/retry/rollback]
A --> F[Data health banner: feeds/jobs/integrations]
A --> G[Drill into Step Detail]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Promotion Run / Timeline │
│ Legacy name/location: No single screen. Pieces were Control Plane "Active Deployments" + Ops. │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Promotion: Platform Release 1.3.0-rc1 manifest sha256:beef... │
│ Target: EU-West / eu-stage → eu-prod Workflow: Canary 10→50→100 │
│ Status: RUNNING (Canary 10%) Started: Feb 18, 08:30 │
│ Data health: WARN — NVD stale 3h | Rescan job failed (worker) | Jenkins degraded │
│ Links: [Ops Feeds] [System Jobs] [Integrations] │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Timeline (click any step) │
│ 08:30 ✓ Inputs Materialized (Vault/Consul resolved, 0 missing) [View] │
│ 08:31 ✓ Gate Eval (Policy) PASS/WARN (reach runtime 35%) [View] │
│ 08:32 ✓ Approval APPROVED by bob.smith [View] │
│ 08:33 ▶ Deploy Canary 10% RUNNING (2/10 targets healthy) [View] [Pause] │
│ ---- ○ Deploy Canary 50% PENDING [—] │
│ ---- ○ Deploy 100% PENDING [—] │
│ ---- ○ Post-Deploy Verify PENDING [—] │
│ ---- ○ Seal Evidence PENDING [—] │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Quick actions: [Pause] [Retry step] [Rollback] [Export evidence (partial)] [Replay policy] │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 14.4 Screen — Step Detail (Logs + Artifacts + Evidence captured at that checkpoint)
### Formerly
* Logs: CI/CD (e.g., Jenkins), deploy agent logs, platform logs — outside Stella.
* Evidence: visible only under **Evidence** menus and not connected to “the step that created it”.
### Why changed like this
* Step Detail is the “unit of explanation”.
* Every meaningful checkpoint should show:
* **inputs** used,
* **outputs** produced,
* **logs**,
* **evidence items** sealed (or pending),
* and **links** to canonical storage (Evidence Packets / Proof Chains).
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Step Detail] --> B[Overview: inputs/outputs + timestamps]
A --> C[Logs (stream / download)]
A --> D[Artifacts (manifests, plans, diffs)]
A --> E[Evidence items (DSSE, receipts, proofs)]
A --> F[Actions: retry step / mark failed / pause]
A --> G[Jump: Evidence Packet / Proof Chain]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Step Detail: Gate Eval (Policy) │
│ Legacy name/location: gate result surfaced loosely on Approvals; evidence elsewhere. │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Start: 08:31 End: 08:31:12 Duration: 12s Result: PASS (2 WARN) │
│ Inputs: bundle manifest sha256:beef... | baseline Prod-EU-West | feeds: NVD stale 3h │
│ Outputs: policy verdict id: verdict-123 | decision digest: sha256:dd77... │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Tabs: [Overview] [Logs] [Artifacts] [Evidence] │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Evidence captured │
│ ✓ DSSE envelope: policy-decision.dsse (digest sha256:dd77...) │
│ ✓ Rekor receipt: rekor-entry.json (tlog index 9918271) │
│ ○ Proof chain: pending until "Seal Evidence" step │
│ Links: [Open Evidence Packet] [Open Proof Chain] [Replay this Verdict] │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 14.5 Screen — Deploy Stage View (targets, health, checkpoints, rollback triggers)
### Formerly
* “Active Deployments” showed minimal progress.
* Detailed rollout/targets health likely lived in your deploy system (outside Stella).
* Platform Health screen exists, but not contextualized to a specific promotion.
### Why changed like this
* This is where “release operations” actually happens:
* show **targets** in the region/env,
* show **health gates** / SLO checks,
* show **automatic rollback triggers**,
* link to platform health and logs.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Deploy Stage View] --> B[Targets table (per region/env)]
A --> C[SLO / health checks]
A --> D[Auto-rollback rules + trigger state]
A --> E[Actions: pause/continue/rollback]
A --> F[Link: Platform Health]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Step Detail: Deploy Canary 10% │
│ Legacy name/location: Control Plane "Active Deployments" (summary only) + external deploy logs │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Stage: Canary 10% Policy: proceed if 95% healthy for 5m, error rate < 1% │
│ Current: 2/10 healthy | Error rate: 0.4% | Latency p95: 210ms | SLO: OK │
│ Auto-rollback trigger: NOT TRIGGERED │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Targets (EU-West / eu-prod) │
│ ┌───────────────┬───────────┬──────────┬──────────────┬───────────────┐ │
│ │ Target │ Version │ Health │ Notes │ Logs │ │
│ ├───────────────┼───────────┼──────────┼──────────────┼───────────────┤ │
│ │ eu-prod-01 │ bundle@beef│ ✓ │ ok │ [open] │ │
│ │ eu-prod-02 │ bundle@beef│ ✓ │ ok │ [open] │ │
│ │ eu-prod-03 │ old │ ○ │ pending │ [open] │ │
│ └───────────────┴───────────┴──────────┴──────────────┴───────────────┘ │
│ Actions: [Pause] [Continue to 50%] (disabled until criteria met) [Rollback] [Open Platform Health]│
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 14.6 Screen — Rollback / Re-run (safe ops controls)
### Formerly
* Rollback existed as a **status** (“ROLLED_BACK”) in Releases list.
* Actual rollback execution likely happened externally or via Orchestrator privileges.
### Why changed like this
* Rollback must be:
* explicit,
* traceable,
* evidence-backed (what was rolled back, why, and what is the resulting state).
* Re-run is needed for transient failures (e.g., feed sync delay, rescan job retry), but must preserve determinism (re-run should record new evidence with timestamps, and keep old evidence).
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Rollback/Re-run] --> B[Select scope: step / stage / full rollback]
A --> C[Preview impact (targets + versions)]
A --> D[Reason + ticket]
A --> E[Execute]
E --> F[Run Timeline updates + evidence appended]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Rollback / Re-run │
│ Legacy name/location: Release status "ROLLED_BACK" existed; rollback execution was not unified │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Promotion: Platform Release 1.3.0-rc1 → EU-West/eu-prod │
│ Current stage: Canary 10% (RUNNING) │
│ │
│ Choose action: │
│ ( ) Re-run current step (Deploy Canary 10%) │
│ ( ) Pause promotion │
│ ( ) Rollback to previously deployed bundle version (manifest sha256:prev...) │
│ │
│ Preview rollback impact: │
│ - 2 targets currently on new bundle → will revert to prev bundle │
│ - 8 targets still old → unchanged │
│ │
│ Reason (required): [ incident #1234: elevated latency ] │
│ [Execute] [Cancel] │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 14.7 Screen — Evidence Timeline (what evidence exists now vs what seals at finalize)
### Formerly
* Evidence existed under:
* **Evidence → Packets**
* **Evidence → Proof Chains**
* **Evidence → Export**
* **Evidence → Evidence Bundles**
…but the *relationship to the run stages* wasnt visible.
### Why changed like this
* Auditors and operators need to answer:
* “What evidence is already available mid-run?”
* “What is pending until completion?”
* “What exactly was sealed and when?”
* This is the bridge between *Ops timeline* and *audit artifacts*.
### Screen graph (Mermaid)
```mermaid
flowchart LR
A[Evidence Timeline (per promotion)] --> B[Evidence items by checkpoint]
A --> C[Open Packet]
A --> D[Open Proof Chain]
A --> E[Export Evidence Pack]
A --> F[Generate Auditor Bundle]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Evidence Timeline — Promotion Run │
│ Legacy name/location: Evidence artifacts existed, but not linked to run checkpoints │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Checkpoint → Evidence │
│ Inputs Materialized │
│ ✓ resolved-inputs.json (hash sha256:aa11...) │
│ │
│ Gate Eval (Policy) │
│ ✓ policy-decision.dsse ✓ rekor receipt ✓ verdict-123 │
│ │
│ Deploy Canary 10% │
│ ○ deploy-attestation.dsse (pending) │
│ │
│ Seal Evidence (final) │
│ ○ proof-chain.json ○ audit-pack.tar.gz ○ evidence-bundle.zip │
│ │
│ Actions: [Open Evidence Packet] [Open Proof Chain] [Export Pack (partial)] [Generate Auditor Bundle]│
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 14.8 Screen — Replay/Verify (contextual replay for *this run*)
### Formerly
* **Evidence → Replay/Verify** (“Verdict Replay”) existed as a standalone screen:
* user inputs verdict id or image reference,
* sees replay requests + determinism overview.
### Why changed like this
* Replay should be reachable from where it matters:
* a specific policy decision checkpoint in a promotion run.
* Keep the existing Replay/Verify functionality, but add a **contextual wrapper**:
* pre-fills verdict id + bundle digest + env baseline,
* shows determinism status for this promotion.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Run → Replay/Verify] --> B[Pre-filled replay request]
B --> C[Replay requests list]
C --> D[Determinism metrics]
D --> E[Link: Evidence → Replay/Verify canonical view]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Replay/Verify — For this Promotion │
│ Legacy name/location: "Verdict Replay" (Evidence → Replay/Verify) │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Pre-filled replay request │
│ Verdict ID: verdict-123 │
│ Bundle: Platform Release 1.3.0-rc1 manifest sha256:beef... │
│ Baseline: Prod-EU-West │
│ Reason: [ Audit verification / policy change test ] │
│ [Request Replay] │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Recent replay requests (for this promotion) │
│ rr-001 COMPLETED Feb 18, 08:30 match │
│ rr-002 RUNNING Feb 18, 07:30 │
│ Determinism: total 2 | matching 1 | mismatches 1 | match rate 50% │
│ Link: [Open canonical Replay/Verify screen] │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 14.9 Screen — Schedule / Automation (promotion scheduling + link to Scheduler Runs)
### Formerly
* **Operations → Scheduler** existed (“Scheduler Runs”) but disconnected from promotions.
* Release list had statuses but scheduling wasnt first-class in the release context.
### Why changed like this
* Scheduling belongs to release operations, but we dont want a new menu.
* This screen:
* schedules this promotion (or a step),
* writes a scheduler job,
* then links to **Scheduler Runs** for execution diagnostics.
### Screen graph (Mermaid)
```mermaid
flowchart LR
A[Schedule Promotion] --> B[Choose time/window]
A --> C[Choose constraints (feeds fresh, scans complete)]
A --> D[Create scheduler job]
D --> E[View Scheduler Runs]
E --> F[Back to Run Timeline]
```
### ASCII mock
```text
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Schedule Promotion │
│ Legacy name/location: Ops → Scheduler (runs), no promotion-level scheduling UI │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Promotion: Hotfix Bundle 1.2.4 → US-East/us-prod │
│ │
│ Schedule: [ Feb 19, 02:00 AM ] Window: [ 2h ] │
│ Preconditions: │
│ [x] NVD/OSV feeds fresh (< 1h) │
│ [x] SBOM rescans complete │
│ [ ] Integrations healthy (warn only) │
│ │
│ [Create Schedule] Link: [Open Scheduler Runs] │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
## Result: what you gain with Pack 14
* A promotion is now a **single auditable narrative**:
* timeline + logs + checkpoints,
* policy decision trace,
* deploy stage health gates,
* rollback controls,
* evidence sealing,
* deterministic replay.
* Hybrid reachability becomes a **2nd-class** signal woven into checkpoints (Policy + Post-Deploy Verify), not a top-level section.
* Existing PoC pages remain valid, but are now **linked meaningfully** from the run storyline.
---
If you want the next pack: **Pack 15** will unify **Nightly Ops Report + Data Freshness** (feeds, rescans, integration degradation) into a single **Operations “Data Integrity”** view and show how it bubbles up to Dashboard/Releases/Approvals without duplicating screens.