Files

master c2f13fe588 preparation for ui re-shelling

2026-02-18 23:03:07 +02:00

28 KiB

Raw Blame History

Pack 14 — Release Run / Deployment Timeline (workflow checkpoints, logs, rollback, evidence capture, replay/verify)

This pack adds the “run view” that ties together everything Stella Ops promises: promote by digest, explain every decision, evidence-backed audit, deterministic replay — without turning reachability into a top-level area.

flowchart TD
ROOT[Stella Ops Console] --> REL[Releases]
ROOT --> APPR[Approvals]
ROOT --> EVID[Evidence]
ROOT --> OPS[Operations]
ROOT --> RC[Release Control (ROOT)]
ROOT --> INT[Integrations]
ROOT --> SEC[Security]

REL --> REL_LIST[Releases (Promotions)]
REL_LIST --> PROMO_DETAIL[Promotion Detail]
PROMO_DETAIL --> RUN_TAB[Run / Timeline]
RUN_TAB --> STEP_DETAIL[Step Detail: logs + artifacts + evidence]
RUN_TAB --> ROLLBACK[Rollback / Re-run]
RUN_TAB --> SCHEDULE[Schedule / Automation]

STEP_DETAIL -. export evidence .-> EVID
STEP_DETAIL -. replay policy .-> EVID
RUN_TAB -. ops health .-> OPS

EVID --> PKT[Packets]
EVID --> CHAIN[Proof Chains]
EVID --> REPLAY[Replay/Verify]
EVID --> EXPORT[Export Center]
EVID --> BUNDLES[Evidence Bundles]

OPS --> ORCH[Orchestrator]
OPS --> SCHED[Scheduler Runs]
OPS --> DLQ[Dead Letter]
OPS --> FEEDS[Feeds + AirGap Ops]
OPS --> HEALTH[Platform Health]

RUN_TAB -. links to .-> ORCH
RUN_TAB -. links to .-> SCHED
RUN_TAB -. links to .-> FEEDS
RUN_TAB -. links to .-> HEALTH

PROMO_DETAIL -. findings snapshot .-> SEC
PROMO_DETAIL -. env inputs .-> RC
PROMO_DETAIL -. secrets/providers .-> INT

14.2 Run lifecycle graph (Mermaid) — promotion execution stages + checkpoints

flowchart LR
A[Promotion Created] --> B[Inputs Materialized]
B --> C[Policy Gate Eval]
C --> D{Approval Required?}
D -- yes --> E[Approval Decision]
D -- no --> F[Deploy Workflow Start]

E --> F
F --> G[Canary 10%]
G --> H{SLO/Health OK?}
H -- no --> R[Auto-Rollback / Pause]
H -- yes --> I[Canary 50%]
I --> J{SLO/Health OK?}
J -- no --> R
J -- yes --> K[100% Rollout]
K --> L[Post-Deploy Verify]
L --> M[Finalize + Seal Evidence]
M --> N[Promotion Complete]

%% Evidence capture points
C -. DSSE policy decision .-> EV[Evidence Pack]
F -. provenance/attestations .-> EV
L -. runtime reachability snapshot .-> EV
M -. Rekor/tlog receipts .-> EV

14.3 Screen — Run / Timeline (Promotion Run)

Formerly (where it lived pre-redesign)

Pieces existed but were fragmented:

Control Plane dashboard showed Active Deployments (high-level only).
Operations → Orchestrator (jobs access) and Operations → Scheduler (runs) were operational but not “release narrative”.
Evidence was in Evidence → Packets / Proof Chains / Export, but not tied to a run timeline.
Any detailed logs typically lived outside Stella (CI/CD, deploy system, cluster logs).

Why changed like this

A release promotion must be auditable as a single storyline:
- what happened,
- when,
- what data it used,
- what it decided,
- what evidence was sealed at each checkpoint,
- and what actions are safe now (pause, rollback, replay).
This screen becomes the single pane that links out to specialized areas (Ops, Evidence), instead of forcing users to hunt.

Screen graph (Mermaid)

flowchart TD
A[Run / Timeline] --> B[Stage timeline with checkpoints]
A --> C[Current status + next step]
A --> D[Links to logs, artifacts, evidence]
A --> E[Actions: pause/retry/rollback]
A --> F[Data health banner: feeds/jobs/integrations]
A --> G[Drill into Step Detail]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Promotion Run / Timeline                                                                       │
│ Legacy name/location: No single screen. Pieces were Control Plane "Active Deployments" + Ops. │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Promotion: Platform Release 1.3.0-rc1  manifest sha256:beef...                                 │
│ Target: EU-West / eu-stage → eu-prod         Workflow: Canary 10→50→100                         │
│ Status: RUNNING (Canary 10%)            Started: Feb 18, 08:30                                  │
│ Data health: WARN — NVD stale 3h | Rescan job failed (worker) | Jenkins degraded                │
│ Links: [Ops Feeds] [System Jobs] [Integrations]                                                 │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Timeline (click any step)                                                                      │
│  08:30  ✓ Inputs Materialized     (Vault/Consul resolved, 0 missing)           [View]          │
│  08:31  ✓ Gate Eval (Policy)      PASS/WARN (reach runtime 35%)               [View]          │
│  08:32  ✓ Approval               APPROVED by bob.smith                         [View]          │
│  08:33  ▶ Deploy Canary 10%       RUNNING (2/10 targets healthy)               [View] [Pause] │
│  ----    ○ Deploy Canary 50%      PENDING                                      [—]            │
│  ----    ○ Deploy 100%            PENDING                                      [—]            │
│  ----    ○ Post-Deploy Verify     PENDING                                      [—]            │
│  ----    ○ Seal Evidence          PENDING                                      [—]            │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Quick actions: [Pause] [Retry step] [Rollback] [Export evidence (partial)] [Replay policy]     │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

14.4 Screen — Step Detail (Logs + Artifacts + Evidence captured at that checkpoint)

Formerly

Logs: CI/CD (e.g., Jenkins), deploy agent logs, platform logs — outside Stella.
Evidence: visible only under Evidence menus and not connected to “the step that created it”.

Why changed like this

Step Detail is the “unit of explanation”.
Every meaningful checkpoint should show:
- inputs used,
- outputs produced,
- logs,
- evidence items sealed (or pending),
- and links to canonical storage (Evidence Packets / Proof Chains).

Screen graph (Mermaid)

flowchart TD
A[Step Detail] --> B[Overview: inputs/outputs + timestamps]
A --> C[Logs (stream / download)]
A --> D[Artifacts (manifests, plans, diffs)]
A --> E[Evidence items (DSSE, receipts, proofs)]
A --> F[Actions: retry step / mark failed / pause]
A --> G[Jump: Evidence Packet / Proof Chain]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Step Detail: Gate Eval (Policy)                                                                │
│ Legacy name/location: gate result surfaced loosely on Approvals; evidence elsewhere.           │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Start: 08:31  End: 08:31:12  Duration: 12s   Result: PASS (2 WARN)                             │
│ Inputs: bundle manifest sha256:beef... | baseline Prod-EU-West | feeds: NVD stale 3h            │
│ Outputs: policy verdict id: verdict-123 | decision digest: sha256:dd77...                       │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Tabs: [Overview] [Logs] [Artifacts] [Evidence]                                                  │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Evidence captured                                                                              │
│  ✓ DSSE envelope: policy-decision.dsse  (digest sha256:dd77...)                                 │
│  ✓ Rekor receipt: rekor-entry.json      (tlog index 9918271)                                    │
│  ○ Proof chain: pending until "Seal Evidence" step                                              │
│ Links: [Open Evidence Packet] [Open Proof Chain] [Replay this Verdict]                          │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

14.5 Screen — Deploy Stage View (targets, health, checkpoints, rollback triggers)

Formerly

“Active Deployments” showed minimal progress.
Detailed rollout/targets health likely lived in your deploy system (outside Stella).
Platform Health screen exists, but not contextualized to a specific promotion.

Why changed like this

This is where “release operations” actually happens:
- show targets in the region/env,
- show health gates / SLO checks,
- show automatic rollback triggers,
- link to platform health and logs.

Screen graph (Mermaid)

flowchart TD
A[Deploy Stage View] --> B[Targets table (per region/env)]
A --> C[SLO / health checks]
A --> D[Auto-rollback rules + trigger state]
A --> E[Actions: pause/continue/rollback]
A --> F[Link: Platform Health]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Step Detail: Deploy Canary 10%                                                                  │
│ Legacy name/location: Control Plane "Active Deployments" (summary only) + external deploy logs │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Stage: Canary 10%   Policy: proceed if 95% healthy for 5m, error rate < 1%                      │
│ Current: 2/10 healthy  | Error rate: 0.4% | Latency p95: 210ms | SLO: OK                         │
│ Auto-rollback trigger: NOT TRIGGERED                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Targets (EU-West / eu-prod)                                                                     │
│ ┌───────────────┬───────────┬──────────┬──────────────┬───────────────┐                        │
│ │ Target         │ Version    │ Health   │ Notes         │ Logs          │                        │
│ ├───────────────┼───────────┼──────────┼──────────────┼───────────────┤                        │
│ │ eu-prod-01     │ bundle@beef│ ✓        │ ok            │ [open]        │                        │
│ │ eu-prod-02     │ bundle@beef│ ✓        │ ok            │ [open]        │                        │
│ │ eu-prod-03     │ old        │ ○        │ pending       │ [open]        │                        │
│ └───────────────┴───────────┴──────────┴──────────────┴───────────────┘                        │
│ Actions: [Pause] [Continue to 50%] (disabled until criteria met) [Rollback] [Open Platform Health]│
└──────────────────────────────────────────────────────────────────────────────────────────────┘

14.6 Screen — Rollback / Re-run (safe ops controls)

Formerly

Rollback existed as a status (“ROLLED_BACK”) in Releases list.
Actual rollback execution likely happened externally or via Orchestrator privileges.

Why changed like this

Rollback must be:
- explicit,
- traceable,
- evidence-backed (what was rolled back, why, and what is the resulting state).
Re-run is needed for transient failures (e.g., feed sync delay, rescan job retry), but must preserve determinism (re-run should record new evidence with timestamps, and keep old evidence).

Screen graph (Mermaid)

flowchart TD
A[Rollback/Re-run] --> B[Select scope: step / stage / full rollback]
A --> C[Preview impact (targets + versions)]
A --> D[Reason + ticket]
A --> E[Execute]
E --> F[Run Timeline updates + evidence appended]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Rollback / Re-run                                                                              │
│ Legacy name/location: Release status "ROLLED_BACK" existed; rollback execution was not unified │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Promotion: Platform Release 1.3.0-rc1  → EU-West/eu-prod                                       │
│ Current stage: Canary 10% (RUNNING)                                                             │
│                                                                                                 │
│ Choose action:                                                                                │
│  ( ) Re-run current step (Deploy Canary 10%)                                                    │
│  ( ) Pause promotion                                                                           │
│  ( ) Rollback to previously deployed bundle version (manifest sha256:prev...)                  │
│                                                                                                 │
│ Preview rollback impact:                                                                        │
│  - 2 targets currently on new bundle → will revert to prev bundle                               │
│  - 8 targets still old → unchanged                                                              │
│                                                                                                 │
│ Reason (required): [ incident #1234: elevated latency ]                                          │
│ [Execute]   [Cancel]                                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

14.7 Screen — Evidence Timeline (what evidence exists now vs what seals at finalize)

Formerly

Evidence existed under:
- Evidence → Packets
- Evidence → Proof Chains
- Evidence → Export
- Evidence → Evidence Bundles …but the relationship to the run stages wasn’t visible.

Why changed like this

Auditors and operators need to answer:
- “What evidence is already available mid-run?”
- “What is pending until completion?”
- “What exactly was sealed and when?”
This is the bridge between Ops timeline and audit artifacts.

Screen graph (Mermaid)

flowchart LR
A[Evidence Timeline (per promotion)] --> B[Evidence items by checkpoint]
A --> C[Open Packet]
A --> D[Open Proof Chain]
A --> E[Export Evidence Pack]
A --> F[Generate Auditor Bundle]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Evidence Timeline — Promotion Run                                                              │
│ Legacy name/location: Evidence artifacts existed, but not linked to run checkpoints             │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Checkpoint → Evidence                                                                           │
│  Inputs Materialized                                                                            │
│   ✓ resolved-inputs.json (hash sha256:aa11...)                                                   │
│                                                                                                  │
│  Gate Eval (Policy)                                                                             │
│   ✓ policy-decision.dsse  ✓ rekor receipt  ✓ verdict-123                                         │
│                                                                                                  │
│  Deploy Canary 10%                                                                              │
│   ○ deploy-attestation.dsse (pending)                                                            │
│                                                                                                  │
│  Seal Evidence (final)                                                                          │
│   ○ proof-chain.json  ○ audit-pack.tar.gz  ○ evidence-bundle.zip                                 │
│                                                                                                  │
│ Actions: [Open Evidence Packet] [Open Proof Chain] [Export Pack (partial)] [Generate Auditor Bundle]│
└──────────────────────────────────────────────────────────────────────────────────────────────┘

14.8 Screen — Replay/Verify (contextual replay for this run)

Formerly

Evidence → Replay/Verify (“Verdict Replay”) existed as a standalone screen:
- user inputs verdict id or image reference,
- sees replay requests + determinism overview.

Why changed like this

Replay should be reachable from where it matters:
- a specific policy decision checkpoint in a promotion run.
Keep the existing Replay/Verify functionality, but add a contextual wrapper:
- pre-fills verdict id + bundle digest + env baseline,
- shows determinism status for this promotion.

Screen graph (Mermaid)

flowchart TD
A[Run → Replay/Verify] --> B[Pre-filled replay request]
B --> C[Replay requests list]
C --> D[Determinism metrics]
D --> E[Link: Evidence → Replay/Verify canonical view]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Replay/Verify — For this Promotion                                                             │
│ Legacy name/location: "Verdict Replay" (Evidence → Replay/Verify)                               │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Pre-filled replay request                                                                       │
│  Verdict ID: verdict-123                                                                         │
│  Bundle: Platform Release 1.3.0-rc1  manifest sha256:beef...                                    │
│  Baseline: Prod-EU-West                                                                          │
│  Reason: [ Audit verification / policy change test ]                                             │
│ [Request Replay]                                                                                │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Recent replay requests (for this promotion)                                                     │
│  rr-001  COMPLETED  Feb 18, 08:30  match                                                        │
│  rr-002  RUNNING    Feb 18, 07:30                                                               │
│ Determinism: total 2 | matching 1 | mismatches 1 | match rate 50%                               │
│ Link: [Open canonical Replay/Verify screen]                                                     │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

14.9 Screen — Schedule / Automation (promotion scheduling + link to Scheduler Runs)

Formerly

Operations → Scheduler existed (“Scheduler Runs”) but disconnected from promotions.
Release list had statuses but scheduling wasn’t first-class in the release context.

Why changed like this

Scheduling belongs to release operations, but we don’t want a new menu.
This screen:
- schedules this promotion (or a step),
- writes a scheduler job,
- then links to Scheduler Runs for execution diagnostics.

Screen graph (Mermaid)

flowchart LR
A[Schedule Promotion] --> B[Choose time/window]
A --> C[Choose constraints (feeds fresh, scans complete)]
A --> D[Create scheduler job]
D --> E[View Scheduler Runs]
E --> F[Back to Run Timeline]

ASCII mock

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ Schedule Promotion                                                                             │
│ Legacy name/location: Ops → Scheduler (runs), no promotion-level scheduling UI                 │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Promotion: Hotfix Bundle 1.2.4 → US-East/us-prod                                                │
│                                                                                                 │
│ Schedule: [ Feb 19, 02:00 AM ]  Window: [ 2h ]                                                  │
│ Preconditions:                                                                                  │
│  [x] NVD/OSV feeds fresh (< 1h)                                                                 │
│  [x] SBOM rescans complete                                                                      │
│  [ ] Integrations healthy (warn only)                                                           │
│                                                                                                 │
│ [Create Schedule]   Link: [Open Scheduler Runs]                                                 │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

Result: what you gain with Pack 14

A promotion is now a single auditable narrative:
- timeline + logs + checkpoints,
- policy decision trace,
- deploy stage health gates,
- rollback controls,
- evidence sealing,
- deterministic replay.
Hybrid reachability becomes a 2nd-class signal woven into checkpoints (Policy + Post-Deploy Verify), not a top-level section.
Existing PoC pages remain valid, but are now linked meaningfully from the run storyline.

If you want the next pack: Pack 15 will unify Nightly Ops Report + Data Freshness (feeds, rescans, integration degradation) into a single Operations “Data Integrity” view and show how it bubbles up to Dashboard/Releases/Approvals without duplicating screens.

28 KiB Raw Blame History Unescape Escape

Pack 14 — Release Run / Deployment Timeline (workflow checkpoints, logs, rollback, evidence capture, replay/verify)

14.1 Menu graph (Mermaid) — where “Release Run” sits in the IA

14.2 Run lifecycle graph (Mermaid) — promotion execution stages + checkpoints

14.3 Screen — Run / Timeline (Promotion Run)

Formerly (where it lived pre-redesign)

Why changed like this

Screen graph (Mermaid)

ASCII mock

14.4 Screen — Step Detail (Logs + Artifacts + Evidence captured at that checkpoint)

Formerly

Why changed like this

Screen graph (Mermaid)

ASCII mock

14.5 Screen — Deploy Stage View (targets, health, checkpoints, rollback triggers)

Formerly

Why changed like this

Screen graph (Mermaid)

ASCII mock

14.6 Screen — Rollback / Re-run (safe ops controls)

Formerly

Why changed like this

Screen graph (Mermaid)

ASCII mock

14.7 Screen — Evidence Timeline (what evidence exists now vs what seals at finalize)

Formerly

Why changed like this

Screen graph (Mermaid)

ASCII mock

14.8 Screen — Replay/Verify (contextual replay for this run)

Formerly

Why changed like this

Screen graph (Mermaid)

ASCII mock

14.9 Screen — Schedule / Automation (promotion scheduling + link to Scheduler Runs)

Formerly

Why changed like this

Screen graph (Mermaid)

ASCII mock

Result: what you gain with Pack 14

28 KiB

Raw Blame History