Files
git.stella-ops.org/docs/modules/ui/v2-rewire/pack-15.md
2026-02-18 23:03:07 +02:00

517 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Pack 15 — Operations: **Data Integrity** (Nightly Ops Report + Data Freshness unified; bubbles into Dashboard / Releases / Approvals)
This pack creates a single **Operations → Data Integrity** area that answers:
“**Can we trust todays SBOM/CVE/reachability data to approve and promote? If not, whats broken, where, and what decisions are impacted?**”
It **does not duplicate** existing specialized pages (Scheduler/Orchestrator/Integrations/Feeds). It **summarizes + links** to them.
---
# 15.1 Operations menu graph (Mermaid) — Data Integrity added as a firstclass Ops area
```mermaid
flowchart TD
OPS[Operations] --> OPS_DI[Data Integrity]
OPS --> OPS_PH[Platform Health]
OPS --> OPS_ORCH[Orchestrator]
OPS --> OPS_SCHED[Scheduler]
OPS --> OPS_DLQ[Dead Letter]
OPS --> OPS_QUOTA[Quotas]
OPS --> OPS_EXPORT[Export]
OPS_DI --> DI_OV[Overview]
OPS_DI --> DI_NIGHT[Nightly Ops Report]
OPS_DI --> DI_FEEDS[Feeds Freshness]
OPS_DI --> DI_SCAN[Scan Pipeline Health]
OPS_DI --> DI_REACH[Reachability Ingest Health]
OPS_DI --> DI_INTEG[Integration Connectivity]
OPS_DI --> DI_DLQ[DLQ & Replays]
OPS_DI --> DI_SLO[Data Quality SLOs]
```
**Design intent:** “Data Integrity” is the operator console for **freshness + pipeline status** that directly affects approvals/promotions.
---
# 15.2 Bubbleup graph (Mermaid) — how Data Integrity signals surface elsewhere (no duplication)
```mermaid
flowchart LR
DI[Ops: Data Integrity\n(single source of truth for data health)] --> DASH[Dashboard\nNightly Ops Signals card]
DI --> REL[Releases List\nData Health column + banner]
DI --> APR[Approvals\nOps/Data tab + warnings]
DI --> SEC[Security Overview\nFeed freshness + scan freshness badges]
DI --> ENV[Env Detail\nSBOM freshness + runtime coverage]
DI --> INT[Integrations Hub\nconnector config & tests]
DI --> FEED[Feeds & AirGap Ops\nmirrors/locks/airgap artifacts]
DI --> SCHED[Scheduler Runs]
DI --> ORCH[Orchestrator Jobs]
DI --> DLQ[Dead Letter]
DI --> PH[Platform Health]
```
---
# 15.3 Screen — Data Integrity Overview
### Previously (where it lived)
* There was **no single overview**.
* Equivalent fragments existed in:
* **Nightly Ops Report** (your new screen request),
* **Operations → Feeds** (freshness),
* **Settings → Integrations** (connectivity),
* **Settings → System → Background Jobs** (job failures),
* **Operations → Dead Letter** (queue stuck),
* plus scattered banners on approvals.
### Why changed like this
You need **one** authoritative place to see:
* **SBOM scan / rescan status**
* **CVE feed sync freshness**
* **Integration connectivity**
* **Reachability ingest health (build / image / runtime)**
* **Which approvals/releases are currently “unsafe to approve” because data is stale**
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Data Integrity Overview] --> B[Nightly Ops Report]
A --> C[Feeds Freshness]
A --> D[Scan Pipeline Health]
A --> E[Reachability Ingest Health]
A --> F[Integration Connectivity]
A --> G[DLQ & Replays]
A --> H[Platform Health]
A --> I[Impacted Decisions\n(approvals/releases)]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ OVERVIEW │
│ Legacy: N/A (new). Previously: Ops Feeds + Settings System Jobs + Integrations + DLQ scattered │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Scope: Region ▾ (All) Environment Type ▾ (All) Window ▾ (24h) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ DATA TRUST SCORE (for approvals/promotions) │
│ Feeds Freshness: WARN (NVD stale 3h) SBOM Pipeline: FAIL (rescan job failing) │
│ Reachability Ingest: WARN (runtime coverage 35%) Integrations: DEGRADED (Jenkins) │
│ DLQ: WARN (reachability events queued: 1,230) │
│ Links: [Nightly Ops Report] [Feeds Freshness] [Integrations] [DLQ] │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ IMPACTED DECISIONS │
│ Approvals blocked due to data issues: 2 │
│ - Platform Release 1.3.0-rc1 → EU-West/eu-prod (SBOM incomplete + NVD stale) [Open] │
│ Promotions running with WARN confidence: 1 [Open Releases filtered] │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ TOP FAILURES (what to fix first) │
│ 1) Nightly SBOM rescan FAILED (registry auth timeout) → stale SBOM on 12 component versions │
│ 2) NVD feed stale 3h → CVE freshness gate WARN/FAIL depending on baseline │
│ 3) Runtime reachability ingest lagging (agent-apac-01 degraded) → runtime coverage 35% │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.4 Screen — Nightly Ops Report (Jobs + causes + impact)
### Previously (where it lived)
* You asked for “some report about nightly jobs status” (new requirement).
* Related fragments existed in:
* **Settings → System → Background Jobs**
* **Operations → Scheduler** (runs)
* **Operations → Orchestrator** (job execution)
* plus manual checks in logs
### Why changed like this
Nightly Ops Report becomes the **releaseimpact view** of jobs:
* not just “job failed”
* but **what release governance capability is now untrustworthy** (feeds/scans/reachability/evidence).
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Nightly Ops Report] --> B[Job Run Detail]
A --> C[Scheduler Runs]
A --> D[Orchestrator]
A --> E[DLQ & Replays]
A --> F[Integrations Detail]
A --> G[Impacted Bundles/Envs]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ NIGHTLY OPS REPORT │
│ Legacy: Settings ▸ System ▸ Background Jobs + Ops Scheduler/Orchestrator (no release context) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Window: Last 24h Region: All │
│ Summary: 7 jobs OK 2 WARN 2 FAIL │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Job Schedule Last Run Status Why it matters (release impact) │
│-------------------------------------------------------------------------------------------------│
│ cve-sync-osv 02:00 02:01 OK vulnerability data freshness │
│ cve-sync-nvd 02:05 02:05 WARN NVD stale → gating confidence drops │
│ sbom-ingest-registry 02:10 02:10 OK new images get SBOM │
│ sbom-nightly-rescan 02:20 02:21 FAIL stale SBOM → approvals may block │
│ reachability-ingest-image 02:30 02:31 OK image reachability evidence │
│ reachability-ingest-runtime 02:35 02:36 WARN runtime reach coverage degraded │
│ evidence-seal-bundles 02:45 02:46 OK audit pack completion │
│-------------------------------------------------------------------------------------------------│
│ Row actions: [View Run] [Open Scheduler] [Open Orchestrator] [Open Integration] [Open DLQ] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.5 Screen — Job Run Detail (root cause + affected assets)
### Previously
* Scheduler/Orchestrator showed raw execution, but not mapped to:
* “affected environments”
* “affected bundles”
* “approvals degraded”
### Why changed like this
This is the **investigation page** that bridges Ops mechanics to release decisions.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Job Run Detail] --> B[Logs & traces]
A --> C[Failed items list\n(images/components/envs)]
A --> D[Open DLQ bucket]
A --> E[Open Integration Detail]
A --> F[Show impacted approvals/releases]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ Job Run Detail: sbom-nightly-rescan (Run #8841) │
│ Legacy: Scheduler/Orchestrator run detail (without release impact mapping) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Status: FAIL Started: 02:21 Ended: 02:24 Error: registry auth timeout │
│ Integration: Harbor Registry (token expired) → [Open Integration Detail] │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Affected items │
│ - 12 images not rescanned (SBOM freshness > 24h) │
│ - 3 bundle versions impacted (approvals may block) │
│ - Regions impacted: EU-West, US-East │
│ Links: [Open impacted approvals] [Open bundles] [Open DLQ bucket] [Open logs] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.6 Screen — Feeds Freshness (operator view, but tied to gating)
### Previously (where it lived)
* **Operations → Feeds** (“Feed Mirror & AirGap Operations” → Sources & Freshness)
* Also partially visible as “feeds” cards under Integrations.
### Why changed like this
Feeds Freshness becomes a **Data Integrity subpage** because its primarily:
* “Can we trust vulnerability data for todays approvals?”
It still links to **Feeds & AirGap Ops** for mirrors/locks (no duplication).
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Feeds Freshness] --> B[Feeds & AirGap Ops: Sources]
A --> C[Version Locks]
A --> D[Mirror Detail]
A --> E[Impacted approvals\n(CVE freshness gate)]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ FEEDS FRESHNESS │
│ Legacy: Operations ▸ Feeds ▸ Sources & Freshness (and partial cards in Integrations) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Region: EU-West SLA profile: Prod (fresh < 2h) │
│ │
│ Source Status Last Sync SLA Resulting gate impact │
│----------------------------------------------------------------------------------------------- │
│ OSV OK 20m ago 6h OK │
│ NVD WARN 3h ago 2h approvals may WARN/FAIL depending baseline │
│ CISA KEV OK 3h ago 24h OK │
│ │
│ Actions: [Open Feeds & AirGap Ops] [Apply Version Lock] [Retry NVD Sync] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.7 Screen — Scan Pipeline Health (SBOM ingest + rescan + vulnerability match)
### Previously
* SBOM status scattered across:
* Security views (findings)
* Jobs views (background jobs)
* Registry integration
* No single “pipeline health” page to explain staleness.
### Why changed like this
You explicitly require:
* “nightly SBOM rescan issues”
* “CVE source not synced”
This page shows the pipeline chain endtoend and where its breaking.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Scan Pipeline Health] --> B[SBOM ingest status]
A --> C[SBOM rescan status]
A --> D[CVE match status]
A --> E[Open Nightly Ops Report]
A --> F[Open Integrations]
A --> G[Open Security findings impact]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ SCAN PIPELINE HEALTH │
│ Legacy: implied across Security + System Jobs + Registry integration │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Pipeline stages (last 24h) │
│ 1) Image discovery (registry) OK new images: 48 │
│ 2) SBOM generation/ingest OK sboms produced: 47 pending: 1 │
│ 3) Nightly SBOM rescan FAIL 12 images stale > 24h │
│ 4) CVE feeds sync WARN NVD stale 3h │
│ 5) CVE ↔ SBOM match/update WARN results may be incomplete │
│ │
│ Impact summary │
│ - Environments with “unknown SBOM freshness”: 2 (EU-West prod, APAC uat) │
│ - Approvals blocked due to missing SBOM: 1 │
│ Links: [Nightly Ops Report] [Feeds Freshness] [Integrations] [Security Findings] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.8 Screen — Reachability Ingest Health (Build / Image / Runtime)
### Previously
* Reachability was referenced in approvals/security, but ingestion health wasnt first-class.
* Runtime evidence depended on agent telemetry; failures were seen indirectly.
### Why changed like this
You require hybrid reachability evidence from:
* **Dover image**
* **build**
* **running environment**
This screen makes it operationally visible when one source is missing so reachability confidence is downgraded.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Reachability Ingest Health] --> B[Image/Dover ingest]
A --> C[Build ingest]
A --> D[Runtime ingest]
A --> E[Agents health]
A --> F[DLQ bucket]
A --> G[Impact: approvals using reachability gate]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ REACHABILITY INGEST HEALTH │
│ Legacy: implicit (Approvals/Security reachability columns) + Agent health elsewhere │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Coverage (last 24h) │
│ Image/Dover: 100% (OK) Build: 78% (WARN) Runtime: 35% (WARN) │
│ │
│ Pipelines │
│ Image/Dover ingest: OK last batch: 02:31 backlog: 0 │
│ Build ingest: WARN last batch: 01:10 backlog: 220 (CI degraded) │
│ Runtime ingest: WARN last batch: 00:55 backlog: 1,230 (agent-apac-01 degraded) │
│ │
│ Links: [Open Agents] [Open DLQ bucket] [Open impacted approvals] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.9 Screen — Integration Connectivity (dataplane dependencies)
### Previously
* **Settings → Integrations** (hub)
* But release operators need a data-integrity lens: “which pipeline is broken because which connector is down?”
### Why changed like this
This view is the “dependency slice” of Integrations:
* still links to the canonical **Integrations Hub** for configuration,
* but shows **pipeline impact** directly (feeds/scans/reachability/evidence).
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Integration Connectivity] --> B[Integrations Hub]
A --> C[Open Integration Detail]
A --> D[Show dependent jobs]
A --> E[Show impacted approvals/releases]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ INTEGRATION CONNECTIVITY │
│ Legacy: Settings ▸ Integrations (no “pipeline impact” slice) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Connector Status Dependent pipelines Impact │
│----------------------------------------------------------------------------------------------- │
│ Harbor Registry WARN SBOM rescan, image discovery rescan failing │
│ Jenkins DEGRADED build reachability ingest, attestations build coverage down │
│ Vault OK env input materialization none │
│ Consul OK env config bindings none │
│ NVD Source DISCONNECTED CVE freshness approvals warn/block │
│ │
│ Actions per row: [Open Detail] [Test] [View dependent jobs] [View impacted approvals] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.10 Screen — DLQ & Replays (data pipelines stuck)
### Previously
* **Operations → Dead Letter** existed, but not clearly integrated into “why approvals are unsafe.”
### Why changed like this
This screen becomes the “last mile” of data integrity:
* When pipelines fail, DLQ grows.
* DLQ items correspond to missing SBOM updates, missing reachability evidence, failed evidence sealing.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[DLQ & Replays] --> B[DLQ buckets by pipeline]
A --> C[Item detail + payload]
A --> D[Replay item]
A --> E[Open Job Run Detail]
A --> F[Open Integration Detail]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ DLQ & REPLAYS │
│ Legacy: Operations ▸ Dead Letter (queue view without release impact context) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Buckets (24h) │
│ reachability-runtime-ingest: 1,230 (agent degraded) │
│ sbom-nightly-rescan: 340 (registry auth timeout) │
│ evidence-seal-bundles: 12 (transparency log unreachable) │
│ │
│ Select bucket → items │
│ item-7781 payload: runtime-trace batch#991 age: 2h action: [Replay] [View] [Link job] │
│ item-7782 payload: runtime-trace batch#992 age: 2h action: [Replay] [View] [Link job] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
# 15.11 Screen — Data Quality SLOs (dataSLO slice, links to System SLO Monitoring)
### Previously
* **Settings/System → SLO Monitoring** (or System root in the redesigned IA)
### Why changed like this
Keep SLO engine canonical under **System**, but provide a “data integrity slice” here so operators see:
* feed freshness SLO
* SBOM staleness SLO
* runtime coverage SLO
…with deep links to the full SLO view.
### Screen graph (Mermaid)
```mermaid
flowchart TD
A[Data Quality SLOs] --> B[System SLO Monitoring (canonical)]
A --> C[Show SLO breaches that impact approvals]
A --> D[Open impacted approvals/releases]
```
### ASCII mock
```text
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ DATA QUALITY SLOs │
│ Legacy: System ▸ SLO Monitoring (not scoped to data integrity) │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ SLO Target Current Status Approval impact │
│----------------------------------------------------------------------------------------------- │
│ CVE feed freshness (NVD/OSV) <2h 3h WARN gate may warn/fail │
│ SBOM staleness (prod envs) <24h 12 stale FAIL blocks prod promotions │
│ Runtime reach coverage (prod) >50% 35% WARN reduces confidence │
│ │
│ Links: [Open System SLO Monitoring] [Open impacted approvals] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```
---
## What this pack changes on other screens (without duplicating them)
These are *UI hooks* (badges/banners/cards) driven by Data Integrity:
* **Dashboard**: “Nightly Ops Signals” card points to **Ops → Data Integrity Overview**.
* **Releases list**: “Data Health” column/badge links to **Data Integrity Overview** filtered to the region.
* **Approvals**: “Ops/Data” tab links to **Data Integrity Overview** + the exact failing job/feed/DLQ bucket.
* **Security Overview**: shows “feeds fresh / stale” and “SBOM freshness” badges, with a link to Data Integrity.
---
If you want the next pack: **Pack 16** can update the **Dashboard** mock explicitly to add the “Nightly Ops Signals” card and the **SBOM + reachable criticals by environment** summary you requested earlier, wired directly into the Data Integrity + Security + Env Detail pages.