Files

master c2f13fe588 preparation for ui re-shelling

2026-02-18 23:03:07 +02:00

30 KiB

Raw Blame History

Pack 15 — Operations: Data Integrity (Nightly Ops Report + Data Freshness unified; bubbles into Dashboard / Releases / Approvals)

This pack creates a single Operations → Data Integrity area that answers: “Can we trust today’s SBOM/CVE/reachability data to approve and promote? If not, what’s broken, where, and what decisions are impacted?”

It does not duplicate existing specialized pages (Scheduler/Orchestrator/Integrations/Feeds). It summarizes + links to them.

flowchart TD
  OPS[Operations] --> OPS_DI[Data Integrity]
  OPS --> OPS_PH[Platform Health]
  OPS --> OPS_ORCH[Orchestrator]
  OPS --> OPS_SCHED[Scheduler]
  OPS --> OPS_DLQ[Dead Letter]
  OPS --> OPS_QUOTA[Quotas]
  OPS --> OPS_EXPORT[Export]

  OPS_DI --> DI_OV[Overview]
  OPS_DI --> DI_NIGHT[Nightly Ops Report]
  OPS_DI --> DI_FEEDS[Feeds Freshness]
  OPS_DI --> DI_SCAN[Scan Pipeline Health]
  OPS_DI --> DI_REACH[Reachability Ingest Health]
  OPS_DI --> DI_INTEG[Integration Connectivity]
  OPS_DI --> DI_DLQ[DLQ & Replays]
  OPS_DI --> DI_SLO[Data Quality SLOs]

Design intent: “Data Integrity” is the operator console for freshness + pipeline status that directly affects approvals/promotions.

15.2 Bubble‑up graph (Mermaid) — how Data Integrity signals surface elsewhere (no duplication)

flowchart LR
  DI[Ops: Data Integrity\n(single source of truth for data health)] --> DASH[Dashboard\nNightly Ops Signals card]
  DI --> REL[Releases List\nData Health column + banner]
  DI --> APR[Approvals\nOps/Data tab + warnings]
  DI --> SEC[Security Overview\nFeed freshness + scan freshness badges]
  DI --> ENV[Env Detail\nSBOM freshness + runtime coverage]

  DI --> INT[Integrations Hub\nconnector config & tests]
  DI --> FEED[Feeds & AirGap Ops\nmirrors/locks/airgap artifacts]
  DI --> SCHED[Scheduler Runs]
  DI --> ORCH[Orchestrator Jobs]
  DI --> DLQ[Dead Letter]
  DI --> PH[Platform Health]

15.3 Screen — Data Integrity Overview

Previously (where it lived)

There was no single overview.
Equivalent fragments existed in:
- Nightly Ops Report (your new screen request),
- Operations → Feeds (freshness),
- Settings → Integrations (connectivity),
- Settings → System → Background Jobs (job failures),
- Operations → Dead Letter (queue stuck),
- plus scattered banners on approvals.

Why changed like this

You need one authoritative place to see:

SBOM scan / rescan status
CVE feed sync freshness
Integration connectivity
Reachability ingest health (build / image / runtime)
Which approvals/releases are currently “unsafe to approve” because data is stale

Screen graph (Mermaid)

flowchart TD
  A[Data Integrity Overview] --> B[Nightly Ops Report]
  A --> C[Feeds Freshness]
  A --> D[Scan Pipeline Health]
  A --> E[Reachability Ingest Health]
  A --> F[Integration Connectivity]
  A --> G[DLQ & Replays]
  A --> H[Platform Health]
  A --> I[Impacted Decisions\n(approvals/releases)]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ OVERVIEW                                                         │
│ Legacy: N/A (new). Previously: Ops Feeds + Settings System Jobs + Integrations + DLQ scattered │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Scope: Region ▾ (All)   Environment Type ▾ (All)   Window ▾ (24h)                               │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ DATA TRUST SCORE (for approvals/promotions)                                                     │
│  Feeds Freshness:  WARN (NVD stale 3h)     SBOM Pipeline: FAIL (rescan job failing)             │
│  Reachability Ingest: WARN (runtime coverage 35%)   Integrations: DEGRADED (Jenkins)            │
│  DLQ: WARN (reachability events queued: 1,230)                                                  │
│ Links: [Nightly Ops Report] [Feeds Freshness] [Integrations] [DLQ]                              │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ IMPACTED DECISIONS                                                                             │
│  Approvals blocked due to data issues: 2                                                        │
│   - Platform Release 1.3.0-rc1 → EU-West/eu-prod  (SBOM incomplete + NVD stale)  [Open]         │
│  Promotions running with WARN confidence: 1  [Open Releases filtered]                           │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ TOP FAILURES (what to fix first)                                                                │
│  1) Nightly SBOM rescan FAILED (registry auth timeout) → stale SBOM on 12 component versions   │
│  2) NVD feed stale 3h → CVE freshness gate WARN/FAIL depending on baseline                      │
│  3) Runtime reachability ingest lagging (agent-apac-01 degraded) → runtime coverage 35%         │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.4 Screen — Nightly Ops Report (Jobs + causes + impact)

Previously (where it lived)

You asked for “some report about nightly jobs status” (new requirement).
Related fragments existed in:
- Settings → System → Background Jobs
- Operations → Scheduler (runs)
- Operations → Orchestrator (job execution)
- plus manual checks in logs

Why changed like this

Nightly Ops Report becomes the release‑impact view of jobs:

not just “job failed”
but what release governance capability is now untrustworthy (feeds/scans/reachability/evidence).

Screen graph (Mermaid)

flowchart TD
  A[Nightly Ops Report] --> B[Job Run Detail]
  A --> C[Scheduler Runs]
  A --> D[Orchestrator]
  A --> E[DLQ & Replays]
  A --> F[Integrations Detail]
  A --> G[Impacted Bundles/Envs]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ NIGHTLY OPS REPORT                                                │
│ Legacy: Settings ▸ System ▸ Background Jobs + Ops Scheduler/Orchestrator (no release context)   │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Window: Last 24h   Region: All                                                                  │
│ Summary:  7 jobs OK   2 WARN   2 FAIL                                                           │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Job                          Schedule   Last Run   Status   Why it matters (release impact)     │
│-------------------------------------------------------------------------------------------------│
│ cve-sync-osv                  02:00     02:01      OK       vulnerability data freshness        │
│ cve-sync-nvd                  02:05     02:05      WARN     NVD stale → gating confidence drops │
│ sbom-ingest-registry          02:10     02:10      OK       new images get SBOM                 │
│ sbom-nightly-rescan           02:20     02:21      FAIL     stale SBOM → approvals may block    │
│ reachability-ingest-image      02:30     02:31      OK       image reachability evidence         │
│ reachability-ingest-runtime    02:35     02:36      WARN     runtime reach coverage degraded    │
│ evidence-seal-bundles          02:45     02:46      OK       audit pack completion              │
│-------------------------------------------------------------------------------------------------│
│ Row actions: [View Run] [Open Scheduler] [Open Orchestrator] [Open Integration] [Open DLQ]       │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.5 Screen — Job Run Detail (root cause + affected assets)

Previously

Scheduler/Orchestrator showed raw execution, but not mapped to:
- “affected environments”
- “affected bundles”
- “approvals degraded”

Why changed like this

This is the investigation page that bridges Ops mechanics to release decisions.

Screen graph (Mermaid)

flowchart TD
  A[Job Run Detail] --> B[Logs & traces]
  A --> C[Failed items list\n(images/components/envs)]
  A --> D[Open DLQ bucket]
  A --> E[Open Integration Detail]
  A --> F[Show impacted approvals/releases]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ Job Run Detail: sbom-nightly-rescan (Run #8841)                                                 │
│ Legacy: Scheduler/Orchestrator run detail (without release impact mapping)                      │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Status: FAIL   Started: 02:21   Ended: 02:24   Error: registry auth timeout                     │
│ Integration: Harbor Registry (token expired)  → [Open Integration Detail]                        │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Affected items                                                                                  │
│  - 12 images not rescanned (SBOM freshness > 24h)                                               │
│  - 3 bundle versions impacted (approvals may block)                                             │
│  - Regions impacted: EU-West, US-East                                                           │
│ Links: [Open impacted approvals] [Open bundles] [Open DLQ bucket] [Open logs]                   │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.6 Screen — Feeds Freshness (operator view, but tied to gating)

Previously (where it lived)

Operations → Feeds (“Feed Mirror & AirGap Operations” → Sources & Freshness)
Also partially visible as “feeds” cards under Integrations.

Why changed like this

Feeds Freshness becomes a Data Integrity subpage because it’s primarily:

“Can we trust vulnerability data for today’s approvals?” It still links to Feeds & AirGap Ops for mirrors/locks (no duplication).

Screen graph (Mermaid)

flowchart TD
  A[Feeds Freshness] --> B[Feeds & AirGap Ops: Sources]
  A --> C[Version Locks]
  A --> D[Mirror Detail]
  A --> E[Impacted approvals\n(CVE freshness gate)]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ FEEDS FRESHNESS                                                   │
│ Legacy: Operations ▸ Feeds ▸ Sources & Freshness (and partial cards in Integrations)            │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Region: EU-West   SLA profile: Prod (fresh < 2h)                                                │
│                                                                                                 │
│ Source     Status     Last Sync   SLA   Resulting gate impact                                    │
│----------------------------------------------------------------------------------------------- │
│ OSV        OK         20m ago     6h    OK                                                       │
│ NVD        WARN       3h ago      2h    approvals may WARN/FAIL depending baseline              │
│ CISA KEV   OK         3h ago      24h   OK                                                       │
│                                                                                                 │
│ Actions: [Open Feeds & AirGap Ops] [Apply Version Lock] [Retry NVD Sync]                         │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.7 Screen — Scan Pipeline Health (SBOM ingest + rescan + vulnerability match)

Previously

SBOM status scattered across:
- Security views (findings)
- Jobs views (background jobs)
- Registry integration
No single “pipeline health” page to explain staleness.

Why changed like this

You explicitly require:

“nightly SBOM re‑scan issues”
“CVE source not synced” This page shows the pipeline chain end‑to‑end and where it’s breaking.

Screen graph (Mermaid)

flowchart TD
  A[Scan Pipeline Health] --> B[SBOM ingest status]
  A --> C[SBOM rescan status]
  A --> D[CVE match status]
  A --> E[Open Nightly Ops Report]
  A --> F[Open Integrations]
  A --> G[Open Security findings impact]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ SCAN PIPELINE HEALTH                                              │
│ Legacy: implied across Security + System Jobs + Registry integration                             │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Pipeline stages (last 24h)                                                                      │
│  1) Image discovery (registry)        OK      new images: 48                                    │
│  2) SBOM generation/ingest            OK      sboms produced: 47  pending: 1                    │
│  3) Nightly SBOM rescan               FAIL    12 images stale > 24h                             │
│  4) CVE feeds sync                    WARN    NVD stale 3h                                      │
│  5) CVE ↔ SBOM match/update           WARN    results may be incomplete                          │
│                                                                                                 │
│ Impact summary                                                                                  │
│  - Environments with “unknown SBOM freshness”: 2 (EU-West prod, APAC uat)                       │
│  - Approvals blocked due to missing SBOM: 1                                                     │
│ Links: [Nightly Ops Report] [Feeds Freshness] [Integrations] [Security Findings]                │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.8 Screen — Reachability Ingest Health (Build / Image / Runtime)

Previously

Reachability was referenced in approvals/security, but ingestion health wasn’t first-class.
Runtime evidence depended on agent telemetry; failures were seen indirectly.

Why changed like this

You require hybrid reachability evidence from:

Dover image
build
running environment This screen makes it operationally visible when one source is missing so reachability confidence is downgraded.

Screen graph (Mermaid)

flowchart TD
  A[Reachability Ingest Health] --> B[Image/Dover ingest]
  A --> C[Build ingest]
  A --> D[Runtime ingest]
  A --> E[Agents health]
  A --> F[DLQ bucket]
  A --> G[Impact: approvals using reachability gate]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ REACHABILITY INGEST HEALTH                                        │
│ Legacy: implicit (Approvals/Security reachability columns) + Agent health elsewhere             │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Coverage (last 24h)                                                                             │
│  Image/Dover: 100% (OK)   Build: 78% (WARN)   Runtime: 35% (WARN)                               │
│                                                                                                 │
│ Pipelines                                                                                       │
│  Image/Dover ingest: OK   last batch: 02:31   backlog: 0                                        │
│  Build ingest:       WARN last batch: 01:10   backlog: 220 (CI degraded)                        │
│  Runtime ingest:     WARN last batch: 00:55   backlog: 1,230 (agent-apac-01 degraded)           │
│                                                                                                 │
│ Links: [Open Agents] [Open DLQ bucket] [Open impacted approvals]                                │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.9 Screen — Integration Connectivity (data‑plane dependencies)

Previously

Settings → Integrations (hub)
But release operators need a data-integrity lens: “which pipeline is broken because which connector is down?”

Why changed like this

This view is the “dependency slice” of Integrations:

still links to the canonical Integrations Hub for configuration,
but shows pipeline impact directly (feeds/scans/reachability/evidence).

Screen graph (Mermaid)

flowchart TD
  A[Integration Connectivity] --> B[Integrations Hub]
  A --> C[Open Integration Detail]
  A --> D[Show dependent jobs]
  A --> E[Show impacted approvals/releases]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ INTEGRATION CONNECTIVITY                                          │
│ Legacy: Settings ▸ Integrations (no “pipeline impact” slice)                                    │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Connector        Status      Dependent pipelines                     Impact                      │
│----------------------------------------------------------------------------------------------- │
│ Harbor Registry  WARN        SBOM rescan, image discovery            rescan failing             │
│ Jenkins          DEGRADED    build reachability ingest, attestations build coverage down        │
│ Vault            OK          env input materialization               none                        │
│ Consul           OK          env config bindings                     none                        │
│ NVD Source       DISCONNECTED CVE freshness                          approvals warn/block       │
│                                                                                                 │
│ Actions per row: [Open Detail] [Test] [View dependent jobs] [View impacted approvals]           │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.10 Screen — DLQ & Replays (data pipelines stuck)

Previously

Operations → Dead Letter existed, but not clearly integrated into “why approvals are unsafe.”

Why changed like this

This screen becomes the “last mile” of data integrity:

When pipelines fail, DLQ grows.
DLQ items correspond to missing SBOM updates, missing reachability evidence, failed evidence sealing.

Screen graph (Mermaid)

flowchart TD
  A[DLQ & Replays] --> B[DLQ buckets by pipeline]
  A --> C[Item detail + payload]
  A --> D[Replay item]
  A --> E[Open Job Run Detail]
  A --> F[Open Integration Detail]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ DLQ & REPLAYS                                                     │
│ Legacy: Operations ▸ Dead Letter (queue view without release impact context)                    │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Buckets (24h)                                                                                   │
│  reachability-runtime-ingest: 1,230  (agent degraded)                                            │
│  sbom-nightly-rescan:         340    (registry auth timeout)                                     │
│  evidence-seal-bundles:        12    (transparency log unreachable)                              │
│                                                                                                 │
│ Select bucket → items                                                                            │
│  item-7781  payload: runtime-trace batch#991  age: 2h  action: [Replay] [View] [Link job]       │
│  item-7782  payload: runtime-trace batch#992  age: 2h  action: [Replay] [View] [Link job]       │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.11 Screen — Data Quality SLOs (data‑SLO slice, links to System SLO Monitoring)

Previously

Settings/System → SLO Monitoring (or System root in the redesigned IA)

Why changed like this

Keep SLO engine canonical under System, but provide a “data integrity slice” here so operators see:

feed freshness SLO
SBOM staleness SLO
runtime coverage SLO …with deep links to the full SLO view.

Screen graph (Mermaid)

flowchart TD
  A[Data Quality SLOs] --> B[System SLO Monitoring (canonical)]
  A --> C[Show SLO breaches that impact approvals]
  A --> D[Open impacted approvals/releases]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ DATA QUALITY SLOs                                                 │
│ Legacy: System ▸ SLO Monitoring (not scoped to data integrity)                                  │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ SLO                                Target     Current     Status     Approval impact            │
│----------------------------------------------------------------------------------------------- │
│ CVE feed freshness (NVD/OSV)        <2h        3h          WARN       gate may warn/fail         │
│ SBOM staleness (prod envs)          <24h       12 stale    FAIL       blocks prod promotions     │
│ Runtime reach coverage (prod)       >50%       35%         WARN       reduces confidence         │
│                                                                                                 │
│ Links: [Open System SLO Monitoring] [Open impacted approvals]                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

What this pack changes on other screens (without duplicating them)

These are UI hooks (badges/banners/cards) driven by Data Integrity:

Dashboard: “Nightly Ops Signals” card points to Ops → Data Integrity Overview.
Releases list: “Data Health” column/badge links to Data Integrity Overview filtered to the region.
Approvals: “Ops/Data” tab links to Data Integrity Overview + the exact failing job/feed/DLQ bucket.
Security Overview: shows “feeds fresh / stale” and “SBOM freshness” badges, with a link to Data Integrity.

If you want the next pack: Pack 16 can update the Dashboard mock explicitly to add the “Nightly Ops Signals” card and the SBOM + reachable criticals by environment summary you requested earlier, wired directly into the Data Integrity + Security + Env Detail pages.

30 KiB Raw Blame History Unescape Escape

Pack 15 — Operations: Data Integrity (Nightly Ops Report + Data Freshness unified; bubbles into Dashboard / Releases / Approvals)

15.1 Operations menu graph (Mermaid) — Data Integrity added as a first‑class Ops area

15.2 Bubble‑up graph (Mermaid) — how Data Integrity signals surface elsewhere (no duplication)

15.3 Screen — Data Integrity Overview

Previously (where it lived)

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.4 Screen — Nightly Ops Report (Jobs + causes + impact)

Previously (where it lived)

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.5 Screen — Job Run Detail (root cause + affected assets)

Previously

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.6 Screen — Feeds Freshness (operator view, but tied to gating)

Previously (where it lived)

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.7 Screen — Scan Pipeline Health (SBOM ingest + rescan + vulnerability match)

Previously

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.8 Screen — Reachability Ingest Health (Build / Image / Runtime)

Previously

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.9 Screen — Integration Connectivity (data‑plane dependencies)

Previously

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.10 Screen — DLQ & Replays (data pipelines stuck)

Previously

Why changed like this

Screen graph (Mermaid)

ASCII mock

15.11 Screen — Data Quality SLOs (data‑SLO slice, links to System SLO Monitoring)

Previously

Why changed like this

Screen graph (Mermaid)

ASCII mock

What this pack changes on other screens (without duplicating them)

30 KiB

Raw Blame History