Files
git.stella-ops.org/docs/modules/ui/v2-rewire/pack-15.md
2026-02-18 23:03:07 +02:00

30 KiB
Raw Blame History

Pack 15 — Operations: Data Integrity (Nightly Ops Report + Data Freshness unified; bubbles into Dashboard / Releases / Approvals)

This pack creates a single Operations → Data Integrity area that answers: “Can we trust todays SBOM/CVE/reachability data to approve and promote? If not, whats broken, where, and what decisions are impacted?

It does not duplicate existing specialized pages (Scheduler/Orchestrator/Integrations/Feeds). It summarizes + links to them.


15.1 Operations menu graph (Mermaid) — Data Integrity added as a firstclass Ops area

flowchart TD
  OPS[Operations] --> OPS_DI[Data Integrity]
  OPS --> OPS_PH[Platform Health]
  OPS --> OPS_ORCH[Orchestrator]
  OPS --> OPS_SCHED[Scheduler]
  OPS --> OPS_DLQ[Dead Letter]
  OPS --> OPS_QUOTA[Quotas]
  OPS --> OPS_EXPORT[Export]

  OPS_DI --> DI_OV[Overview]
  OPS_DI --> DI_NIGHT[Nightly Ops Report]
  OPS_DI --> DI_FEEDS[Feeds Freshness]
  OPS_DI --> DI_SCAN[Scan Pipeline Health]
  OPS_DI --> DI_REACH[Reachability Ingest Health]
  OPS_DI --> DI_INTEG[Integration Connectivity]
  OPS_DI --> DI_DLQ[DLQ & Replays]
  OPS_DI --> DI_SLO[Data Quality SLOs]

Design intent: “Data Integrity” is the operator console for freshness + pipeline status that directly affects approvals/promotions.


15.2 Bubbleup graph (Mermaid) — how Data Integrity signals surface elsewhere (no duplication)

flowchart LR
  DI[Ops: Data Integrity\n(single source of truth for data health)] --> DASH[Dashboard\nNightly Ops Signals card]
  DI --> REL[Releases List\nData Health column + banner]
  DI --> APR[Approvals\nOps/Data tab + warnings]
  DI --> SEC[Security Overview\nFeed freshness + scan freshness badges]
  DI --> ENV[Env Detail\nSBOM freshness + runtime coverage]

  DI --> INT[Integrations Hub\nconnector config & tests]
  DI --> FEED[Feeds & AirGap Ops\nmirrors/locks/airgap artifacts]
  DI --> SCHED[Scheduler Runs]
  DI --> ORCH[Orchestrator Jobs]
  DI --> DLQ[Dead Letter]
  DI --> PH[Platform Health]

15.3 Screen — Data Integrity Overview

Previously (where it lived)

  • There was no single overview.

  • Equivalent fragments existed in:

    • Nightly Ops Report (your new screen request),
    • Operations → Feeds (freshness),
    • Settings → Integrations (connectivity),
    • Settings → System → Background Jobs (job failures),
    • Operations → Dead Letter (queue stuck),
    • plus scattered banners on approvals.

Why changed like this

You need one authoritative place to see:

  • SBOM scan / rescan status
  • CVE feed sync freshness
  • Integration connectivity
  • Reachability ingest health (build / image / runtime)
  • Which approvals/releases are currently “unsafe to approve” because data is stale

Screen graph (Mermaid)

flowchart TD
  A[Data Integrity Overview] --> B[Nightly Ops Report]
  A --> C[Feeds Freshness]
  A --> D[Scan Pipeline Health]
  A --> E[Reachability Ingest Health]
  A --> F[Integration Connectivity]
  A --> G[DLQ & Replays]
  A --> H[Platform Health]
  A --> I[Impacted Decisions\n(approvals/releases)]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ OVERVIEW                                                         │
│ Legacy: N/A (new). Previously: Ops Feeds + Settings System Jobs + Integrations + DLQ scattered │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Scope: Region ▾ (All)   Environment Type ▾ (All)   Window ▾ (24h)                               │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ DATA TRUST SCORE (for approvals/promotions)                                                     │
│  Feeds Freshness:  WARN (NVD stale 3h)     SBOM Pipeline: FAIL (rescan job failing)             │
│  Reachability Ingest: WARN (runtime coverage 35%)   Integrations: DEGRADED (Jenkins)            │
│  DLQ: WARN (reachability events queued: 1,230)                                                  │
│ Links: [Nightly Ops Report] [Feeds Freshness] [Integrations] [DLQ]                              │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ IMPACTED DECISIONS                                                                             │
│  Approvals blocked due to data issues: 2                                                        │
│   - Platform Release 1.3.0-rc1 → EU-West/eu-prod  (SBOM incomplete + NVD stale)  [Open]         │
│  Promotions running with WARN confidence: 1  [Open Releases filtered]                           │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ TOP FAILURES (what to fix first)                                                                │
│  1) Nightly SBOM rescan FAILED (registry auth timeout) → stale SBOM on 12 component versions   │
│  2) NVD feed stale 3h → CVE freshness gate WARN/FAIL depending on baseline                      │
│  3) Runtime reachability ingest lagging (agent-apac-01 degraded) → runtime coverage 35%         │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.4 Screen — Nightly Ops Report (Jobs + causes + impact)

Previously (where it lived)

  • You asked for “some report about nightly jobs status” (new requirement).

  • Related fragments existed in:

    • Settings → System → Background Jobs
    • Operations → Scheduler (runs)
    • Operations → Orchestrator (job execution)
    • plus manual checks in logs

Why changed like this

Nightly Ops Report becomes the releaseimpact view of jobs:

  • not just “job failed”
  • but what release governance capability is now untrustworthy (feeds/scans/reachability/evidence).

Screen graph (Mermaid)

flowchart TD
  A[Nightly Ops Report] --> B[Job Run Detail]
  A --> C[Scheduler Runs]
  A --> D[Orchestrator]
  A --> E[DLQ & Replays]
  A --> F[Integrations Detail]
  A --> G[Impacted Bundles/Envs]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ NIGHTLY OPS REPORT                                                │
│ Legacy: Settings ▸ System ▸ Background Jobs + Ops Scheduler/Orchestrator (no release context)   │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Window: Last 24h   Region: All                                                                  │
│ Summary:  7 jobs OK   2 WARN   2 FAIL                                                           │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Job                          Schedule   Last Run   Status   Why it matters (release impact)     │
│-------------------------------------------------------------------------------------------------│
│ cve-sync-osv                  02:00     02:01      OK       vulnerability data freshness        │
│ cve-sync-nvd                  02:05     02:05      WARN     NVD stale → gating confidence drops │
│ sbom-ingest-registry          02:10     02:10      OK       new images get SBOM                 │
│ sbom-nightly-rescan           02:20     02:21      FAIL     stale SBOM → approvals may block    │
│ reachability-ingest-image      02:30     02:31      OK       image reachability evidence         │
│ reachability-ingest-runtime    02:35     02:36      WARN     runtime reach coverage degraded    │
│ evidence-seal-bundles          02:45     02:46      OK       audit pack completion              │
│-------------------------------------------------------------------------------------------------│
│ Row actions: [View Run] [Open Scheduler] [Open Orchestrator] [Open Integration] [Open DLQ]       │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.5 Screen — Job Run Detail (root cause + affected assets)

Previously

  • Scheduler/Orchestrator showed raw execution, but not mapped to:

    • “affected environments”
    • “affected bundles”
    • “approvals degraded”

Why changed like this

This is the investigation page that bridges Ops mechanics to release decisions.

Screen graph (Mermaid)

flowchart TD
  A[Job Run Detail] --> B[Logs & traces]
  A --> C[Failed items list\n(images/components/envs)]
  A --> D[Open DLQ bucket]
  A --> E[Open Integration Detail]
  A --> F[Show impacted approvals/releases]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ Job Run Detail: sbom-nightly-rescan (Run #8841)                                                 │
│ Legacy: Scheduler/Orchestrator run detail (without release impact mapping)                      │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Status: FAIL   Started: 02:21   Ended: 02:24   Error: registry auth timeout                     │
│ Integration: Harbor Registry (token expired)  → [Open Integration Detail]                        │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Affected items                                                                                  │
│  - 12 images not rescanned (SBOM freshness > 24h)                                               │
│  - 3 bundle versions impacted (approvals may block)                                             │
│  - Regions impacted: EU-West, US-East                                                           │
│ Links: [Open impacted approvals] [Open bundles] [Open DLQ bucket] [Open logs]                   │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.6 Screen — Feeds Freshness (operator view, but tied to gating)

Previously (where it lived)

  • Operations → Feeds (“Feed Mirror & AirGap Operations” → Sources & Freshness)
  • Also partially visible as “feeds” cards under Integrations.

Why changed like this

Feeds Freshness becomes a Data Integrity subpage because its primarily:

  • “Can we trust vulnerability data for todays approvals?” It still links to Feeds & AirGap Ops for mirrors/locks (no duplication).

Screen graph (Mermaid)

flowchart TD
  A[Feeds Freshness] --> B[Feeds & AirGap Ops: Sources]
  A --> C[Version Locks]
  A --> D[Mirror Detail]
  A --> E[Impacted approvals\n(CVE freshness gate)]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ FEEDS FRESHNESS                                                   │
│ Legacy: Operations ▸ Feeds ▸ Sources & Freshness (and partial cards in Integrations)            │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Region: EU-West   SLA profile: Prod (fresh < 2h)                                                │
│                                                                                                 │
│ Source     Status     Last Sync   SLA   Resulting gate impact                                    │
│----------------------------------------------------------------------------------------------- │
│ OSV        OK         20m ago     6h    OK                                                       │
│ NVD        WARN       3h ago      2h    approvals may WARN/FAIL depending baseline              │
│ CISA KEV   OK         3h ago      24h   OK                                                       │
│                                                                                                 │
│ Actions: [Open Feeds & AirGap Ops] [Apply Version Lock] [Retry NVD Sync]                         │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.7 Screen — Scan Pipeline Health (SBOM ingest + rescan + vulnerability match)

Previously

  • SBOM status scattered across:

    • Security views (findings)
    • Jobs views (background jobs)
    • Registry integration
  • No single “pipeline health” page to explain staleness.

Why changed like this

You explicitly require:

  • “nightly SBOM rescan issues”
  • “CVE source not synced” This page shows the pipeline chain endtoend and where its breaking.

Screen graph (Mermaid)

flowchart TD
  A[Scan Pipeline Health] --> B[SBOM ingest status]
  A --> C[SBOM rescan status]
  A --> D[CVE match status]
  A --> E[Open Nightly Ops Report]
  A --> F[Open Integrations]
  A --> G[Open Security findings impact]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ SCAN PIPELINE HEALTH                                              │
│ Legacy: implied across Security + System Jobs + Registry integration                             │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Pipeline stages (last 24h)                                                                      │
│  1) Image discovery (registry)        OK      new images: 48                                    │
│  2) SBOM generation/ingest            OK      sboms produced: 47  pending: 1                    │
│  3) Nightly SBOM rescan               FAIL    12 images stale > 24h                             │
│  4) CVE feeds sync                    WARN    NVD stale 3h                                      │
│  5) CVE ↔ SBOM match/update           WARN    results may be incomplete                          │
│                                                                                                 │
│ Impact summary                                                                                  │
│  - Environments with “unknown SBOM freshness”: 2 (EU-West prod, APAC uat)                       │
│  - Approvals blocked due to missing SBOM: 1                                                     │
│ Links: [Nightly Ops Report] [Feeds Freshness] [Integrations] [Security Findings]                │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.8 Screen — Reachability Ingest Health (Build / Image / Runtime)

Previously

  • Reachability was referenced in approvals/security, but ingestion health wasnt first-class.
  • Runtime evidence depended on agent telemetry; failures were seen indirectly.

Why changed like this

You require hybrid reachability evidence from:

  • Dover image
  • build
  • running environment This screen makes it operationally visible when one source is missing so reachability confidence is downgraded.

Screen graph (Mermaid)

flowchart TD
  A[Reachability Ingest Health] --> B[Image/Dover ingest]
  A --> C[Build ingest]
  A --> D[Runtime ingest]
  A --> E[Agents health]
  A --> F[DLQ bucket]
  A --> G[Impact: approvals using reachability gate]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ REACHABILITY INGEST HEALTH                                        │
│ Legacy: implicit (Approvals/Security reachability columns) + Agent health elsewhere             │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Coverage (last 24h)                                                                             │
│  Image/Dover: 100% (OK)   Build: 78% (WARN)   Runtime: 35% (WARN)                               │
│                                                                                                 │
│ Pipelines                                                                                       │
│  Image/Dover ingest: OK   last batch: 02:31   backlog: 0                                        │
│  Build ingest:       WARN last batch: 01:10   backlog: 220 (CI degraded)                        │
│  Runtime ingest:     WARN last batch: 00:55   backlog: 1,230 (agent-apac-01 degraded)           │
│                                                                                                 │
│ Links: [Open Agents] [Open DLQ bucket] [Open impacted approvals]                                │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.9 Screen — Integration Connectivity (dataplane dependencies)

Previously

  • Settings → Integrations (hub)
  • But release operators need a data-integrity lens: “which pipeline is broken because which connector is down?”

Why changed like this

This view is the “dependency slice” of Integrations:

  • still links to the canonical Integrations Hub for configuration,
  • but shows pipeline impact directly (feeds/scans/reachability/evidence).

Screen graph (Mermaid)

flowchart TD
  A[Integration Connectivity] --> B[Integrations Hub]
  A --> C[Open Integration Detail]
  A --> D[Show dependent jobs]
  A --> E[Show impacted approvals/releases]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ INTEGRATION CONNECTIVITY                                          │
│ Legacy: Settings ▸ Integrations (no “pipeline impact” slice)                                    │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Connector        Status      Dependent pipelines                     Impact                      │
│----------------------------------------------------------------------------------------------- │
│ Harbor Registry  WARN        SBOM rescan, image discovery            rescan failing             │
│ Jenkins          DEGRADED    build reachability ingest, attestations build coverage down        │
│ Vault            OK          env input materialization               none                        │
│ Consul           OK          env config bindings                     none                        │
│ NVD Source       DISCONNECTED CVE freshness                          approvals warn/block       │
│                                                                                                 │
│ Actions per row: [Open Detail] [Test] [View dependent jobs] [View impacted approvals]           │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.10 Screen — DLQ & Replays (data pipelines stuck)

Previously

  • Operations → Dead Letter existed, but not clearly integrated into “why approvals are unsafe.”

Why changed like this

This screen becomes the “last mile” of data integrity:

  • When pipelines fail, DLQ grows.
  • DLQ items correspond to missing SBOM updates, missing reachability evidence, failed evidence sealing.

Screen graph (Mermaid)

flowchart TD
  A[DLQ & Replays] --> B[DLQ buckets by pipeline]
  A --> C[Item detail + payload]
  A --> D[Replay item]
  A --> E[Open Job Run Detail]
  A --> F[Open Integration Detail]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ DLQ & REPLAYS                                                     │
│ Legacy: Operations ▸ Dead Letter (queue view without release impact context)                    │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Buckets (24h)                                                                                   │
│  reachability-runtime-ingest: 1,230  (agent degraded)                                            │
│  sbom-nightly-rescan:         340    (registry auth timeout)                                     │
│  evidence-seal-bundles:        12    (transparency log unreachable)                              │
│                                                                                                 │
│ Select bucket → items                                                                            │
│  item-7781  payload: runtime-trace batch#991  age: 2h  action: [Replay] [View] [Link job]       │
│  item-7782  payload: runtime-trace batch#992  age: 2h  action: [Replay] [View] [Link job]       │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

15.11 Screen — Data Quality SLOs (dataSLO slice, links to System SLO Monitoring)

Previously

  • Settings/System → SLO Monitoring (or System root in the redesigned IA)

Why changed like this

Keep SLO engine canonical under System, but provide a “data integrity slice” here so operators see:

  • feed freshness SLO
  • SBOM staleness SLO
  • runtime coverage SLO …with deep links to the full SLO view.

Screen graph (Mermaid)

flowchart TD
  A[Data Quality SLOs] --> B[System SLO Monitoring (canonical)]
  A --> C[Show SLO breaches that impact approvals]
  A --> D[Open impacted approvals/releases]

ASCII mock

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ OPERATIONS ▸ DATA INTEGRITY ▸ DATA QUALITY SLOs                                                 │
│ Legacy: System ▸ SLO Monitoring (not scoped to data integrity)                                  │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ SLO                                Target     Current     Status     Approval impact            │
│----------------------------------------------------------------------------------------------- │
│ CVE feed freshness (NVD/OSV)        <2h        3h          WARN       gate may warn/fail         │
│ SBOM staleness (prod envs)          <24h       12 stale    FAIL       blocks prod promotions     │
│ Runtime reach coverage (prod)       >50%       35%         WARN       reduces confidence         │
│                                                                                                 │
│ Links: [Open System SLO Monitoring] [Open impacted approvals]                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

What this pack changes on other screens (without duplicating them)

These are UI hooks (badges/banners/cards) driven by Data Integrity:

  • Dashboard: “Nightly Ops Signals” card points to Ops → Data Integrity Overview.
  • Releases list: “Data Health” column/badge links to Data Integrity Overview filtered to the region.
  • Approvals: “Ops/Data” tab links to Data Integrity Overview + the exact failing job/feed/DLQ bucket.
  • Security Overview: shows “feeds fresh / stale” and “SBOM freshness” badges, with a link to Data Integrity.

If you want the next pack: Pack 16 can update the Dashboard mock explicitly to add the “Nightly Ops Signals” card and the SBOM + reachable criticals by environment summary you requested earlier, wired directly into the Data Integrity + Security + Env Detail pages.