Files
git.stella-ops.org/docs/product-advisories/14-Dec-2025 - Add a dedicated “first_signal” event.md
2025-12-14 16:23:44 +02:00

38 KiB
Raw Blame History

Heres a lightweight pattern to make failures show up instantly while keeping backends decoupled: emit a tiny, versioned event the moment you know something failed, and attach pointers to heavier evidence that can arrive later.


Why this helps

  • UI reacts in real time: show “Failed at Step X (E123)” immediately—no waiting for logs, SBOMs, or artifacts to upload/process.
  • Backends evolve safely: logs, traces, SBOM/VEX, heap dumps, etc., can change format or arrive out of order without breaking the UI contract.
  • Deterministic UX: a small, stable schema prevents flaky pipelines from blocking visibility.
  • Great for airgapped/offline: the tiny event rides your internal bus/storage; bulky payloads sync or materialize when available.

The event itself (keep it tiny)

Fields (stable, versioned):

  • v — schema version (e.g., 1).
  • ts — event timestamp (UTC, ISO 8601).
  • run_id — pipeline/execution correlation ID.
  • stage — coarse phase (e.g., fetch, build, scan, policy, deploy).
  • step — fine-grained step (e.g., trivy-scan, dotnet-restore).
  • statusfail|warn|pass|info (for this pattern, youll use fail).
  • error_class — stable classifier (e.g., NETWORK_DNS, AUTH_EXPIRED, POLICY_BLOCK, VULN_REACHABLE).
  • summary — short human string (“Reachable vuln blocks release”).
  • pointers — array of opaque, resolvable references (log offsets, artifact URIs, attestation IDs).
  • kv — optional tiny key/values for quick filtering (e.g., severity=A, package=openssl).
  • sig (optional) — detached/inline signature (DSSE) for integrity.

Example

{
  "v": 1,
  "ts": "2025-12-13T12:10:03Z",
  "run_id": "run_7f3c6a8",
  "stage": "policy",
  "step": "vex-gate",
  "status": "fail",
  "error_class": "VULN_REACHABLE",
  "summary": "Reachable CVE blocks release",
  "pointers": [
    {"type":"log", "ref":"logs://scanner/7f3c6a8#L1423-L1480"},
    {"type":"attestation", "ref":"rekor://sha256:…"},
    {"type":"sbom", "ref":"artifact://sbom/cyclonedx@run_7f3c6a8.json"}
  ],
  "kv": {"cve":"CVE-2025-12345", "component":"openssl", "severity":"A"}
}

UI behavior (instant, then enrich)

  1. Instant render (sub-200 ms): show a red card with stage/step, error_class, and summary.

  2. Progressive hydration: as pointers resolve, add:

    • “View log excerpt” (jump to #L1423-L1480)
    • “Open attestation” (verify DSSE/Rekor)
    • “Inspect SBOM diff” (component → version → callgraph)
  3. Stable affordances: UI never breaks if a pointer is slow/missing; it just shows a spinner or “awaiting evidence”.


Backend contract

  • Publish early: emit on first knowledge of failure (e.g., nonzero exit, policy deny, TLS error).
  • Dont embed heavy data: only pointers or tiny facts for filters.
  • Pointer resolution is pluggable: files, object storage, Postgres row, Valkey cache key, Rekor entry—whatever suits the deployment.
  • Version discipline: bump v only for breaking schema changes; additive fields are fine.

Minimal topic map (so teams agree on names)

  • stage: fetch|build|scan|policy|sign|package|deploy

  • error_class suggestions:

    • Infra: NETWORK_DNS, NETWORK_TIMEOUT, REGISTRY_403, DISK_FULL
    • AuthN/Z: AUTH_EXPIRED, TOKEN_SCOPE_MISS
    • Supply chain: ATTESTATION_MISSING, SIGNATURE_INVALID, SBOM_STALE
    • Secure build: POLICY_BLOCK, VULN_REACHABLE, MALWARE_FLAG
    • Runtime: IMAGE_DRIFT, PROVENANCE_MISMATCH

Keep each to a 12 line definition in a shared doc.


Dropin for StellaOps (tailored)

  • Emitter: StellaOps.Events (tiny .NET lib) used by Scanner/Policy/Scheduler to publish TinyFailureEvent.
  • Transport: Postgres notify (default) + Valkey pub/sub accelerator. (Matches your Postgres+Valkey architecture choice.)
  • Resolver service: EvidenceGateway that turns pointers into viewable slices (log excerpts, SBOM component focus, Rekor proof).
  • UI: “Failure Feed” panel shows cards from the event stream; detail drawer resolves pointers on demand.
  • Signing: optional DSSE for events; Rekor (or mirror) for attestations—your “ProofLinked” moat.
  • Airgap: pointers use artifact:// and row:// schemes resolvable entirely onprem.

Quick implementation checklist

  • Define TinyFailureEvent schema v1 and error_class registry.
  • Add emit helpers for each module (FailNow(summary, error_class, pointers, kv)).
  • Build EvidenceGateway.Resolve(pointer) handlers.
  • UI: render card instantly; hydrate sections as resolvers return.
  • Telemetry: metrics on TTFE (TimeToFailureEvent) and pointer hydration latencies.
  • Docs: 1page contract; examples for each error_class.

If you want, I can draft the .NET 10 interfaces (ITinyEventEmitter, resolvers, and a small Razor/Angular card) and a Postgres schema you can paste into your repo. Below is a PM-grade implementation spec for “Real-time Failure Signaling” using Tiny Failure Events + Evidence Pointers, written so engineers can build it without guessing.


Product: Real-time Failure Signaling (Tiny Failure Events)

Goal

When any pipeline run fails, users must see what failed and where (stage/step + error class + short summary) immediately, even if logs/SBOM/attestations are delayed, huge, or unavailable.

The UI must render a failure card from a tiny event and then progressively enrich with evidence as it becomes resolvable.

Outcomes we must deliver

  1. Instant visibility: “Failed at Step X” appears within seconds of failure.
  2. Decoupling: UI depends only on a stable tiny schema, not on log formats/artifact structures.
  3. Evidence linking: Users can open logs/SBOM/attestations when available, via pointers.
  4. Reliability: Duplicate/out-of-order events dont break the UI; state remains consistent.
  5. Security: Evidence access is authorized; pointers do not leak sensitive info.

Scope

In scope (MVP)

  • Emit TinyFailureEvent v1 on first detected failure for a step.

  • Transport events in near real-time to UI.

  • Store events durably and allow UI to fetch a runs event timeline.

  • Support evidence pointers for:

    • logs (excerptable)
    • artifacts (SBOM, reports)
    • attestations (provenance/signature)
  • UI:

    • show run timeline
    • show failure card instantly
    • hydrate evidence sections on demand (or automatically where feasible)

Out of scope (MVP)

  • Full trace viewer / distributed tracing UI (we can link to external trace systems via pointer).
  • Automated remediation (“fix it”) actions.
  • Full-blown case management.

Key terms and definitions

  • Run: A single execution of a pipeline. Identified by run_id.
  • Stage: Coarse lifecycle phase (fetch, build, scan, policy, sign, package, deploy).
  • Step: A concrete activity within a stage (dotnet-restore, trivy-scan, vex-gate).
  • Tiny Failure Event: A small message representing “this step failed”, including stable classification and references to evidence.
  • Pointer: An opaque reference that can be resolved into evidence content or a link later.

User stories and acceptance criteria

Story 1: I see failure instantly

As a developer I want to see which step failed immediately So that I dont wait on logs/artifacts

Acceptance criteria

  • When a step fails, the UI updates within ≤ 2 seconds p95 from the time the orchestrator/runner detects failure.

  • The failure card includes:

    • stage, step
    • error class
    • human summary
    • timestamp
    • (optional) primary key/value details (e.g., CVE, severity)

Story 2: I can open evidence when available

As a release engineer I want to click evidence links (logs/SBOM/attestation) So that I can diagnose/root-cause

Acceptance criteria

  • Failure card shows evidence sections as:

    • Available (clickable)
    • Pending (spinner / “awaiting evidence”)
    • Unavailable (“not produced” or “access denied”)
  • Clicking log evidence opens an excerpt view, not a 500MB file download.

  • Evidence access enforces authorization (same as run access).

Story 3: Events are robust to duplicates/out-of-order

As a user I want the timeline to remain correct Even if event delivery is at-least-once

Acceptance criteria

  • UI displays exactly one current “failed” state per step attempt.
  • Duplicate events do not create duplicate cards.
  • Out-of-order arrival does not revert a step from fail → pass.

Functional requirements (what developers must build)

FR1: TinyFailureEvent schema v1

Required fields

All producers MUST emit events that validate against this schema.

{
  "v": 1,
  "event_id": "evt_01J…", 
  "ts": "2025-12-13T12:10:03.123Z",
  "run_id": "run_7f3c6a8",
  "stage": "policy",
  "step": "vex-gate",
  "attempt": 1,
  "status": "fail",
  "error_class": "VULN_REACHABLE",
  "summary": "Reachable CVE blocks release",
  "pointers": [],
  "kv": {}
}

Field definitions & constraints

  • v (int, required): must be 1 for this spec.

  • event_id (string, required): globally unique.

    • Format: evt_<ULID> (ULID recommended for time-sortable IDs).
  • ts (RFC3339 UTC, required): creation timestamp.

  • run_id (string, required): stable correlation id for run.

  • stage (enum string, required): one of:

    • fetch|build|scan|policy|sign|package|deploy|runtime
  • step (string, required): lowercase kebab-case recommended; max 80 chars.

  • attempt (int, required): starts at 1; increments for retries.

  • status (enum string, required for this feature): fail (MVP supports fail only; schema allows later expansion)

  • error_class (string, required): stable classifier from a shared registry (see FR2).

    • max 64 chars; uppercase snake-case.
  • summary (string, required): human readable, max 140 chars.

  • pointers (array, optional): max 20 items; each item is a Pointer object (see FR3).

  • kv (object, optional): small metadata map for filtering.

    • max 20 keys
    • key max 32 chars; value max 120 chars
    • no nested objects/arrays

Size limits

  • Entire event payload MUST be ≤ 8 KB serialized JSON.
  • If producers exceed limits, they MUST truncate summary and drop low-priority kv keys before failing emission.

FR2: Error class registry (stable contract)

We maintain a canonical list of error_class values in a shared repo/module.

Requirements

  • Each error_class MUST have:

    • name (e.g., NETWORK_DNS)
    • short description
    • severity mapping (optional)
    • recommended remediation hints (optional, can be UI-side)
  • Producers MUST use a registry value if applicable.

  • Producers MAY emit error_class="UNKNOWN" if no mapping exists, but must log a warning and increment a metric.

Initial registry (minimum)

Infra/Network:

  • NETWORK_DNS
  • NETWORK_TIMEOUT
  • DISK_FULL

Auth:

  • AUTH_EXPIRED
  • REGISTRY_403

Supply chain:

  • SIGNATURE_INVALID
  • ATTESTATION_MISSING
  • SBOM_MISSING

Policy/Security:

  • POLICY_BLOCK
  • VULN_REACHABLE
  • MALWARE_FLAG

Runner/Orchestrator:

  • STEP_TIMEOUT
  • RUN_ABORTED
  • WORKER_LOST

FR3: Evidence pointer format and rules

Pointer object schema

{
  "type": "log|artifact|attestation|url|trace",
  "ref": "logs://scanner/run_7f3c6a8#L1423-L1480",
  "mime": "text/plain",
  "label": "Scanner log excerpt",
  "expires_at": "2025-12-20T00:00:00Z",
  "sha256": "optional hex"
}

Rules

  • type and ref are required.
  • ref is opaque to UI; UI passes it to the resolver service.
  • label is optional, but strongly recommended for UI friendliness.
  • expires_at is optional; if present UI should show “may expire”.
  • sha256 optional for immutability verification (artifacts/attestations especially).

Allowed schemes (MVP)

  • logs://<provider>/<run_id>#Lx-Ly
  • artifact://<kind>/<name>@<version-or-run-id>
  • attestation://<store>/<id-or-digest>
  • url://<encoded> (only internal allowed; resolver enforces)
  • trace://<system>/<trace-id>

Security constraints

  • Pointers MUST NOT embed secrets (tokens, passwords).
  • Any pointer that could expose sensitive data MUST be resolvable only through the Evidence Gateway (FR6), never directly client-side.
  • The resolver MUST enforce authorization for the requesting user.

FR4: Emission rules (when and how events are produced)

When to emit

Producers MUST emit a TinyFailureEvent when:

  1. A step exits non-zero.
  2. A policy decision is “deny/block”.
  3. A required artifact/attestation is missing at gate time.
  4. A step times out.
  5. The worker is lost (emitted by orchestrator watchdog).

Exactly-once vs at-least-once

  • Transport can be at-least-once.
  • Consumers MUST be idempotent using (run_id, stage, step, attempt, status) + event_id.

One failure event per step attempt

  • For a given (run_id, stage, step, attempt):

    • First emitted status=fail is canonical.
    • Later fail events for the same tuple are treated as “updates” only if they add pointers/kv (see FR5).

Updates / enrichment

We support enrichment without breaking “tiny”:

  • Producers MAY emit a second event with the same tuple (run_id/stage/step/attempt/status) that adds pointers or kv after the initial fail.
  • Consumers MUST merge pointers (dedupe identical type+ref) and merge kv (new keys overwrite old keys).
  • Producers MUST NOT spam; max 3 enrichment events per tuple.

FR5: Event storage and aggregation

Required services/components

  1. Event Ingest (API or internal library endpoint)
  2. Event Store (durable DB table)
  3. Realtime Fanout (pub/sub channel)
  4. Run Timeline API (query per run)

Behavior

  • On ingest:

    • Validate schema (reject invalid with 400/validation error).
    • Persist to event store.
    • Publish to realtime channel.

Suggested DB model (Postgres)

Table: run_events

  • event_id PK
  • run_id indexed
  • ts indexed
  • stage, step, attempt, status indexed composite
  • payload jsonb
  • ingested_at

Uniqueness constraints:

  • event_id unique
  • Optional: unique on (run_id, stage, step, attempt, status, hash(summary)) if you want stronger dedupe

Query API

  • GET /runs/{run_id}/events returns events sorted by ts ascending.
  • UI should also subscribe realtime to avoid polling.

FR6: Evidence Gateway (pointer resolver)

Purpose

A single service that resolves pointers into either:

  • log excerpts
  • signed download URLs
  • attestation display + verification data
  • external trace links (sanitized)

Endpoints (MVP)

  1. Resolve metadata

    • POST /evidence/resolve

    • body: { "run_id": "...", "pointers": [ { "type": "...", "ref": "..." } ] }

    • returns per pointer:

      • status: available|pending|missing|denied|expired|error
      • kind: inline|link
      • title
      • mime
      • size_bytes (if known)
      • link (if kind=link) must be short-lived, server-generated
      • inline_preview (optional, small excerpt)
  2. Fetch log excerpt

    • GET /evidence/log-excerpt?ref=...

    • returns:

      • text (max 64 KB)
      • start_line, end_line
      • source (provider info)
  3. Fetch artifact

    • GET /evidence/artifact?ref=...

    • returns either:

      • short-lived download link
      • or 404/403/410

AuthZ requirements

  • Evidence Gateway MUST verify the caller has access to the run_id.
  • Gateway MUST validate that the pointer belongs to that run (or is explicitly declared “global shared”).
  • Gateway MUST audit-log every evidence resolution.

Resilience

  • If evidence is not ready, resolver returns pending, not 500.
  • If pointer is unknown format, return error with a safe message.

UI requirements (what the product must do)

UI1: Run timeline renders from events

  • The run detail page MUST show:

    • stages/steps list
    • current state per step (pass/warn/fail/running)
    • failure details if fail exists
  • The failure state MUST be derived from TinyFailureEvent without requiring any log fetch.

UI2: Failure card content (minimum)

When a fail event arrives:

  • Show a red failure card with:

    • stage + step
    • summary
    • error_class badge
    • ts (relative + absolute on hover)
    • key kv fields (up to 4 shown; remainder behind “Show more”)

UI3: Progressive hydration

  • The card MUST include an “Evidence” section.

  • For each pointer:

    • show a row with label and availability status
    • if available, show “Open”
    • if pending, show spinner + “Awaiting evidence”
    • if denied, show lock icon + “No access”
    • if missing, show “Not produced”
  • Clicking “Open”:

    • logs open excerpt viewer (modal/drawer)
    • artifacts open in viewer or download (type-dependent)
    • attestations open verification view

UI4: Realtime behavior

  • UI MUST subscribe to realtime events for the run.

  • UI MUST apply idempotent merge logic:

    • dedupe by event_id
    • merge enrichment events by tuple (run_id/stage/step/attempt/status)

UI5: Ordering and out-of-order handling

  • UI MUST sort by ts for display.

  • UI MUST NOT regress a step state if a late “pass/info” arrives after fail.

    • Rule: fail is terminal for a step attempt.

Non-functional requirements

Latency

  • From failure detection to UI update: ≤ 2s p95, ≤ 5s p99 (within the same network).

  • Evidence resolution:

    • resolve call should return in ≤ 300ms p95 for cached/known pointers.

Reliability

  • Event ingestion must be durable (stored) before fanout.

  • System must tolerate:

    • duplicates
    • retries
    • out-of-order delivery
    • partial evidence availability

Payload limits

  • Event size ≤ 8KB
  • Evidence inline previews ≤ 4KB per pointer

Retention

  • Tiny events retained ≥ 30 days (configurable).
  • Evidence retention depends on provider, but resolver must surface expiry.

Metrics and instrumentation (definition of success)

Producers + ingestion MUST emit:

  • ttfe_ms: time to failure event (from step start or from failure detection)
  • event_ingest_latency_ms
  • event_validation_fail_count
  • unknown_error_class_count
  • pointer_resolution_status_count{available|pending|missing|denied|expired|error}
  • pointer_hydration_latency_ms

UI MUST log:

  • time from run page open → first event rendered
  • evidence open clickthrough rate
  • evidence resolution failure rate

Edge cases we explicitly handle

  1. Runner killed before it can emit

    • Orchestrator watchdog emits WORKER_LOST with stage/step best-effort.
  2. Logs produced after failure

    • Initial fail event has no log pointer.
    • Later enrichment event adds log pointer (same tuple).
  3. Evidence exists but user lacks access

    • Resolver returns denied; UI shows locked state.
  4. Evidence link expired

    • Resolver returns expired and provides a “Refresh” action that re-resolves.
  5. Multiple retries

    • attempt increments; UI shows attempt number and keeps prior attempt history.

Definition of Done (engineers can ship when…)

Backend DoD

  • Schema validation implemented.

  • Ingest → store → fanout pipeline working.

  • Enrichment merge logic implemented.

  • Evidence Gateway resolves at least:

    • log excerpt pointers
    • artifact pointers
    • attestation pointers
  • AuthZ enforced.

Frontend DoD

  • Run page shows failure card from TinyFailureEvent alone.
  • Evidence hydration works and handles all resolver statuses.
  • Realtime updates work; duplicates/out-of-order are safe.

QA DoD (minimum test cases)

  1. Step fails; event arrives; UI shows failure without logs.
  2. Log pointer arrives later; UI updates evidence section.
  3. Duplicate event delivery; UI shows one failure.
  4. Evidence denied; UI shows lock state.
  5. Out-of-order: enrichment arrives before initial fail; UI still resolves correctly.

Implementation checklist (suggested division of work)

Team A: Producer SDK

  • StellaOps.Events library:

    • EmitFailure(run_id, stage, step, attempt, error_class, summary, pointers, kv)
    • ULID generation
    • payload size enforcement
    • retry w/ backoff

Team B: Ingest + Store + Fanout

  • API endpoint or internal gRPC for ingest
  • Postgres run_events table
  • Valkey pub/sub channel: run:{run_id}:events

Team C: Evidence Gateway

  • pointer parsing + resolvers
  • log excerpt adapter(s)
  • artifact download adapter(s)
  • attestation viewer adapter(s)

Team D: UI

  • realtime subscription
  • timeline state reducer w/ merge rules
  • evidence drawer + resolver calls

If you want, I can also provide:

  • a formal JSON Schema file for v1 (drop-in for validation),
  • a state reducer spec for the UI (pure function inputs/outputs),
  • and a Postgres migration + example queries. Below is a UI State Reducer Spec (Pure Function Inputs/Outputs) for the Run Detail page that renders the timeline + step statuses + failure cards + evidence hydration using TinyFailureEvents (and future-compatible with pass/warn/info).

This is written so devs can implement it as a deterministic reducer (Redux, Zustand w/ reducer, Elm-style update, etc.).


UI State Reducer Spec v1: Run Timeline + Failure Cards

Reducer contract

Pure function

reduceRunView(state: RunViewState, action: Action): RunViewState

Guarantees

  • Pure & deterministic: no IO, no timers, no random IDs, no Date.now() inside reducer.
  • Idempotent: applying the same RUN_EVENT_RECEIVED twice yields the same state after the first time.
  • Order-safe: out-of-order events never “downgrade” a step attempt from failpass.

1) Data types

1.1 Event type used by reducer

type StageName =
  | 'fetch' | 'build' | 'scan' | 'policy'
  | 'sign' | 'package' | 'deploy' | 'runtime';

type StepStatus =
  // present now (MVP)
  | 'fail'
  // future-compatible
  | 'warn' | 'pass' | 'running' | 'queued' | 'info' | 'unknown';

type PointerType = 'log' | 'artifact' | 'attestation' | 'url' | 'trace';

type Pointer = {
  type: PointerType;
  ref: string;
  mime?: string;
  label?: string;
  expires_at?: string; // RFC3339
  sha256?: string;
};

type TinyEventV1 = {
  v: 1;
  event_id: string;
  ts: string;          // RFC3339 UTC
  run_id: string;
  stage: StageName;
  step: string;
  attempt: number;
  status: StepStatus;  // MVP sends 'fail' only
  error_class: string;
  summary: string;
  pointers?: Pointer[];
  kv?: Record<string, string>;
};

// Normalized for sorting and comparisons (created outside or inside reducer deterministically)
type NormalizedEvent = TinyEventV1 & {
  tsMs: number; // parse(ts) -> number, invalid => 0
};

1.2 Keys and comparisons

type TupleKey = string;       // `${stage}|${step}|${attempt}|${status}`
type StepAttemptKey = string; // `${stage}|${step}|${attempt}`
type StepIdentityKey = string;// `${stage}|${step}` (no attempt)
type PointerKey = string;     // `${type}|${ref}`

function tupleKey(e: TinyEventV1): TupleKey {
  return `${e.stage}|${e.step}|${e.attempt}|${e.status}`;
}
function stepAttemptKey(e: TinyEventV1): StepAttemptKey {
  return `${e.stage}|${e.step}|${e.attempt}`;
}
function stepIdentityKey(e: TinyEventV1): StepIdentityKey {
  return `${e.stage}|${e.step}`;
}
function pointerKey(p: Pointer): PointerKey {
  return `${p.type}|${p.ref}`;
}

// Sort: ts ascending, then event_id lexicographically (stable deterministic tiebreak)
function compareEvent(a: NormalizedEvent, b: NormalizedEvent): number {
  if (a.tsMs !== b.tsMs) return a.tsMs - b.tsMs;
  return a.event_id < b.event_id ? -1 : (a.event_id > b.event_id ? 1 : 0);
}

1.3 Status ranking rule (terminal safety)

We need a single numeric ranking so we can:

  • prevent regressions (fail must remain terminal), and
  • compute rollups.
const STATUS_RANK: Record<StepStatus, number> = {
  unknown: 0,
  queued:  1,
  running: 2,
  info:    3,
  pass:    4,
  warn:    5,
  fail:    6,
};

function isTerminal(status: StepStatus): boolean {
  return status === 'fail' || status === 'warn' || status === 'pass';
}

Invariant: A step attempts displayed status must never decrease in rank.


2) State shape

This state is for a single Run Detail page (one runId at a time). If you store multiple runs in a global store, wrap this in a Record<runId, RunViewState>.

type RealtimeStatus = 'idle' | 'connecting' | 'connected' | 'disconnected' | 'error';
type LoadStatus = 'idle' | 'loading' | 'loaded' | 'error';

type EvidenceResolveStatus =
  | 'unresolved'  // pointer exists but no resolver call made yet
  | 'loading'     // resolver call in-flight
  | 'available' | 'pending' | 'missing' | 'denied' | 'expired' | 'error';

type EvidenceResolution = {
  status: EvidenceResolveStatus;
  kind?: 'inline' | 'link';
  title?: string;
  mime?: string;
  size_bytes?: number;
  inline_preview?: string; // small preview
  link?: string;           // short-lived link
  error_message?: string;
};

type EvidenceState = {
  pointer: Pointer;          // latest metadata merged from events
  status: EvidenceResolveStatus;
  lastResolvedAtMs?: number; // from action payload (not Date.now)
  // for stale response protection
  seq: number;               // increments each request
  inFlightSeq?: number;      // seq currently in-flight
  resolution?: EvidenceResolution;
};

type PointerAggregate = {
  pointerKey: PointerKey;
  pointer: Pointer; // merged metadata
};

type TupleAggregate = {
  tupleKey: TupleKey;

  // all events contributing to this tuple (same stage/step/attempt/status)
  eventIdsSorted: string[];      // sorted by (tsMs, event_id)
  canonicalEventId: string;      // min by (tsMs, event_id)

  // merged view computed deterministically from eventIdsSorted
  merged: {
    summary: string;             // from canonical event
    error_class: string;         // from canonical event
    kv: Record<string, string>;  // merged by sorted order (later overwrites)
    pointers: PointerAggregate[];// dedup by pointerKey, merged by sorted order
    updatedAtMs: number;         // max tsMs among contributing events
  };
};

type StepAttemptState = {
  key: StepAttemptKey;
  stage: StageName;
  step: string;
  attempt: number;

  // all tuple aggregates for this attempt (one per status)
  tuplesByStatus: Partial<Record<StepStatus, TupleKey>>;

  // derived “best” status for this attempt
  bestStatus: StepStatus;
  bestStatusRank: number;
  updatedAtMs: number; // max of all tupleAgg.updatedAtMs for this attempt
};

type StageRollup = {
  stage: StageName;
  // worst status among latest attempts of steps in this stage
  rollupStatus: StepStatus;
  rollupRank: number;
};

type RunViewState = {
  runId: string | null;

  loading: { initialEvents: LoadStatus; error?: string };
  realtime: { status: RealtimeStatus; error?: string };

  // storage
  eventsById: Record<string, NormalizedEvent>;
  timelineEventIds: string[];  // global timeline sorted by (tsMs, event_id)

  tupleAggByKey: Record<TupleKey, TupleAggregate>;
  stepAttemptByKey: Record<StepAttemptKey, StepAttemptState>;
  latestAttemptByStep: Record<StepIdentityKey, number>; // max attempt observed

  stageRollups: Record<StageName, StageRollup>;

  evidenceByPointer: Record<PointerKey, EvidenceState>;
};

3) Actions (inputs to reducer)

type Action =
  | { type: 'RUN_VIEW_OPENED'; runId: string }
  | { type: 'RUN_EVENTS_LOAD_STARTED'; runId: string }
  | { type: 'RUN_EVENTS_LOADED'; runId: string; events: TinyEventV1[] }
  | { type: 'RUN_EVENTS_LOAD_FAILED'; runId: string; error: string }

  | { type: 'REALTIME_STATUS_CHANGED'; runId: string; status: RealtimeStatus; error?: string }
  | { type: 'RUN_EVENT_RECEIVED'; event: TinyEventV1 }

  // Evidence hydration lifecycle (pure reducer; side-effects happen elsewhere)
  | { type: 'EVIDENCE_RESOLVE_REQUESTED'; runId: string; pointerKey: PointerKey }
  | { type: 'EVIDENCE_RESOLVE_RESULT'; runId: string; pointerKey: PointerKey; seq: number; resolvedAtMs: number; resolution: EvidenceResolution }
  | { type: 'EVIDENCE_RESOLVE_CLEARED'; runId: string; pointerKey: PointerKey };

Reducer must ignore any action where action.runId !== state.runId (except RUN_VIEW_OPENED which sets it).


4) Reducer semantics (outputs)

4.1 RUN_VIEW_OPENED

Input: { runId } Output: resets all run-specific state.

Rules:

  • Set state.runId = runId
  • Clear events, aggregates, evidence, timeline.
  • Set loading.initialEvents = 'loading'
  • Set realtime.status = 'connecting' (optional)

4.2 RUN_EVENTS_LOAD_STARTED / LOADED / FAILED

RUN_EVENTS_LOAD_STARTED

  • If runId matches, set loading.initialEvents = 'loading'.

RUN_EVENTS_LOADED

  • If runId matches:

    • For each event in events: apply the exact same logic as RUN_EVENT_RECEIVED.
    • Then set loading.initialEvents = 'loaded'.

RUN_EVENTS_LOAD_FAILED

  • If runId matches: loading.initialEvents = 'error', store error string.

4.3 REALTIME_STATUS_CHANGED

  • Update realtime.status and realtime.error if runId matches.

4.4 RUN_EVENT_RECEIVED (core ingestion)

Preconditions

If state.runId is null, ignore (or treat as no-op). If event.run_id !== state.runId, ignore.

Step A — normalize + dedupe

  • Convert to NormalizedEvent:

    • tsMs = parseRFC3339ToMs(event.ts); if parse fails, tsMs = 0.
    • Default pointers = [], kv = {} if missing.
  • If eventsById[event_id] exists: no-op.

Step B — insert into global stores

  • Add to eventsById[event_id].
  • Insert event_id into timelineEventIds keeping sorted order by (tsMs, event_id).

Step C — ensure evidence entries exist for pointers

For each pointer p:

  • pk = pointerKey(p)

  • If evidenceByPointer[pk] is missing:

    • create { pointer: p, status: 'unresolved', seq: 0 }
  • Else merge pointer metadata into evidenceByPointer[pk].pointer using pointer-merge rules (below). (Do not overwrite existing resolver resolution fields.)

Step D — update tuple aggregate (merge/enrichment)

Let tk = tupleKey(event).

  • If tupleAggByKey[tk] missing, create new TupleAggregate with:

    • eventIdsSorted = [event_id]
    • canonicalEventId = event_id
    • merged from this event
  • Else:

    • Insert event_id into eventIdsSorted in sorted order (using compareEvent via eventsById).

    • Recompute:

      • canonicalEventId = min(eventIdsSorted) by compareEvent
      • merged deterministically from all contributing events (see merge rules)

Tuple merge rules (deterministic)

Given contributing events E sorted by (tsMs, event_id) ascending:

  • canonical = E[0]

  • merged.summary = canonical.summary

  • merged.error_class = canonical.error_class

  • merged.kv:

    • start empty {}
    • for each event e in order, for each (k,v) in e.kv: merged.kv[k] = v (later events overwrite earlier keys)
  • merged.pointers:

    • maintain map: Record<PointerKey, Pointer>

    • for each event e in order, for each pointer p:

      • pk = pointerKey(p)
      • if not present: set map[pk] = p
      • else: map[pk] = mergePointerMeta(map[pk], p) (see below)
    • output pointers as an array sorted by PointerKey lexicographically (for stable UI lists)

  • merged.updatedAtMs = max(e.tsMs)

Pointer metadata merge rule (non-null wins, later wins)

function mergePointerMeta(oldP: Pointer, newP: Pointer): Pointer {
  // type/ref must match
  return {
    type: oldP.type,
    ref: oldP.ref,
    // later non-empty wins
    mime:       newP.mime       ?? oldP.mime,
    label:      newP.label      ?? oldP.label,
    expires_at: newP.expires_at ?? oldP.expires_at,
    sha256:     newP.sha256     ?? oldP.sha256,
  };
}

4.5 Update StepAttemptState (best status + no regression)

After tuple aggregate update, update the parent step attempt:

  • Let sak = stepAttemptKey(event) and sid = stepIdentityKey(event).

latest attempt tracking

  • latestAttemptByStep[sid] = max(previous, event.attempt)

StepAttemptState update

  • If missing, create:

    • bestStatus = 'unknown', bestStatusRank = 0, tuplesByStatus = {}
  • Set tuplesByStatus[event.status] = tk

Recompute best status (never decreases)

Compute candidate best by checking all statuses present for this attempt:

candidateBest = argmax(status in tuplesByStatus) STATUS_RANK[status]

Then apply no-regression rule:

  • If STATUS_RANK[candidateBest] >= step.bestStatusRank:

    • update bestStatus, bestStatusRank
  • Else:

    • keep existing bestStatus (prevents fail → pass regressions)

Set updatedAtMs = max(updatedAtMs, tupleAgg.merged.updatedAtMs).

Important: This rule guarantees “late pass/info” cannot override a prior fail.


Whenever any StepAttemptState changes, update stageRollups[stage] deterministically:

For each stage:

  • Consider only the latest attempt per step identity in that stage:

    • For each StepIdentityKey = stage|step, find attempt = latestAttemptByStep[stage|step]
    • Look up StepAttemptState for that attempt.
  • Roll up stage status as the worst rank among those:

    • rollupRank = max(step.bestStatusRank)
    • rollupStatus = status with that rank

If a stage has no steps yet, set rollupStatus='unknown'.


5) Evidence hydration reducer rules

Evidence actions update evidenceByPointer only; they must not mutate events/aggregates.

5.1 EVIDENCE_RESOLVE_REQUESTED

Input: { pointerKey }

Rules:

  • If no evidence entry exists: create one with status unresolved and seq=0 (should be rare).
  • Increment seq = seq + 1
  • Set inFlightSeq = seq
  • Set status = 'loading'
  • Keep resolution (optional: clear it if you want UI to hide stale info; recommended to keep and show “Refreshing…”)

Middleware/effect contract (outside reducer):

  • After dispatching EVIDENCE_RESOLVE_REQUESTED, the effect layer reads inFlightSeq from state and uses it in the API call.
  • When the response returns, dispatch EVIDENCE_RESOLVE_RESULT with that same seq.

5.2 EVIDENCE_RESOLVE_RESULT

Input: { pointerKey, seq, resolvedAtMs, resolution }

Rules:

  • If evidenceByPointer[pointerKey] missing: ignore or create (implementation choice).

  • If evidence.inFlightSeq !== seq: ignore stale response.

  • Else:

    • status = resolution.status
    • resolution = resolution
    • lastResolvedAtMs = resolvedAtMs
    • inFlightSeq = undefined

5.3 EVIDENCE_RESOLVE_CLEARED

  • Reset entry back to { status:'unresolved', resolution: undefined, inFlightSeq: undefined }
  • Keep pointer metadata.

6) Selectors (pure outputs for rendering)

These are not reducer logic, but they define how UI consumes state deterministically.

6.1 Timeline view model

selectTimeline(state): NormalizedEvent[] {
  return state.timelineEventIds.map(id => state.eventsById[id]);
}

6.2 Latest attempt cards per step identity

type StepCardVM = {
  stage: StageName;
  step: string;
  attempt: number;
  status: StepStatus;
  error_class?: string;
  summary?: string;
  kv: Record<string,string>;
  pointers: PointerAggregate[];
  updatedAtMs: number;
};

selectLatestStepCards(state): StepCardVM[] {
  const cards: StepCardVM[] = [];
  for (const sid in state.latestAttemptByStep) {
    const attempt = state.latestAttemptByStep[sid];
    const [stage, step] = sid.split('|') as [StageName, string];
    const sak = `${stage}|${step}|${attempt}`;

    const sa = state.stepAttemptByKey[sak];
    if (!sa) continue;

    // Prefer fail tuple for details if present
    const failTk = sa.tuplesByStatus['fail'];
    const bestTk = sa.tuplesByStatus[sa.bestStatus];
    const tk = failTk ?? bestTk;
    const agg = tk ? state.tupleAggByKey[tk] : undefined;

    cards.push({
      stage, step, attempt,
      status: sa.bestStatus,
      error_class: agg?.merged.error_class,
      summary: agg?.merged.summary,
      kv: agg?.merged.kv ?? {},
      pointers: agg?.merged.pointers ?? [],
      updatedAtMs: sa.updatedAtMs,
    });
  }
  // stable ordering: by stage order, then step name
  return cards.sort((a,b) =>
    (STAGE_ORDER.indexOf(a.stage) - STAGE_ORDER.indexOf(b.stage)) ||
    a.step.localeCompare(b.step)
  );
}

6.3 Failure banner (first failure by time)

selectFirstFailure(state): StepCardVM | null {
  const cards = selectLatestStepCards(state).filter(c => c.status === 'fail');
  if (cards.length === 0) return null;
  return cards.sort((a,b) => a.updatedAtMs - b.updatedAtMs)[0];
}

7) Worked examples (expected reducer behavior)

Example A: fail event arrives, then enrichment adds pointers

  1. Receive fail event (no pointers)
  • Step card shows fail, summary, error_class, evidence list empty.
  1. Receive second event same tupleKey with pointers
  • Same step card remains fail (no regression)
  • Evidence section now lists pointers (status unresolved until resolved).

Example B: out-of-order enrichment arrives before initial fail

  • Enrichment event arrives first (later tsMs) → creates tupleAgg; canonical is that (for now).

  • Later initial fail arrives with earlier tsMs:

    • canonical becomes the earlier event (smaller tsMs)
    • pointers remain, because merged pointers are union across all contributing events.

Example C: duplicate delivery

  • Same event_id received twice → second is ignored (idempotent).

Example D: late pass after fail (future-proof)

  • If a pass event arrives after a fail for the same step attempt:

    • bestStatusRank is already fail (6)
    • candidate is pass (4)
    • no-regression rule keeps fail

8) Implementation notes (non-binding but useful)

  • Event counts per run are usually small; simple array insert + sort is fine.
  • If you expect thousands of events, maintain a binary insertion for timelineEventIds and eventIdsSorted.
  • Keep all “current time” out of reducer. Any timestamps used in actions (e.g., resolvedAtMs) must be created outside.

If you want next, I can provide:

  • a drop-in TypeScript implementation of reduceRunView with helper functions, and
  • a set of unit test vectors (Given actions → expect final state) covering all edge cases above.