Here’s a lightweight pattern to make failures show up instantly while keeping backends decoupled: **emit a tiny, versioned event the moment you know something failed**, and attach pointers to heavier evidence that can arrive later. --- # Why this helps * **UI reacts in real time**: show “Failed at Step X (E123)” immediately—no waiting for logs, SBOMs, or artifacts to upload/process. * **Backends evolve safely**: logs, traces, SBOM/VEX, heap dumps, etc., can change format or arrive out of order without breaking the UI contract. * **Deterministic UX**: a small, stable schema prevents flaky pipelines from blocking visibility. * **Great for air‑gapped/offline**: the tiny event rides your internal bus/storage; bulky payloads sync or materialize when available. --- # The event itself (keep it tiny) **Fields (stable, versioned):** * `v` — schema version (e.g., `1`). * `ts` — event timestamp (UTC, ISO 8601). * `run_id` — pipeline/execution correlation ID. * `stage` — coarse phase (e.g., `fetch`, `build`, `scan`, `policy`, `deploy`). * `step` — fine-grained step (e.g., `trivy-scan`, `dotnet-restore`). * `status` — `fail|warn|pass|info` (for this pattern, you’ll use `fail`). * `error_class` — stable classifier (e.g., `NETWORK_DNS`, `AUTH_EXPIRED`, `POLICY_BLOCK`, `VULN_REACHABLE`). * `summary` — short human string (“Reachable vuln blocks release”). * `pointers` — array of *opaque, resolvable references* (log offsets, artifact URIs, attestation IDs). * `kv` — optional tiny key/values for quick filtering (e.g., `severity=A`, `package=openssl`). * `sig` (optional) — detached/inline signature (DSSE) for integrity. **Example** ```json { "v": 1, "ts": "2025-12-13T12:10:03Z", "run_id": "run_7f3c6a8", "stage": "policy", "step": "vex-gate", "status": "fail", "error_class": "VULN_REACHABLE", "summary": "Reachable CVE blocks release", "pointers": [ {"type":"log", "ref":"logs://scanner/7f3c6a8#L1423-L1480"}, {"type":"attestation", "ref":"rekor://sha256:…"}, {"type":"sbom", "ref":"artifact://sbom/cyclonedx@run_7f3c6a8.json"} ], "kv": {"cve":"CVE-2025-12345", "component":"openssl", "severity":"A"} } ``` --- # UI behavior (instant, then enrich) 1. **Instant render** (sub-200 ms): show a red card with stage/step, `error_class`, and `summary`. 2. **Progressive hydration**: as pointers resolve, add: * “View log excerpt” (jump to `#L1423-L1480`) * “Open attestation” (verify DSSE/Rekor) * “Inspect SBOM diff” (component → version → call‑graph) 3. **Stable affordances**: UI never breaks if a pointer is slow/missing; it just shows a spinner or “awaiting evidence”. --- # Backend contract * **Publish early**: emit on first knowledge of failure (e.g., non‑zero exit, policy deny, TLS error). * **Don’t embed heavy data**: only pointers or tiny facts for filters. * **Pointer resolution is pluggable**: files, object storage, Postgres row, Valkey cache key, Rekor entry—whatever suits the deployment. * **Version discipline**: bump `v` only for breaking schema changes; additive fields are fine. --- # Minimal topic map (so teams agree on names) * `stage`: `fetch|build|scan|policy|sign|package|deploy` * `error_class` suggestions: * Infra: `NETWORK_DNS`, `NETWORK_TIMEOUT`, `REGISTRY_403`, `DISK_FULL` * AuthN/Z: `AUTH_EXPIRED`, `TOKEN_SCOPE_MISS` * Supply chain: `ATTESTATION_MISSING`, `SIGNATURE_INVALID`, `SBOM_STALE` * Secure build: `POLICY_BLOCK`, `VULN_REACHABLE`, `MALWARE_FLAG` * Runtime: `IMAGE_DRIFT`, `PROVENANCE_MISMATCH` Keep each to a 1–2 line definition in a shared doc. --- # Drop‑in for Stella Ops (tailored) * **Emitter**: `StellaOps.Events` (tiny .NET lib) used by Scanner/Policy/Scheduler to publish `TinyFailureEvent`. * **Transport**: Postgres notify (default) + Valkey pub/sub accelerator. (Matches your Postgres+Valkey architecture choice.) * **Resolver service**: `EvidenceGateway` that turns `pointers` into viewable slices (log excerpts, SBOM component focus, Rekor proof). * **UI**: “Failure Feed” panel shows cards from the event stream; detail drawer resolves pointers on demand. * **Signing**: optional DSSE for events; Rekor (or mirror) for attestations—your “Proof‑Linked” moat. * **Air‑gap**: pointers use `artifact://` and `row://` schemes resolvable entirely on‑prem. --- # Quick implementation checklist * Define `TinyFailureEvent` schema v1 and `error_class` registry. * Add emit helpers for each module (`FailNow(summary, error_class, pointers, kv)`). * Build `EvidenceGateway.Resolve(pointer)` handlers. * UI: render card instantly; hydrate sections as resolvers return. * Telemetry: metrics on TTF**E** (Time‑To‑Failure‑Event) and pointer hydration latencies. * Docs: 1‑page contract; examples for each error_class. If you want, I can draft the .NET 10 interfaces (`ITinyEventEmitter`, resolvers, and a small Razor/Angular card) and a Postgres schema you can paste into your repo. Below is a **PM-grade implementation spec** for “Real-time Failure Signaling” using **Tiny Failure Events** + **Evidence Pointers**, written so engineers can build it without guessing. --- # Product: Real-time Failure Signaling (Tiny Failure Events) ## Goal When any pipeline run fails, users must see **what failed and where** (stage/step + error class + short summary) **immediately**, even if logs/SBOM/attestations are delayed, huge, or unavailable. The UI must render a failure card from a tiny event and then progressively enrich with evidence as it becomes resolvable. ## Outcomes we must deliver 1. **Instant visibility:** “Failed at Step X” appears within seconds of failure. 2. **Decoupling:** UI depends only on a stable tiny schema, not on log formats/artifact structures. 3. **Evidence linking:** Users can open logs/SBOM/attestations when available, via pointers. 4. **Reliability:** Duplicate/out-of-order events don’t break the UI; state remains consistent. 5. **Security:** Evidence access is authorized; pointers do not leak sensitive info. --- # Scope ## In scope (MVP) * Emit **TinyFailureEvent v1** on first detected failure for a step. * Transport events in near real-time to UI. * Store events durably and allow UI to fetch a run’s event timeline. * Support evidence pointers for: * logs (excerptable) * artifacts (SBOM, reports) * attestations (provenance/signature) * UI: * show run timeline * show failure card instantly * hydrate evidence sections on demand (or automatically where feasible) ## Out of scope (MVP) * Full trace viewer / distributed tracing UI (we can link to external trace systems via pointer). * Automated remediation (“fix it”) actions. * Full-blown case management. --- # Key terms and definitions * **Run:** A single execution of a pipeline. Identified by `run_id`. * **Stage:** Coarse lifecycle phase (`fetch`, `build`, `scan`, `policy`, `sign`, `package`, `deploy`). * **Step:** A concrete activity within a stage (`dotnet-restore`, `trivy-scan`, `vex-gate`). * **Tiny Failure Event:** A small message representing “this step failed”, including stable classification and references to evidence. * **Pointer:** An opaque reference that can be resolved into evidence content or a link later. --- # User stories and acceptance criteria ## Story 1: I see failure instantly **As a** developer **I want** to see which step failed immediately **So that** I don’t wait on logs/artifacts **Acceptance criteria** * When a step fails, the UI updates within **≤ 2 seconds p95** from the time the orchestrator/runner detects failure. * The failure card includes: * stage, step * error class * human summary * timestamp * (optional) primary key/value details (e.g., CVE, severity) ## Story 2: I can open evidence when available **As a** release engineer **I want** to click evidence links (logs/SBOM/attestation) **So that** I can diagnose/root-cause **Acceptance criteria** * Failure card shows evidence sections as: * **Available** (clickable) * **Pending** (spinner / “awaiting evidence”) * **Unavailable** (“not produced” or “access denied”) * Clicking log evidence opens an excerpt view, not a 500MB file download. * Evidence access enforces authorization (same as run access). ## Story 3: Events are robust to duplicates/out-of-order **As a** user **I want** the timeline to remain correct **Even if** event delivery is at-least-once **Acceptance criteria** * UI displays exactly one current “failed” state per step attempt. * Duplicate events do not create duplicate cards. * Out-of-order arrival does not revert a step from fail → pass. --- # Functional requirements (what developers must build) ## FR1: TinyFailureEvent schema v1 ### Required fields All producers MUST emit events that validate against this schema. ```json { "v": 1, "event_id": "evt_01J…", "ts": "2025-12-13T12:10:03.123Z", "run_id": "run_7f3c6a8", "stage": "policy", "step": "vex-gate", "attempt": 1, "status": "fail", "error_class": "VULN_REACHABLE", "summary": "Reachable CVE blocks release", "pointers": [], "kv": {} } ``` ### Field definitions & constraints * `v` (int, required): must be `1` for this spec. * `event_id` (string, required): globally unique. * Format: `evt_` (ULID recommended for time-sortable IDs). * `ts` (RFC3339 UTC, required): creation timestamp. * `run_id` (string, required): stable correlation id for run. * `stage` (enum string, required): one of: * `fetch|build|scan|policy|sign|package|deploy|runtime` * `step` (string, required): lowercase kebab-case recommended; max 80 chars. * `attempt` (int, required): starts at 1; increments for retries. * `status` (enum string, required for this feature): `fail` (MVP supports fail only; schema allows later expansion) * `error_class` (string, required): stable classifier from a shared registry (see FR2). * max 64 chars; uppercase snake-case. * `summary` (string, required): human readable, max 140 chars. * `pointers` (array, optional): max 20 items; each item is a `Pointer` object (see FR3). * `kv` (object, optional): small metadata map for filtering. * max 20 keys * key max 32 chars; value max 120 chars * no nested objects/arrays ### Size limits * Entire event payload MUST be ≤ **8 KB** serialized JSON. * If producers exceed limits, they MUST truncate `summary` and drop low-priority `kv` keys before failing emission. --- ## FR2: Error class registry (stable contract) We maintain a canonical list of `error_class` values in a shared repo/module. ### Requirements * Each `error_class` MUST have: * name (e.g., `NETWORK_DNS`) * short description * severity mapping (optional) * recommended remediation hints (optional, can be UI-side) * Producers MUST use a registry value if applicable. * Producers MAY emit `error_class="UNKNOWN"` if no mapping exists, but must log a warning and increment a metric. ### Initial registry (minimum) Infra/Network: * `NETWORK_DNS` * `NETWORK_TIMEOUT` * `DISK_FULL` Auth: * `AUTH_EXPIRED` * `REGISTRY_403` Supply chain: * `SIGNATURE_INVALID` * `ATTESTATION_MISSING` * `SBOM_MISSING` Policy/Security: * `POLICY_BLOCK` * `VULN_REACHABLE` * `MALWARE_FLAG` Runner/Orchestrator: * `STEP_TIMEOUT` * `RUN_ABORTED` * `WORKER_LOST` --- ## FR3: Evidence pointer format and rules ### Pointer object schema ```json { "type": "log|artifact|attestation|url|trace", "ref": "logs://scanner/run_7f3c6a8#L1423-L1480", "mime": "text/plain", "label": "Scanner log excerpt", "expires_at": "2025-12-20T00:00:00Z", "sha256": "optional hex" } ``` ### Rules * `type` and `ref` are required. * `ref` is opaque to UI; UI passes it to the resolver service. * `label` is optional, but strongly recommended for UI friendliness. * `expires_at` is optional; if present UI should show “may expire”. * `sha256` optional for immutability verification (artifacts/attestations especially). ### Allowed schemes (MVP) * `logs:///#Lx-Ly` * `artifact:///@` * `attestation:///` * `url://` (only internal allowed; resolver enforces) * `trace:///` ### Security constraints * Pointers MUST NOT embed secrets (tokens, passwords). * Any pointer that could expose sensitive data MUST be resolvable only through the Evidence Gateway (FR6), never directly client-side. * The resolver MUST enforce authorization for the requesting user. --- ## FR4: Emission rules (when and how events are produced) ### When to emit Producers MUST emit a TinyFailureEvent when: 1. A step exits non-zero. 2. A policy decision is “deny/block”. 3. A required artifact/attestation is missing at gate time. 4. A step times out. 5. The worker is lost (emitted by orchestrator watchdog). ### Exactly-once vs at-least-once * Transport can be **at-least-once**. * Consumers MUST be idempotent using `(run_id, stage, step, attempt, status)` + `event_id`. ### One failure event per step attempt * For a given `(run_id, stage, step, attempt)`: * First emitted `status=fail` is canonical. * Later fail events for the same tuple are treated as “updates” only if they add pointers/kv (see FR5). ### Updates / enrichment We support enrichment without breaking “tiny”: * Producers MAY emit a second event **with the same tuple** (run_id/stage/step/attempt/status) that adds pointers or kv after the initial fail. * Consumers MUST merge pointers (dedupe identical `type+ref`) and merge kv (new keys overwrite old keys). * Producers MUST NOT spam; max 3 enrichment events per tuple. --- ## FR5: Event storage and aggregation ### Required services/components 1. **Event Ingest** (API or internal library endpoint) 2. **Event Store** (durable DB table) 3. **Realtime Fanout** (pub/sub channel) 4. **Run Timeline API** (query per run) ### Behavior * On ingest: * Validate schema (reject invalid with 400/validation error). * Persist to event store. * Publish to realtime channel. ### Suggested DB model (Postgres) Table: `run_events` * `event_id` PK * `run_id` indexed * `ts` indexed * `stage`, `step`, `attempt`, `status` indexed composite * `payload` jsonb * `ingested_at` Uniqueness constraints: * `event_id` unique * Optional: unique on `(run_id, stage, step, attempt, status, hash(summary))` if you want stronger dedupe ### Query API * `GET /runs/{run_id}/events` returns events sorted by `ts` ascending. * UI should also subscribe realtime to avoid polling. --- ## FR6: Evidence Gateway (pointer resolver) ### Purpose A single service that resolves pointers into either: * log excerpts * signed download URLs * attestation display + verification data * external trace links (sanitized) ### Endpoints (MVP) 1. **Resolve metadata** * `POST /evidence/resolve` * body: `{ "run_id": "...", "pointers": [ { "type": "...", "ref": "..." } ] }` * returns per pointer: * `status`: `available|pending|missing|denied|expired|error` * `kind`: `inline|link` * `title` * `mime` * `size_bytes` (if known) * `link` (if kind=link) – must be short-lived, server-generated * `inline_preview` (optional, small excerpt) 2. **Fetch log excerpt** * `GET /evidence/log-excerpt?ref=...` * returns: * `text` (max 64 KB) * `start_line`, `end_line` * `source` (provider info) 3. **Fetch artifact** * `GET /evidence/artifact?ref=...` * returns either: * short-lived download link * or 404/403/410 ### AuthZ requirements * Evidence Gateway MUST verify the caller has access to the `run_id`. * Gateway MUST validate that the pointer belongs to that run (or is explicitly declared “global shared”). * Gateway MUST audit-log every evidence resolution. ### Resilience * If evidence is not ready, resolver returns `pending`, not 500. * If pointer is unknown format, return `error` with a safe message. --- # UI requirements (what the product must do) ## UI1: Run timeline renders from events * The run detail page MUST show: * stages/steps list * current state per step (pass/warn/fail/running) * failure details if fail exists * The failure state MUST be derived from TinyFailureEvent without requiring any log fetch. ## UI2: Failure card content (minimum) When a fail event arrives: * Show a red failure card with: * `stage` + `step` * `summary` * `error_class` badge * `ts` (relative + absolute on hover) * key kv fields (up to 4 shown; remainder behind “Show more”) ## UI3: Progressive hydration * The card MUST include an “Evidence” section. * For each pointer: * show a row with label and availability status * if available, show “Open” * if pending, show spinner + “Awaiting evidence” * if denied, show lock icon + “No access” * if missing, show “Not produced” * Clicking “Open”: * logs open excerpt viewer (modal/drawer) * artifacts open in viewer or download (type-dependent) * attestations open verification view ## UI4: Realtime behavior * UI MUST subscribe to realtime events for the run. * UI MUST apply idempotent merge logic: * dedupe by `event_id` * merge enrichment events by tuple (run_id/stage/step/attempt/status) ## UI5: Ordering and out-of-order handling * UI MUST sort by `ts` for display. * UI MUST NOT regress a step state if a late “pass/info” arrives after fail. * Rule: `fail` is terminal for a step attempt. --- # Non-functional requirements ## Latency * From failure detection to UI update: **≤ 2s p95**, **≤ 5s p99** (within the same network). * Evidence resolution: * `resolve` call should return in **≤ 300ms p95** for cached/known pointers. ## Reliability * Event ingestion must be durable (stored) before fanout. * System must tolerate: * duplicates * retries * out-of-order delivery * partial evidence availability ## Payload limits * Event size ≤ 8KB * Evidence inline previews ≤ 4KB per pointer ## Retention * Tiny events retained ≥ 30 days (configurable). * Evidence retention depends on provider, but resolver must surface expiry. --- # Metrics and instrumentation (definition of success) Producers + ingestion MUST emit: * `ttfe_ms`: time to failure event (from step start or from failure detection) * `event_ingest_latency_ms` * `event_validation_fail_count` * `unknown_error_class_count` * `pointer_resolution_status_count{available|pending|missing|denied|expired|error}` * `pointer_hydration_latency_ms` UI MUST log: * time from run page open → first event rendered * evidence open clickthrough rate * evidence resolution failure rate --- # Edge cases we explicitly handle 1. **Runner killed before it can emit** * Orchestrator watchdog emits `WORKER_LOST` with stage/step best-effort. 2. **Logs produced after failure** * Initial fail event has no log pointer. * Later enrichment event adds log pointer (same tuple). 3. **Evidence exists but user lacks access** * Resolver returns `denied`; UI shows locked state. 4. **Evidence link expired** * Resolver returns `expired` and provides a “Refresh” action that re-resolves. 5. **Multiple retries** * `attempt` increments; UI shows attempt number and keeps prior attempt history. --- # Definition of Done (engineers can ship when…) ## Backend DoD * Schema validation implemented. * Ingest → store → fanout pipeline working. * Enrichment merge logic implemented. * Evidence Gateway resolves at least: * log excerpt pointers * artifact pointers * attestation pointers * AuthZ enforced. ## Frontend DoD * Run page shows failure card from TinyFailureEvent alone. * Evidence hydration works and handles all resolver statuses. * Realtime updates work; duplicates/out-of-order are safe. ## QA DoD (minimum test cases) 1. Step fails; event arrives; UI shows failure without logs. 2. Log pointer arrives later; UI updates evidence section. 3. Duplicate event delivery; UI shows one failure. 4. Evidence denied; UI shows lock state. 5. Out-of-order: enrichment arrives before initial fail; UI still resolves correctly. --- # Implementation checklist (suggested division of work) ### Team A: Producer SDK * `StellaOps.Events` library: * `EmitFailure(run_id, stage, step, attempt, error_class, summary, pointers, kv)` * ULID generation * payload size enforcement * retry w/ backoff ### Team B: Ingest + Store + Fanout * API endpoint or internal gRPC for ingest * Postgres `run_events` table * Valkey pub/sub channel: `run:{run_id}:events` ### Team C: Evidence Gateway * pointer parsing + resolvers * log excerpt adapter(s) * artifact download adapter(s) * attestation viewer adapter(s) ### Team D: UI * realtime subscription * timeline state reducer w/ merge rules * evidence drawer + resolver calls --- If you want, I can also provide: * a **formal JSON Schema** file for v1 (drop-in for validation), * a **state reducer spec** for the UI (pure function inputs/outputs), * and a **Postgres migration** + example queries. Below is a **UI State Reducer Spec (Pure Function Inputs/Outputs)** for the Run Detail page that renders the **timeline + step statuses + failure cards + evidence hydration** using TinyFailureEvents (and future-compatible with pass/warn/info). This is written so devs can implement it as a deterministic reducer (Redux, Zustand w/ reducer, Elm-style update, etc.). --- # UI State Reducer Spec v1: Run Timeline + Failure Cards ## Reducer contract ### Pure function ```ts reduceRunView(state: RunViewState, action: Action): RunViewState ``` ### Guarantees * **Pure & deterministic**: no IO, no timers, no random IDs, no Date.now() inside reducer. * **Idempotent**: applying the same `RUN_EVENT_RECEIVED` twice yields the same state after the first time. * **Order-safe**: out-of-order events never “downgrade” a step attempt from `fail` → `pass`. --- # 1) Data types ## 1.1 Event type used by reducer ```ts type StageName = | 'fetch' | 'build' | 'scan' | 'policy' | 'sign' | 'package' | 'deploy' | 'runtime'; type StepStatus = // present now (MVP) | 'fail' // future-compatible | 'warn' | 'pass' | 'running' | 'queued' | 'info' | 'unknown'; type PointerType = 'log' | 'artifact' | 'attestation' | 'url' | 'trace'; type Pointer = { type: PointerType; ref: string; mime?: string; label?: string; expires_at?: string; // RFC3339 sha256?: string; }; type TinyEventV1 = { v: 1; event_id: string; ts: string; // RFC3339 UTC run_id: string; stage: StageName; step: string; attempt: number; status: StepStatus; // MVP sends 'fail' only error_class: string; summary: string; pointers?: Pointer[]; kv?: Record; }; // Normalized for sorting and comparisons (created outside or inside reducer deterministically) type NormalizedEvent = TinyEventV1 & { tsMs: number; // parse(ts) -> number, invalid => 0 }; ``` --- ## 1.2 Keys and comparisons ```ts type TupleKey = string; // `${stage}|${step}|${attempt}|${status}` type StepAttemptKey = string; // `${stage}|${step}|${attempt}` type StepIdentityKey = string;// `${stage}|${step}` (no attempt) type PointerKey = string; // `${type}|${ref}` function tupleKey(e: TinyEventV1): TupleKey { return `${e.stage}|${e.step}|${e.attempt}|${e.status}`; } function stepAttemptKey(e: TinyEventV1): StepAttemptKey { return `${e.stage}|${e.step}|${e.attempt}`; } function stepIdentityKey(e: TinyEventV1): StepIdentityKey { return `${e.stage}|${e.step}`; } function pointerKey(p: Pointer): PointerKey { return `${p.type}|${p.ref}`; } // Sort: ts ascending, then event_id lexicographically (stable deterministic tiebreak) function compareEvent(a: NormalizedEvent, b: NormalizedEvent): number { if (a.tsMs !== b.tsMs) return a.tsMs - b.tsMs; return a.event_id < b.event_id ? -1 : (a.event_id > b.event_id ? 1 : 0); } ``` --- ## 1.3 Status ranking rule (terminal safety) We need a single numeric ranking so we can: * prevent regressions (`fail` must remain terminal), and * compute rollups. ```ts const STATUS_RANK: Record = { unknown: 0, queued: 1, running: 2, info: 3, pass: 4, warn: 5, fail: 6, }; function isTerminal(status: StepStatus): boolean { return status === 'fail' || status === 'warn' || status === 'pass'; } ``` **Invariant:** A step attempt’s displayed status must never decrease in rank. --- # 2) State shape This state is for a single Run Detail page (one `runId` at a time). If you store multiple runs in a global store, wrap this in a `Record`. ```ts type RealtimeStatus = 'idle' | 'connecting' | 'connected' | 'disconnected' | 'error'; type LoadStatus = 'idle' | 'loading' | 'loaded' | 'error'; type EvidenceResolveStatus = | 'unresolved' // pointer exists but no resolver call made yet | 'loading' // resolver call in-flight | 'available' | 'pending' | 'missing' | 'denied' | 'expired' | 'error'; type EvidenceResolution = { status: EvidenceResolveStatus; kind?: 'inline' | 'link'; title?: string; mime?: string; size_bytes?: number; inline_preview?: string; // small preview link?: string; // short-lived link error_message?: string; }; type EvidenceState = { pointer: Pointer; // latest metadata merged from events status: EvidenceResolveStatus; lastResolvedAtMs?: number; // from action payload (not Date.now) // for stale response protection seq: number; // increments each request inFlightSeq?: number; // seq currently in-flight resolution?: EvidenceResolution; }; type PointerAggregate = { pointerKey: PointerKey; pointer: Pointer; // merged metadata }; type TupleAggregate = { tupleKey: TupleKey; // all events contributing to this tuple (same stage/step/attempt/status) eventIdsSorted: string[]; // sorted by (tsMs, event_id) canonicalEventId: string; // min by (tsMs, event_id) // merged view computed deterministically from eventIdsSorted merged: { summary: string; // from canonical event error_class: string; // from canonical event kv: Record; // merged by sorted order (later overwrites) pointers: PointerAggregate[];// dedup by pointerKey, merged by sorted order updatedAtMs: number; // max tsMs among contributing events }; }; type StepAttemptState = { key: StepAttemptKey; stage: StageName; step: string; attempt: number; // all tuple aggregates for this attempt (one per status) tuplesByStatus: Partial>; // derived “best” status for this attempt bestStatus: StepStatus; bestStatusRank: number; updatedAtMs: number; // max of all tupleAgg.updatedAtMs for this attempt }; type StageRollup = { stage: StageName; // worst status among latest attempts of steps in this stage rollupStatus: StepStatus; rollupRank: number; }; type RunViewState = { runId: string | null; loading: { initialEvents: LoadStatus; error?: string }; realtime: { status: RealtimeStatus; error?: string }; // storage eventsById: Record; timelineEventIds: string[]; // global timeline sorted by (tsMs, event_id) tupleAggByKey: Record; stepAttemptByKey: Record; latestAttemptByStep: Record; // max attempt observed stageRollups: Record; evidenceByPointer: Record; }; ``` --- # 3) Actions (inputs to reducer) ```ts type Action = | { type: 'RUN_VIEW_OPENED'; runId: string } | { type: 'RUN_EVENTS_LOAD_STARTED'; runId: string } | { type: 'RUN_EVENTS_LOADED'; runId: string; events: TinyEventV1[] } | { type: 'RUN_EVENTS_LOAD_FAILED'; runId: string; error: string } | { type: 'REALTIME_STATUS_CHANGED'; runId: string; status: RealtimeStatus; error?: string } | { type: 'RUN_EVENT_RECEIVED'; event: TinyEventV1 } // Evidence hydration lifecycle (pure reducer; side-effects happen elsewhere) | { type: 'EVIDENCE_RESOLVE_REQUESTED'; runId: string; pointerKey: PointerKey } | { type: 'EVIDENCE_RESOLVE_RESULT'; runId: string; pointerKey: PointerKey; seq: number; resolvedAtMs: number; resolution: EvidenceResolution } | { type: 'EVIDENCE_RESOLVE_CLEARED'; runId: string; pointerKey: PointerKey }; ``` **Reducer must ignore** any action where `action.runId !== state.runId` (except `RUN_VIEW_OPENED` which sets it). --- # 4) Reducer semantics (outputs) ## 4.1 RUN_VIEW_OPENED **Input:** `{ runId }` **Output:** resets all run-specific state. Rules: * Set `state.runId = runId` * Clear events, aggregates, evidence, timeline. * Set `loading.initialEvents = 'loading'` * Set `realtime.status = 'connecting'` (optional) --- ## 4.2 RUN_EVENTS_LOAD_STARTED / LOADED / FAILED ### RUN_EVENTS_LOAD_STARTED * If runId matches, set `loading.initialEvents = 'loading'`. ### RUN_EVENTS_LOADED * If runId matches: * For each event in `events`: apply the exact same logic as `RUN_EVENT_RECEIVED`. * Then set `loading.initialEvents = 'loaded'`. ### RUN_EVENTS_LOAD_FAILED * If runId matches: `loading.initialEvents = 'error'`, store error string. --- ## 4.3 REALTIME_STATUS_CHANGED * Update `realtime.status` and `realtime.error` if runId matches. --- ## 4.4 RUN_EVENT_RECEIVED (core ingestion) ### Preconditions If `state.runId` is null, ignore (or treat as no-op). If `event.run_id !== state.runId`, ignore. ### Step A — normalize + dedupe * Convert to `NormalizedEvent`: * `tsMs = parseRFC3339ToMs(event.ts)`; if parse fails, `tsMs = 0`. * Default `pointers = []`, `kv = {}` if missing. * If `eventsById[event_id]` exists: **no-op**. ### Step B — insert into global stores * Add to `eventsById[event_id]`. * Insert `event_id` into `timelineEventIds` keeping sorted order by `(tsMs, event_id)`. ### Step C — ensure evidence entries exist for pointers For each pointer `p`: * `pk = pointerKey(p)` * If `evidenceByPointer[pk]` is missing: * create `{ pointer: p, status: 'unresolved', seq: 0 }` * Else merge pointer metadata into `evidenceByPointer[pk].pointer` using pointer-merge rules (below). (Do **not** overwrite existing resolver resolution fields.) ### Step D — update tuple aggregate (merge/enrichment) Let `tk = tupleKey(event)`. * If `tupleAggByKey[tk]` missing, create new `TupleAggregate` with: * `eventIdsSorted = [event_id]` * `canonicalEventId = event_id` * `merged` from this event * Else: * Insert `event_id` into `eventIdsSorted` in sorted order (using `compareEvent` via `eventsById`). * Recompute: * `canonicalEventId = min(eventIdsSorted)` by compareEvent * `merged` deterministically from all contributing events (see merge rules) ### Tuple merge rules (deterministic) Given contributing events `E` sorted by `(tsMs, event_id)` ascending: * `canonical = E[0]` * `merged.summary = canonical.summary` * `merged.error_class = canonical.error_class` * `merged.kv`: * start empty `{}` * for each event `e` in order, for each `(k,v)` in `e.kv`: `merged.kv[k] = v` (later events overwrite earlier keys) * `merged.pointers`: * maintain `map: Record` * for each event `e` in order, for each pointer `p`: * `pk = pointerKey(p)` * if not present: set map[pk] = p * else: map[pk] = mergePointerMeta(map[pk], p) (see below) * output pointers as an array sorted by `PointerKey` lexicographically (for stable UI lists) * `merged.updatedAtMs = max(e.tsMs)` ### Pointer metadata merge rule (non-null wins, later wins) ```ts function mergePointerMeta(oldP: Pointer, newP: Pointer): Pointer { // type/ref must match return { type: oldP.type, ref: oldP.ref, // later non-empty wins mime: newP.mime ?? oldP.mime, label: newP.label ?? oldP.label, expires_at: newP.expires_at ?? oldP.expires_at, sha256: newP.sha256 ?? oldP.sha256, }; } ``` --- ## 4.5 Update StepAttemptState (best status + no regression) After tuple aggregate update, update the parent step attempt: * Let `sak = stepAttemptKey(event)` and `sid = stepIdentityKey(event)`. ### latest attempt tracking * `latestAttemptByStep[sid] = max(previous, event.attempt)` ### StepAttemptState update * If missing, create: * `bestStatus = 'unknown'`, `bestStatusRank = 0`, `tuplesByStatus = {}` * Set `tuplesByStatus[event.status] = tk` ### Recompute best status (never decreases) Compute candidate best by checking all statuses present for this attempt: ```ts candidateBest = argmax(status in tuplesByStatus) STATUS_RANK[status] ``` Then apply **no-regression rule**: * If `STATUS_RANK[candidateBest] >= step.bestStatusRank`: * update `bestStatus`, `bestStatusRank` * Else: * keep existing `bestStatus` (prevents fail → pass regressions) Set `updatedAtMs = max(updatedAtMs, tupleAgg.merged.updatedAtMs)`. **Important:** This rule guarantees “late pass/info” cannot override a prior fail. --- ## 4.6 Stage rollups (optional but recommended) Whenever any `StepAttemptState` changes, update `stageRollups[stage]` deterministically: For each stage: * Consider only the **latest attempt per step identity** in that stage: * For each `StepIdentityKey = stage|step`, find `attempt = latestAttemptByStep[stage|step]` * Look up `StepAttemptState` for that attempt. * Roll up stage status as the **worst rank** among those: * `rollupRank = max(step.bestStatusRank)` * `rollupStatus = status with that rank` If a stage has no steps yet, set `rollupStatus='unknown'`. --- # 5) Evidence hydration reducer rules Evidence actions update `evidenceByPointer` only; they must not mutate events/aggregates. ## 5.1 EVIDENCE_RESOLVE_REQUESTED **Input:** `{ pointerKey }` Rules: * If no evidence entry exists: create one with status `unresolved` and `seq=0` (should be rare). * Increment `seq = seq + 1` * Set `inFlightSeq = seq` * Set `status = 'loading'` * Keep `resolution` (optional: clear it if you want UI to hide stale info; recommended to keep and show “Refreshing…”) **Middleware/effect contract (outside reducer):** * After dispatching `EVIDENCE_RESOLVE_REQUESTED`, the effect layer reads `inFlightSeq` from state and uses it in the API call. * When the response returns, dispatch `EVIDENCE_RESOLVE_RESULT` with that same `seq`. ## 5.2 EVIDENCE_RESOLVE_RESULT **Input:** `{ pointerKey, seq, resolvedAtMs, resolution }` Rules: * If `evidenceByPointer[pointerKey]` missing: ignore or create (implementation choice). * If `evidence.inFlightSeq !== seq`: **ignore stale response**. * Else: * `status = resolution.status` * `resolution = resolution` * `lastResolvedAtMs = resolvedAtMs` * `inFlightSeq = undefined` ## 5.3 EVIDENCE_RESOLVE_CLEARED * Reset entry back to `{ status:'unresolved', resolution: undefined, inFlightSeq: undefined }` * Keep `pointer` metadata. --- # 6) Selectors (pure outputs for rendering) These are not reducer logic, but they define how UI consumes state deterministically. ## 6.1 Timeline view model ```ts selectTimeline(state): NormalizedEvent[] { return state.timelineEventIds.map(id => state.eventsById[id]); } ``` ## 6.2 Latest attempt cards per step identity ```ts type StepCardVM = { stage: StageName; step: string; attempt: number; status: StepStatus; error_class?: string; summary?: string; kv: Record; pointers: PointerAggregate[]; updatedAtMs: number; }; selectLatestStepCards(state): StepCardVM[] { const cards: StepCardVM[] = []; for (const sid in state.latestAttemptByStep) { const attempt = state.latestAttemptByStep[sid]; const [stage, step] = sid.split('|') as [StageName, string]; const sak = `${stage}|${step}|${attempt}`; const sa = state.stepAttemptByKey[sak]; if (!sa) continue; // Prefer fail tuple for details if present const failTk = sa.tuplesByStatus['fail']; const bestTk = sa.tuplesByStatus[sa.bestStatus]; const tk = failTk ?? bestTk; const agg = tk ? state.tupleAggByKey[tk] : undefined; cards.push({ stage, step, attempt, status: sa.bestStatus, error_class: agg?.merged.error_class, summary: agg?.merged.summary, kv: agg?.merged.kv ?? {}, pointers: agg?.merged.pointers ?? [], updatedAtMs: sa.updatedAtMs, }); } // stable ordering: by stage order, then step name return cards.sort((a,b) => (STAGE_ORDER.indexOf(a.stage) - STAGE_ORDER.indexOf(b.stage)) || a.step.localeCompare(b.step) ); } ``` ## 6.3 Failure banner (first failure by time) ```ts selectFirstFailure(state): StepCardVM | null { const cards = selectLatestStepCards(state).filter(c => c.status === 'fail'); if (cards.length === 0) return null; return cards.sort((a,b) => a.updatedAtMs - b.updatedAtMs)[0]; } ``` --- # 7) Worked examples (expected reducer behavior) ## Example A: fail event arrives, then enrichment adds pointers 1. Receive fail event (no pointers) * Step card shows `fail`, summary, error_class, evidence list empty. 2. Receive second event same tupleKey with pointers * Same step card remains `fail` (no regression) * Evidence section now lists pointers (status `unresolved` until resolved). ## Example B: out-of-order enrichment arrives before initial fail * Enrichment event arrives first (later tsMs) → creates tupleAgg; canonical is that (for now). * Later initial fail arrives with earlier tsMs: * canonical becomes the earlier event (smaller tsMs) * **pointers remain**, because merged pointers are union across all contributing events. ## Example C: duplicate delivery * Same `event_id` received twice → second is ignored (idempotent). ## Example D: late pass after fail (future-proof) * If a `pass` event arrives after a `fail` for the same step attempt: * `bestStatusRank` is already `fail` (6) * candidate is `pass` (4) * no-regression rule keeps `fail` --- # 8) Implementation notes (non-binding but useful) * Event counts per run are usually small; simple array insert + sort is fine. * If you expect thousands of events, maintain a binary insertion for `timelineEventIds` and `eventIdsSorted`. * Keep all “current time” out of reducer. Any timestamps used in actions (e.g., `resolvedAtMs`) must be created outside. --- If you want next, I can provide: * a drop-in **TypeScript implementation** of `reduceRunView` with helper functions, and * a set of **unit test vectors** (Given actions → expect final state) covering all edge cases above.