38 KiB
Here’s a lightweight pattern to make failures show up instantly while keeping backends decoupled: emit a tiny, versioned event the moment you know something failed, and attach pointers to heavier evidence that can arrive later.
Why this helps
- UI reacts in real time: show “Failed at Step X (E123)” immediately—no waiting for logs, SBOMs, or artifacts to upload/process.
- Backends evolve safely: logs, traces, SBOM/VEX, heap dumps, etc., can change format or arrive out of order without breaking the UI contract.
- Deterministic UX: a small, stable schema prevents flaky pipelines from blocking visibility.
- Great for air‑gapped/offline: the tiny event rides your internal bus/storage; bulky payloads sync or materialize when available.
The event itself (keep it tiny)
Fields (stable, versioned):
v— schema version (e.g.,1).ts— event timestamp (UTC, ISO 8601).run_id— pipeline/execution correlation ID.stage— coarse phase (e.g.,fetch,build,scan,policy,deploy).step— fine-grained step (e.g.,trivy-scan,dotnet-restore).status—fail|warn|pass|info(for this pattern, you’ll usefail).error_class— stable classifier (e.g.,NETWORK_DNS,AUTH_EXPIRED,POLICY_BLOCK,VULN_REACHABLE).summary— short human string (“Reachable vuln blocks release”).pointers— array of opaque, resolvable references (log offsets, artifact URIs, attestation IDs).kv— optional tiny key/values for quick filtering (e.g.,severity=A,package=openssl).sig(optional) — detached/inline signature (DSSE) for integrity.
Example
{
"v": 1,
"ts": "2025-12-13T12:10:03Z",
"run_id": "run_7f3c6a8",
"stage": "policy",
"step": "vex-gate",
"status": "fail",
"error_class": "VULN_REACHABLE",
"summary": "Reachable CVE blocks release",
"pointers": [
{"type":"log", "ref":"logs://scanner/7f3c6a8#L1423-L1480"},
{"type":"attestation", "ref":"rekor://sha256:…"},
{"type":"sbom", "ref":"artifact://sbom/cyclonedx@run_7f3c6a8.json"}
],
"kv": {"cve":"CVE-2025-12345", "component":"openssl", "severity":"A"}
}
UI behavior (instant, then enrich)
-
Instant render (sub-200 ms): show a red card with stage/step,
error_class, andsummary. -
Progressive hydration: as pointers resolve, add:
- “View log excerpt” (jump to
#L1423-L1480) - “Open attestation” (verify DSSE/Rekor)
- “Inspect SBOM diff” (component → version → call‑graph)
- “View log excerpt” (jump to
-
Stable affordances: UI never breaks if a pointer is slow/missing; it just shows a spinner or “awaiting evidence”.
Backend contract
- Publish early: emit on first knowledge of failure (e.g., non‑zero exit, policy deny, TLS error).
- Don’t embed heavy data: only pointers or tiny facts for filters.
- Pointer resolution is pluggable: files, object storage, Postgres row, Valkey cache key, Rekor entry—whatever suits the deployment.
- Version discipline: bump
vonly for breaking schema changes; additive fields are fine.
Minimal topic map (so teams agree on names)
-
stage:fetch|build|scan|policy|sign|package|deploy -
error_classsuggestions:- Infra:
NETWORK_DNS,NETWORK_TIMEOUT,REGISTRY_403,DISK_FULL - AuthN/Z:
AUTH_EXPIRED,TOKEN_SCOPE_MISS - Supply chain:
ATTESTATION_MISSING,SIGNATURE_INVALID,SBOM_STALE - Secure build:
POLICY_BLOCK,VULN_REACHABLE,MALWARE_FLAG - Runtime:
IMAGE_DRIFT,PROVENANCE_MISMATCH
- Infra:
Keep each to a 1–2 line definition in a shared doc.
Drop‑in for Stella Ops (tailored)
- Emitter:
StellaOps.Events(tiny .NET lib) used by Scanner/Policy/Scheduler to publishTinyFailureEvent. - Transport: Postgres notify (default) + Valkey pub/sub accelerator. (Matches your Postgres+Valkey architecture choice.)
- Resolver service:
EvidenceGatewaythat turnspointersinto viewable slices (log excerpts, SBOM component focus, Rekor proof). - UI: “Failure Feed” panel shows cards from the event stream; detail drawer resolves pointers on demand.
- Signing: optional DSSE for events; Rekor (or mirror) for attestations—your “Proof‑Linked” moat.
- Air‑gap: pointers use
artifact://androw://schemes resolvable entirely on‑prem.
Quick implementation checklist
- Define
TinyFailureEventschema v1 anderror_classregistry. - Add emit helpers for each module (
FailNow(summary, error_class, pointers, kv)). - Build
EvidenceGateway.Resolve(pointer)handlers. - UI: render card instantly; hydrate sections as resolvers return.
- Telemetry: metrics on TTFE (Time‑To‑Failure‑Event) and pointer hydration latencies.
- Docs: 1‑page contract; examples for each error_class.
If you want, I can draft the .NET 10 interfaces (ITinyEventEmitter, resolvers, and a small Razor/Angular card) and a Postgres schema you can paste into your repo.
Below is a PM-grade implementation spec for “Real-time Failure Signaling” using Tiny Failure Events + Evidence Pointers, written so engineers can build it without guessing.
Product: Real-time Failure Signaling (Tiny Failure Events)
Goal
When any pipeline run fails, users must see what failed and where (stage/step + error class + short summary) immediately, even if logs/SBOM/attestations are delayed, huge, or unavailable.
The UI must render a failure card from a tiny event and then progressively enrich with evidence as it becomes resolvable.
Outcomes we must deliver
- Instant visibility: “Failed at Step X” appears within seconds of failure.
- Decoupling: UI depends only on a stable tiny schema, not on log formats/artifact structures.
- Evidence linking: Users can open logs/SBOM/attestations when available, via pointers.
- Reliability: Duplicate/out-of-order events don’t break the UI; state remains consistent.
- Security: Evidence access is authorized; pointers do not leak sensitive info.
Scope
In scope (MVP)
-
Emit TinyFailureEvent v1 on first detected failure for a step.
-
Transport events in near real-time to UI.
-
Store events durably and allow UI to fetch a run’s event timeline.
-
Support evidence pointers for:
- logs (excerptable)
- artifacts (SBOM, reports)
- attestations (provenance/signature)
-
UI:
- show run timeline
- show failure card instantly
- hydrate evidence sections on demand (or automatically where feasible)
Out of scope (MVP)
- Full trace viewer / distributed tracing UI (we can link to external trace systems via pointer).
- Automated remediation (“fix it”) actions.
- Full-blown case management.
Key terms and definitions
- Run: A single execution of a pipeline. Identified by
run_id. - Stage: Coarse lifecycle phase (
fetch,build,scan,policy,sign,package,deploy). - Step: A concrete activity within a stage (
dotnet-restore,trivy-scan,vex-gate). - Tiny Failure Event: A small message representing “this step failed”, including stable classification and references to evidence.
- Pointer: An opaque reference that can be resolved into evidence content or a link later.
User stories and acceptance criteria
Story 1: I see failure instantly
As a developer I want to see which step failed immediately So that I don’t wait on logs/artifacts
Acceptance criteria
-
When a step fails, the UI updates within ≤ 2 seconds p95 from the time the orchestrator/runner detects failure.
-
The failure card includes:
- stage, step
- error class
- human summary
- timestamp
- (optional) primary key/value details (e.g., CVE, severity)
Story 2: I can open evidence when available
As a release engineer I want to click evidence links (logs/SBOM/attestation) So that I can diagnose/root-cause
Acceptance criteria
-
Failure card shows evidence sections as:
- Available (clickable)
- Pending (spinner / “awaiting evidence”)
- Unavailable (“not produced” or “access denied”)
-
Clicking log evidence opens an excerpt view, not a 500MB file download.
-
Evidence access enforces authorization (same as run access).
Story 3: Events are robust to duplicates/out-of-order
As a user I want the timeline to remain correct Even if event delivery is at-least-once
Acceptance criteria
- UI displays exactly one current “failed” state per step attempt.
- Duplicate events do not create duplicate cards.
- Out-of-order arrival does not revert a step from fail → pass.
Functional requirements (what developers must build)
FR1: TinyFailureEvent schema v1
Required fields
All producers MUST emit events that validate against this schema.
{
"v": 1,
"event_id": "evt_01J…",
"ts": "2025-12-13T12:10:03.123Z",
"run_id": "run_7f3c6a8",
"stage": "policy",
"step": "vex-gate",
"attempt": 1,
"status": "fail",
"error_class": "VULN_REACHABLE",
"summary": "Reachable CVE blocks release",
"pointers": [],
"kv": {}
}
Field definitions & constraints
-
v(int, required): must be1for this spec. -
event_id(string, required): globally unique.- Format:
evt_<ULID>(ULID recommended for time-sortable IDs).
- Format:
-
ts(RFC3339 UTC, required): creation timestamp. -
run_id(string, required): stable correlation id for run. -
stage(enum string, required): one of:fetch|build|scan|policy|sign|package|deploy|runtime
-
step(string, required): lowercase kebab-case recommended; max 80 chars. -
attempt(int, required): starts at 1; increments for retries. -
status(enum string, required for this feature):fail(MVP supports fail only; schema allows later expansion) -
error_class(string, required): stable classifier from a shared registry (see FR2).- max 64 chars; uppercase snake-case.
-
summary(string, required): human readable, max 140 chars. -
pointers(array, optional): max 20 items; each item is aPointerobject (see FR3). -
kv(object, optional): small metadata map for filtering.- max 20 keys
- key max 32 chars; value max 120 chars
- no nested objects/arrays
Size limits
- Entire event payload MUST be ≤ 8 KB serialized JSON.
- If producers exceed limits, they MUST truncate
summaryand drop low-prioritykvkeys before failing emission.
FR2: Error class registry (stable contract)
We maintain a canonical list of error_class values in a shared repo/module.
Requirements
-
Each
error_classMUST have:- name (e.g.,
NETWORK_DNS) - short description
- severity mapping (optional)
- recommended remediation hints (optional, can be UI-side)
- name (e.g.,
-
Producers MUST use a registry value if applicable.
-
Producers MAY emit
error_class="UNKNOWN"if no mapping exists, but must log a warning and increment a metric.
Initial registry (minimum)
Infra/Network:
NETWORK_DNSNETWORK_TIMEOUTDISK_FULL
Auth:
AUTH_EXPIREDREGISTRY_403
Supply chain:
SIGNATURE_INVALIDATTESTATION_MISSINGSBOM_MISSING
Policy/Security:
POLICY_BLOCKVULN_REACHABLEMALWARE_FLAG
Runner/Orchestrator:
STEP_TIMEOUTRUN_ABORTEDWORKER_LOST
FR3: Evidence pointer format and rules
Pointer object schema
{
"type": "log|artifact|attestation|url|trace",
"ref": "logs://scanner/run_7f3c6a8#L1423-L1480",
"mime": "text/plain",
"label": "Scanner log excerpt",
"expires_at": "2025-12-20T00:00:00Z",
"sha256": "optional hex"
}
Rules
typeandrefare required.refis opaque to UI; UI passes it to the resolver service.labelis optional, but strongly recommended for UI friendliness.expires_atis optional; if present UI should show “may expire”.sha256optional for immutability verification (artifacts/attestations especially).
Allowed schemes (MVP)
logs://<provider>/<run_id>#Lx-Lyartifact://<kind>/<name>@<version-or-run-id>attestation://<store>/<id-or-digest>url://<encoded>(only internal allowed; resolver enforces)trace://<system>/<trace-id>
Security constraints
- Pointers MUST NOT embed secrets (tokens, passwords).
- Any pointer that could expose sensitive data MUST be resolvable only through the Evidence Gateway (FR6), never directly client-side.
- The resolver MUST enforce authorization for the requesting user.
FR4: Emission rules (when and how events are produced)
When to emit
Producers MUST emit a TinyFailureEvent when:
- A step exits non-zero.
- A policy decision is “deny/block”.
- A required artifact/attestation is missing at gate time.
- A step times out.
- The worker is lost (emitted by orchestrator watchdog).
Exactly-once vs at-least-once
- Transport can be at-least-once.
- Consumers MUST be idempotent using
(run_id, stage, step, attempt, status)+event_id.
One failure event per step attempt
-
For a given
(run_id, stage, step, attempt):- First emitted
status=failis canonical. - Later fail events for the same tuple are treated as “updates” only if they add pointers/kv (see FR5).
- First emitted
Updates / enrichment
We support enrichment without breaking “tiny”:
- Producers MAY emit a second event with the same tuple (run_id/stage/step/attempt/status) that adds pointers or kv after the initial fail.
- Consumers MUST merge pointers (dedupe identical
type+ref) and merge kv (new keys overwrite old keys). - Producers MUST NOT spam; max 3 enrichment events per tuple.
FR5: Event storage and aggregation
Required services/components
- Event Ingest (API or internal library endpoint)
- Event Store (durable DB table)
- Realtime Fanout (pub/sub channel)
- Run Timeline API (query per run)
Behavior
-
On ingest:
- Validate schema (reject invalid with 400/validation error).
- Persist to event store.
- Publish to realtime channel.
Suggested DB model (Postgres)
Table: run_events
event_idPKrun_idindexedtsindexedstage,step,attempt,statusindexed compositepayloadjsonbingested_at
Uniqueness constraints:
event_idunique- Optional: unique on
(run_id, stage, step, attempt, status, hash(summary))if you want stronger dedupe
Query API
GET /runs/{run_id}/eventsreturns events sorted bytsascending.- UI should also subscribe realtime to avoid polling.
FR6: Evidence Gateway (pointer resolver)
Purpose
A single service that resolves pointers into either:
- log excerpts
- signed download URLs
- attestation display + verification data
- external trace links (sanitized)
Endpoints (MVP)
-
Resolve metadata
-
POST /evidence/resolve -
body:
{ "run_id": "...", "pointers": [ { "type": "...", "ref": "..." } ] } -
returns per pointer:
status:available|pending|missing|denied|expired|errorkind:inline|linktitlemimesize_bytes(if known)link(if kind=link) – must be short-lived, server-generatedinline_preview(optional, small excerpt)
-
-
Fetch log excerpt
-
GET /evidence/log-excerpt?ref=... -
returns:
text(max 64 KB)start_line,end_linesource(provider info)
-
-
Fetch artifact
-
GET /evidence/artifact?ref=... -
returns either:
- short-lived download link
- or 404/403/410
-
AuthZ requirements
- Evidence Gateway MUST verify the caller has access to the
run_id. - Gateway MUST validate that the pointer belongs to that run (or is explicitly declared “global shared”).
- Gateway MUST audit-log every evidence resolution.
Resilience
- If evidence is not ready, resolver returns
pending, not 500. - If pointer is unknown format, return
errorwith a safe message.
UI requirements (what the product must do)
UI1: Run timeline renders from events
-
The run detail page MUST show:
- stages/steps list
- current state per step (pass/warn/fail/running)
- failure details if fail exists
-
The failure state MUST be derived from TinyFailureEvent without requiring any log fetch.
UI2: Failure card content (minimum)
When a fail event arrives:
-
Show a red failure card with:
stage+stepsummaryerror_classbadgets(relative + absolute on hover)- key kv fields (up to 4 shown; remainder behind “Show more”)
UI3: Progressive hydration
-
The card MUST include an “Evidence” section.
-
For each pointer:
- show a row with label and availability status
- if available, show “Open”
- if pending, show spinner + “Awaiting evidence”
- if denied, show lock icon + “No access”
- if missing, show “Not produced”
-
Clicking “Open”:
- logs open excerpt viewer (modal/drawer)
- artifacts open in viewer or download (type-dependent)
- attestations open verification view
UI4: Realtime behavior
-
UI MUST subscribe to realtime events for the run.
-
UI MUST apply idempotent merge logic:
- dedupe by
event_id - merge enrichment events by tuple (run_id/stage/step/attempt/status)
- dedupe by
UI5: Ordering and out-of-order handling
-
UI MUST sort by
tsfor display. -
UI MUST NOT regress a step state if a late “pass/info” arrives after fail.
- Rule:
failis terminal for a step attempt.
- Rule:
Non-functional requirements
Latency
-
From failure detection to UI update: ≤ 2s p95, ≤ 5s p99 (within the same network).
-
Evidence resolution:
resolvecall should return in ≤ 300ms p95 for cached/known pointers.
Reliability
-
Event ingestion must be durable (stored) before fanout.
-
System must tolerate:
- duplicates
- retries
- out-of-order delivery
- partial evidence availability
Payload limits
- Event size ≤ 8KB
- Evidence inline previews ≤ 4KB per pointer
Retention
- Tiny events retained ≥ 30 days (configurable).
- Evidence retention depends on provider, but resolver must surface expiry.
Metrics and instrumentation (definition of success)
Producers + ingestion MUST emit:
ttfe_ms: time to failure event (from step start or from failure detection)event_ingest_latency_msevent_validation_fail_countunknown_error_class_countpointer_resolution_status_count{available|pending|missing|denied|expired|error}pointer_hydration_latency_ms
UI MUST log:
- time from run page open → first event rendered
- evidence open clickthrough rate
- evidence resolution failure rate
Edge cases we explicitly handle
-
Runner killed before it can emit
- Orchestrator watchdog emits
WORKER_LOSTwith stage/step best-effort.
- Orchestrator watchdog emits
-
Logs produced after failure
- Initial fail event has no log pointer.
- Later enrichment event adds log pointer (same tuple).
-
Evidence exists but user lacks access
- Resolver returns
denied; UI shows locked state.
- Resolver returns
-
Evidence link expired
- Resolver returns
expiredand provides a “Refresh” action that re-resolves.
- Resolver returns
-
Multiple retries
attemptincrements; UI shows attempt number and keeps prior attempt history.
Definition of Done (engineers can ship when…)
Backend DoD
-
Schema validation implemented.
-
Ingest → store → fanout pipeline working.
-
Enrichment merge logic implemented.
-
Evidence Gateway resolves at least:
- log excerpt pointers
- artifact pointers
- attestation pointers
-
AuthZ enforced.
Frontend DoD
- Run page shows failure card from TinyFailureEvent alone.
- Evidence hydration works and handles all resolver statuses.
- Realtime updates work; duplicates/out-of-order are safe.
QA DoD (minimum test cases)
- Step fails; event arrives; UI shows failure without logs.
- Log pointer arrives later; UI updates evidence section.
- Duplicate event delivery; UI shows one failure.
- Evidence denied; UI shows lock state.
- Out-of-order: enrichment arrives before initial fail; UI still resolves correctly.
Implementation checklist (suggested division of work)
Team A: Producer SDK
-
StellaOps.Eventslibrary:EmitFailure(run_id, stage, step, attempt, error_class, summary, pointers, kv)- ULID generation
- payload size enforcement
- retry w/ backoff
Team B: Ingest + Store + Fanout
- API endpoint or internal gRPC for ingest
- Postgres
run_eventstable - Valkey pub/sub channel:
run:{run_id}:events
Team C: Evidence Gateway
- pointer parsing + resolvers
- log excerpt adapter(s)
- artifact download adapter(s)
- attestation viewer adapter(s)
Team D: UI
- realtime subscription
- timeline state reducer w/ merge rules
- evidence drawer + resolver calls
If you want, I can also provide:
- a formal JSON Schema file for v1 (drop-in for validation),
- a state reducer spec for the UI (pure function inputs/outputs),
- and a Postgres migration + example queries. Below is a UI State Reducer Spec (Pure Function Inputs/Outputs) for the Run Detail page that renders the timeline + step statuses + failure cards + evidence hydration using TinyFailureEvents (and future-compatible with pass/warn/info).
This is written so devs can implement it as a deterministic reducer (Redux, Zustand w/ reducer, Elm-style update, etc.).
UI State Reducer Spec v1: Run Timeline + Failure Cards
Reducer contract
Pure function
reduceRunView(state: RunViewState, action: Action): RunViewState
Guarantees
- Pure & deterministic: no IO, no timers, no random IDs, no Date.now() inside reducer.
- Idempotent: applying the same
RUN_EVENT_RECEIVEDtwice yields the same state after the first time. - Order-safe: out-of-order events never “downgrade” a step attempt from
fail→pass.
1) Data types
1.1 Event type used by reducer
type StageName =
| 'fetch' | 'build' | 'scan' | 'policy'
| 'sign' | 'package' | 'deploy' | 'runtime';
type StepStatus =
// present now (MVP)
| 'fail'
// future-compatible
| 'warn' | 'pass' | 'running' | 'queued' | 'info' | 'unknown';
type PointerType = 'log' | 'artifact' | 'attestation' | 'url' | 'trace';
type Pointer = {
type: PointerType;
ref: string;
mime?: string;
label?: string;
expires_at?: string; // RFC3339
sha256?: string;
};
type TinyEventV1 = {
v: 1;
event_id: string;
ts: string; // RFC3339 UTC
run_id: string;
stage: StageName;
step: string;
attempt: number;
status: StepStatus; // MVP sends 'fail' only
error_class: string;
summary: string;
pointers?: Pointer[];
kv?: Record<string, string>;
};
// Normalized for sorting and comparisons (created outside or inside reducer deterministically)
type NormalizedEvent = TinyEventV1 & {
tsMs: number; // parse(ts) -> number, invalid => 0
};
1.2 Keys and comparisons
type TupleKey = string; // `${stage}|${step}|${attempt}|${status}`
type StepAttemptKey = string; // `${stage}|${step}|${attempt}`
type StepIdentityKey = string;// `${stage}|${step}` (no attempt)
type PointerKey = string; // `${type}|${ref}`
function tupleKey(e: TinyEventV1): TupleKey {
return `${e.stage}|${e.step}|${e.attempt}|${e.status}`;
}
function stepAttemptKey(e: TinyEventV1): StepAttemptKey {
return `${e.stage}|${e.step}|${e.attempt}`;
}
function stepIdentityKey(e: TinyEventV1): StepIdentityKey {
return `${e.stage}|${e.step}`;
}
function pointerKey(p: Pointer): PointerKey {
return `${p.type}|${p.ref}`;
}
// Sort: ts ascending, then event_id lexicographically (stable deterministic tiebreak)
function compareEvent(a: NormalizedEvent, b: NormalizedEvent): number {
if (a.tsMs !== b.tsMs) return a.tsMs - b.tsMs;
return a.event_id < b.event_id ? -1 : (a.event_id > b.event_id ? 1 : 0);
}
1.3 Status ranking rule (terminal safety)
We need a single numeric ranking so we can:
- prevent regressions (
failmust remain terminal), and - compute rollups.
const STATUS_RANK: Record<StepStatus, number> = {
unknown: 0,
queued: 1,
running: 2,
info: 3,
pass: 4,
warn: 5,
fail: 6,
};
function isTerminal(status: StepStatus): boolean {
return status === 'fail' || status === 'warn' || status === 'pass';
}
Invariant: A step attempt’s displayed status must never decrease in rank.
2) State shape
This state is for a single Run Detail page (one runId at a time). If you store multiple runs in a global store, wrap this in a Record<runId, RunViewState>.
type RealtimeStatus = 'idle' | 'connecting' | 'connected' | 'disconnected' | 'error';
type LoadStatus = 'idle' | 'loading' | 'loaded' | 'error';
type EvidenceResolveStatus =
| 'unresolved' // pointer exists but no resolver call made yet
| 'loading' // resolver call in-flight
| 'available' | 'pending' | 'missing' | 'denied' | 'expired' | 'error';
type EvidenceResolution = {
status: EvidenceResolveStatus;
kind?: 'inline' | 'link';
title?: string;
mime?: string;
size_bytes?: number;
inline_preview?: string; // small preview
link?: string; // short-lived link
error_message?: string;
};
type EvidenceState = {
pointer: Pointer; // latest metadata merged from events
status: EvidenceResolveStatus;
lastResolvedAtMs?: number; // from action payload (not Date.now)
// for stale response protection
seq: number; // increments each request
inFlightSeq?: number; // seq currently in-flight
resolution?: EvidenceResolution;
};
type PointerAggregate = {
pointerKey: PointerKey;
pointer: Pointer; // merged metadata
};
type TupleAggregate = {
tupleKey: TupleKey;
// all events contributing to this tuple (same stage/step/attempt/status)
eventIdsSorted: string[]; // sorted by (tsMs, event_id)
canonicalEventId: string; // min by (tsMs, event_id)
// merged view computed deterministically from eventIdsSorted
merged: {
summary: string; // from canonical event
error_class: string; // from canonical event
kv: Record<string, string>; // merged by sorted order (later overwrites)
pointers: PointerAggregate[];// dedup by pointerKey, merged by sorted order
updatedAtMs: number; // max tsMs among contributing events
};
};
type StepAttemptState = {
key: StepAttemptKey;
stage: StageName;
step: string;
attempt: number;
// all tuple aggregates for this attempt (one per status)
tuplesByStatus: Partial<Record<StepStatus, TupleKey>>;
// derived “best” status for this attempt
bestStatus: StepStatus;
bestStatusRank: number;
updatedAtMs: number; // max of all tupleAgg.updatedAtMs for this attempt
};
type StageRollup = {
stage: StageName;
// worst status among latest attempts of steps in this stage
rollupStatus: StepStatus;
rollupRank: number;
};
type RunViewState = {
runId: string | null;
loading: { initialEvents: LoadStatus; error?: string };
realtime: { status: RealtimeStatus; error?: string };
// storage
eventsById: Record<string, NormalizedEvent>;
timelineEventIds: string[]; // global timeline sorted by (tsMs, event_id)
tupleAggByKey: Record<TupleKey, TupleAggregate>;
stepAttemptByKey: Record<StepAttemptKey, StepAttemptState>;
latestAttemptByStep: Record<StepIdentityKey, number>; // max attempt observed
stageRollups: Record<StageName, StageRollup>;
evidenceByPointer: Record<PointerKey, EvidenceState>;
};
3) Actions (inputs to reducer)
type Action =
| { type: 'RUN_VIEW_OPENED'; runId: string }
| { type: 'RUN_EVENTS_LOAD_STARTED'; runId: string }
| { type: 'RUN_EVENTS_LOADED'; runId: string; events: TinyEventV1[] }
| { type: 'RUN_EVENTS_LOAD_FAILED'; runId: string; error: string }
| { type: 'REALTIME_STATUS_CHANGED'; runId: string; status: RealtimeStatus; error?: string }
| { type: 'RUN_EVENT_RECEIVED'; event: TinyEventV1 }
// Evidence hydration lifecycle (pure reducer; side-effects happen elsewhere)
| { type: 'EVIDENCE_RESOLVE_REQUESTED'; runId: string; pointerKey: PointerKey }
| { type: 'EVIDENCE_RESOLVE_RESULT'; runId: string; pointerKey: PointerKey; seq: number; resolvedAtMs: number; resolution: EvidenceResolution }
| { type: 'EVIDENCE_RESOLVE_CLEARED'; runId: string; pointerKey: PointerKey };
Reducer must ignore any action where action.runId !== state.runId (except RUN_VIEW_OPENED which sets it).
4) Reducer semantics (outputs)
4.1 RUN_VIEW_OPENED
Input: { runId }
Output: resets all run-specific state.
Rules:
- Set
state.runId = runId - Clear events, aggregates, evidence, timeline.
- Set
loading.initialEvents = 'loading' - Set
realtime.status = 'connecting'(optional)
4.2 RUN_EVENTS_LOAD_STARTED / LOADED / FAILED
RUN_EVENTS_LOAD_STARTED
- If runId matches, set
loading.initialEvents = 'loading'.
RUN_EVENTS_LOADED
-
If runId matches:
- For each event in
events: apply the exact same logic asRUN_EVENT_RECEIVED. - Then set
loading.initialEvents = 'loaded'.
- For each event in
RUN_EVENTS_LOAD_FAILED
- If runId matches:
loading.initialEvents = 'error', store error string.
4.3 REALTIME_STATUS_CHANGED
- Update
realtime.statusandrealtime.errorif runId matches.
4.4 RUN_EVENT_RECEIVED (core ingestion)
Preconditions
If state.runId is null, ignore (or treat as no-op).
If event.run_id !== state.runId, ignore.
Step A — normalize + dedupe
-
Convert to
NormalizedEvent:tsMs = parseRFC3339ToMs(event.ts); if parse fails,tsMs = 0.- Default
pointers = [],kv = {}if missing.
-
If
eventsById[event_id]exists: no-op.
Step B — insert into global stores
- Add to
eventsById[event_id]. - Insert
event_idintotimelineEventIdskeeping sorted order by(tsMs, event_id).
Step C — ensure evidence entries exist for pointers
For each pointer p:
-
pk = pointerKey(p) -
If
evidenceByPointer[pk]is missing:- create
{ pointer: p, status: 'unresolved', seq: 0 }
- create
-
Else merge pointer metadata into
evidenceByPointer[pk].pointerusing pointer-merge rules (below). (Do not overwrite existing resolver resolution fields.)
Step D — update tuple aggregate (merge/enrichment)
Let tk = tupleKey(event).
-
If
tupleAggByKey[tk]missing, create newTupleAggregatewith:eventIdsSorted = [event_id]canonicalEventId = event_idmergedfrom this event
-
Else:
-
Insert
event_idintoeventIdsSortedin sorted order (usingcompareEventviaeventsById). -
Recompute:
canonicalEventId = min(eventIdsSorted)by compareEventmergeddeterministically from all contributing events (see merge rules)
-
Tuple merge rules (deterministic)
Given contributing events E sorted by (tsMs, event_id) ascending:
-
canonical = E[0] -
merged.summary = canonical.summary -
merged.error_class = canonical.error_class -
merged.kv:- start empty
{} - for each event
ein order, for each(k,v)ine.kv:merged.kv[k] = v(later events overwrite earlier keys)
- start empty
-
merged.pointers:-
maintain
map: Record<PointerKey, Pointer> -
for each event
ein order, for each pointerp:pk = pointerKey(p)- if not present: set map[pk] = p
- else: map[pk] = mergePointerMeta(map[pk], p) (see below)
-
output pointers as an array sorted by
PointerKeylexicographically (for stable UI lists)
-
-
merged.updatedAtMs = max(e.tsMs)
Pointer metadata merge rule (non-null wins, later wins)
function mergePointerMeta(oldP: Pointer, newP: Pointer): Pointer {
// type/ref must match
return {
type: oldP.type,
ref: oldP.ref,
// later non-empty wins
mime: newP.mime ?? oldP.mime,
label: newP.label ?? oldP.label,
expires_at: newP.expires_at ?? oldP.expires_at,
sha256: newP.sha256 ?? oldP.sha256,
};
}
4.5 Update StepAttemptState (best status + no regression)
After tuple aggregate update, update the parent step attempt:
- Let
sak = stepAttemptKey(event)andsid = stepIdentityKey(event).
latest attempt tracking
latestAttemptByStep[sid] = max(previous, event.attempt)
StepAttemptState update
-
If missing, create:
bestStatus = 'unknown',bestStatusRank = 0,tuplesByStatus = {}
-
Set
tuplesByStatus[event.status] = tk
Recompute best status (never decreases)
Compute candidate best by checking all statuses present for this attempt:
candidateBest = argmax(status in tuplesByStatus) STATUS_RANK[status]
Then apply no-regression rule:
-
If
STATUS_RANK[candidateBest] >= step.bestStatusRank:- update
bestStatus,bestStatusRank
- update
-
Else:
- keep existing
bestStatus(prevents fail → pass regressions)
- keep existing
Set updatedAtMs = max(updatedAtMs, tupleAgg.merged.updatedAtMs).
Important: This rule guarantees “late pass/info” cannot override a prior fail.
4.6 Stage rollups (optional but recommended)
Whenever any StepAttemptState changes, update stageRollups[stage] deterministically:
For each stage:
-
Consider only the latest attempt per step identity in that stage:
- For each
StepIdentityKey = stage|step, findattempt = latestAttemptByStep[stage|step] - Look up
StepAttemptStatefor that attempt.
- For each
-
Roll up stage status as the worst rank among those:
rollupRank = max(step.bestStatusRank)rollupStatus = status with that rank
If a stage has no steps yet, set rollupStatus='unknown'.
5) Evidence hydration reducer rules
Evidence actions update evidenceByPointer only; they must not mutate events/aggregates.
5.1 EVIDENCE_RESOLVE_REQUESTED
Input: { pointerKey }
Rules:
- If no evidence entry exists: create one with status
unresolvedandseq=0(should be rare). - Increment
seq = seq + 1 - Set
inFlightSeq = seq - Set
status = 'loading' - Keep
resolution(optional: clear it if you want UI to hide stale info; recommended to keep and show “Refreshing…”)
Middleware/effect contract (outside reducer):
- After dispatching
EVIDENCE_RESOLVE_REQUESTED, the effect layer readsinFlightSeqfrom state and uses it in the API call. - When the response returns, dispatch
EVIDENCE_RESOLVE_RESULTwith that sameseq.
5.2 EVIDENCE_RESOLVE_RESULT
Input: { pointerKey, seq, resolvedAtMs, resolution }
Rules:
-
If
evidenceByPointer[pointerKey]missing: ignore or create (implementation choice). -
If
evidence.inFlightSeq !== seq: ignore stale response. -
Else:
status = resolution.statusresolution = resolutionlastResolvedAtMs = resolvedAtMsinFlightSeq = undefined
5.3 EVIDENCE_RESOLVE_CLEARED
- Reset entry back to
{ status:'unresolved', resolution: undefined, inFlightSeq: undefined } - Keep
pointermetadata.
6) Selectors (pure outputs for rendering)
These are not reducer logic, but they define how UI consumes state deterministically.
6.1 Timeline view model
selectTimeline(state): NormalizedEvent[] {
return state.timelineEventIds.map(id => state.eventsById[id]);
}
6.2 Latest attempt cards per step identity
type StepCardVM = {
stage: StageName;
step: string;
attempt: number;
status: StepStatus;
error_class?: string;
summary?: string;
kv: Record<string,string>;
pointers: PointerAggregate[];
updatedAtMs: number;
};
selectLatestStepCards(state): StepCardVM[] {
const cards: StepCardVM[] = [];
for (const sid in state.latestAttemptByStep) {
const attempt = state.latestAttemptByStep[sid];
const [stage, step] = sid.split('|') as [StageName, string];
const sak = `${stage}|${step}|${attempt}`;
const sa = state.stepAttemptByKey[sak];
if (!sa) continue;
// Prefer fail tuple for details if present
const failTk = sa.tuplesByStatus['fail'];
const bestTk = sa.tuplesByStatus[sa.bestStatus];
const tk = failTk ?? bestTk;
const agg = tk ? state.tupleAggByKey[tk] : undefined;
cards.push({
stage, step, attempt,
status: sa.bestStatus,
error_class: agg?.merged.error_class,
summary: agg?.merged.summary,
kv: agg?.merged.kv ?? {},
pointers: agg?.merged.pointers ?? [],
updatedAtMs: sa.updatedAtMs,
});
}
// stable ordering: by stage order, then step name
return cards.sort((a,b) =>
(STAGE_ORDER.indexOf(a.stage) - STAGE_ORDER.indexOf(b.stage)) ||
a.step.localeCompare(b.step)
);
}
6.3 Failure banner (first failure by time)
selectFirstFailure(state): StepCardVM | null {
const cards = selectLatestStepCards(state).filter(c => c.status === 'fail');
if (cards.length === 0) return null;
return cards.sort((a,b) => a.updatedAtMs - b.updatedAtMs)[0];
}
7) Worked examples (expected reducer behavior)
Example A: fail event arrives, then enrichment adds pointers
- Receive fail event (no pointers)
- Step card shows
fail, summary, error_class, evidence list empty.
- Receive second event same tupleKey with pointers
- Same step card remains
fail(no regression) - Evidence section now lists pointers (status
unresolveduntil resolved).
Example B: out-of-order enrichment arrives before initial fail
-
Enrichment event arrives first (later tsMs) → creates tupleAgg; canonical is that (for now).
-
Later initial fail arrives with earlier tsMs:
- canonical becomes the earlier event (smaller tsMs)
- pointers remain, because merged pointers are union across all contributing events.
Example C: duplicate delivery
- Same
event_idreceived twice → second is ignored (idempotent).
Example D: late pass after fail (future-proof)
-
If a
passevent arrives after afailfor the same step attempt:bestStatusRankis alreadyfail(6)- candidate is
pass(4) - no-regression rule keeps
fail
8) Implementation notes (non-binding but useful)
- Event counts per run are usually small; simple array insert + sort is fine.
- If you expect thousands of events, maintain a binary insertion for
timelineEventIdsandeventIdsSorted. - Keep all “current time” out of reducer. Any timestamps used in actions (e.g.,
resolvedAtMs) must be created outside.
If you want next, I can provide:
- a drop-in TypeScript implementation of
reduceRunViewwith helper functions, and - a set of unit test vectors (Given actions → expect final state) covering all edge cases above.