Files

master 01f4943ab9 up

2025-12-14 16:23:44 +02:00

38 KiB

Raw Blame History

Here’s a lightweight pattern to make failures show up instantly while keeping backends decoupled: emit a tiny, versioned event the moment you know something failed, and attach pointers to heavier evidence that can arrive later.

Why this helps

UI reacts in real time: show “Failed at Step X (E123)” immediately—no waiting for logs, SBOMs, or artifacts to upload/process.
Backends evolve safely: logs, traces, SBOM/VEX, heap dumps, etc., can change format or arrive out of order without breaking the UI contract.
Deterministic UX: a small, stable schema prevents flaky pipelines from blocking visibility.
Great for air‑gapped/offline: the tiny event rides your internal bus/storage; bulky payloads sync or materialize when available.

The event itself (keep it tiny)

Fields (stable, versioned):

v — schema version (e.g., 1).
ts — event timestamp (UTC, ISO 8601).
run_id — pipeline/execution correlation ID.
stage — coarse phase (e.g., fetch, build, scan, policy, deploy).
step — fine-grained step (e.g., trivy-scan, dotnet-restore).
status — fail|warn|pass|info (for this pattern, you’ll use fail).
error_class — stable classifier (e.g., NETWORK_DNS, AUTH_EXPIRED, POLICY_BLOCK, VULN_REACHABLE).
summary — short human string (“Reachable vuln blocks release”).
pointers — array of opaque, resolvable references (log offsets, artifact URIs, attestation IDs).
kv — optional tiny key/values for quick filtering (e.g., severity=A, package=openssl).
sig (optional) — detached/inline signature (DSSE) for integrity.

Example

{
  "v": 1,
  "ts": "2025-12-13T12:10:03Z",
  "run_id": "run_7f3c6a8",
  "stage": "policy",
  "step": "vex-gate",
  "status": "fail",
  "error_class": "VULN_REACHABLE",
  "summary": "Reachable CVE blocks release",
  "pointers": [
    {"type":"log", "ref":"logs://scanner/7f3c6a8#L1423-L1480"},
    {"type":"attestation", "ref":"rekor://sha256:…"},
    {"type":"sbom", "ref":"artifact://sbom/cyclonedx@run_7f3c6a8.json"}
  ],
  "kv": {"cve":"CVE-2025-12345", "component":"openssl", "severity":"A"}
}

UI behavior (instant, then enrich)

Instant render (sub-200 ms): show a red card with stage/step, error_class, and summary.
Progressive hydration: as pointers resolve, add:
- “View log excerpt” (jump to #L1423-L1480)
- “Open attestation” (verify DSSE/Rekor)
- “Inspect SBOM diff” (component → version → call‑graph)
Stable affordances: UI never breaks if a pointer is slow/missing; it just shows a spinner or “awaiting evidence”.

Backend contract

Publish early: emit on first knowledge of failure (e.g., non‑zero exit, policy deny, TLS error).
Don’t embed heavy data: only pointers or tiny facts for filters.
Pointer resolution is pluggable: files, object storage, Postgres row, Valkey cache key, Rekor entry—whatever suits the deployment.
Version discipline: bump v only for breaking schema changes; additive fields are fine.

Minimal topic map (so teams agree on names)

stage: fetch|build|scan|policy|sign|package|deploy
error_class suggestions:
- Infra: NETWORK_DNS, NETWORK_TIMEOUT, REGISTRY_403, DISK_FULL
- AuthN/Z: AUTH_EXPIRED, TOKEN_SCOPE_MISS
- Supply chain: ATTESTATION_MISSING, SIGNATURE_INVALID, SBOM_STALE
- Secure build: POLICY_BLOCK, VULN_REACHABLE, MALWARE_FLAG
- Runtime: IMAGE_DRIFT, PROVENANCE_MISMATCH

Keep each to a 1–2 line definition in a shared doc.

Drop‑in for Stella Ops (tailored)

Emitter: StellaOps.Events (tiny .NET lib) used by Scanner/Policy/Scheduler to publish TinyFailureEvent.
Transport: Postgres notify (default) + Valkey pub/sub accelerator. (Matches your Postgres+Valkey architecture choice.)
Resolver service: EvidenceGateway that turns pointers into viewable slices (log excerpts, SBOM component focus, Rekor proof).
UI: “Failure Feed” panel shows cards from the event stream; detail drawer resolves pointers on demand.
Signing: optional DSSE for events; Rekor (or mirror) for attestations—your “Proof‑Linked” moat.
Air‑gap: pointers use artifact:// and row:// schemes resolvable entirely on‑prem.

Quick implementation checklist

Define TinyFailureEvent schema v1 and error_class registry.
Add emit helpers for each module (FailNow(summary, error_class, pointers, kv)).
Build EvidenceGateway.Resolve(pointer) handlers.
UI: render card instantly; hydrate sections as resolvers return.
Telemetry: metrics on TTFE (Time‑To‑Failure‑Event) and pointer hydration latencies.
Docs: 1‑page contract; examples for each error_class.

If you want, I can draft the .NET 10 interfaces (ITinyEventEmitter, resolvers, and a small Razor/Angular card) and a Postgres schema you can paste into your repo. Below is a PM-grade implementation spec for “Real-time Failure Signaling” using Tiny Failure Events + Evidence Pointers, written so engineers can build it without guessing.

Product: Real-time Failure Signaling (Tiny Failure Events)

Goal

When any pipeline run fails, users must see what failed and where (stage/step + error class + short summary) immediately, even if logs/SBOM/attestations are delayed, huge, or unavailable.

The UI must render a failure card from a tiny event and then progressively enrich with evidence as it becomes resolvable.

Outcomes we must deliver

Instant visibility: “Failed at Step X” appears within seconds of failure.
Decoupling: UI depends only on a stable tiny schema, not on log formats/artifact structures.
Evidence linking: Users can open logs/SBOM/attestations when available, via pointers.
Reliability: Duplicate/out-of-order events don’t break the UI; state remains consistent.
Security: Evidence access is authorized; pointers do not leak sensitive info.

Scope

In scope (MVP)

Emit TinyFailureEvent v1 on first detected failure for a step.
Transport events in near real-time to UI.
Store events durably and allow UI to fetch a run’s event timeline.
Support evidence pointers for:
- logs (excerptable)
- artifacts (SBOM, reports)
- attestations (provenance/signature)
UI:
- show run timeline
- show failure card instantly
- hydrate evidence sections on demand (or automatically where feasible)

Out of scope (MVP)

Full trace viewer / distributed tracing UI (we can link to external trace systems via pointer).
Automated remediation (“fix it”) actions.
Full-blown case management.

Key terms and definitions

Run: A single execution of a pipeline. Identified by run_id.
Stage: Coarse lifecycle phase (fetch, build, scan, policy, sign, package, deploy).
Step: A concrete activity within a stage (dotnet-restore, trivy-scan, vex-gate).
Tiny Failure Event: A small message representing “this step failed”, including stable classification and references to evidence.
Pointer: An opaque reference that can be resolved into evidence content or a link later.

User stories and acceptance criteria

Story 1: I see failure instantly

As a developer I want to see which step failed immediately So that I don’t wait on logs/artifacts

Acceptance criteria

When a step fails, the UI updates within ≤ 2 seconds p95 from the time the orchestrator/runner detects failure.
The failure card includes:
- stage, step
- error class
- human summary
- timestamp
- (optional) primary key/value details (e.g., CVE, severity)

Story 2: I can open evidence when available

As a release engineer I want to click evidence links (logs/SBOM/attestation) So that I can diagnose/root-cause

Acceptance criteria

Failure card shows evidence sections as:
- Available (clickable)
- Pending (spinner / “awaiting evidence”)
- Unavailable (“not produced” or “access denied”)
Clicking log evidence opens an excerpt view, not a 500MB file download.
Evidence access enforces authorization (same as run access).

Story 3: Events are robust to duplicates/out-of-order

As a user I want the timeline to remain correct Even if event delivery is at-least-once

Acceptance criteria

UI displays exactly one current “failed” state per step attempt.
Duplicate events do not create duplicate cards.
Out-of-order arrival does not revert a step from fail → pass.

Functional requirements (what developers must build)

FR1: TinyFailureEvent schema v1

Required fields

All producers MUST emit events that validate against this schema.

{
  "v": 1,
  "event_id": "evt_01J…", 
  "ts": "2025-12-13T12:10:03.123Z",
  "run_id": "run_7f3c6a8",
  "stage": "policy",
  "step": "vex-gate",
  "attempt": 1,
  "status": "fail",
  "error_class": "VULN_REACHABLE",
  "summary": "Reachable CVE blocks release",
  "pointers": [],
  "kv": {}
}

Field definitions & constraints

v (int, required): must be 1 for this spec.
event_id (string, required): globally unique.
- Format: evt_<ULID> (ULID recommended for time-sortable IDs).
ts (RFC3339 UTC, required): creation timestamp.
run_id (string, required): stable correlation id for run.
stage (enum string, required): one of:
- fetch|build|scan|policy|sign|package|deploy|runtime
step (string, required): lowercase kebab-case recommended; max 80 chars.
attempt (int, required): starts at 1; increments for retries.
status (enum string, required for this feature): fail (MVP supports fail only; schema allows later expansion)
error_class (string, required): stable classifier from a shared registry (see FR2).
- max 64 chars; uppercase snake-case.
summary (string, required): human readable, max 140 chars.
pointers (array, optional): max 20 items; each item is a Pointer object (see FR3).
kv (object, optional): small metadata map for filtering.
- max 20 keys
- key max 32 chars; value max 120 chars
- no nested objects/arrays

Size limits

Entire event payload MUST be ≤ 8 KB serialized JSON.
If producers exceed limits, they MUST truncate summary and drop low-priority kv keys before failing emission.

FR2: Error class registry (stable contract)

We maintain a canonical list of error_class values in a shared repo/module.

Requirements

Each error_class MUST have:
- name (e.g., NETWORK_DNS)
- short description
- severity mapping (optional)
- recommended remediation hints (optional, can be UI-side)
Producers MUST use a registry value if applicable.
Producers MAY emit error_class="UNKNOWN" if no mapping exists, but must log a warning and increment a metric.

Initial registry (minimum)

Infra/Network:

NETWORK_DNS
NETWORK_TIMEOUT
DISK_FULL

Auth:

AUTH_EXPIRED
REGISTRY_403

Supply chain:

SIGNATURE_INVALID
ATTESTATION_MISSING
SBOM_MISSING

Policy/Security:

POLICY_BLOCK
VULN_REACHABLE
MALWARE_FLAG

Runner/Orchestrator:

STEP_TIMEOUT
RUN_ABORTED
WORKER_LOST

FR3: Evidence pointer format and rules

Pointer object schema

{
  "type": "log|artifact|attestation|url|trace",
  "ref": "logs://scanner/run_7f3c6a8#L1423-L1480",
  "mime": "text/plain",
  "label": "Scanner log excerpt",
  "expires_at": "2025-12-20T00:00:00Z",
  "sha256": "optional hex"
}

Rules

type and ref are required.
ref is opaque to UI; UI passes it to the resolver service.
label is optional, but strongly recommended for UI friendliness.
expires_at is optional; if present UI should show “may expire”.
sha256 optional for immutability verification (artifacts/attestations especially).

Allowed schemes (MVP)

logs://<provider>/<run_id>#Lx-Ly
artifact://<kind>/<name>@<version-or-run-id>
attestation://<store>/<id-or-digest>
url://<encoded> (only internal allowed; resolver enforces)
trace://<system>/<trace-id>

Security constraints

Pointers MUST NOT embed secrets (tokens, passwords).
Any pointer that could expose sensitive data MUST be resolvable only through the Evidence Gateway (FR6), never directly client-side.
The resolver MUST enforce authorization for the requesting user.

FR4: Emission rules (when and how events are produced)

When to emit

Producers MUST emit a TinyFailureEvent when:

A step exits non-zero.
A policy decision is “deny/block”.
A required artifact/attestation is missing at gate time.
A step times out.
The worker is lost (emitted by orchestrator watchdog).

Exactly-once vs at-least-once

Transport can be at-least-once.
Consumers MUST be idempotent using (run_id, stage, step, attempt, status) + event_id.

One failure event per step attempt

For a given (run_id, stage, step, attempt):
- First emitted status=fail is canonical.
- Later fail events for the same tuple are treated as “updates” only if they add pointers/kv (see FR5).

Updates / enrichment

We support enrichment without breaking “tiny”:

Producers MAY emit a second event with the same tuple (run_id/stage/step/attempt/status) that adds pointers or kv after the initial fail.
Consumers MUST merge pointers (dedupe identical type+ref) and merge kv (new keys overwrite old keys).
Producers MUST NOT spam; max 3 enrichment events per tuple.

FR5: Event storage and aggregation

Required services/components

Event Ingest (API or internal library endpoint)
Event Store (durable DB table)
Realtime Fanout (pub/sub channel)
Run Timeline API (query per run)

Behavior

On ingest:
- Validate schema (reject invalid with 400/validation error).
- Persist to event store.
- Publish to realtime channel.

Suggested DB model (Postgres)

Table: run_events

event_id PK
run_id indexed
ts indexed
stage, step, attempt, status indexed composite
payload jsonb
ingested_at

Uniqueness constraints:

event_id unique
Optional: unique on (run_id, stage, step, attempt, status, hash(summary)) if you want stronger dedupe

Query API

GET /runs/{run_id}/events returns events sorted by ts ascending.
UI should also subscribe realtime to avoid polling.

FR6: Evidence Gateway (pointer resolver)

Purpose

A single service that resolves pointers into either:

log excerpts
signed download URLs
attestation display + verification data
external trace links (sanitized)

Endpoints (MVP)

Resolve metadata
- POST /evidence/resolve
- body: { "run_id": "...", "pointers": [ { "type": "...", "ref": "..." } ] }
- returns per pointer:
  - status: available|pending|missing|denied|expired|error
  - kind: inline|link
  - title
  - mime
  - size_bytes (if known)
  - link (if kind=link) – must be short-lived, server-generated
  - inline_preview (optional, small excerpt)
Fetch log excerpt
- GET /evidence/log-excerpt?ref=...
- returns:
  - text (max 64 KB)
  - start_line, end_line
  - source (provider info)
Fetch artifact
- GET /evidence/artifact?ref=...
- returns either:
  - short-lived download link
  - or 404/403/410

AuthZ requirements

Evidence Gateway MUST verify the caller has access to the run_id.
Gateway MUST validate that the pointer belongs to that run (or is explicitly declared “global shared”).
Gateway MUST audit-log every evidence resolution.

Resilience

If evidence is not ready, resolver returns pending, not 500.
If pointer is unknown format, return error with a safe message.

UI requirements (what the product must do)

UI1: Run timeline renders from events

The run detail page MUST show:
- stages/steps list
- current state per step (pass/warn/fail/running)
- failure details if fail exists
The failure state MUST be derived from TinyFailureEvent without requiring any log fetch.

UI2: Failure card content (minimum)

When a fail event arrives:

Show a red failure card with:
- stage + step
- summary
- error_class badge
- ts (relative + absolute on hover)
- key kv fields (up to 4 shown; remainder behind “Show more”)

UI3: Progressive hydration

The card MUST include an “Evidence” section.
For each pointer:
- show a row with label and availability status
- if available, show “Open”
- if pending, show spinner + “Awaiting evidence”
- if denied, show lock icon + “No access”
- if missing, show “Not produced”
Clicking “Open”:
- logs open excerpt viewer (modal/drawer)
- artifacts open in viewer or download (type-dependent)
- attestations open verification view

UI4: Realtime behavior

UI MUST subscribe to realtime events for the run.
UI MUST apply idempotent merge logic:
- dedupe by event_id
- merge enrichment events by tuple (run_id/stage/step/attempt/status)

UI5: Ordering and out-of-order handling

UI MUST sort by ts for display.
UI MUST NOT regress a step state if a late “pass/info” arrives after fail.
- Rule: fail is terminal for a step attempt.

Non-functional requirements

Latency

From failure detection to UI update: ≤ 2s p95, ≤ 5s p99 (within the same network).
Evidence resolution:
- resolve call should return in ≤ 300ms p95 for cached/known pointers.

Reliability

Event ingestion must be durable (stored) before fanout.
System must tolerate:
- duplicates
- retries
- out-of-order delivery
- partial evidence availability

Payload limits

Event size ≤ 8KB
Evidence inline previews ≤ 4KB per pointer

Retention

Tiny events retained ≥ 30 days (configurable).
Evidence retention depends on provider, but resolver must surface expiry.

Metrics and instrumentation (definition of success)

Producers + ingestion MUST emit:

ttfe_ms: time to failure event (from step start or from failure detection)
event_ingest_latency_ms
event_validation_fail_count
unknown_error_class_count
pointer_resolution_status_count{available|pending|missing|denied|expired|error}
pointer_hydration_latency_ms

UI MUST log:

time from run page open → first event rendered
evidence open clickthrough rate
evidence resolution failure rate

Edge cases we explicitly handle

Runner killed before it can emit
- Orchestrator watchdog emits WORKER_LOST with stage/step best-effort.
Logs produced after failure
- Initial fail event has no log pointer.
- Later enrichment event adds log pointer (same tuple).
Evidence exists but user lacks access
- Resolver returns denied; UI shows locked state.
Evidence link expired
- Resolver returns expired and provides a “Refresh” action that re-resolves.
Multiple retries
- attempt increments; UI shows attempt number and keeps prior attempt history.

Definition of Done (engineers can ship when…)

Backend DoD

Schema validation implemented.
Ingest → store → fanout pipeline working.
Enrichment merge logic implemented.
Evidence Gateway resolves at least:
- log excerpt pointers
- artifact pointers
- attestation pointers
AuthZ enforced.

Frontend DoD

Run page shows failure card from TinyFailureEvent alone.
Evidence hydration works and handles all resolver statuses.
Realtime updates work; duplicates/out-of-order are safe.

QA DoD (minimum test cases)

Step fails; event arrives; UI shows failure without logs.
Log pointer arrives later; UI updates evidence section.
Duplicate event delivery; UI shows one failure.
Evidence denied; UI shows lock state.
Out-of-order: enrichment arrives before initial fail; UI still resolves correctly.

Implementation checklist (suggested division of work)

Team A: Producer SDK

StellaOps.Events library:
- EmitFailure(run_id, stage, step, attempt, error_class, summary, pointers, kv)
- ULID generation
- payload size enforcement
- retry w/ backoff

Team B: Ingest + Store + Fanout

API endpoint or internal gRPC for ingest
Postgres run_events table
Valkey pub/sub channel: run:{run_id}:events

Team C: Evidence Gateway

pointer parsing + resolvers
log excerpt adapter(s)
artifact download adapter(s)
attestation viewer adapter(s)

Team D: UI

realtime subscription
timeline state reducer w/ merge rules
evidence drawer + resolver calls

If you want, I can also provide:

a formal JSON Schema file for v1 (drop-in for validation),
a state reducer spec for the UI (pure function inputs/outputs),
and a Postgres migration + example queries. Below is a UI State Reducer Spec (Pure Function Inputs/Outputs) for the Run Detail page that renders the timeline + step statuses + failure cards + evidence hydration using TinyFailureEvents (and future-compatible with pass/warn/info).

This is written so devs can implement it as a deterministic reducer (Redux, Zustand w/ reducer, Elm-style update, etc.).

UI State Reducer Spec v1: Run Timeline + Failure Cards

Reducer contract

Pure function

reduceRunView(state: RunViewState, action: Action): RunViewState

Guarantees

Pure & deterministic: no IO, no timers, no random IDs, no Date.now() inside reducer.
Idempotent: applying the same RUN_EVENT_RECEIVED twice yields the same state after the first time.
Order-safe: out-of-order events never “downgrade” a step attempt from fail → pass.

1) Data types

1.1 Event type used by reducer

type StageName =
  | 'fetch' | 'build' | 'scan' | 'policy'
  | 'sign' | 'package' | 'deploy' | 'runtime';

type StepStatus =
  // present now (MVP)
  | 'fail'
  // future-compatible
  | 'warn' | 'pass' | 'running' | 'queued' | 'info' | 'unknown';

type PointerType = 'log' | 'artifact' | 'attestation' | 'url' | 'trace';

type Pointer = {
  type: PointerType;
  ref: string;
  mime?: string;
  label?: string;
  expires_at?: string; // RFC3339
  sha256?: string;
};

type TinyEventV1 = {
  v: 1;
  event_id: string;
  ts: string;          // RFC3339 UTC
  run_id: string;
  stage: StageName;
  step: string;
  attempt: number;
  status: StepStatus;  // MVP sends 'fail' only
  error_class: string;
  summary: string;
  pointers?: Pointer[];
  kv?: Record<string, string>;
};

// Normalized for sorting and comparisons (created outside or inside reducer deterministically)
type NormalizedEvent = TinyEventV1 & {
  tsMs: number; // parse(ts) -> number, invalid => 0
};

1.2 Keys and comparisons

type TupleKey = string;       // `${stage}|${step}|${attempt}|${status}`
type StepAttemptKey = string; // `${stage}|${step}|${attempt}`
type StepIdentityKey = string;// `${stage}|${step}` (no attempt)
type PointerKey = string;     // `${type}|${ref}`

function tupleKey(e: TinyEventV1): TupleKey {
  return `${e.stage}|${e.step}|${e.attempt}|${e.status}`;
}
function stepAttemptKey(e: TinyEventV1): StepAttemptKey {
  return `${e.stage}|${e.step}|${e.attempt}`;
}
function stepIdentityKey(e: TinyEventV1): StepIdentityKey {
  return `${e.stage}|${e.step}`;
}
function pointerKey(p: Pointer): PointerKey {
  return `${p.type}|${p.ref}`;
}

// Sort: ts ascending, then event_id lexicographically (stable deterministic tiebreak)
function compareEvent(a: NormalizedEvent, b: NormalizedEvent): number {
  if (a.tsMs !== b.tsMs) return a.tsMs - b.tsMs;
  return a.event_id < b.event_id ? -1 : (a.event_id > b.event_id ? 1 : 0);
}

1.3 Status ranking rule (terminal safety)

We need a single numeric ranking so we can:

prevent regressions (fail must remain terminal), and
compute rollups.

const STATUS_RANK: Record<StepStatus, number> = {
  unknown: 0,
  queued:  1,
  running: 2,
  info:    3,
  pass:    4,
  warn:    5,
  fail:    6,
};

function isTerminal(status: StepStatus): boolean {
  return status === 'fail' || status === 'warn' || status === 'pass';
}

Invariant: A step attempt’s displayed status must never decrease in rank.

2) State shape

This state is for a single Run Detail page (one runId at a time). If you store multiple runs in a global store, wrap this in a Record<runId, RunViewState>.

type RealtimeStatus = 'idle' | 'connecting' | 'connected' | 'disconnected' | 'error';
type LoadStatus = 'idle' | 'loading' | 'loaded' | 'error';

type EvidenceResolveStatus =
  | 'unresolved'  // pointer exists but no resolver call made yet
  | 'loading'     // resolver call in-flight
  | 'available' | 'pending' | 'missing' | 'denied' | 'expired' | 'error';

type EvidenceResolution = {
  status: EvidenceResolveStatus;
  kind?: 'inline' | 'link';
  title?: string;
  mime?: string;
  size_bytes?: number;
  inline_preview?: string; // small preview
  link?: string;           // short-lived link
  error_message?: string;
};

type EvidenceState = {
  pointer: Pointer;          // latest metadata merged from events
  status: EvidenceResolveStatus;
  lastResolvedAtMs?: number; // from action payload (not Date.now)
  // for stale response protection
  seq: number;               // increments each request
  inFlightSeq?: number;      // seq currently in-flight
  resolution?: EvidenceResolution;
};

type PointerAggregate = {
  pointerKey: PointerKey;
  pointer: Pointer; // merged metadata
};

type TupleAggregate = {
  tupleKey: TupleKey;

  // all events contributing to this tuple (same stage/step/attempt/status)
  eventIdsSorted: string[];      // sorted by (tsMs, event_id)
  canonicalEventId: string;      // min by (tsMs, event_id)

  // merged view computed deterministically from eventIdsSorted
  merged: {
    summary: string;             // from canonical event
    error_class: string;         // from canonical event
    kv: Record<string, string>;  // merged by sorted order (later overwrites)
    pointers: PointerAggregate[];// dedup by pointerKey, merged by sorted order
    updatedAtMs: number;         // max tsMs among contributing events
  };
};

type StepAttemptState = {
  key: StepAttemptKey;
  stage: StageName;
  step: string;
  attempt: number;

  // all tuple aggregates for this attempt (one per status)
  tuplesByStatus: Partial<Record<StepStatus, TupleKey>>;

  // derived “best” status for this attempt
  bestStatus: StepStatus;
  bestStatusRank: number;
  updatedAtMs: number; // max of all tupleAgg.updatedAtMs for this attempt
};

type StageRollup = {
  stage: StageName;
  // worst status among latest attempts of steps in this stage
  rollupStatus: StepStatus;
  rollupRank: number;
};

type RunViewState = {
  runId: string | null;

  loading: { initialEvents: LoadStatus; error?: string };
  realtime: { status: RealtimeStatus; error?: string };

  // storage
  eventsById: Record<string, NormalizedEvent>;
  timelineEventIds: string[];  // global timeline sorted by (tsMs, event_id)

  tupleAggByKey: Record<TupleKey, TupleAggregate>;
  stepAttemptByKey: Record<StepAttemptKey, StepAttemptState>;
  latestAttemptByStep: Record<StepIdentityKey, number>; // max attempt observed

  stageRollups: Record<StageName, StageRollup>;

  evidenceByPointer: Record<PointerKey, EvidenceState>;
};

3) Actions (inputs to reducer)

type Action =
  | { type: 'RUN_VIEW_OPENED'; runId: string }
  | { type: 'RUN_EVENTS_LOAD_STARTED'; runId: string }
  | { type: 'RUN_EVENTS_LOADED'; runId: string; events: TinyEventV1[] }
  | { type: 'RUN_EVENTS_LOAD_FAILED'; runId: string; error: string }

  | { type: 'REALTIME_STATUS_CHANGED'; runId: string; status: RealtimeStatus; error?: string }
  | { type: 'RUN_EVENT_RECEIVED'; event: TinyEventV1 }

  // Evidence hydration lifecycle (pure reducer; side-effects happen elsewhere)
  | { type: 'EVIDENCE_RESOLVE_REQUESTED'; runId: string; pointerKey: PointerKey }
  | { type: 'EVIDENCE_RESOLVE_RESULT'; runId: string; pointerKey: PointerKey; seq: number; resolvedAtMs: number; resolution: EvidenceResolution }
  | { type: 'EVIDENCE_RESOLVE_CLEARED'; runId: string; pointerKey: PointerKey };

Reducer must ignore any action where action.runId !== state.runId (except RUN_VIEW_OPENED which sets it).

4) Reducer semantics (outputs)

4.1 RUN_VIEW_OPENED

Input: { runId } Output: resets all run-specific state.

Rules:

Set state.runId = runId
Clear events, aggregates, evidence, timeline.
Set loading.initialEvents = 'loading'
Set realtime.status = 'connecting' (optional)

4.2 RUN_EVENTS_LOAD_STARTED / LOADED / FAILED

RUN_EVENTS_LOAD_STARTED

If runId matches, set loading.initialEvents = 'loading'.

RUN_EVENTS_LOADED

If runId matches:
- For each event in events: apply the exact same logic as RUN_EVENT_RECEIVED.
- Then set loading.initialEvents = 'loaded'.

RUN_EVENTS_LOAD_FAILED

If runId matches: loading.initialEvents = 'error', store error string.

4.3 REALTIME_STATUS_CHANGED

Update realtime.status and realtime.error if runId matches.

4.4 RUN_EVENT_RECEIVED (core ingestion)

Preconditions

If state.runId is null, ignore (or treat as no-op). If event.run_id !== state.runId, ignore.

Step A — normalize + dedupe

Convert to NormalizedEvent:
- tsMs = parseRFC3339ToMs(event.ts); if parse fails, tsMs = 0.
- Default pointers = [], kv = {} if missing.
If eventsById[event_id] exists: no-op.

Step B — insert into global stores

Add to eventsById[event_id].
Insert event_id into timelineEventIds keeping sorted order by (tsMs, event_id).

Step C — ensure evidence entries exist for pointers

For each pointer p:

pk = pointerKey(p)
If evidenceByPointer[pk] is missing:
- create { pointer: p, status: 'unresolved', seq: 0 }
Else merge pointer metadata into evidenceByPointer[pk].pointer using pointer-merge rules (below). (Do not overwrite existing resolver resolution fields.)

Step D — update tuple aggregate (merge/enrichment)

Let tk = tupleKey(event).

If tupleAggByKey[tk] missing, create new TupleAggregate with:
- eventIdsSorted = [event_id]
- canonicalEventId = event_id
- merged from this event
Else:
- Insert event_id into eventIdsSorted in sorted order (using compareEvent via eventsById).
- Recompute:
  - canonicalEventId = min(eventIdsSorted) by compareEvent
  - merged deterministically from all contributing events (see merge rules)

Tuple merge rules (deterministic)

Given contributing events E sorted by (tsMs, event_id) ascending:

canonical = E[0]
merged.summary = canonical.summary
merged.error_class = canonical.error_class
merged.kv:
- start empty {}
- for each event e in order, for each (k,v) in e.kv: merged.kv[k] = v (later events overwrite earlier keys)
merged.pointers:
- maintain map: Record<PointerKey, Pointer>
- for each event e in order, for each pointer p:
  - pk = pointerKey(p)
  - if not present: set map[pk] = p
  - else: map[pk] = mergePointerMeta(map[pk], p) (see below)
- output pointers as an array sorted by PointerKey lexicographically (for stable UI lists)
merged.updatedAtMs = max(e.tsMs)

Pointer metadata merge rule (non-null wins, later wins)

function mergePointerMeta(oldP: Pointer, newP: Pointer): Pointer {
  // type/ref must match
  return {
    type: oldP.type,
    ref: oldP.ref,
    // later non-empty wins
    mime:       newP.mime       ?? oldP.mime,
    label:      newP.label      ?? oldP.label,
    expires_at: newP.expires_at ?? oldP.expires_at,
    sha256:     newP.sha256     ?? oldP.sha256,
  };
}

4.5 Update StepAttemptState (best status + no regression)

After tuple aggregate update, update the parent step attempt:

Let sak = stepAttemptKey(event) and sid = stepIdentityKey(event).

latest attempt tracking

latestAttemptByStep[sid] = max(previous, event.attempt)

StepAttemptState update

If missing, create:
- bestStatus = 'unknown', bestStatusRank = 0, tuplesByStatus = {}
Set tuplesByStatus[event.status] = tk

Recompute best status (never decreases)

Compute candidate best by checking all statuses present for this attempt:

candidateBest = argmax(status in tuplesByStatus) STATUS_RANK[status]

Then apply no-regression rule:

If STATUS_RANK[candidateBest] >= step.bestStatusRank:
- update bestStatus, bestStatusRank
Else:
- keep existing bestStatus (prevents fail → pass regressions)

Set updatedAtMs = max(updatedAtMs, tupleAgg.merged.updatedAtMs).

Important: This rule guarantees “late pass/info” cannot override a prior fail.

4.6 Stage rollups (optional but recommended)

Whenever any StepAttemptState changes, update stageRollups[stage] deterministically:

For each stage:

Consider only the latest attempt per step identity in that stage:
- For each StepIdentityKey = stage|step, find attempt = latestAttemptByStep[stage|step]
- Look up StepAttemptState for that attempt.
Roll up stage status as the worst rank among those:
- rollupRank = max(step.bestStatusRank)
- rollupStatus = status with that rank

If a stage has no steps yet, set rollupStatus='unknown'.

5) Evidence hydration reducer rules

Evidence actions update evidenceByPointer only; they must not mutate events/aggregates.

5.1 EVIDENCE_RESOLVE_REQUESTED

Input: { pointerKey }

Rules:

If no evidence entry exists: create one with status unresolved and seq=0 (should be rare).
Increment seq = seq + 1
Set inFlightSeq = seq
Set status = 'loading'
Keep resolution (optional: clear it if you want UI to hide stale info; recommended to keep and show “Refreshing…”)

Middleware/effect contract (outside reducer):

After dispatching EVIDENCE_RESOLVE_REQUESTED, the effect layer reads inFlightSeq from state and uses it in the API call.
When the response returns, dispatch EVIDENCE_RESOLVE_RESULT with that same seq.

5.2 EVIDENCE_RESOLVE_RESULT

Input: { pointerKey, seq, resolvedAtMs, resolution }

Rules:

If evidenceByPointer[pointerKey] missing: ignore or create (implementation choice).
If evidence.inFlightSeq !== seq: ignore stale response.
Else:
- status = resolution.status
- resolution = resolution
- lastResolvedAtMs = resolvedAtMs
- inFlightSeq = undefined

5.3 EVIDENCE_RESOLVE_CLEARED

Reset entry back to { status:'unresolved', resolution: undefined, inFlightSeq: undefined }
Keep pointer metadata.

6) Selectors (pure outputs for rendering)

These are not reducer logic, but they define how UI consumes state deterministically.

6.1 Timeline view model

selectTimeline(state): NormalizedEvent[] {
  return state.timelineEventIds.map(id => state.eventsById[id]);
}

6.2 Latest attempt cards per step identity

type StepCardVM = {
  stage: StageName;
  step: string;
  attempt: number;
  status: StepStatus;
  error_class?: string;
  summary?: string;
  kv: Record<string,string>;
  pointers: PointerAggregate[];
  updatedAtMs: number;
};

selectLatestStepCards(state): StepCardVM[] {
  const cards: StepCardVM[] = [];
  for (const sid in state.latestAttemptByStep) {
    const attempt = state.latestAttemptByStep[sid];
    const [stage, step] = sid.split('|') as [StageName, string];
    const sak = `${stage}|${step}|${attempt}`;

    const sa = state.stepAttemptByKey[sak];
    if (!sa) continue;

    // Prefer fail tuple for details if present
    const failTk = sa.tuplesByStatus['fail'];
    const bestTk = sa.tuplesByStatus[sa.bestStatus];
    const tk = failTk ?? bestTk;
    const agg = tk ? state.tupleAggByKey[tk] : undefined;

    cards.push({
      stage, step, attempt,
      status: sa.bestStatus,
      error_class: agg?.merged.error_class,
      summary: agg?.merged.summary,
      kv: agg?.merged.kv ?? {},
      pointers: agg?.merged.pointers ?? [],
      updatedAtMs: sa.updatedAtMs,
    });
  }
  // stable ordering: by stage order, then step name
  return cards.sort((a,b) =>
    (STAGE_ORDER.indexOf(a.stage) - STAGE_ORDER.indexOf(b.stage)) ||
    a.step.localeCompare(b.step)
  );
}

6.3 Failure banner (first failure by time)

selectFirstFailure(state): StepCardVM | null {
  const cards = selectLatestStepCards(state).filter(c => c.status === 'fail');
  if (cards.length === 0) return null;
  return cards.sort((a,b) => a.updatedAtMs - b.updatedAtMs)[0];
}

7) Worked examples (expected reducer behavior)

Example A: fail event arrives, then enrichment adds pointers

Receive fail event (no pointers)

Step card shows fail, summary, error_class, evidence list empty.

Receive second event same tupleKey with pointers

Same step card remains fail (no regression)
Evidence section now lists pointers (status unresolved until resolved).

Example B: out-of-order enrichment arrives before initial fail

Enrichment event arrives first (later tsMs) → creates tupleAgg; canonical is that (for now).
Later initial fail arrives with earlier tsMs:
- canonical becomes the earlier event (smaller tsMs)
- pointers remain, because merged pointers are union across all contributing events.

Example C: duplicate delivery

Same event_id received twice → second is ignored (idempotent).

Example D: late pass after fail (future-proof)

If a pass event arrives after a fail for the same step attempt:
- bestStatusRank is already fail (6)
- candidate is pass (4)
- no-regression rule keeps fail

8) Implementation notes (non-binding but useful)

Event counts per run are usually small; simple array insert + sort is fine.
If you expect thousands of events, maintain a binary insertion for timelineEventIds and eventIdsSorted.
Keep all “current time” out of reducer. Any timestamps used in actions (e.g., resolvedAtMs) must be created outside.

If you want next, I can provide:

a drop-in TypeScript implementation of reduceRunView with helper functions, and
a set of unit test vectors (Given actions → expect final state) covering all edge cases above.

38 KiB Raw Blame History Unescape Escape

Why this helps

The event itself (keep it tiny)

UI behavior (instant, then enrich)

Backend contract

Minimal topic map (so teams agree on names)

Drop‑in for Stella Ops (tailored)

Quick implementation checklist

Product: Real-time Failure Signaling (Tiny Failure Events)

Goal

Outcomes we must deliver

Scope

In scope (MVP)

Out of scope (MVP)

Key terms and definitions

User stories and acceptance criteria

Story 1: I see failure instantly

Story 2: I can open evidence when available

Story 3: Events are robust to duplicates/out-of-order

Functional requirements (what developers must build)

FR1: TinyFailureEvent schema v1

Required fields

Field definitions & constraints

Size limits

FR2: Error class registry (stable contract)

Requirements

Initial registry (minimum)

FR3: Evidence pointer format and rules

Pointer object schema

Rules

Allowed schemes (MVP)

Security constraints

FR4: Emission rules (when and how events are produced)

When to emit

Exactly-once vs at-least-once

One failure event per step attempt

Updates / enrichment

FR5: Event storage and aggregation

Required services/components

Behavior

Suggested DB model (Postgres)

Query API

FR6: Evidence Gateway (pointer resolver)

Purpose

Endpoints (MVP)

AuthZ requirements

Resilience

UI requirements (what the product must do)

UI1: Run timeline renders from events

UI2: Failure card content (minimum)

UI3: Progressive hydration

UI4: Realtime behavior

UI5: Ordering and out-of-order handling

Non-functional requirements

Latency

Reliability

Payload limits

Retention

Metrics and instrumentation (definition of success)

Edge cases we explicitly handle

Definition of Done (engineers can ship when…)

Backend DoD

Frontend DoD

QA DoD (minimum test cases)

Implementation checklist (suggested division of work)

Team A: Producer SDK

Team B: Ingest + Store + Fanout

Team C: Evidence Gateway

Team D: UI

UI State Reducer Spec v1: Run Timeline + Failure Cards

Reducer contract

Pure function

Guarantees

1) Data types

1.1 Event type used by reducer

1.2 Keys and comparisons

1.3 Status ranking rule (terminal safety)

2) State shape

3) Actions (inputs to reducer)

4) Reducer semantics (outputs)

38 KiB

Raw Blame History

Drop‑in for Stella Ops (tailored)