Here’s a simple, high‑leverage UX metric to add to your pipeline run view that will immediately make DevOps feel faster and calmer: # Time‑to‑First‑Signal (TTFS) **What it is:** the time from opening a run’s details page until the UI renders the **first actionable insight** (e.g., “Stage `build` failed – `dotnet restore` 401 – token expired”). **Why it matters:** engineers don’t need *all* data instantly—just the first trustworthy clue to start acting. Lower TTFS = quicker triage, lower stress, tighter MTTR. --- ## What counts as a “first signal” * Failed stage + reason (exit code, key log line, failing test name) * Degraded but actionable status (e.g., flaky test signature) * Policy gate block with the specific rule that failed * Reachability‑aware security finding that blocks deploy (one concrete example, not the whole list) > Not a signal: spinners, generic “loading…”, or unactionable counts. --- ## How to optimize TTFS (practical steps) 1. **Deferred loading (prioritize critical panes):** * Render header + failing stage card first; lazy‑load artifacts, full logs, and graphs after. * Pre‑expand the *first failing node* in the stage graph. 2. **Log pre‑indexing at ingest:** * During CI, stream logs into chunks keyed by `[jobId, phase, severity, firstErrorLine]`. * Extract the **first error tuple** (timestamp, step, message) and store it next to the job record. * On UI open, fetch only that tuple (sub‑100 ms) before fetching the rest. 3. **Cached summaries:** * Persist a tiny JSON “run.summary.v1” (status, first failing stage, first error line, blocking policies) in Redis/Postgres. * Invalidate on new job events; always serve this summary first. 4. **Edge prefetch:** * When the runs table is visible, prefetch summaries for rows in viewport so details pages open “warm”. 5. **Compress + cap first log burst:** * Send the first **5–10 error lines** (already extracted) immediately; stream the rest. --- ## Instrumentation (so you can prove it) Emit these points as telemetry: * `ttfs_start`: when the run details route is entered (or when tab becomes visible) * `ttfs_signal_rendered`: when the first actionable card is in the DOM * `ttfs_ms = signal_rendered - start` * Dimensions: `pipeline_provider`, `repo`, `branch`, `run_type` (PR/main), `device`, `release`, `network_state` **SLO:** *P50 ≤ 700 ms, P95 ≤ 2.5 s* (adjust to your infra). **Dashboards to track:** * TTFS distribution (P50/P90/P95) by release * Correlate TTFS with bounce rate and “open → rerun” delay * Error budget: % of views with TTFS > 3 s --- ## Minimal backend contract (example) ```json GET /api/runs/{runId}/first-signal { "runId": "123", "firstSignal": { "type": "stage_failed", "stage": "build", "step": "dotnet restore", "message": "401 Unauthorized: token expired", "at": "2025-12-11T09:22:31Z", "artifact": { "kind": "log", "range": {"start": 1880, "end": 1896} } }, "summaryEtag": "W/\"a1b2c3\"" } ``` --- ## Frontend pattern (Angular 17, signal‑first) * Fire `first-signal` request in route resolver. * Render `FirstSignalCard` immediately. * Lazy‑load stage graph, full logs, security panes. * Fire `ttfs_signal_rendered` when `FirstSignalCard` enters viewport. --- ## CI adapter hints (GitLab/GitHub/Azure) * Hook on job status webhooks to compute & store the first error tuple. * For GitLab: scan `trace` stream for first `ERRO|FATAL|##[error]` match; store to DB table `ci_run_first_signal(run_id, stage, step, message, t)`. --- ## “Good TTFS” acceptance tests * Run with early fail → first signal < 1 s, shows exact command + exit code. * Run with policy gate fail → rule name + fix hint visible first. * Offline/slow network → cached summary still renders an actionable hint. --- ## Copy to put in your UX guidelines > “Optimize **Time‑to‑First‑Signal (TTFS)** above all. Users must see one trustworthy, actionable clue within 1 second on a warm path—even if the rest of the UI is still loading.” If you want, I can sketch the exact DB schema for the pre‑indexed log tuples and the Angular resolver + telemetry hooks next. Below is an extended, end‑to‑end implementation plan for **Time‑to‑First‑Signal (TTFS)** that you can drop into your backlog. It includes architecture, data model, API contracts, frontend work, observability, QA, and rollout—structured as epics/phases with “definition of done” and acceptance criteria. --- # Scope extension ## What we’re building A run details experience that renders **one actionable clue** fast—before loading heavy UI like full logs, graphs, artifacts. **“First signal”** is a small payload derived from run/job events and the earliest meaningful error evidence (stage/step + key log line(s) + reason/classification). ## What we’re extending beyond the initial idea 1. **First‑Signal Quality** (not just speed) * Classify error type (auth, dependency, compilation, test, infra, policy, timeout). * Identify “culprit step” and a stable “signature” for dedupe and search. 2. **Progressive disclosure UX** * Summary → First signal card → expanded context (stage graph, logs, artifacts). 3. **Provider‑agnostic ingestion** * Adapters for GitLab/GitHub/Azure (or your CI provider). 4. **Caching + prefetch** * Warm open from list/table, with ETags and stale‑while‑revalidate. 5. **Observability & SLOs** * TTFS metrics, dashboards, alerting, and quality metrics (false signals). 6. **Rollout safety** * Feature flags, canary, A/B gating, and a guaranteed fallback path. --- # Success criteria ## Primary metric * **TTFS (ms)**: time from details page route enter → first actionable signal rendered. ## Targets (example SLOs) * **P50 ≤ 700 ms**, **P95 ≤ 2500 ms** on warm path. * **Cold path**: P95 ≤ 4000 ms (depends on infra). ## Secondary outcome metrics * **Open→Action time**: time from opening run to first user action (rerun, cancel, assign, open failing log line). * **Bounce rate**: close page within 10 seconds without interaction. * **MTTR proxy**: time from failure to first rerun or fix commit. ## Quality metrics * **Signal availability rate**: % of run views that show a first signal card within 3s. * **Signal accuracy score** (sampled): engineer confirms “helpful vs not”. * **Extractor failure rate**: parsing errors / missing mappings / timeouts. --- # Architecture overview ## Data flow 1. **CI provider events** (job started, job finished, stage failed, log appended) land in your backend. 2. **Run summarizer** maintains: * `run_summary` (small JSON) * `first_signal` (small, actionable payload) 3. **UI opens run details** * Immediately calls `GET /runs/{id}/first-signal` (or `/summary`). * Renders FirstSignalCard as soon as payload arrives. 4. Background fetches: * Stage graph, full logs, artifacts, security scans, trends. ## Key decision: where to compute first signal * **Option A: at ingest time (recommended)** Compute first signal when logs/events arrive, store it, serve it instantly. * **Option B: on demand** Compute when user opens run details (simpler initially, worse TTFS and load). --- # Data model ## Tables (relational example) ### `ci_run` * `run_id (pk)` * `provider` * `repo_id` * `branch` * `status` * `created_at`, `updated_at` ### `ci_job` * `job_id (pk)` * `run_id (fk)` * `stage_name` * `job_name` * `status` * `started_at`, `finished_at` ### `ci_log_chunk` * `chunk_id (pk)` * `job_id (fk)` * `seq` (monotonic) * `byte_start`, `byte_end` (range into blob) * `first_error_line_no` (nullable) * `first_error_excerpt` (nullable, short) * `severity_max` (info/warn/error) ### `ci_run_summary` * `run_id (pk)` * `version` (e.g., `1`) * `etag` (hash) * `summary_json` (small, 1–5 KB) * `updated_at` ### `ci_first_signal` * `run_id (pk)` * `etag` * `signal_json` (small, 0.5–2 KB) * `quality_flags` (bitmask or json) * `updated_at` ## Cache layer * Redis keys: * `run:{runId}:summary:v1` * `run:{runId}:first-signal:v1` * TTL: generous but safe (e.g., 24h) with “write‑through” on event updates. --- # First signal definition ## `FirstSignal` object (recommended shape) ```json { "runId": "123", "computedAt": "2025-12-12T09:22:31Z", "status": "failed", "firstSignal": { "type": "stage_failed", "classification": "dependency_auth", "stage": "build", "job": "build-linux-x64", "step": "dotnet restore", "message": "401 Unauthorized: token expired", "signature": "dotnet-restore-401-unauthorized", "log": { "jobId": "job-789", "lines": [ "error : Response status code does not indicate success: 401 (Unauthorized).", "error : The token is expired." ], "range": { "start": 1880, "end": 1896 } }, "suggestedActions": [ { "label": "Rotate token", "type": "doc", "target": "internal://docs/tokens" }, { "label": "Rerun job", "type": "action", "target": "rerun-job:job-789" } ] }, "etag": "W/\"a1b2c3\"" } ``` ### Notes * `signature` should be stable for grouping. * `suggestedActions` is optional but hugely valuable (even 1–2 actions). --- # APIs ## 1) First signal endpoint **GET** `/api/runs/{runId}/first-signal` Headers: * `If-None-Match: W/"..."` supported * Response includes `ETag` and `Cache-Control` Responses: * `200`: full first signal object * `304`: not modified * `404`: run not found * `204`: run exists but signal not available yet (rare; should degrade gracefully) ## 2) Summary endpoint (optional but useful) **GET** `/api/runs/{runId}/summary` * Includes: status, first failing stage/job, timestamps, blocking policies, artifact counts. ## 3) SSE / WebSocket updates (nice-to-have) **GET** `/api/runs/{runId}/events` (SSE) * Push new signal or summary updates in near real-time while user is on the page. --- # Frontend implementation plan (Angular 17) ## UX behavior 1. **Route enter** * Start TTFS timer. 2. Render instantly: * Title, status badge, pipeline metadata (run id, commit, branch). * Skeleton for details area. 3. Fetch first signal: * Render `FirstSignalCard` immediately when available. * Fire telemetry event when card is **in DOM and visible**. 4. Lazy-load: * Stage graph * Full logs viewer * Artifacts list * Security findings * Trends, flaky tests, etc. ## Angular structure * `RunDetailsResolver` (or `resolveFn`) requests first signal. * `RunDetailsComponent` uses signals to render quickly. * `FirstSignalCardComponent` is standalone + minimal deps. ## Prefetch strategy from runs list view * When the runs table is visible, prefetch summaries/first signals for items in viewport: * Use `IntersectionObserver` to prefetch only visible rows. * Store results in an in-memory cache (e.g., `Map`). * Respect ETag to avoid redundant payloads. ## Telemetry hooks * `ttfs_start`: route activation + tab visible * `ttfs_signal_rendered`: FirstSignalCard attached and visible * Dimensions: provider, repo, branch, run_type, release_version, network_state --- # Backend implementation plan ## Summarizer / First-signal service A service or module that: * subscribes to run/job events * receives log chunks (or pointers) * computes and stores: * `run_summary` * `first_signal` * publishes updates (optional) to an event stream for SSE ### Concurrency rule First signal should be set once per run unless a “better” signal appears: * if current signal is missing → set * if current signal is “generic” and new one is “specific” → replace * otherwise keep (avoid churn) --- # Extraction & classification logic ## Minimum viable extractor (Phase 1) * Heuristics: * first match among patterns: `FATAL`, `ERROR`, `##[error]`, `panic:`, `Unhandled exception`, `npm ERR!`, `BUILD FAILED`, etc. * plus provider-specific fail markers * Pull: * stage/job/step context (from job metadata or step boundaries) * 5–10 log lines around first error line ## Improved extractor (Phase 2+) * Language/tool specific rules: * dotnet, maven/gradle, npm/yarn/pnpm, python/pytest, go test, docker build, terraform, helm * Add `classification` and `signature`: * normalize common errors: * auth expired/forbidden * missing dependency / DNS / TLS * compilation error * test failure (include test name) * infra capacity / agent lost * policy gate failure ## Guardrails * **Secret redaction**: before storing excerpts, run your existing redaction pipeline. * **Payload cap**: cap message length and excerpt lines. * **PII discipline**: avoid including arbitrary stack traces if they contain sensitive paths; include only key lines. --- # Development plan by phases (epics) Each phase below includes deliverables + acceptance criteria. You can treat each as a sprint/iteration. --- ## Phase 0 — Baseline and alignment ### Deliverables * Baseline TTFS measurement (current behavior) * Definition of “actionable signal” and priority rules * Performance budget for run details view ### Tasks * Add client-side telemetry for current page load steps: * route enter, summary loaded, logs loaded, graph loaded * Measure TTFS proxy today (likely “time to status shown”) * Identify top 20 failure modes in your CI (from historical logs) ### Acceptance criteria * Dashboard shows baseline P50/P95 for current experience. * “First signal” contract signed off with UI + backend teams. --- ## Phase 1 — Data model and storage ### Deliverables * DB migrations for `ci_run_summary` and `ci_first_signal` * Redis cache keys and invalidation strategy * ADR: where summaries live and how they update ### Tasks * Create tables and indices: * index on `run_id`, `updated_at`, `provider` * Add serializer/deserializer for `summary_json` and `signal_json` * Implement ETag generation (hash of JSON payload) ### Acceptance criteria * Can store and retrieve summary + first signal for a run in < 50ms (DB) and < 10ms (cache). * ETag works end-to-end. --- ## Phase 2 — Ingestion and first signal computation ### Deliverables * First-signal computation module * Provider adapter integration points (webhook consumers) * “first error tuple” extraction from logs ### Tasks * On job log append: * scan incrementally for first error markers * store excerpt + line range + job/stage/step mapping * On job finish/fail: * finalize first signal with best known context * Implement the “better signal replaces generic” rule ### Acceptance criteria * For a known failing run, API returns first signal without reading full log blob. * Computation does not exceed a small CPU budget per log chunk (guard with limits). * Extraction failure rate < 1% for sampled runs (initial). --- ## Phase 3 — API endpoints and caching ### Deliverables * `/runs/{id}/first-signal` endpoint * Optional `/runs/{id}/summary` * Cache-control + ETag support * Access control checks consistent with existing run authorization ### Tasks * Serve cached first signal first; fallback to DB * If missing: * return `204` (or a “pending” object) and allow UI fallback * Add server-side metrics: * endpoint latency, cache hit rate, payload size ### Acceptance criteria * Endpoint P95 latency meets target (e.g., < 200ms internal). * Cache hit rate is high for active runs (after prefetch). --- ## Phase 4 — Frontend progressive rendering ### Deliverables * FirstSignalCard component * Route resolver + local cache * Prefetch on runs list view * Telemetry for TTFS ### Tasks * Render shell immediately * Fetch and render first signal * Lazy-load heavy panels using `@defer` / dynamic imports * Implement “open failing stage” default behavior ### Acceptance criteria * In throttled network test, first signal card appears significantly earlier than logs and graphs. * `ttfs_signal_rendered` fires exactly once per view, with correct dimensions. --- ## Phase 5 — Observability, dashboards, and alerting ### Deliverables * TTFS dashboards by: * provider, repo, run type, release version * Alerts: * P95 regression threshold * Quality dashboard: * availability rate, extraction failures, “generic signal rate” ### Tasks * Create event pipeline for telemetry into your analytics system * Define SLO/error budget alerts * Add tracing (OpenTelemetry) for endpoint and summarizer ### Acceptance criteria * You can correlate TTFS with: * bounce rate * open→action time * You can pinpoint whether regressions are backend, frontend, or provider‑specific. --- ## Phase 6 — QA, performance testing, rollout ### Deliverables * Automated tests * Feature flag + gradual rollout * A/B experiment (optional) ### Tasks **Testing** * Unit tests: * extractor patterns * classification rules * Integration tests: * simulated job logs with known outcomes * E2E (Playwright/Cypress): * verify first signal appears before logs * verify fallback path works if endpoint fails * Performance tests: * cold cache vs warm cache * throttled CPU/network profiles **Rollout** * Feature flag: * enabled for internal users first * ramp by repo or percentage * Monitor key metrics during ramp: * TTFS P95 * API error rate * UI error rate * cache miss spikes ### Acceptance criteria * No increase in overall error rates. * TTFS improves at least X% for a meaningful slice of users (define X from baseline). * Fallback UX remains usable when signals are unavailable. --- # Backlog examples (ready-to-create Jira tickets) ## Epic: Run summary and first signal storage * Create `ci_first_signal` table * Create `ci_run_summary` table * Implement ETag hashing * Implement Redis caching layer * Add admin/debug endpoint (internal only) to inspect computed signals ## Epic: Log chunk extraction * Implement incremental log scanning * Store first error excerpt + range * Map excerpt to job + step * Add redaction pass to excerpts ## Epic: Run details progressive UI * FirstSignalCard UI component * Lazy-load logs viewer * Default to opening failing stage * Prefetch signals in runs list ## Epic: Telemetry and dashboards * Add `ttfs_start` and `ttfs_signal_rendered` * Add endpoint latency metrics * Build dashboards + alerts * Add sampling for “signal helpfulness” feedback --- # Risk register and mitigations ## Risk: First signal is wrong/misleading * Mitigation: * track “generic signal rate” and “corrected by user” feedback * classification confidence scoring * always provide quick access to full logs as fallback ## Risk: Logs contain secrets * Mitigation: * redact excerpts before storing/serving * cap excerpt lines and length * keep raw logs behind existing permissions ## Risk: Increased ingest CPU cost * Mitigation: * incremental scanning with early stop after first error captured * limit scanning per chunk * sample/skip overly large logs until job completion ## Risk: Cache invalidation bugs * Mitigation: * ETag-based correctness * versioned keys: `:v1` * “write-through” cache updates from summarizer --- # Definition of Done checklist A phase is “done” when: * ✅ TTFS measured with reliable client events * ✅ FirstSignalCard renders from `/first-signal` endpoint * ✅ ETag caching works * ✅ Fallback path is solid (no blank screens) * ✅ Dashboards exist and are actively watched during rollout * ✅ Security review completed for log excerpts/redaction * ✅ Load tests show no unacceptable backend regressions --- # Optional enhancements after initial launch 1. **Next-step recommendations** Add action suggestions and deep links (rotate token, open failing test, open doc). 2. **Flaky test / known issue detection** Show “this matches known flaky signature” with last-seen frequency. 3. **“Compare to last green”** Summarize what changed since last successful run (commit diff, dependency bump). 4. **SSE live updates** Update first signal as soon as failure occurs while user watches. --- If you tell me your current backend stack (Node/Go/.NET), log storage (S3/Elastic/Loki), and which CI providers you support, I can translate this into a concrete set of modules/classes, exact schema migrations, and the Angular routing + signals code structure you’d implement.