Files
git.stella-ops.org/docs/product-advisories/12-Dec-2025 - Measure UX Efficiency Through TTFS.md
master d776e93b16
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
add advisories
2025-12-13 02:08:11 +02:00

20 KiB
Raw Blame History

Heres a simple, highleverage UX metric to add to your pipeline run view that will immediately make DevOps feel faster and calmer:

TimetoFirstSignal (TTFS)

What it is: the time from opening a runs details page until the UI renders the first actionable insight (e.g., “Stage build failed dotnet restore 401 token expired”). Why it matters: engineers dont need all data instantly—just the first trustworthy clue to start acting. Lower TTFS = quicker triage, lower stress, tighter MTTR.


What counts as a “first signal”

  • Failed stage + reason (exit code, key log line, failing test name)
  • Degraded but actionable status (e.g., flaky test signature)
  • Policy gate block with the specific rule that failed
  • Reachabilityaware security finding that blocks deploy (one concrete example, not the whole list)

Not a signal: spinners, generic “loading…”, or unactionable counts.


How to optimize TTFS (practical steps)

  1. Deferred loading (prioritize critical panes):

    • Render header + failing stage card first; lazyload artifacts, full logs, and graphs after.
    • Preexpand the first failing node in the stage graph.
  2. Log preindexing at ingest:

    • During CI, stream logs into chunks keyed by [jobId, phase, severity, firstErrorLine].
    • Extract the first error tuple (timestamp, step, message) and store it next to the job record.
    • On UI open, fetch only that tuple (sub100ms) before fetching the rest.
  3. Cached summaries:

    • Persist a tiny JSON “run.summary.v1” (status, first failing stage, first error line, blocking policies) in Redis/Postgres.
    • Invalidate on new job events; always serve this summary first.
  4. Edge prefetch:

    • When the runs table is visible, prefetch summaries for rows in viewport so details pages open “warm”.
  5. Compress + cap first log burst:

    • Send the first 510 error lines (already extracted) immediately; stream the rest.

Instrumentation (so you can prove it)

Emit these points as telemetry:

  • ttfs_start: when the run details route is entered (or when tab becomes visible)
  • ttfs_signal_rendered: when the first actionable card is in the DOM
  • ttfs_ms = signal_rendered - start
  • Dimensions: pipeline_provider, repo, branch, run_type (PR/main), device, release, network_state

SLO: P50 ≤ 700ms, P95 ≤ 2.5s (adjust to your infra).

Dashboards to track:

  • TTFS distribution (P50/P90/P95) by release
  • Correlate TTFS with bounce rate and “open → rerun” delay
  • Error budget: % of views with TTFS > 3s

Minimal backend contract (example)

GET /api/runs/{runId}/first-signal
{
  "runId": "123",
  "firstSignal": {
    "type": "stage_failed",
    "stage": "build",
    "step": "dotnet restore",
    "message": "401 Unauthorized: token expired",
    "at": "2025-12-11T09:22:31Z",
    "artifact": { "kind": "log", "range": {"start": 1880, "end": 1896} }
  },
  "summaryEtag": "W/\"a1b2c3\""
}

Frontend pattern (Angular 17, signalfirst)

  • Fire first-signal request in route resolver.
  • Render FirstSignalCard immediately.
  • Lazyload stage graph, full logs, security panes.
  • Fire ttfs_signal_rendered when FirstSignalCard enters viewport.

CI adapter hints (GitLab/GitHub/Azure)

  • Hook on job status webhooks to compute & store the first error tuple.
  • For GitLab: scan trace stream for first ERRO|FATAL|##[error] match; store to DB table ci_run_first_signal(run_id, stage, step, message, t).

“Good TTFS” acceptance tests

  • Run with early fail → first signal < 1s, shows exact command + exit code.
  • Run with policy gate fail → rule name + fix hint visible first.
  • Offline/slow network → cached summary still renders an actionable hint.

Copy to put in your UX guidelines

“Optimize TimetoFirstSignal (TTFS) above all. Users must see one trustworthy, actionable clue within 1 second on a warm path—even if the rest of the UI is still loading.”

If you want, I can sketch the exact DB schema for the preindexed log tuples and the Angular resolver + telemetry hooks next. Below is an extended, endtoend implementation plan for TimetoFirstSignal (TTFS) that you can drop into your backlog. It includes architecture, data model, API contracts, frontend work, observability, QA, and rollout—structured as epics/phases with “definition of done” and acceptance criteria.


Scope extension

What were building

A run details experience that renders one actionable clue fast—before loading heavy UI like full logs, graphs, artifacts.

“First signal” is a small payload derived from run/job events and the earliest meaningful error evidence (stage/step + key log line(s) + reason/classification).

What were extending beyond the initial idea

  1. FirstSignal Quality (not just speed)

    • Classify error type (auth, dependency, compilation, test, infra, policy, timeout).
    • Identify “culprit step” and a stable “signature” for dedupe and search.
  2. Progressive disclosure UX

    • Summary → First signal card → expanded context (stage graph, logs, artifacts).
  3. Provideragnostic ingestion

    • Adapters for GitLab/GitHub/Azure (or your CI provider).
  4. Caching + prefetch

    • Warm open from list/table, with ETags and stalewhilerevalidate.
  5. Observability & SLOs

    • TTFS metrics, dashboards, alerting, and quality metrics (false signals).
  6. Rollout safety

    • Feature flags, canary, A/B gating, and a guaranteed fallback path.

Success criteria

Primary metric

  • TTFS (ms): time from details page route enter → first actionable signal rendered.

Targets (example SLOs)

  • P50 ≤ 700 ms, P95 ≤ 2500 ms on warm path.
  • Cold path: P95 ≤ 4000 ms (depends on infra).

Secondary outcome metrics

  • Open→Action time: time from opening run to first user action (rerun, cancel, assign, open failing log line).
  • Bounce rate: close page within 10 seconds without interaction.
  • MTTR proxy: time from failure to first rerun or fix commit.

Quality metrics

  • Signal availability rate: % of run views that show a first signal card within 3s.
  • Signal accuracy score (sampled): engineer confirms “helpful vs not”.
  • Extractor failure rate: parsing errors / missing mappings / timeouts.

Architecture overview

Data flow

  1. CI provider events (job started, job finished, stage failed, log appended) land in your backend.

  2. Run summarizer maintains:

    • run_summary (small JSON)
    • first_signal (small, actionable payload)
  3. UI opens run details

    • Immediately calls GET /runs/{id}/first-signal (or /summary).
    • Renders FirstSignalCard as soon as payload arrives.
  4. Background fetches:

    • Stage graph, full logs, artifacts, security scans, trends.

Key decision: where to compute first signal

  • Option A: at ingest time (recommended) Compute first signal when logs/events arrive, store it, serve it instantly.
  • Option B: on demand Compute when user opens run details (simpler initially, worse TTFS and load).

Data model

Tables (relational example)

ci_run

  • run_id (pk)
  • provider
  • repo_id
  • branch
  • status
  • created_at, updated_at

ci_job

  • job_id (pk)
  • run_id (fk)
  • stage_name
  • job_name
  • status
  • started_at, finished_at

ci_log_chunk

  • chunk_id (pk)
  • job_id (fk)
  • seq (monotonic)
  • byte_start, byte_end (range into blob)
  • first_error_line_no (nullable)
  • first_error_excerpt (nullable, short)
  • severity_max (info/warn/error)

ci_run_summary

  • run_id (pk)
  • version (e.g., 1)
  • etag (hash)
  • summary_json (small, 15 KB)
  • updated_at

ci_first_signal

  • run_id (pk)
  • etag
  • signal_json (small, 0.52 KB)
  • quality_flags (bitmask or json)
  • updated_at

Cache layer

  • Redis keys:

    • run:{runId}:summary:v1
    • run:{runId}:first-signal:v1
  • TTL: generous but safe (e.g., 24h) with “writethrough” on event updates.


First signal definition

{
  "runId": "123",
  "computedAt": "2025-12-12T09:22:31Z",
  "status": "failed",
  "firstSignal": {
    "type": "stage_failed",
    "classification": "dependency_auth",
    "stage": "build",
    "job": "build-linux-x64",
    "step": "dotnet restore",
    "message": "401 Unauthorized: token expired",
    "signature": "dotnet-restore-401-unauthorized",
    "log": {
      "jobId": "job-789",
      "lines": [
        "error : Response status code does not indicate success: 401 (Unauthorized).",
        "error : The token is expired."
      ],
      "range": { "start": 1880, "end": 1896 }
    },
    "suggestedActions": [
      { "label": "Rotate token", "type": "doc", "target": "internal://docs/tokens" },
      { "label": "Rerun job", "type": "action", "target": "rerun-job:job-789" }
    ]
  },
  "etag": "W/\"a1b2c3\""
}

Notes

  • signature should be stable for grouping.
  • suggestedActions is optional but hugely valuable (even 12 actions).

APIs

1) First signal endpoint

GET /api/runs/{runId}/first-signal

Headers:

  • If-None-Match: W/"..." supported
  • Response includes ETag and Cache-Control

Responses:

  • 200: full first signal object
  • 304: not modified
  • 404: run not found
  • 204: run exists but signal not available yet (rare; should degrade gracefully)

2) Summary endpoint (optional but useful)

GET /api/runs/{runId}/summary

  • Includes: status, first failing stage/job, timestamps, blocking policies, artifact counts.

3) SSE / WebSocket updates (nice-to-have)

GET /api/runs/{runId}/events (SSE)

  • Push new signal or summary updates in near real-time while user is on the page.

Frontend implementation plan (Angular 17)

UX behavior

  1. Route enter

    • Start TTFS timer.
  2. Render instantly:

    • Title, status badge, pipeline metadata (run id, commit, branch).
    • Skeleton for details area.
  3. Fetch first signal:

    • Render FirstSignalCard immediately when available.
    • Fire telemetry event when card is in DOM and visible.
  4. Lazy-load:

    • Stage graph
    • Full logs viewer
    • Artifacts list
    • Security findings
    • Trends, flaky tests, etc.

Angular structure

  • RunDetailsResolver (or resolveFn) requests first signal.
  • RunDetailsComponent uses signals to render quickly.
  • FirstSignalCardComponent is standalone + minimal deps.

Prefetch strategy from runs list view

  • When the runs table is visible, prefetch summaries/first signals for items in viewport:

    • Use IntersectionObserver to prefetch only visible rows.
    • Store results in an in-memory cache (e.g., Map<runId, FirstSignal>).
    • Respect ETag to avoid redundant payloads.

Telemetry hooks

  • ttfs_start: route activation + tab visible
  • ttfs_signal_rendered: FirstSignalCard attached and visible
  • Dimensions: provider, repo, branch, run_type, release_version, network_state

Backend implementation plan

Summarizer / First-signal service

A service or module that:

  • subscribes to run/job events

  • receives log chunks (or pointers)

  • computes and stores:

    • run_summary
    • first_signal
  • publishes updates (optional) to an event stream for SSE

Concurrency rule

First signal should be set once per run unless a “better” signal appears:

  • if current signal is missing → set
  • if current signal is “generic” and new one is “specific” → replace
  • otherwise keep (avoid churn)

Extraction & classification logic

Minimum viable extractor (Phase 1)

  • Heuristics:

    • first match among patterns: FATAL, ERROR, ##[error], panic:, Unhandled exception, npm ERR!, BUILD FAILED, etc.
    • plus provider-specific fail markers
  • Pull:

    • stage/job/step context (from job metadata or step boundaries)
    • 510 log lines around first error line

Improved extractor (Phase 2+)

  • Language/tool specific rules:

    • dotnet, maven/gradle, npm/yarn/pnpm, python/pytest, go test, docker build, terraform, helm
  • Add classification and signature:

    • normalize common errors:

      • auth expired/forbidden
      • missing dependency / DNS / TLS
      • compilation error
      • test failure (include test name)
      • infra capacity / agent lost
      • policy gate failure

Guardrails

  • Secret redaction: before storing excerpts, run your existing redaction pipeline.
  • Payload cap: cap message length and excerpt lines.
  • PII discipline: avoid including arbitrary stack traces if they contain sensitive paths; include only key lines.

Development plan by phases (epics)

Each phase below includes deliverables + acceptance criteria. You can treat each as a sprint/iteration.


Phase 0 — Baseline and alignment

Deliverables

  • Baseline TTFS measurement (current behavior)
  • Definition of “actionable signal” and priority rules
  • Performance budget for run details view

Tasks

  • Add client-side telemetry for current page load steps:

    • route enter, summary loaded, logs loaded, graph loaded
  • Measure TTFS proxy today (likely “time to status shown”)

  • Identify top 20 failure modes in your CI (from historical logs)

Acceptance criteria

  • Dashboard shows baseline P50/P95 for current experience.
  • “First signal” contract signed off with UI + backend teams.

Phase 1 — Data model and storage

Deliverables

  • DB migrations for ci_run_summary and ci_first_signal
  • Redis cache keys and invalidation strategy
  • ADR: where summaries live and how they update

Tasks

  • Create tables and indices:

    • index on run_id, updated_at, provider
  • Add serializer/deserializer for summary_json and signal_json

  • Implement ETag generation (hash of JSON payload)

Acceptance criteria

  • Can store and retrieve summary + first signal for a run in < 50ms (DB) and < 10ms (cache).
  • ETag works end-to-end.

Phase 2 — Ingestion and first signal computation

Deliverables

  • First-signal computation module
  • Provider adapter integration points (webhook consumers)
  • “first error tuple” extraction from logs

Tasks

  • On job log append:

    • scan incrementally for first error markers
    • store excerpt + line range + job/stage/step mapping
  • On job finish/fail:

    • finalize first signal with best known context
  • Implement the “better signal replaces generic” rule

Acceptance criteria

  • For a known failing run, API returns first signal without reading full log blob.
  • Computation does not exceed a small CPU budget per log chunk (guard with limits).
  • Extraction failure rate < 1% for sampled runs (initial).

Phase 3 — API endpoints and caching

Deliverables

  • /runs/{id}/first-signal endpoint
  • Optional /runs/{id}/summary
  • Cache-control + ETag support
  • Access control checks consistent with existing run authorization

Tasks

  • Serve cached first signal first; fallback to DB

  • If missing:

    • return 204 (or a “pending” object) and allow UI fallback
  • Add server-side metrics:

    • endpoint latency, cache hit rate, payload size

Acceptance criteria

  • Endpoint P95 latency meets target (e.g., < 200ms internal).
  • Cache hit rate is high for active runs (after prefetch).

Phase 4 — Frontend progressive rendering

Deliverables

  • FirstSignalCard component
  • Route resolver + local cache
  • Prefetch on runs list view
  • Telemetry for TTFS

Tasks

  • Render shell immediately
  • Fetch and render first signal
  • Lazy-load heavy panels using @defer / dynamic imports
  • Implement “open failing stage” default behavior

Acceptance criteria

  • In throttled network test, first signal card appears significantly earlier than logs and graphs.
  • ttfs_signal_rendered fires exactly once per view, with correct dimensions.

Phase 5 — Observability, dashboards, and alerting

Deliverables

  • TTFS dashboards by:

    • provider, repo, run type, release version
  • Alerts:

    • P95 regression threshold
  • Quality dashboard:

    • availability rate, extraction failures, “generic signal rate”

Tasks

  • Create event pipeline for telemetry into your analytics system
  • Define SLO/error budget alerts
  • Add tracing (OpenTelemetry) for endpoint and summarizer

Acceptance criteria

  • You can correlate TTFS with:

    • bounce rate
    • open→action time
  • You can pinpoint whether regressions are backend, frontend, or providerspecific.


Phase 6 — QA, performance testing, rollout

Deliverables

  • Automated tests
  • Feature flag + gradual rollout
  • A/B experiment (optional)

Tasks

Testing

  • Unit tests:

    • extractor patterns
    • classification rules
  • Integration tests:

    • simulated job logs with known outcomes
  • E2E (Playwright/Cypress):

    • verify first signal appears before logs
    • verify fallback path works if endpoint fails
  • Performance tests:

    • cold cache vs warm cache
    • throttled CPU/network profiles

Rollout

  • Feature flag:

    • enabled for internal users first
    • ramp by repo or percentage
  • Monitor key metrics during ramp:

    • TTFS P95
    • API error rate
    • UI error rate
    • cache miss spikes

Acceptance criteria

  • No increase in overall error rates.
  • TTFS improves at least X% for a meaningful slice of users (define X from baseline).
  • Fallback UX remains usable when signals are unavailable.

Backlog examples (ready-to-create Jira tickets)

Epic: Run summary and first signal storage

  • Create ci_first_signal table
  • Create ci_run_summary table
  • Implement ETag hashing
  • Implement Redis caching layer
  • Add admin/debug endpoint (internal only) to inspect computed signals

Epic: Log chunk extraction

  • Implement incremental log scanning
  • Store first error excerpt + range
  • Map excerpt to job + step
  • Add redaction pass to excerpts

Epic: Run details progressive UI

  • FirstSignalCard UI component
  • Lazy-load logs viewer
  • Default to opening failing stage
  • Prefetch signals in runs list

Epic: Telemetry and dashboards

  • Add ttfs_start and ttfs_signal_rendered
  • Add endpoint latency metrics
  • Build dashboards + alerts
  • Add sampling for “signal helpfulness” feedback

Risk register and mitigations

Risk: First signal is wrong/misleading

  • Mitigation:

    • track “generic signal rate” and “corrected by user” feedback
    • classification confidence scoring
    • always provide quick access to full logs as fallback

Risk: Logs contain secrets

  • Mitigation:

    • redact excerpts before storing/serving
    • cap excerpt lines and length
    • keep raw logs behind existing permissions

Risk: Increased ingest CPU cost

  • Mitigation:

    • incremental scanning with early stop after first error captured
    • limit scanning per chunk
    • sample/skip overly large logs until job completion

Risk: Cache invalidation bugs

  • Mitigation:

    • ETag-based correctness
    • versioned keys: :v1
    • “write-through” cache updates from summarizer

Definition of Done checklist

A phase is “done” when:

  • TTFS measured with reliable client events
  • FirstSignalCard renders from /first-signal endpoint
  • ETag caching works
  • Fallback path is solid (no blank screens)
  • Dashboards exist and are actively watched during rollout
  • Security review completed for log excerpts/redaction
  • Load tests show no unacceptable backend regressions

Optional enhancements after initial launch

  1. Next-step recommendations Add action suggestions and deep links (rotate token, open failing test, open doc).
  2. Flaky test / known issue detection Show “this matches known flaky signature” with last-seen frequency.
  3. “Compare to last green” Summarize what changed since last successful run (commit diff, dependency bump).
  4. SSE live updates Update first signal as soon as failure occurs while user watches.

If you tell me your current backend stack (Node/Go/.NET), log storage (S3/Elastic/Loki), and which CI providers you support, I can translate this into a concrete set of modules/classes, exact schema migrations, and the Angular routing + signals code structure youd implement.