20 KiB
Here’s a simple, high‑leverage UX metric to add to your pipeline run view that will immediately make DevOps feel faster and calmer:
Time‑to‑First‑Signal (TTFS)
What it is: the time from opening a run’s details page until the UI renders the first actionable insight (e.g., “Stage build failed – dotnet restore 401 – token expired”).
Why it matters: engineers don’t need all data instantly—just the first trustworthy clue to start acting. Lower TTFS = quicker triage, lower stress, tighter MTTR.
What counts as a “first signal”
- Failed stage + reason (exit code, key log line, failing test name)
- Degraded but actionable status (e.g., flaky test signature)
- Policy gate block with the specific rule that failed
- Reachability‑aware security finding that blocks deploy (one concrete example, not the whole list)
Not a signal: spinners, generic “loading…”, or unactionable counts.
How to optimize TTFS (practical steps)
-
Deferred loading (prioritize critical panes):
- Render header + failing stage card first; lazy‑load artifacts, full logs, and graphs after.
- Pre‑expand the first failing node in the stage graph.
-
Log pre‑indexing at ingest:
- During CI, stream logs into chunks keyed by
[jobId, phase, severity, firstErrorLine]. - Extract the first error tuple (timestamp, step, message) and store it next to the job record.
- On UI open, fetch only that tuple (sub‑100 ms) before fetching the rest.
- During CI, stream logs into chunks keyed by
-
Cached summaries:
- Persist a tiny JSON “run.summary.v1” (status, first failing stage, first error line, blocking policies) in Redis/Postgres.
- Invalidate on new job events; always serve this summary first.
-
Edge prefetch:
- When the runs table is visible, prefetch summaries for rows in viewport so details pages open “warm”.
-
Compress + cap first log burst:
- Send the first 5–10 error lines (already extracted) immediately; stream the rest.
Instrumentation (so you can prove it)
Emit these points as telemetry:
ttfs_start: when the run details route is entered (or when tab becomes visible)ttfs_signal_rendered: when the first actionable card is in the DOMttfs_ms = signal_rendered - start- Dimensions:
pipeline_provider,repo,branch,run_type(PR/main),device,release,network_state
SLO: P50 ≤ 700 ms, P95 ≤ 2.5 s (adjust to your infra).
Dashboards to track:
- TTFS distribution (P50/P90/P95) by release
- Correlate TTFS with bounce rate and “open → rerun” delay
- Error budget: % of views with TTFS > 3 s
Minimal backend contract (example)
GET /api/runs/{runId}/first-signal
{
"runId": "123",
"firstSignal": {
"type": "stage_failed",
"stage": "build",
"step": "dotnet restore",
"message": "401 Unauthorized: token expired",
"at": "2025-12-11T09:22:31Z",
"artifact": { "kind": "log", "range": {"start": 1880, "end": 1896} }
},
"summaryEtag": "W/\"a1b2c3\""
}
Frontend pattern (Angular 17, signal‑first)
- Fire
first-signalrequest in route resolver. - Render
FirstSignalCardimmediately. - Lazy‑load stage graph, full logs, security panes.
- Fire
ttfs_signal_renderedwhenFirstSignalCardenters viewport.
CI adapter hints (GitLab/GitHub/Azure)
- Hook on job status webhooks to compute & store the first error tuple.
- For GitLab: scan
tracestream for firstERRO|FATAL|##[error]match; store to DB tableci_run_first_signal(run_id, stage, step, message, t).
“Good TTFS” acceptance tests
- Run with early fail → first signal < 1 s, shows exact command + exit code.
- Run with policy gate fail → rule name + fix hint visible first.
- Offline/slow network → cached summary still renders an actionable hint.
Copy to put in your UX guidelines
“Optimize Time‑to‑First‑Signal (TTFS) above all. Users must see one trustworthy, actionable clue within 1 second on a warm path—even if the rest of the UI is still loading.”
If you want, I can sketch the exact DB schema for the pre‑indexed log tuples and the Angular resolver + telemetry hooks next. Below is an extended, end‑to‑end implementation plan for Time‑to‑First‑Signal (TTFS) that you can drop into your backlog. It includes architecture, data model, API contracts, frontend work, observability, QA, and rollout—structured as epics/phases with “definition of done” and acceptance criteria.
Scope extension
What we’re building
A run details experience that renders one actionable clue fast—before loading heavy UI like full logs, graphs, artifacts.
“First signal” is a small payload derived from run/job events and the earliest meaningful error evidence (stage/step + key log line(s) + reason/classification).
What we’re extending beyond the initial idea
-
First‑Signal Quality (not just speed)
- Classify error type (auth, dependency, compilation, test, infra, policy, timeout).
- Identify “culprit step” and a stable “signature” for dedupe and search.
-
Progressive disclosure UX
- Summary → First signal card → expanded context (stage graph, logs, artifacts).
-
Provider‑agnostic ingestion
- Adapters for GitLab/GitHub/Azure (or your CI provider).
-
Caching + prefetch
- Warm open from list/table, with ETags and stale‑while‑revalidate.
-
Observability & SLOs
- TTFS metrics, dashboards, alerting, and quality metrics (false signals).
-
Rollout safety
- Feature flags, canary, A/B gating, and a guaranteed fallback path.
Success criteria
Primary metric
- TTFS (ms): time from details page route enter → first actionable signal rendered.
Targets (example SLOs)
- P50 ≤ 700 ms, P95 ≤ 2500 ms on warm path.
- Cold path: P95 ≤ 4000 ms (depends on infra).
Secondary outcome metrics
- Open→Action time: time from opening run to first user action (rerun, cancel, assign, open failing log line).
- Bounce rate: close page within 10 seconds without interaction.
- MTTR proxy: time from failure to first rerun or fix commit.
Quality metrics
- Signal availability rate: % of run views that show a first signal card within 3s.
- Signal accuracy score (sampled): engineer confirms “helpful vs not”.
- Extractor failure rate: parsing errors / missing mappings / timeouts.
Architecture overview
Data flow
-
CI provider events (job started, job finished, stage failed, log appended) land in your backend.
-
Run summarizer maintains:
run_summary(small JSON)first_signal(small, actionable payload)
-
UI opens run details
- Immediately calls
GET /runs/{id}/first-signal(or/summary). - Renders FirstSignalCard as soon as payload arrives.
- Immediately calls
-
Background fetches:
- Stage graph, full logs, artifacts, security scans, trends.
Key decision: where to compute first signal
- Option A: at ingest time (recommended) Compute first signal when logs/events arrive, store it, serve it instantly.
- Option B: on demand Compute when user opens run details (simpler initially, worse TTFS and load).
Data model
Tables (relational example)
ci_run
run_id (pk)providerrepo_idbranchstatuscreated_at,updated_at
ci_job
job_id (pk)run_id (fk)stage_namejob_namestatusstarted_at,finished_at
ci_log_chunk
chunk_id (pk)job_id (fk)seq(monotonic)byte_start,byte_end(range into blob)first_error_line_no(nullable)first_error_excerpt(nullable, short)severity_max(info/warn/error)
ci_run_summary
run_id (pk)version(e.g.,1)etag(hash)summary_json(small, 1–5 KB)updated_at
ci_first_signal
run_id (pk)etagsignal_json(small, 0.5–2 KB)quality_flags(bitmask or json)updated_at
Cache layer
-
Redis keys:
run:{runId}:summary:v1run:{runId}:first-signal:v1
-
TTL: generous but safe (e.g., 24h) with “write‑through” on event updates.
First signal definition
FirstSignal object (recommended shape)
{
"runId": "123",
"computedAt": "2025-12-12T09:22:31Z",
"status": "failed",
"firstSignal": {
"type": "stage_failed",
"classification": "dependency_auth",
"stage": "build",
"job": "build-linux-x64",
"step": "dotnet restore",
"message": "401 Unauthorized: token expired",
"signature": "dotnet-restore-401-unauthorized",
"log": {
"jobId": "job-789",
"lines": [
"error : Response status code does not indicate success: 401 (Unauthorized).",
"error : The token is expired."
],
"range": { "start": 1880, "end": 1896 }
},
"suggestedActions": [
{ "label": "Rotate token", "type": "doc", "target": "internal://docs/tokens" },
{ "label": "Rerun job", "type": "action", "target": "rerun-job:job-789" }
]
},
"etag": "W/\"a1b2c3\""
}
Notes
signatureshould be stable for grouping.suggestedActionsis optional but hugely valuable (even 1–2 actions).
APIs
1) First signal endpoint
GET /api/runs/{runId}/first-signal
Headers:
If-None-Match: W/"..."supported- Response includes
ETagandCache-Control
Responses:
200: full first signal object304: not modified404: run not found204: run exists but signal not available yet (rare; should degrade gracefully)
2) Summary endpoint (optional but useful)
GET /api/runs/{runId}/summary
- Includes: status, first failing stage/job, timestamps, blocking policies, artifact counts.
3) SSE / WebSocket updates (nice-to-have)
GET /api/runs/{runId}/events (SSE)
- Push new signal or summary updates in near real-time while user is on the page.
Frontend implementation plan (Angular 17)
UX behavior
-
Route enter
- Start TTFS timer.
-
Render instantly:
- Title, status badge, pipeline metadata (run id, commit, branch).
- Skeleton for details area.
-
Fetch first signal:
- Render
FirstSignalCardimmediately when available. - Fire telemetry event when card is in DOM and visible.
- Render
-
Lazy-load:
- Stage graph
- Full logs viewer
- Artifacts list
- Security findings
- Trends, flaky tests, etc.
Angular structure
RunDetailsResolver(orresolveFn) requests first signal.RunDetailsComponentuses signals to render quickly.FirstSignalCardComponentis standalone + minimal deps.
Prefetch strategy from runs list view
-
When the runs table is visible, prefetch summaries/first signals for items in viewport:
- Use
IntersectionObserverto prefetch only visible rows. - Store results in an in-memory cache (e.g.,
Map<runId, FirstSignal>). - Respect ETag to avoid redundant payloads.
- Use
Telemetry hooks
ttfs_start: route activation + tab visiblettfs_signal_rendered: FirstSignalCard attached and visible- Dimensions: provider, repo, branch, run_type, release_version, network_state
Backend implementation plan
Summarizer / First-signal service
A service or module that:
-
subscribes to run/job events
-
receives log chunks (or pointers)
-
computes and stores:
run_summaryfirst_signal
-
publishes updates (optional) to an event stream for SSE
Concurrency rule
First signal should be set once per run unless a “better” signal appears:
- if current signal is missing → set
- if current signal is “generic” and new one is “specific” → replace
- otherwise keep (avoid churn)
Extraction & classification logic
Minimum viable extractor (Phase 1)
-
Heuristics:
- first match among patterns:
FATAL,ERROR,##[error],panic:,Unhandled exception,npm ERR!,BUILD FAILED, etc. - plus provider-specific fail markers
- first match among patterns:
-
Pull:
- stage/job/step context (from job metadata or step boundaries)
- 5–10 log lines around first error line
Improved extractor (Phase 2+)
-
Language/tool specific rules:
- dotnet, maven/gradle, npm/yarn/pnpm, python/pytest, go test, docker build, terraform, helm
-
Add
classificationandsignature:-
normalize common errors:
- auth expired/forbidden
- missing dependency / DNS / TLS
- compilation error
- test failure (include test name)
- infra capacity / agent lost
- policy gate failure
-
Guardrails
- Secret redaction: before storing excerpts, run your existing redaction pipeline.
- Payload cap: cap message length and excerpt lines.
- PII discipline: avoid including arbitrary stack traces if they contain sensitive paths; include only key lines.
Development plan by phases (epics)
Each phase below includes deliverables + acceptance criteria. You can treat each as a sprint/iteration.
Phase 0 — Baseline and alignment
Deliverables
- Baseline TTFS measurement (current behavior)
- Definition of “actionable signal” and priority rules
- Performance budget for run details view
Tasks
-
Add client-side telemetry for current page load steps:
- route enter, summary loaded, logs loaded, graph loaded
-
Measure TTFS proxy today (likely “time to status shown”)
-
Identify top 20 failure modes in your CI (from historical logs)
Acceptance criteria
- Dashboard shows baseline P50/P95 for current experience.
- “First signal” contract signed off with UI + backend teams.
Phase 1 — Data model and storage
Deliverables
- DB migrations for
ci_run_summaryandci_first_signal - Redis cache keys and invalidation strategy
- ADR: where summaries live and how they update
Tasks
-
Create tables and indices:
- index on
run_id,updated_at,provider
- index on
-
Add serializer/deserializer for
summary_jsonandsignal_json -
Implement ETag generation (hash of JSON payload)
Acceptance criteria
- Can store and retrieve summary + first signal for a run in < 50ms (DB) and < 10ms (cache).
- ETag works end-to-end.
Phase 2 — Ingestion and first signal computation
Deliverables
- First-signal computation module
- Provider adapter integration points (webhook consumers)
- “first error tuple” extraction from logs
Tasks
-
On job log append:
- scan incrementally for first error markers
- store excerpt + line range + job/stage/step mapping
-
On job finish/fail:
- finalize first signal with best known context
-
Implement the “better signal replaces generic” rule
Acceptance criteria
- For a known failing run, API returns first signal without reading full log blob.
- Computation does not exceed a small CPU budget per log chunk (guard with limits).
- Extraction failure rate < 1% for sampled runs (initial).
Phase 3 — API endpoints and caching
Deliverables
/runs/{id}/first-signalendpoint- Optional
/runs/{id}/summary - Cache-control + ETag support
- Access control checks consistent with existing run authorization
Tasks
-
Serve cached first signal first; fallback to DB
-
If missing:
- return
204(or a “pending” object) and allow UI fallback
- return
-
Add server-side metrics:
- endpoint latency, cache hit rate, payload size
Acceptance criteria
- Endpoint P95 latency meets target (e.g., < 200ms internal).
- Cache hit rate is high for active runs (after prefetch).
Phase 4 — Frontend progressive rendering
Deliverables
- FirstSignalCard component
- Route resolver + local cache
- Prefetch on runs list view
- Telemetry for TTFS
Tasks
- Render shell immediately
- Fetch and render first signal
- Lazy-load heavy panels using
@defer/ dynamic imports - Implement “open failing stage” default behavior
Acceptance criteria
- In throttled network test, first signal card appears significantly earlier than logs and graphs.
ttfs_signal_renderedfires exactly once per view, with correct dimensions.
Phase 5 — Observability, dashboards, and alerting
Deliverables
-
TTFS dashboards by:
- provider, repo, run type, release version
-
Alerts:
- P95 regression threshold
-
Quality dashboard:
- availability rate, extraction failures, “generic signal rate”
Tasks
- Create event pipeline for telemetry into your analytics system
- Define SLO/error budget alerts
- Add tracing (OpenTelemetry) for endpoint and summarizer
Acceptance criteria
-
You can correlate TTFS with:
- bounce rate
- open→action time
-
You can pinpoint whether regressions are backend, frontend, or provider‑specific.
Phase 6 — QA, performance testing, rollout
Deliverables
- Automated tests
- Feature flag + gradual rollout
- A/B experiment (optional)
Tasks
Testing
-
Unit tests:
- extractor patterns
- classification rules
-
Integration tests:
- simulated job logs with known outcomes
-
E2E (Playwright/Cypress):
- verify first signal appears before logs
- verify fallback path works if endpoint fails
-
Performance tests:
- cold cache vs warm cache
- throttled CPU/network profiles
Rollout
-
Feature flag:
- enabled for internal users first
- ramp by repo or percentage
-
Monitor key metrics during ramp:
- TTFS P95
- API error rate
- UI error rate
- cache miss spikes
Acceptance criteria
- No increase in overall error rates.
- TTFS improves at least X% for a meaningful slice of users (define X from baseline).
- Fallback UX remains usable when signals are unavailable.
Backlog examples (ready-to-create Jira tickets)
Epic: Run summary and first signal storage
- Create
ci_first_signaltable - Create
ci_run_summarytable - Implement ETag hashing
- Implement Redis caching layer
- Add admin/debug endpoint (internal only) to inspect computed signals
Epic: Log chunk extraction
- Implement incremental log scanning
- Store first error excerpt + range
- Map excerpt to job + step
- Add redaction pass to excerpts
Epic: Run details progressive UI
- FirstSignalCard UI component
- Lazy-load logs viewer
- Default to opening failing stage
- Prefetch signals in runs list
Epic: Telemetry and dashboards
- Add
ttfs_startandttfs_signal_rendered - Add endpoint latency metrics
- Build dashboards + alerts
- Add sampling for “signal helpfulness” feedback
Risk register and mitigations
Risk: First signal is wrong/misleading
-
Mitigation:
- track “generic signal rate” and “corrected by user” feedback
- classification confidence scoring
- always provide quick access to full logs as fallback
Risk: Logs contain secrets
-
Mitigation:
- redact excerpts before storing/serving
- cap excerpt lines and length
- keep raw logs behind existing permissions
Risk: Increased ingest CPU cost
-
Mitigation:
- incremental scanning with early stop after first error captured
- limit scanning per chunk
- sample/skip overly large logs until job completion
Risk: Cache invalidation bugs
-
Mitigation:
- ETag-based correctness
- versioned keys:
:v1 - “write-through” cache updates from summarizer
Definition of Done checklist
A phase is “done” when:
- ✅ TTFS measured with reliable client events
- ✅ FirstSignalCard renders from
/first-signalendpoint - ✅ ETag caching works
- ✅ Fallback path is solid (no blank screens)
- ✅ Dashboards exist and are actively watched during rollout
- ✅ Security review completed for log excerpts/redaction
- ✅ Load tests show no unacceptable backend regressions
Optional enhancements after initial launch
- Next-step recommendations Add action suggestions and deep links (rotate token, open failing test, open doc).
- Flaky test / known issue detection Show “this matches known flaky signature” with last-seen frequency.
- “Compare to last green” Summarize what changed since last successful run (commit diff, dependency bump).
- SSE live updates Update first signal as soon as failure occurs while user watches.
If you tell me your current backend stack (Node/Go/.NET), log storage (S3/Elastic/Loki), and which CI providers you support, I can translate this into a concrete set of modules/classes, exact schema migrations, and the Angular routing + signals code structure you’d implement.