stella-ops.org/git.stella-ops.org

Fork 0

Files

master d776e93b16

Docs CI / lint-and-preview (push) Has been cancelled

Details

add advisories

2025-12-13 02:08:11 +02:00

20 KiB

Raw Blame History

Here’s a simple, high‑leverage UX metric to add to your pipeline run view that will immediately make DevOps feel faster and calmer:

Time‑to‑First‑Signal (TTFS)

What it is: the time from opening a run’s details page until the UI renders the first actionable insight (e.g., “Stage build failed – dotnet restore 401 – token expired”). Why it matters: engineers don’t need all data instantly—just the first trustworthy clue to start acting. Lower TTFS = quicker triage, lower stress, tighter MTTR.

What counts as a “first signal”

Failed stage + reason (exit code, key log line, failing test name)
Degraded but actionable status (e.g., flaky test signature)
Policy gate block with the specific rule that failed
Reachability‑aware security finding that blocks deploy (one concrete example, not the whole list)

Not a signal: spinners, generic “loading…”, or unactionable counts.

How to optimize TTFS (practical steps)

Deferred loading (prioritize critical panes):
- Render header + failing stage card first; lazy‑load artifacts, full logs, and graphs after.
- Pre‑expand the first failing node in the stage graph.
Log pre‑indexing at ingest:
- During CI, stream logs into chunks keyed by [jobId, phase, severity, firstErrorLine].
- Extract the first error tuple (timestamp, step, message) and store it next to the job record.
- On UI open, fetch only that tuple (sub‑100 ms) before fetching the rest.
Cached summaries:
- Persist a tiny JSON “run.summary.v1” (status, first failing stage, first error line, blocking policies) in Redis/Postgres.
- Invalidate on new job events; always serve this summary first.
Edge prefetch:
- When the runs table is visible, prefetch summaries for rows in viewport so details pages open “warm”.
Compress + cap first log burst:
- Send the first 5–10 error lines (already extracted) immediately; stream the rest.

Instrumentation (so you can prove it)

Emit these points as telemetry:

ttfs_start: when the run details route is entered (or when tab becomes visible)
ttfs_signal_rendered: when the first actionable card is in the DOM
ttfs_ms = signal_rendered - start
Dimensions: pipeline_provider, repo, branch, run_type (PR/main), device, release, network_state

SLO: P50 ≤ 700 ms, P95 ≤ 2.5 s (adjust to your infra).

Dashboards to track:

TTFS distribution (P50/P90/P95) by release
Correlate TTFS with bounce rate and “open → rerun” delay
Error budget: % of views with TTFS > 3 s

Minimal backend contract (example)

GET /api/runs/{runId}/first-signal
{
  "runId": "123",
  "firstSignal": {
    "type": "stage_failed",
    "stage": "build",
    "step": "dotnet restore",
    "message": "401 Unauthorized: token expired",
    "at": "2025-12-11T09:22:31Z",
    "artifact": { "kind": "log", "range": {"start": 1880, "end": 1896} }
  },
  "summaryEtag": "W/\"a1b2c3\""
}

Frontend pattern (Angular 17, signal‑first)

Fire first-signal request in route resolver.
Render FirstSignalCard immediately.
Lazy‑load stage graph, full logs, security panes.
Fire ttfs_signal_rendered when FirstSignalCard enters viewport.

CI adapter hints (GitLab/GitHub/Azure)

Hook on job status webhooks to compute & store the first error tuple.
For GitLab: scan trace stream for first ERRO|FATAL|##[error] match; store to DB table ci_run_first_signal(run_id, stage, step, message, t).

“Good TTFS” acceptance tests

Run with early fail → first signal < 1 s, shows exact command + exit code.
Run with policy gate fail → rule name + fix hint visible first.
Offline/slow network → cached summary still renders an actionable hint.

Copy to put in your UX guidelines

“Optimize Time‑to‑First‑Signal (TTFS) above all. Users must see one trustworthy, actionable clue within 1 second on a warm path—even if the rest of the UI is still loading.”

If you want, I can sketch the exact DB schema for the pre‑indexed log tuples and the Angular resolver + telemetry hooks next. Below is an extended, end‑to‑end implementation plan for Time‑to‑First‑Signal (TTFS) that you can drop into your backlog. It includes architecture, data model, API contracts, frontend work, observability, QA, and rollout—structured as epics/phases with “definition of done” and acceptance criteria.

Scope extension

What we’re building

A run details experience that renders one actionable clue fast—before loading heavy UI like full logs, graphs, artifacts.

“First signal” is a small payload derived from run/job events and the earliest meaningful error evidence (stage/step + key log line(s) + reason/classification).

What we’re extending beyond the initial idea

First‑Signal Quality (not just speed)
- Classify error type (auth, dependency, compilation, test, infra, policy, timeout).
- Identify “culprit step” and a stable “signature” for dedupe and search.
Progressive disclosure UX
- Summary → First signal card → expanded context (stage graph, logs, artifacts).
Provider‑agnostic ingestion
- Adapters for GitLab/GitHub/Azure (or your CI provider).
Caching + prefetch
- Warm open from list/table, with ETags and stale‑while‑revalidate.
Observability & SLOs
- TTFS metrics, dashboards, alerting, and quality metrics (false signals).
Rollout safety
- Feature flags, canary, A/B gating, and a guaranteed fallback path.

Success criteria

Primary metric

TTFS (ms): time from details page route enter → first actionable signal rendered.

Targets (example SLOs)

P50 ≤ 700 ms, P95 ≤ 2500 ms on warm path.
Cold path: P95 ≤ 4000 ms (depends on infra).

Secondary outcome metrics

Open→Action time: time from opening run to first user action (rerun, cancel, assign, open failing log line).
Bounce rate: close page within 10 seconds without interaction.
MTTR proxy: time from failure to first rerun or fix commit.

Quality metrics

Signal availability rate: % of run views that show a first signal card within 3s.
Signal accuracy score (sampled): engineer confirms “helpful vs not”.
Extractor failure rate: parsing errors / missing mappings / timeouts.

Architecture overview

Data flow

CI provider events (job started, job finished, stage failed, log appended) land in your backend.
Run summarizer maintains:
- run_summary (small JSON)
- first_signal (small, actionable payload)
UI opens run details
- Immediately calls GET /runs/{id}/first-signal (or /summary).
- Renders FirstSignalCard as soon as payload arrives.
Background fetches:
- Stage graph, full logs, artifacts, security scans, trends.

Key decision: where to compute first signal

Option A: at ingest time (recommended) Compute first signal when logs/events arrive, store it, serve it instantly.
Option B: on demand Compute when user opens run details (simpler initially, worse TTFS and load).

Data model

Tables (relational example)

`ci_run`

run_id (pk)
provider
repo_id
branch
status
created_at, updated_at

`ci_job`

job_id (pk)
run_id (fk)
stage_name
job_name
status
started_at, finished_at

`ci_log_chunk`

chunk_id (pk)
job_id (fk)
seq (monotonic)
byte_start, byte_end (range into blob)
first_error_line_no (nullable)
first_error_excerpt (nullable, short)
severity_max (info/warn/error)

`ci_run_summary`

run_id (pk)
version (e.g., 1)
etag (hash)
summary_json (small, 1–5 KB)
updated_at

`ci_first_signal`

run_id (pk)
etag
signal_json (small, 0.5–2 KB)
quality_flags (bitmask or json)
updated_at

Cache layer

Redis keys:
- run:{runId}:summary:v1
- run:{runId}:first-signal:v1
TTL: generous but safe (e.g., 24h) with “write‑through” on event updates.

First signal definition

`FirstSignal` object (recommended shape)

{
  "runId": "123",
  "computedAt": "2025-12-12T09:22:31Z",
  "status": "failed",
  "firstSignal": {
    "type": "stage_failed",
    "classification": "dependency_auth",
    "stage": "build",
    "job": "build-linux-x64",
    "step": "dotnet restore",
    "message": "401 Unauthorized: token expired",
    "signature": "dotnet-restore-401-unauthorized",
    "log": {
      "jobId": "job-789",
      "lines": [
        "error : Response status code does not indicate success: 401 (Unauthorized).",
        "error : The token is expired."
      ],
      "range": { "start": 1880, "end": 1896 }
    },
    "suggestedActions": [
      { "label": "Rotate token", "type": "doc", "target": "internal://docs/tokens" },
      { "label": "Rerun job", "type": "action", "target": "rerun-job:job-789" }
    ]
  },
  "etag": "W/\"a1b2c3\""
}

Notes

signature should be stable for grouping.
suggestedActions is optional but hugely valuable (even 1–2 actions).

APIs

1) First signal endpoint

GET /api/runs/{runId}/first-signal

Headers:

If-None-Match: W/"..." supported
Response includes ETag and Cache-Control

Responses:

200: full first signal object
304: not modified
404: run not found
204: run exists but signal not available yet (rare; should degrade gracefully)

2) Summary endpoint (optional but useful)

GET /api/runs/{runId}/summary

Includes: status, first failing stage/job, timestamps, blocking policies, artifact counts.

3) SSE / WebSocket updates (nice-to-have)

GET /api/runs/{runId}/events (SSE)

Push new signal or summary updates in near real-time while user is on the page.

Frontend implementation plan (Angular 17)

UX behavior

Route enter
- Start TTFS timer.
Render instantly:
- Title, status badge, pipeline metadata (run id, commit, branch).
- Skeleton for details area.
Fetch first signal:
- Render FirstSignalCard immediately when available.
- Fire telemetry event when card is in DOM and visible.
Lazy-load:
- Stage graph
- Full logs viewer
- Artifacts list
- Security findings
- Trends, flaky tests, etc.

Angular structure

RunDetailsResolver (or resolveFn) requests first signal.
RunDetailsComponent uses signals to render quickly.
FirstSignalCardComponent is standalone + minimal deps.

Prefetch strategy from runs list view

When the runs table is visible, prefetch summaries/first signals for items in viewport:
- Use IntersectionObserver to prefetch only visible rows.
- Store results in an in-memory cache (e.g., Map<runId, FirstSignal>).
- Respect ETag to avoid redundant payloads.

Telemetry hooks

ttfs_start: route activation + tab visible
ttfs_signal_rendered: FirstSignalCard attached and visible
Dimensions: provider, repo, branch, run_type, release_version, network_state

Backend implementation plan

Summarizer / First-signal service

A service or module that:

subscribes to run/job events
receives log chunks (or pointers)
computes and stores:
- run_summary
- first_signal
publishes updates (optional) to an event stream for SSE

Concurrency rule

First signal should be set once per run unless a “better” signal appears:

if current signal is missing → set
if current signal is “generic” and new one is “specific” → replace
otherwise keep (avoid churn)

Extraction & classification logic

Minimum viable extractor (Phase 1)

Heuristics:
- first match among patterns: FATAL, ERROR, ##[error], panic:, Unhandled exception, npm ERR!, BUILD FAILED, etc.
- plus provider-specific fail markers
Pull:
- stage/job/step context (from job metadata or step boundaries)
- 5–10 log lines around first error line

Improved extractor (Phase 2+)

Language/tool specific rules:
- dotnet, maven/gradle, npm/yarn/pnpm, python/pytest, go test, docker build, terraform, helm
Add classification and signature:
- normalize common errors:
  - auth expired/forbidden
  - missing dependency / DNS / TLS
  - compilation error
  - test failure (include test name)
  - infra capacity / agent lost
  - policy gate failure

Guardrails

Secret redaction: before storing excerpts, run your existing redaction pipeline.
Payload cap: cap message length and excerpt lines.
PII discipline: avoid including arbitrary stack traces if they contain sensitive paths; include only key lines.

Development plan by phases (epics)

Each phase below includes deliverables + acceptance criteria. You can treat each as a sprint/iteration.

Phase 0 — Baseline and alignment

Deliverables

Baseline TTFS measurement (current behavior)
Definition of “actionable signal” and priority rules
Performance budget for run details view

Tasks

Add client-side telemetry for current page load steps:
- route enter, summary loaded, logs loaded, graph loaded
Measure TTFS proxy today (likely “time to status shown”)
Identify top 20 failure modes in your CI (from historical logs)

Acceptance criteria

Dashboard shows baseline P50/P95 for current experience.
“First signal” contract signed off with UI + backend teams.

Phase 1 — Data model and storage

Deliverables

DB migrations for ci_run_summary and ci_first_signal
Redis cache keys and invalidation strategy
ADR: where summaries live and how they update

Tasks

Create tables and indices:
- index on run_id, updated_at, provider
Add serializer/deserializer for summary_json and signal_json
Implement ETag generation (hash of JSON payload)

Acceptance criteria

Can store and retrieve summary + first signal for a run in < 50ms (DB) and < 10ms (cache).
ETag works end-to-end.

Phase 2 — Ingestion and first signal computation

Deliverables

First-signal computation module
Provider adapter integration points (webhook consumers)
“first error tuple” extraction from logs

Tasks

On job log append:
- scan incrementally for first error markers
- store excerpt + line range + job/stage/step mapping
On job finish/fail:
- finalize first signal with best known context
Implement the “better signal replaces generic” rule

Acceptance criteria

For a known failing run, API returns first signal without reading full log blob.
Computation does not exceed a small CPU budget per log chunk (guard with limits).
Extraction failure rate < 1% for sampled runs (initial).

Phase 3 — API endpoints and caching

Deliverables

/runs/{id}/first-signal endpoint
Optional /runs/{id}/summary
Cache-control + ETag support
Access control checks consistent with existing run authorization

Tasks

Serve cached first signal first; fallback to DB
If missing:
- return 204 (or a “pending” object) and allow UI fallback
Add server-side metrics:
- endpoint latency, cache hit rate, payload size

Acceptance criteria

Endpoint P95 latency meets target (e.g., < 200ms internal).
Cache hit rate is high for active runs (after prefetch).

Phase 4 — Frontend progressive rendering

Deliverables

FirstSignalCard component
Route resolver + local cache
Prefetch on runs list view
Telemetry for TTFS

Tasks

Render shell immediately
Fetch and render first signal
Lazy-load heavy panels using @defer / dynamic imports
Implement “open failing stage” default behavior

Acceptance criteria

In throttled network test, first signal card appears significantly earlier than logs and graphs.
ttfs_signal_rendered fires exactly once per view, with correct dimensions.

Phase 5 — Observability, dashboards, and alerting

Deliverables

TTFS dashboards by:
- provider, repo, run type, release version
Alerts:
- P95 regression threshold
Quality dashboard:
- availability rate, extraction failures, “generic signal rate”

Tasks

Create event pipeline for telemetry into your analytics system
Define SLO/error budget alerts
Add tracing (OpenTelemetry) for endpoint and summarizer

Acceptance criteria

You can correlate TTFS with:
- bounce rate
- open→action time
You can pinpoint whether regressions are backend, frontend, or provider‑specific.

Phase 6 — QA, performance testing, rollout

Deliverables

Automated tests
Feature flag + gradual rollout
A/B experiment (optional)

Tasks

Testing

Unit tests:
- extractor patterns
- classification rules
Integration tests:
- simulated job logs with known outcomes
E2E (Playwright/Cypress):
- verify first signal appears before logs
- verify fallback path works if endpoint fails
Performance tests:
- cold cache vs warm cache
- throttled CPU/network profiles

Rollout

Feature flag:
- enabled for internal users first
- ramp by repo or percentage
Monitor key metrics during ramp:
- TTFS P95
- API error rate
- UI error rate
- cache miss spikes

Acceptance criteria

No increase in overall error rates.
TTFS improves at least X% for a meaningful slice of users (define X from baseline).
Fallback UX remains usable when signals are unavailable.

Backlog examples (ready-to-create Jira tickets)

Epic: Run summary and first signal storage

Create ci_first_signal table
Create ci_run_summary table
Implement ETag hashing
Implement Redis caching layer
Add admin/debug endpoint (internal only) to inspect computed signals

Epic: Log chunk extraction

Implement incremental log scanning
Store first error excerpt + range
Map excerpt to job + step
Add redaction pass to excerpts

Epic: Run details progressive UI

FirstSignalCard UI component
Lazy-load logs viewer
Default to opening failing stage
Prefetch signals in runs list

Epic: Telemetry and dashboards

Add ttfs_start and ttfs_signal_rendered
Add endpoint latency metrics
Build dashboards + alerts
Add sampling for “signal helpfulness” feedback

Risk register and mitigations

Risk: First signal is wrong/misleading

Mitigation:
- track “generic signal rate” and “corrected by user” feedback
- classification confidence scoring
- always provide quick access to full logs as fallback

Risk: Logs contain secrets

Mitigation:
- redact excerpts before storing/serving
- cap excerpt lines and length
- keep raw logs behind existing permissions

Risk: Increased ingest CPU cost

Mitigation:
- incremental scanning with early stop after first error captured
- limit scanning per chunk
- sample/skip overly large logs until job completion

Risk: Cache invalidation bugs

Mitigation:
- ETag-based correctness
- versioned keys: :v1
- “write-through” cache updates from summarizer

Definition of Done checklist

A phase is “done” when:

✅ TTFS measured with reliable client events
✅ FirstSignalCard renders from /first-signal endpoint
✅ ETag caching works
✅ Fallback path is solid (no blank screens)
✅ Dashboards exist and are actively watched during rollout
✅ Security review completed for log excerpts/redaction
✅ Load tests show no unacceptable backend regressions

Optional enhancements after initial launch

Next-step recommendations Add action suggestions and deep links (rotate token, open failing test, open doc).
Flaky test / known issue detection Show “this matches known flaky signature” with last-seen frequency.
“Compare to last green” Summarize what changed since last successful run (commit diff, dependency bump).
SSE live updates Update first signal as soon as failure occurs while user watches.

If you tell me your current backend stack (Node/Go/.NET), log storage (S3/Elastic/Loki), and which CI providers you support, I can translate this into a concrete set of modules/classes, exact schema migrations, and the Angular routing + signals code structure you’d implement.

20 KiB Raw Blame History Unescape Escape

Time‑to‑First‑Signal (TTFS)

What counts as a “first signal”

How to optimize TTFS (practical steps)

Instrumentation (so you can prove it)

Minimal backend contract (example)

Frontend pattern (Angular 17, signal‑first)

CI adapter hints (GitLab/GitHub/Azure)

“Good TTFS” acceptance tests

Copy to put in your UX guidelines

Scope extension

What we’re building

What we’re extending beyond the initial idea

Success criteria

Primary metric

Targets (example SLOs)

Secondary outcome metrics

Quality metrics

Architecture overview

Data flow

Key decision: where to compute first signal

Data model

Tables (relational example)

ci_run

ci_job

ci_log_chunk

ci_run_summary

ci_first_signal

Cache layer

First signal definition

FirstSignal object (recommended shape)

Notes

APIs

1) First signal endpoint

2) Summary endpoint (optional but useful)

3) SSE / WebSocket updates (nice-to-have)

Frontend implementation plan (Angular 17)

UX behavior

Angular structure

Prefetch strategy from runs list view

Telemetry hooks

Backend implementation plan

Summarizer / First-signal service

Concurrency rule

Extraction & classification logic

Minimum viable extractor (Phase 1)

Improved extractor (Phase 2+)

Guardrails

Development plan by phases (epics)

Phase 0 — Baseline and alignment

Deliverables

Tasks

Acceptance criteria

Phase 1 — Data model and storage

Deliverables

Tasks

Acceptance criteria

Phase 2 — Ingestion and first signal computation

Deliverables

Tasks

Acceptance criteria

Phase 3 — API endpoints and caching

Deliverables

Tasks

Acceptance criteria

Phase 4 — Frontend progressive rendering

Deliverables

Tasks

Acceptance criteria

Phase 5 — Observability, dashboards, and alerting

Deliverables

Tasks

Acceptance criteria

Phase 6 — QA, performance testing, rollout

Deliverables

Tasks

Acceptance criteria

Backlog examples (ready-to-create Jira tickets)

Epic: Run summary and first signal storage

Epic: Log chunk extraction

20 KiB

Raw Blame History

`ci_run`

`ci_job`

`ci_log_chunk`

`ci_run_summary`

`ci_first_signal`

`FirstSignal` object (recommended shape)