14 KiB
Time-to-First-Signal (TTFS) Architecture
Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.
1) Overview
Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.
1.1 Design Goals
- Instant Feedback: P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
- Graceful Degradation: Skeleton → Cached Signal → Live Data progression
- Offline-First: Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
- Predictive Context: Provide "last known outcome" and ETA estimates for in-progress jobs
1.2 Signal Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ TTFS Signal Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Action API Layer Cache Layer Data Layer │
│ ─────────── ───────── ─────────── ────────── │
│ │
│ [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐ │
│ [CLI Start] ───┤ │ │ │ │
│ [CI Job] ───┘ │ │ ▼ │
│ │ │ ┌──────────────┐ │
│ ▼ │ │ PostgreSQL │ │
│ ┌──────────┐ │ │ first_signal │ │
│ │ ETag │◄────────────────┤ │ _snapshots │ │
│ │ Validation│ │ └──────────────┘ │
│ └──────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Response Assembly │ │
│ │ • kind (status indicator) │ │
│ │ • phase (current stage) │ │
│ │ • summary (human text) │ │
│ │ • eta_seconds (estimate) │ │
│ │ • last_known_outcome │ │
│ │ • next_actions │ │
│ └──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ SSE / Polling Client │ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
2) Component Budgets
The 5-second P95 budget is allocated across components:
| Component | P50 Budget | P95 Budget | Notes |
|---|---|---|---|
| Frontend (skeleton + hydration) | 100ms | 150ms | Network-independent |
| Edge API (auth + routing) | 150ms | 250ms | JWT validation, rate limiting |
| Core Services (lookup + assembly) | 700ms | 1,500ms | Cache hit vs cold path |
| SSE/WebSocket establishment | — | 300ms | Fallback to polling if exceeded |
| Total (warm path) | 700ms | 2,500ms | Cache hit scenario |
| Total (cold path) | 1,200ms | 4,000ms | Cache miss, compute required |
3) Signal Kinds
The kind field indicates the current signal state:
| Kind | Description | Typical Duration | Icon |
|---|---|---|---|
queued |
Job waiting in queue | 0-30s | Queue |
started |
Job has begun execution | — | Play |
phase |
Job in specific phase | Varies | Progress |
blocked |
Waiting on dependency/policy | — | Pause |
failed |
Job has failed | — | Error |
succeeded |
Job completed successfully | — | Check |
canceled |
Job was canceled | — | Cancel |
unavailable |
Signal cannot be determined | — | Unknown |
4) Signal Phases
The phase field indicates the current execution phase:
| Phase | Description | SLO Target |
|---|---|---|
resolve |
Dependency/artifact resolution | P95 < 30s |
fetch |
Data retrieval (registry, advisories) | P95 < 45s |
restore |
Cache/snapshot restoration | P95 < 10s |
analyze |
Analysis execution (scan, policy) | P95 < 120s |
policy |
Policy evaluation | P95 < 15s |
report |
Report generation/upload | P95 < 30s |
unknown |
Phase cannot be determined | — |
5) API Contracts
5.1 First Signal Endpoint
GET /api/v1/orchestrator/jobs/{jobId}/first-signal
Accept: application/json
If-None-Match: "{etag}"
200 OK
ETag: "job-{id}-{updated_at.unix_ms}"
Cache-Control: private, max-age=1, stale-while-revalidate=5
X-Signal-Source: snapshot | cold_start | failure_index
{
"kind": "started",
"phase": "analyze",
"summary": "Scanning image layers (47%)",
"eta_seconds": 38,
"last_known_outcome": {
"status": "succeeded",
"finished_at": "2025-12-13T10:15:00Z",
"findings_count": 12
},
"next_actions": [
{"label": "View previous run", "href": "/runs/abc-123"}
],
"diagnostics": {
"queue_position": null,
"worker_id": "worker-7"
}
}
304 Not Modified (if ETag matches)
5.2 SSE Stream
GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
Accept: text/event-stream
event: signal
data: {"kind":"started","phase":"analyze",...}
event: signal
data: {"kind":"phase","phase":"policy",...}
event: done
data: {"kind":"succeeded",...}
5.3 CLI Integration
# Job status with immediate signal
stella job status <job-id> --watch
# Output progression:
# [queued] Waiting in queue (position: 3)
# [started] Job started on worker-7
# [phase:analyze] Scanning image layers (47%)
# [succeeded] Completed in 2m 34s
6) Caching Strategy
6.1 Cache Tiers
| Tier | Storage | TTL | Use Case |
|---|---|---|---|
| L1 | In-memory (per-instance) | 1s | Hot path, same-instance requests |
| L2 | Valkey/Redis | 5s | Cross-instance, active jobs |
| L3 | PostgreSQL | 24h | Persistent snapshots, air-gap mode |
6.2 Cache Keys
ttfs:job:{tenant_id}:{job_id}:signal # Current signal
ttfs:job:{tenant_id}:{job_id}:eta # ETA prediction
ttfs:run:{tenant_id}:{run_id}:signals # Run-level aggregation
ttfs:tenant:{tenant_id}:failure_sig # Failure signatures
6.3 Air-Gap Mode
In air-gapped environments without Valkey/Redis:
- PostgreSQL NOTIFY/LISTEN replaces pub/sub for real-time updates
- Polling fallback with 2-second intervals
- first_signal_snapshots table serves as L2 cache
- All SSE endpoints gracefully degrade to long-polling
7) Telemetry & Observability
7.1 Metrics
| Metric | Type | Description |
|---|---|---|
ttfs_latency_seconds |
Histogram | End-to-end signal latency |
ttfs_cache_latency_seconds |
Histogram | Cache lookup time |
ttfs_cold_latency_seconds |
Histogram | Cold path computation time |
ttfs_signal_total |
Counter | Signals by kind/surface |
ttfs_cache_hit_total |
Counter | Cache hits |
ttfs_cache_miss_total |
Counter | Cache misses |
ttfs_slo_breach_total |
Counter | SLO breaches |
ttfs_error_total |
Counter | Errors by type |
7.2 Labels
All metrics include the following labels:
surface:ui|cli|cicache_hit:true|falsesignal_source:snapshot|cold_start|failure_indexkind: Signal kind enumtenant_id: Tenant identifier (for multi-tenant deployments)
7.3 SLO Definitions
# Prometheus recording rules
- record: ttfs:slo:p50_target
expr: 2.0 # seconds
- record: ttfs:slo:p95_target
expr: 5.0 # seconds
- record: ttfs:slo:compliance
expr: |
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
< 5.0
# Alerting rules
- alert: TtfsSloBreachP95
expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
for: 5m
labels:
severity: page
annotations:
summary: "TTFS P95 exceeds 5s SLO"
- alert: TtfsHighErrorRate
expr: rate(ttfs_error_total[5m]) > 0.1
for: 2m
labels:
severity: warning
8) Frontend Integration
8.1 Component Hierarchy
FirstSignalCard (Smart Component)
├── FirstSignalStore (Signal-based State)
│ ├── SSE subscription
│ ├── Polling fallback
│ └── ETag caching
├── StatusIndicator (Dumb Component)
│ └── kind → icon + color mapping
├── PhaseProgress (Dumb Component)
│ └── phase → progress bar
└── ActionButtons (Dumb Component)
└── next_actions rendering
8.2 State Machine
type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';
// State transitions:
// idle → loading (initial fetch)
// loading → streaming (SSE connected) | error (fetch failed)
// streaming → done (terminal signal) | error (connection lost)
// error → loading (retry)
8.3 Animation Tokens
| Token | Value | Usage |
|---|---|---|
--motion-duration-quick |
150ms | Skeleton fade, icon transitions |
--motion-duration-normal |
250ms | Card expansion, phase transitions |
--motion-duration-slow |
400ms | Success/failure celebrations |
--motion-easing-standard |
cubic-bezier(0.4, 0, 0.2, 1) | Default easing |
--motion-easing-decelerate |
cubic-bezier(0, 0, 0.2, 1) | Entries |
--motion-easing-accelerate |
cubic-bezier(0.4, 0, 1, 1) | Exits |
9) Failure Signatures
Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.
9.1 Signature Schema
{
"signature_hash": "sha256:abc123...",
"pattern": {
"phase": "analyze",
"error_code": "LAYER_EXTRACT_FAILED",
"image_pattern": "registry.io/.*:v1.*"
},
"outcome": {
"likely_cause": "Registry rate limiting",
"mttr_p50_seconds": 300,
"suggested_action": "Wait 5 minutes and retry"
},
"confidence": 0.87,
"sample_count": 42
}
9.2 Usage
When a job enters a known failure pattern:
- Match current job state against
failure_signaturestable - Enrich signal with
last_known_outcome.likely_cause - Predict ETA based on historical MTTR
- Suggest remediation via
next_actions
10) Database Schema
See docs/db/schemas/ttfs.sql for the complete schema definition.
10.1 Core Tables
| Table | Purpose |
|---|---|
scheduler.first_signal_snapshots |
Cached signal state per job |
scheduler.ttfs_events |
Telemetry event log |
scheduler.failure_signatures |
Historical failure patterns |
10.2 Hourly Rollup View
The scheduler.ttfs_hourly_summary view provides pre-aggregated metrics for dashboard performance.
11) Testing Requirements
11.1 Unit Tests
- Signal store state machine transitions
- ETag generation and validation
- Cache hit/miss scenarios
- Failure signature matching
11.2 Integration Tests
- End-to-end API latency measurement
- SSE connection lifecycle
- Air-gap mode fallback
- Multi-tenant isolation
11.3 Deterministic Fixtures
// tests/fixtures/ttfs/
export const TTFS_FIXTURES = {
FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
DETERMINISTIC_SEED: 0x5EED2025,
SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
SAMPLE_TENANT_ID: 'tenant-test-001'
};
12) References
- Advisory:
docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md - Sprint 1 (Foundation):
docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md - Sprint 2 (API):
docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md - Sprint 3 (UI):
docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md - Sprint 4 (Enhancements):
docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md - TTE Architecture:
docs/modules/telemetry/architecture.md - Telemetry Schema:
docs/schemas/ttfs-event.schema.json - Database Schema:
docs/db/schemas/ttfs.sql