Files
git.stella-ops.org/docs/modules/telemetry/ttfs-architecture.md
StellaOps Bot b058dbe031 up
2025-12-14 23:20:14 +02:00

14 KiB

Time-to-First-Signal (TTFS) Architecture

Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.

1) Overview

Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.

1.1 Design Goals

  • Instant Feedback: P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
  • Graceful Degradation: Skeleton → Cached Signal → Live Data progression
  • Offline-First: Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
  • Predictive Context: Provide "last known outcome" and ETA estimates for in-progress jobs

1.2 Signal Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TTFS Signal Flow                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   User Action          API Layer              Cache Layer      Data Layer   │
│   ───────────          ─────────              ───────────      ──────────   │
│                                                                             │
│   [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐              │
│   [CLI Start]  ───┤        │                      │         │              │
│   [CI Job]     ───┘        │                      │         ▼              │
│                            │                      │    ┌──────────────┐    │
│                            ▼                      │    │ PostgreSQL   │    │
│                      ┌──────────┐                 │    │ first_signal │    │
│                      │ ETag     │◄────────────────┤    │ _snapshots   │    │
│                      │ Validation│                │    └──────────────┘    │
│                      └──────────┘                 │                        │
│                            │                      │                        │
│                            ▼                      ▼                        │
│                      ┌──────────────────────────────┐                      │
│                      │     Response Assembly        │                      │
│                      │ • kind (status indicator)    │                      │
│                      │ • phase (current stage)      │                      │
│                      │ • summary (human text)       │                      │
│                      │ • eta_seconds (estimate)     │                      │
│                      │ • last_known_outcome         │                      │
│                      │ • next_actions               │                      │
│                      └──────────────────────────────┘                      │
│                                    │                                        │
│                                    ▼                                        │
│                      ┌──────────────────────────────┐                      │
│                      │     SSE / Polling Client     │                      │
│                      └──────────────────────────────┘                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2) Component Budgets

The 5-second P95 budget is allocated across components:

Component P50 Budget P95 Budget Notes
Frontend (skeleton + hydration) 100ms 150ms Network-independent
Edge API (auth + routing) 150ms 250ms JWT validation, rate limiting
Core Services (lookup + assembly) 700ms 1,500ms Cache hit vs cold path
SSE/WebSocket establishment 300ms Fallback to polling if exceeded
Total (warm path) 700ms 2,500ms Cache hit scenario
Total (cold path) 1,200ms 4,000ms Cache miss, compute required

3) Signal Kinds

The kind field indicates the current signal state:

Kind Description Typical Duration Icon
queued Job waiting in queue 0-30s Queue
started Job has begun execution Play
phase Job in specific phase Varies Progress
blocked Waiting on dependency/policy Pause
failed Job has failed Error
succeeded Job completed successfully Check
canceled Job was canceled Cancel
unavailable Signal cannot be determined Unknown

4) Signal Phases

The phase field indicates the current execution phase:

Phase Description SLO Target
resolve Dependency/artifact resolution P95 < 30s
fetch Data retrieval (registry, advisories) P95 < 45s
restore Cache/snapshot restoration P95 < 10s
analyze Analysis execution (scan, policy) P95 < 120s
policy Policy evaluation P95 < 15s
report Report generation/upload P95 < 30s
unknown Phase cannot be determined

5) API Contracts

5.1 First Signal Endpoint

GET /api/v1/orchestrator/jobs/{jobId}/first-signal
Accept: application/json
If-None-Match: "{etag}"

200 OK
ETag: "job-{id}-{updated_at.unix_ms}"
Cache-Control: private, max-age=1, stale-while-revalidate=5
X-Signal-Source: snapshot | cold_start | failure_index

{
  "kind": "started",
  "phase": "analyze",
  "summary": "Scanning image layers (47%)",
  "eta_seconds": 38,
  "last_known_outcome": {
    "status": "succeeded",
    "finished_at": "2025-12-13T10:15:00Z",
    "findings_count": 12
  },
  "next_actions": [
    {"label": "View previous run", "href": "/runs/abc-123"}
  ],
  "diagnostics": {
    "queue_position": null,
    "worker_id": "worker-7"
  }
}

304 Not Modified (if ETag matches)

5.2 SSE Stream

GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
Accept: text/event-stream

event: signal
data: {"kind":"started","phase":"analyze",...}

event: signal
data: {"kind":"phase","phase":"policy",...}

event: done
data: {"kind":"succeeded",...}

5.3 CLI Integration

# Job status with immediate signal
stella job status <job-id> --watch

# Output progression:
# [queued] Waiting in queue (position: 3)
# [started] Job started on worker-7
# [phase:analyze] Scanning image layers (47%)
# [succeeded] Completed in 2m 34s

6) Caching Strategy

6.1 Cache Tiers

Tier Storage TTL Use Case
L1 In-memory (per-instance) 1s Hot path, same-instance requests
L2 Valkey/Redis 5s Cross-instance, active jobs
L3 PostgreSQL 24h Persistent snapshots, air-gap mode

6.2 Cache Keys

ttfs:job:{tenant_id}:{job_id}:signal      # Current signal
ttfs:job:{tenant_id}:{job_id}:eta         # ETA prediction
ttfs:run:{tenant_id}:{run_id}:signals     # Run-level aggregation
ttfs:tenant:{tenant_id}:failure_sig       # Failure signatures

6.3 Air-Gap Mode

In air-gapped environments without Valkey/Redis:

  1. PostgreSQL NOTIFY/LISTEN replaces pub/sub for real-time updates
  2. Polling fallback with 2-second intervals
  3. first_signal_snapshots table serves as L2 cache
  4. All SSE endpoints gracefully degrade to long-polling

7) Telemetry & Observability

7.1 Metrics

Metric Type Description
ttfs_latency_seconds Histogram End-to-end signal latency
ttfs_cache_latency_seconds Histogram Cache lookup time
ttfs_cold_latency_seconds Histogram Cold path computation time
ttfs_signal_total Counter Signals by kind/surface
ttfs_cache_hit_total Counter Cache hits
ttfs_cache_miss_total Counter Cache misses
ttfs_slo_breach_total Counter SLO breaches
ttfs_error_total Counter Errors by type

7.2 Labels

All metrics include the following labels:

  • surface: ui | cli | ci
  • cache_hit: true | false
  • signal_source: snapshot | cold_start | failure_index
  • kind: Signal kind enum
  • tenant_id: Tenant identifier (for multi-tenant deployments)

7.3 SLO Definitions

# Prometheus recording rules
- record: ttfs:slo:p50_target
  expr: 2.0  # seconds

- record: ttfs:slo:p95_target
  expr: 5.0  # seconds

- record: ttfs:slo:compliance
  expr: |
    histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
    < 5.0

# Alerting rules
- alert: TtfsSloBreachP95
  expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "TTFS P95 exceeds 5s SLO"

- alert: TtfsHighErrorRate
  expr: rate(ttfs_error_total[5m]) > 0.1
  for: 2m
  labels:
    severity: warning

8) Frontend Integration

8.1 Component Hierarchy

FirstSignalCard (Smart Component)
├── FirstSignalStore (Signal-based State)
│   ├── SSE subscription
│   ├── Polling fallback
│   └── ETag caching
├── StatusIndicator (Dumb Component)
│   └── kind → icon + color mapping
├── PhaseProgress (Dumb Component)
│   └── phase → progress bar
└── ActionButtons (Dumb Component)
    └── next_actions rendering

8.2 State Machine

type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';

// State transitions:
// idle → loading (initial fetch)
// loading → streaming (SSE connected) | error (fetch failed)
// streaming → done (terminal signal) | error (connection lost)
// error → loading (retry)

8.3 Animation Tokens

Token Value Usage
--motion-duration-quick 150ms Skeleton fade, icon transitions
--motion-duration-normal 250ms Card expansion, phase transitions
--motion-duration-slow 400ms Success/failure celebrations
--motion-easing-standard cubic-bezier(0.4, 0, 0.2, 1) Default easing
--motion-easing-decelerate cubic-bezier(0, 0, 0.2, 1) Entries
--motion-easing-accelerate cubic-bezier(0.4, 0, 1, 1) Exits

9) Failure Signatures

Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.

9.1 Signature Schema

{
  "signature_hash": "sha256:abc123...",
  "pattern": {
    "phase": "analyze",
    "error_code": "LAYER_EXTRACT_FAILED",
    "image_pattern": "registry.io/.*:v1.*"
  },
  "outcome": {
    "likely_cause": "Registry rate limiting",
    "mttr_p50_seconds": 300,
    "suggested_action": "Wait 5 minutes and retry"
  },
  "confidence": 0.87,
  "sample_count": 42
}

9.2 Usage

When a job enters a known failure pattern:

  1. Match current job state against failure_signatures table
  2. Enrich signal with last_known_outcome.likely_cause
  3. Predict ETA based on historical MTTR
  4. Suggest remediation via next_actions

10) Database Schema

See docs/db/schemas/ttfs.sql for the complete schema definition.

10.1 Core Tables

Table Purpose
scheduler.first_signal_snapshots Cached signal state per job
scheduler.ttfs_events Telemetry event log
scheduler.failure_signatures Historical failure patterns

10.2 Hourly Rollup View

The scheduler.ttfs_hourly_summary view provides pre-aggregated metrics for dashboard performance.

11) Testing Requirements

11.1 Unit Tests

  • Signal store state machine transitions
  • ETag generation and validation
  • Cache hit/miss scenarios
  • Failure signature matching

11.2 Integration Tests

  • End-to-end API latency measurement
  • SSE connection lifecycle
  • Air-gap mode fallback
  • Multi-tenant isolation

11.3 Deterministic Fixtures

// tests/fixtures/ttfs/
export const TTFS_FIXTURES = {
  FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
  DETERMINISTIC_SEED: 0x5EED2025,
  SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
  SAMPLE_TENANT_ID: 'tenant-test-001'
};

12) References

  • Advisory: docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md
  • Sprint 1 (Foundation): docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md
  • Sprint 2 (API): docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md
  • Sprint 3 (UI): docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md
  • Sprint 4 (Enhancements): docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md
  • TTE Architecture: docs/modules/telemetry/architecture.md
  • Telemetry Schema: docs/schemas/ttfs-event.schema.json
  • Database Schema: docs/db/schemas/ttfs.sql