Files

StellaOps Bot b058dbe031 up

2025-12-14 23:20:14 +02:00

14 KiB

Raw Blame History

Time-to-First-Signal (TTFS) Architecture

Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.

1) Overview

Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.

1.1 Design Goals

Instant Feedback: P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
Graceful Degradation: Skeleton → Cached Signal → Live Data progression
Offline-First: Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
Predictive Context: Provide "last known outcome" and ETA estimates for in-progress jobs

1.2 Signal Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TTFS Signal Flow                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   User Action          API Layer              Cache Layer      Data Layer   │
│   ───────────          ─────────              ───────────      ──────────   │
│                                                                             │
│   [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐              │
│   [CLI Start]  ───┤        │                      │         │              │
│   [CI Job]     ───┘        │                      │         ▼              │
│                            │                      │    ┌──────────────┐    │
│                            ▼                      │    │ PostgreSQL   │    │
│                      ┌──────────┐                 │    │ first_signal │    │
│                      │ ETag     │◄────────────────┤    │ _snapshots   │    │
│                      │ Validation│                │    └──────────────┘    │
│                      └──────────┘                 │                        │
│                            │                      │                        │
│                            ▼                      ▼                        │
│                      ┌──────────────────────────────┐                      │
│                      │     Response Assembly        │                      │
│                      │ • kind (status indicator)    │                      │
│                      │ • phase (current stage)      │                      │
│                      │ • summary (human text)       │                      │
│                      │ • eta_seconds (estimate)     │                      │
│                      │ • last_known_outcome         │                      │
│                      │ • next_actions               │                      │
│                      └──────────────────────────────┘                      │
│                                    │                                        │
│                                    ▼                                        │
│                      ┌──────────────────────────────┐                      │
│                      │     SSE / Polling Client     │                      │
│                      └──────────────────────────────┘                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2) Component Budgets

The 5-second P95 budget is allocated across components:

Component	P50 Budget	P95 Budget	Notes
Frontend (skeleton + hydration)	100ms	150ms	Network-independent
Edge API (auth + routing)	150ms	250ms	JWT validation, rate limiting
Core Services (lookup + assembly)	700ms	1,500ms	Cache hit vs cold path
SSE/WebSocket establishment	—	300ms	Fallback to polling if exceeded
Total (warm path)	700ms	2,500ms	Cache hit scenario
Total (cold path)	1,200ms	4,000ms	Cache miss, compute required

3) Signal Kinds

The kind field indicates the current signal state:

Kind	Description	Typical Duration	Icon
`queued`	Job waiting in queue	0-30s	Queue
`started`	Job has begun execution	—	Play
`phase`	Job in specific phase	Varies	Progress
`blocked`	Waiting on dependency/policy	—	Pause
`failed`	Job has failed	—	Error
`succeeded`	Job completed successfully	—	Check
`canceled`	Job was canceled	—	Cancel
`unavailable`	Signal cannot be determined	—	Unknown

4) Signal Phases

The phase field indicates the current execution phase:

Phase	Description	SLO Target
`resolve`	Dependency/artifact resolution	P95 < 30s
`fetch`	Data retrieval (registry, advisories)	P95 < 45s
`restore`	Cache/snapshot restoration	P95 < 10s
`analyze`	Analysis execution (scan, policy)	P95 < 120s
`policy`	Policy evaluation	P95 < 15s
`report`	Report generation/upload	P95 < 30s
`unknown`	Phase cannot be determined	—

5) API Contracts

5.1 First Signal Endpoint

GET /api/v1/orchestrator/jobs/{jobId}/first-signal
Accept: application/json
If-None-Match: "{etag}"

200 OK
ETag: "job-{id}-{updated_at.unix_ms}"
Cache-Control: private, max-age=1, stale-while-revalidate=5
X-Signal-Source: snapshot | cold_start | failure_index

{
  "kind": "started",
  "phase": "analyze",
  "summary": "Scanning image layers (47%)",
  "eta_seconds": 38,
  "last_known_outcome": {
    "status": "succeeded",
    "finished_at": "2025-12-13T10:15:00Z",
    "findings_count": 12
  },
  "next_actions": [
    {"label": "View previous run", "href": "/runs/abc-123"}
  ],
  "diagnostics": {
    "queue_position": null,
    "worker_id": "worker-7"
  }
}

304 Not Modified (if ETag matches)

5.2 SSE Stream

GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
Accept: text/event-stream

event: signal
data: {"kind":"started","phase":"analyze",...}

event: signal
data: {"kind":"phase","phase":"policy",...}

event: done
data: {"kind":"succeeded",...}

5.3 CLI Integration

# Job status with immediate signal
stella job status <job-id> --watch

# Output progression:
# [queued] Waiting in queue (position: 3)
# [started] Job started on worker-7
# [phase:analyze] Scanning image layers (47%)
# [succeeded] Completed in 2m 34s

6) Caching Strategy

6.1 Cache Tiers

Tier	Storage	TTL	Use Case
L1	In-memory (per-instance)	1s	Hot path, same-instance requests
L2	Valkey/Redis	5s	Cross-instance, active jobs
L3	PostgreSQL	24h	Persistent snapshots, air-gap mode

6.2 Cache Keys

ttfs:job:{tenant_id}:{job_id}:signal      # Current signal
ttfs:job:{tenant_id}:{job_id}:eta         # ETA prediction
ttfs:run:{tenant_id}:{run_id}:signals     # Run-level aggregation
ttfs:tenant:{tenant_id}:failure_sig       # Failure signatures

6.3 Air-Gap Mode

In air-gapped environments without Valkey/Redis:

PostgreSQL NOTIFY/LISTEN replaces pub/sub for real-time updates
Polling fallback with 2-second intervals
first_signal_snapshots table serves as L2 cache
All SSE endpoints gracefully degrade to long-polling

7) Telemetry & Observability

7.1 Metrics

Metric	Type	Description
`ttfs_latency_seconds`	Histogram	End-to-end signal latency
`ttfs_cache_latency_seconds`	Histogram	Cache lookup time
`ttfs_cold_latency_seconds`	Histogram	Cold path computation time
`ttfs_signal_total`	Counter	Signals by kind/surface
`ttfs_cache_hit_total`	Counter	Cache hits
`ttfs_cache_miss_total`	Counter	Cache misses
`ttfs_slo_breach_total`	Counter	SLO breaches
`ttfs_error_total`	Counter	Errors by type

7.2 Labels

All metrics include the following labels:

surface: ui | cli | ci
cache_hit: true | false
signal_source: snapshot | cold_start | failure_index
kind: Signal kind enum
tenant_id: Tenant identifier (for multi-tenant deployments)

7.3 SLO Definitions

# Prometheus recording rules
- record: ttfs:slo:p50_target
  expr: 2.0  # seconds

- record: ttfs:slo:p95_target
  expr: 5.0  # seconds

- record: ttfs:slo:compliance
  expr: |
    histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
    < 5.0

# Alerting rules
- alert: TtfsSloBreachP95
  expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "TTFS P95 exceeds 5s SLO"

- alert: TtfsHighErrorRate
  expr: rate(ttfs_error_total[5m]) > 0.1
  for: 2m
  labels:
    severity: warning

8) Frontend Integration

8.1 Component Hierarchy

FirstSignalCard (Smart Component)
├── FirstSignalStore (Signal-based State)
│   ├── SSE subscription
│   ├── Polling fallback
│   └── ETag caching
├── StatusIndicator (Dumb Component)
│   └── kind → icon + color mapping
├── PhaseProgress (Dumb Component)
│   └── phase → progress bar
└── ActionButtons (Dumb Component)
    └── next_actions rendering

8.2 State Machine

type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';

// State transitions:
// idle → loading (initial fetch)
// loading → streaming (SSE connected) | error (fetch failed)
// streaming → done (terminal signal) | error (connection lost)
// error → loading (retry)

8.3 Animation Tokens

Token	Value	Usage
`--motion-duration-quick`	150ms	Skeleton fade, icon transitions
`--motion-duration-normal`	250ms	Card expansion, phase transitions
`--motion-duration-slow`	400ms	Success/failure celebrations
`--motion-easing-standard`	cubic-bezier(0.4, 0, 0.2, 1)	Default easing
`--motion-easing-decelerate`	cubic-bezier(0, 0, 0.2, 1)	Entries
`--motion-easing-accelerate`	cubic-bezier(0.4, 0, 1, 1)	Exits

9) Failure Signatures

Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.

9.1 Signature Schema

{
  "signature_hash": "sha256:abc123...",
  "pattern": {
    "phase": "analyze",
    "error_code": "LAYER_EXTRACT_FAILED",
    "image_pattern": "registry.io/.*:v1.*"
  },
  "outcome": {
    "likely_cause": "Registry rate limiting",
    "mttr_p50_seconds": 300,
    "suggested_action": "Wait 5 minutes and retry"
  },
  "confidence": 0.87,
  "sample_count": 42
}

9.2 Usage

When a job enters a known failure pattern:

Match current job state against failure_signatures table
Enrich signal with last_known_outcome.likely_cause
Predict ETA based on historical MTTR
Suggest remediation via next_actions

10) Database Schema

See docs/db/schemas/ttfs.sql for the complete schema definition.

10.1 Core Tables

Table	Purpose
`scheduler.first_signal_snapshots`	Cached signal state per job
`scheduler.ttfs_events`	Telemetry event log
`scheduler.failure_signatures`	Historical failure patterns

10.2 Hourly Rollup View

The scheduler.ttfs_hourly_summary view provides pre-aggregated metrics for dashboard performance.

11) Testing Requirements

11.1 Unit Tests

Signal store state machine transitions
ETag generation and validation
Cache hit/miss scenarios
Failure signature matching

11.2 Integration Tests

End-to-end API latency measurement
SSE connection lifecycle
Air-gap mode fallback
Multi-tenant isolation

11.3 Deterministic Fixtures

// tests/fixtures/ttfs/
export const TTFS_FIXTURES = {
  FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
  DETERMINISTIC_SEED: 0x5EED2025,
  SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
  SAMPLE_TENANT_ID: 'tenant-test-001'
};

12) References

Advisory: docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md
Sprint 1 (Foundation): docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md
Sprint 2 (API): docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md
Sprint 3 (UI): docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md
Sprint 4 (Enhancements): docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md
TTE Architecture: docs/modules/telemetry/architecture.md
Telemetry Schema: docs/schemas/ttfs-event.schema.json
Database Schema: docs/db/schemas/ttfs.sql

14 KiB Raw Blame History