374 lines
14 KiB
Markdown
374 lines
14 KiB
Markdown
# Time-to-First-Signal (TTFS) Architecture
|
|
|
|
> Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.
|
|
|
|
## 1) Overview
|
|
|
|
Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.
|
|
|
|
### 1.1 Design Goals
|
|
|
|
- **Instant Feedback:** P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
|
|
- **Graceful Degradation:** Skeleton → Cached Signal → Live Data progression
|
|
- **Offline-First:** Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
|
|
- **Predictive Context:** Provide "last known outcome" and ETA estimates for in-progress jobs
|
|
|
|
### 1.2 Signal Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ TTFS Signal Flow │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ User Action API Layer Cache Layer Data Layer │
|
|
│ ─────────── ───────── ─────────── ────────── │
|
|
│ │
|
|
│ [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐ │
|
|
│ [CLI Start] ───┤ │ │ │ │
|
|
│ [CI Job] ───┘ │ │ ▼ │
|
|
│ │ │ ┌──────────────┐ │
|
|
│ ▼ │ │ PostgreSQL │ │
|
|
│ ┌──────────┐ │ │ first_signal │ │
|
|
│ │ ETag │◄────────────────┤ │ _snapshots │ │
|
|
│ │ Validation│ │ └──────────────┘ │
|
|
│ └──────────┘ │ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ Response Assembly │ │
|
|
│ │ • kind (status indicator) │ │
|
|
│ │ • phase (current stage) │ │
|
|
│ │ • summary (human text) │ │
|
|
│ │ • eta_seconds (estimate) │ │
|
|
│ │ • last_known_outcome │ │
|
|
│ │ • next_actions │ │
|
|
│ └──────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ SSE / Polling Client │ │
|
|
│ └──────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## 2) Component Budgets
|
|
|
|
The 5-second P95 budget is allocated across components:
|
|
|
|
| Component | P50 Budget | P95 Budget | Notes |
|
|
|-----------|------------|------------|-------|
|
|
| Frontend (skeleton + hydration) | 100ms | 150ms | Network-independent |
|
|
| Edge API (auth + routing) | 150ms | 250ms | JWT validation, rate limiting |
|
|
| Core Services (lookup + assembly) | 700ms | 1,500ms | Cache hit vs cold path |
|
|
| SSE/WebSocket establishment | — | 300ms | Fallback to polling if exceeded |
|
|
| **Total (warm path)** | **700ms** | **2,500ms** | Cache hit scenario |
|
|
| **Total (cold path)** | **1,200ms** | **4,000ms** | Cache miss, compute required |
|
|
|
|
## 3) Signal Kinds
|
|
|
|
The `kind` field indicates the current signal state:
|
|
|
|
| Kind | Description | Typical Duration | Icon |
|
|
|------|-------------|------------------|------|
|
|
| `queued` | Job waiting in queue | 0-30s | Queue |
|
|
| `started` | Job has begun execution | — | Play |
|
|
| `phase` | Job in specific phase | Varies | Progress |
|
|
| `blocked` | Waiting on dependency/policy | — | Pause |
|
|
| `failed` | Job has failed | — | Error |
|
|
| `succeeded` | Job completed successfully | — | Check |
|
|
| `canceled` | Job was canceled | — | Cancel |
|
|
| `unavailable` | Signal cannot be determined | — | Unknown |
|
|
|
|
## 4) Signal Phases
|
|
|
|
The `phase` field indicates the current execution phase:
|
|
|
|
| Phase | Description | SLO Target |
|
|
|-------|-------------|------------|
|
|
| `resolve` | Dependency/artifact resolution | P95 < 30s |
|
|
| `fetch` | Data retrieval (registry, advisories) | P95 < 45s |
|
|
| `restore` | Cache/snapshot restoration | P95 < 10s |
|
|
| `analyze` | Analysis execution (scan, policy) | P95 < 120s |
|
|
| `policy` | Policy evaluation | P95 < 15s |
|
|
| `report` | Report generation/upload | P95 < 30s |
|
|
| `unknown` | Phase cannot be determined | — |
|
|
|
|
## 5) API Contracts
|
|
|
|
### 5.1 First Signal Endpoint
|
|
|
|
```http
|
|
GET /api/v1/orchestrator/jobs/{jobId}/first-signal
|
|
Accept: application/json
|
|
If-None-Match: "{etag}"
|
|
|
|
200 OK
|
|
ETag: "job-{id}-{updated_at.unix_ms}"
|
|
Cache-Control: private, max-age=1, stale-while-revalidate=5
|
|
X-Signal-Source: snapshot | cold_start | failure_index
|
|
|
|
{
|
|
"kind": "started",
|
|
"phase": "analyze",
|
|
"summary": "Scanning image layers (47%)",
|
|
"eta_seconds": 38,
|
|
"last_known_outcome": {
|
|
"status": "succeeded",
|
|
"finished_at": "2025-12-13T10:15:00Z",
|
|
"findings_count": 12
|
|
},
|
|
"next_actions": [
|
|
{"label": "View previous run", "href": "/runs/abc-123"}
|
|
],
|
|
"diagnostics": {
|
|
"queue_position": null,
|
|
"worker_id": "worker-7"
|
|
}
|
|
}
|
|
|
|
304 Not Modified (if ETag matches)
|
|
```
|
|
|
|
### 5.2 SSE Stream
|
|
|
|
```http
|
|
GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
|
|
Accept: text/event-stream
|
|
|
|
event: signal
|
|
data: {"kind":"started","phase":"analyze",...}
|
|
|
|
event: signal
|
|
data: {"kind":"phase","phase":"policy",...}
|
|
|
|
event: done
|
|
data: {"kind":"succeeded",...}
|
|
```
|
|
|
|
### 5.3 CLI Integration
|
|
|
|
```bash
|
|
# Job status with immediate signal
|
|
stella job status <job-id> --watch
|
|
|
|
# Output progression:
|
|
# [queued] Waiting in queue (position: 3)
|
|
# [started] Job started on worker-7
|
|
# [phase:analyze] Scanning image layers (47%)
|
|
# [succeeded] Completed in 2m 34s
|
|
```
|
|
|
|
## 6) Caching Strategy
|
|
|
|
### 6.1 Cache Tiers
|
|
|
|
| Tier | Storage | TTL | Use Case |
|
|
|------|---------|-----|----------|
|
|
| L1 | In-memory (per-instance) | 1s | Hot path, same-instance requests |
|
|
| L2 | Valkey/Redis | 5s | Cross-instance, active jobs |
|
|
| L3 | PostgreSQL | 24h | Persistent snapshots, air-gap mode |
|
|
|
|
### 6.2 Cache Keys
|
|
|
|
```
|
|
ttfs:job:{tenant_id}:{job_id}:signal # Current signal
|
|
ttfs:job:{tenant_id}:{job_id}:eta # ETA prediction
|
|
ttfs:run:{tenant_id}:{run_id}:signals # Run-level aggregation
|
|
ttfs:tenant:{tenant_id}:failure_sig # Failure signatures
|
|
```
|
|
|
|
### 6.3 Air-Gap Mode
|
|
|
|
In air-gapped environments without Valkey/Redis:
|
|
|
|
1. **PostgreSQL NOTIFY/LISTEN** replaces pub/sub for real-time updates
|
|
2. **Polling fallback** with 2-second intervals
|
|
3. **first_signal_snapshots** table serves as L2 cache
|
|
4. All SSE endpoints gracefully degrade to long-polling
|
|
|
|
## 7) Telemetry & Observability
|
|
|
|
### 7.1 Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `ttfs_latency_seconds` | Histogram | End-to-end signal latency |
|
|
| `ttfs_cache_latency_seconds` | Histogram | Cache lookup time |
|
|
| `ttfs_cold_latency_seconds` | Histogram | Cold path computation time |
|
|
| `ttfs_signal_total` | Counter | Signals by kind/surface |
|
|
| `ttfs_cache_hit_total` | Counter | Cache hits |
|
|
| `ttfs_cache_miss_total` | Counter | Cache misses |
|
|
| `ttfs_slo_breach_total` | Counter | SLO breaches |
|
|
| `ttfs_error_total` | Counter | Errors by type |
|
|
|
|
### 7.2 Labels
|
|
|
|
All metrics include the following labels:
|
|
|
|
- `surface`: `ui` | `cli` | `ci`
|
|
- `cache_hit`: `true` | `false`
|
|
- `signal_source`: `snapshot` | `cold_start` | `failure_index`
|
|
- `kind`: Signal kind enum
|
|
- `tenant_id`: Tenant identifier (for multi-tenant deployments)
|
|
|
|
### 7.3 SLO Definitions
|
|
|
|
```yaml
|
|
# Prometheus recording rules
|
|
- record: ttfs:slo:p50_target
|
|
expr: 2.0 # seconds
|
|
|
|
- record: ttfs:slo:p95_target
|
|
expr: 5.0 # seconds
|
|
|
|
- record: ttfs:slo:compliance
|
|
expr: |
|
|
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
|
|
< 5.0
|
|
|
|
# Alerting rules
|
|
- alert: TtfsSloBreachP95
|
|
expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
|
|
for: 5m
|
|
labels:
|
|
severity: page
|
|
annotations:
|
|
summary: "TTFS P95 exceeds 5s SLO"
|
|
|
|
- alert: TtfsHighErrorRate
|
|
expr: rate(ttfs_error_total[5m]) > 0.1
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
```
|
|
|
|
## 8) Frontend Integration
|
|
|
|
### 8.1 Component Hierarchy
|
|
|
|
```
|
|
FirstSignalCard (Smart Component)
|
|
├── FirstSignalStore (Signal-based State)
|
|
│ ├── SSE subscription
|
|
│ ├── Polling fallback
|
|
│ └── ETag caching
|
|
├── StatusIndicator (Dumb Component)
|
|
│ └── kind → icon + color mapping
|
|
├── PhaseProgress (Dumb Component)
|
|
│ └── phase → progress bar
|
|
└── ActionButtons (Dumb Component)
|
|
└── next_actions rendering
|
|
```
|
|
|
|
### 8.2 State Machine
|
|
|
|
```typescript
|
|
type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';
|
|
|
|
// State transitions:
|
|
// idle → loading (initial fetch)
|
|
// loading → streaming (SSE connected) | error (fetch failed)
|
|
// streaming → done (terminal signal) | error (connection lost)
|
|
// error → loading (retry)
|
|
```
|
|
|
|
### 8.3 Animation Tokens
|
|
|
|
| Token | Value | Usage |
|
|
|-------|-------|-------|
|
|
| `--motion-duration-quick` | 150ms | Skeleton fade, icon transitions |
|
|
| `--motion-duration-normal` | 250ms | Card expansion, phase transitions |
|
|
| `--motion-duration-slow` | 400ms | Success/failure celebrations |
|
|
| `--motion-easing-standard` | cubic-bezier(0.4, 0, 0.2, 1) | Default easing |
|
|
| `--motion-easing-decelerate` | cubic-bezier(0, 0, 0.2, 1) | Entries |
|
|
| `--motion-easing-accelerate` | cubic-bezier(0.4, 0, 1, 1) | Exits |
|
|
|
|
## 9) Failure Signatures
|
|
|
|
Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.
|
|
|
|
### 9.1 Signature Schema
|
|
|
|
```json
|
|
{
|
|
"signature_hash": "sha256:abc123...",
|
|
"pattern": {
|
|
"phase": "analyze",
|
|
"error_code": "LAYER_EXTRACT_FAILED",
|
|
"image_pattern": "registry.io/.*:v1.*"
|
|
},
|
|
"outcome": {
|
|
"likely_cause": "Registry rate limiting",
|
|
"mttr_p50_seconds": 300,
|
|
"suggested_action": "Wait 5 minutes and retry"
|
|
},
|
|
"confidence": 0.87,
|
|
"sample_count": 42
|
|
}
|
|
```
|
|
|
|
### 9.2 Usage
|
|
|
|
When a job enters a known failure pattern:
|
|
|
|
1. **Match** current job state against `failure_signatures` table
|
|
2. **Enrich** signal with `last_known_outcome.likely_cause`
|
|
3. **Predict** ETA based on historical MTTR
|
|
4. **Suggest** remediation via `next_actions`
|
|
|
|
## 10) Database Schema
|
|
|
|
See `docs/db/schemas/ttfs.sql` for the complete schema definition.
|
|
|
|
### 10.1 Core Tables
|
|
|
|
| Table | Purpose |
|
|
|-------|---------|
|
|
| `scheduler.first_signal_snapshots` | Cached signal state per job |
|
|
| `scheduler.ttfs_events` | Telemetry event log |
|
|
| `scheduler.failure_signatures` | Historical failure patterns |
|
|
|
|
### 10.2 Hourly Rollup View
|
|
|
|
The `scheduler.ttfs_hourly_summary` view provides pre-aggregated metrics for dashboard performance.
|
|
|
|
## 11) Testing Requirements
|
|
|
|
### 11.1 Unit Tests
|
|
|
|
- Signal store state machine transitions
|
|
- ETag generation and validation
|
|
- Cache hit/miss scenarios
|
|
- Failure signature matching
|
|
|
|
### 11.2 Integration Tests
|
|
|
|
- End-to-end API latency measurement
|
|
- SSE connection lifecycle
|
|
- Air-gap mode fallback
|
|
- Multi-tenant isolation
|
|
|
|
### 11.3 Deterministic Fixtures
|
|
|
|
```typescript
|
|
// tests/fixtures/ttfs/
|
|
export const TTFS_FIXTURES = {
|
|
FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
|
|
DETERMINISTIC_SEED: 0x5EED2025,
|
|
SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
|
|
SAMPLE_TENANT_ID: 'tenant-test-001'
|
|
};
|
|
```
|
|
|
|
## 12) References
|
|
|
|
- Advisory: `docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md`
|
|
- Sprint 1 (Foundation): `docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md`
|
|
- Sprint 2 (API): `docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md`
|
|
- Sprint 3 (UI): `docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md`
|
|
- Sprint 4 (Enhancements): `docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md`
|
|
- TTE Architecture: `docs/modules/telemetry/architecture.md`
|
|
- Telemetry Schema: `docs/schemas/ttfs-event.schema.json`
|
|
- Database Schema: `docs/db/schemas/ttfs.sql`
|