Files
git.stella-ops.org/docs/modules/telemetry/ttfs-architecture.md
StellaOps Bot b058dbe031 up
2025-12-14 23:20:14 +02:00

374 lines
14 KiB
Markdown

# Time-to-First-Signal (TTFS) Architecture
> Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.
## 1) Overview
Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.
### 1.1 Design Goals
- **Instant Feedback:** P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
- **Graceful Degradation:** Skeleton Cached Signal Live Data progression
- **Offline-First:** Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
- **Predictive Context:** Provide "last known outcome" and ETA estimates for in-progress jobs
### 1.2 Signal Flow
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ TTFS Signal Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Action API Layer Cache Layer Data Layer │
│ ─────────── ───────── ─────────── ────────── │
│ │
│ [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐ │
│ [CLI Start] ───┤ │ │ │ │
│ [CI Job] ───┘ │ │ ▼ │
│ │ │ ┌──────────────┐ │
│ ▼ │ │ PostgreSQL │ │
│ ┌──────────┐ │ │ first_signal │ │
│ │ ETag │◄────────────────┤ │ _snapshots │ │
│ │ Validation│ │ └──────────────┘ │
│ └──────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Response Assembly │ │
│ │ • kind (status indicator) │ │
│ │ • phase (current stage) │ │
│ │ • summary (human text) │ │
│ │ • eta_seconds (estimate) │ │
│ │ • last_known_outcome │ │
│ │ • next_actions │ │
│ └──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ SSE / Polling Client │ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## 2) Component Budgets
The 5-second P95 budget is allocated across components:
| Component | P50 Budget | P95 Budget | Notes |
|-----------|------------|------------|-------|
| Frontend (skeleton + hydration) | 100ms | 150ms | Network-independent |
| Edge API (auth + routing) | 150ms | 250ms | JWT validation, rate limiting |
| Core Services (lookup + assembly) | 700ms | 1,500ms | Cache hit vs cold path |
| SSE/WebSocket establishment | | 300ms | Fallback to polling if exceeded |
| **Total (warm path)** | **700ms** | **2,500ms** | Cache hit scenario |
| **Total (cold path)** | **1,200ms** | **4,000ms** | Cache miss, compute required |
## 3) Signal Kinds
The `kind` field indicates the current signal state:
| Kind | Description | Typical Duration | Icon |
|------|-------------|------------------|------|
| `queued` | Job waiting in queue | 0-30s | Queue |
| `started` | Job has begun execution | | Play |
| `phase` | Job in specific phase | Varies | Progress |
| `blocked` | Waiting on dependency/policy | | Pause |
| `failed` | Job has failed | | Error |
| `succeeded` | Job completed successfully | | Check |
| `canceled` | Job was canceled | | Cancel |
| `unavailable` | Signal cannot be determined | | Unknown |
## 4) Signal Phases
The `phase` field indicates the current execution phase:
| Phase | Description | SLO Target |
|-------|-------------|------------|
| `resolve` | Dependency/artifact resolution | P95 < 30s |
| `fetch` | Data retrieval (registry, advisories) | P95 < 45s |
| `restore` | Cache/snapshot restoration | P95 < 10s |
| `analyze` | Analysis execution (scan, policy) | P95 < 120s |
| `policy` | Policy evaluation | P95 < 15s |
| `report` | Report generation/upload | P95 < 30s |
| `unknown` | Phase cannot be determined | |
## 5) API Contracts
### 5.1 First Signal Endpoint
```http
GET /api/v1/orchestrator/jobs/{jobId}/first-signal
Accept: application/json
If-None-Match: "{etag}"
200 OK
ETag: "job-{id}-{updated_at.unix_ms}"
Cache-Control: private, max-age=1, stale-while-revalidate=5
X-Signal-Source: snapshot | cold_start | failure_index
{
"kind": "started",
"phase": "analyze",
"summary": "Scanning image layers (47%)",
"eta_seconds": 38,
"last_known_outcome": {
"status": "succeeded",
"finished_at": "2025-12-13T10:15:00Z",
"findings_count": 12
},
"next_actions": [
{"label": "View previous run", "href": "/runs/abc-123"}
],
"diagnostics": {
"queue_position": null,
"worker_id": "worker-7"
}
}
304 Not Modified (if ETag matches)
```
### 5.2 SSE Stream
```http
GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
Accept: text/event-stream
event: signal
data: {"kind":"started","phase":"analyze",...}
event: signal
data: {"kind":"phase","phase":"policy",...}
event: done
data: {"kind":"succeeded",...}
```
### 5.3 CLI Integration
```bash
# Job status with immediate signal
stella job status <job-id> --watch
# Output progression:
# [queued] Waiting in queue (position: 3)
# [started] Job started on worker-7
# [phase:analyze] Scanning image layers (47%)
# [succeeded] Completed in 2m 34s
```
## 6) Caching Strategy
### 6.1 Cache Tiers
| Tier | Storage | TTL | Use Case |
|------|---------|-----|----------|
| L1 | In-memory (per-instance) | 1s | Hot path, same-instance requests |
| L2 | Valkey/Redis | 5s | Cross-instance, active jobs |
| L3 | PostgreSQL | 24h | Persistent snapshots, air-gap mode |
### 6.2 Cache Keys
```
ttfs:job:{tenant_id}:{job_id}:signal # Current signal
ttfs:job:{tenant_id}:{job_id}:eta # ETA prediction
ttfs:run:{tenant_id}:{run_id}:signals # Run-level aggregation
ttfs:tenant:{tenant_id}:failure_sig # Failure signatures
```
### 6.3 Air-Gap Mode
In air-gapped environments without Valkey/Redis:
1. **PostgreSQL NOTIFY/LISTEN** replaces pub/sub for real-time updates
2. **Polling fallback** with 2-second intervals
3. **first_signal_snapshots** table serves as L2 cache
4. All SSE endpoints gracefully degrade to long-polling
## 7) Telemetry & Observability
### 7.1 Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `ttfs_latency_seconds` | Histogram | End-to-end signal latency |
| `ttfs_cache_latency_seconds` | Histogram | Cache lookup time |
| `ttfs_cold_latency_seconds` | Histogram | Cold path computation time |
| `ttfs_signal_total` | Counter | Signals by kind/surface |
| `ttfs_cache_hit_total` | Counter | Cache hits |
| `ttfs_cache_miss_total` | Counter | Cache misses |
| `ttfs_slo_breach_total` | Counter | SLO breaches |
| `ttfs_error_total` | Counter | Errors by type |
### 7.2 Labels
All metrics include the following labels:
- `surface`: `ui` | `cli` | `ci`
- `cache_hit`: `true` | `false`
- `signal_source`: `snapshot` | `cold_start` | `failure_index`
- `kind`: Signal kind enum
- `tenant_id`: Tenant identifier (for multi-tenant deployments)
### 7.3 SLO Definitions
```yaml
# Prometheus recording rules
- record: ttfs:slo:p50_target
expr: 2.0 # seconds
- record: ttfs:slo:p95_target
expr: 5.0 # seconds
- record: ttfs:slo:compliance
expr: |
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
< 5.0
# Alerting rules
- alert: TtfsSloBreachP95
expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
for: 5m
labels:
severity: page
annotations:
summary: "TTFS P95 exceeds 5s SLO"
- alert: TtfsHighErrorRate
expr: rate(ttfs_error_total[5m]) > 0.1
for: 2m
labels:
severity: warning
```
## 8) Frontend Integration
### 8.1 Component Hierarchy
```
FirstSignalCard (Smart Component)
├── FirstSignalStore (Signal-based State)
│ ├── SSE subscription
│ ├── Polling fallback
│ └── ETag caching
├── StatusIndicator (Dumb Component)
│ └── kind → icon + color mapping
├── PhaseProgress (Dumb Component)
│ └── phase → progress bar
└── ActionButtons (Dumb Component)
└── next_actions rendering
```
### 8.2 State Machine
```typescript
type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';
// State transitions:
// idle → loading (initial fetch)
// loading → streaming (SSE connected) | error (fetch failed)
// streaming → done (terminal signal) | error (connection lost)
// error → loading (retry)
```
### 8.3 Animation Tokens
| Token | Value | Usage |
|-------|-------|-------|
| `--motion-duration-quick` | 150ms | Skeleton fade, icon transitions |
| `--motion-duration-normal` | 250ms | Card expansion, phase transitions |
| `--motion-duration-slow` | 400ms | Success/failure celebrations |
| `--motion-easing-standard` | cubic-bezier(0.4, 0, 0.2, 1) | Default easing |
| `--motion-easing-decelerate` | cubic-bezier(0, 0, 0.2, 1) | Entries |
| `--motion-easing-accelerate` | cubic-bezier(0.4, 0, 1, 1) | Exits |
## 9) Failure Signatures
Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.
### 9.1 Signature Schema
```json
{
"signature_hash": "sha256:abc123...",
"pattern": {
"phase": "analyze",
"error_code": "LAYER_EXTRACT_FAILED",
"image_pattern": "registry.io/.*:v1.*"
},
"outcome": {
"likely_cause": "Registry rate limiting",
"mttr_p50_seconds": 300,
"suggested_action": "Wait 5 minutes and retry"
},
"confidence": 0.87,
"sample_count": 42
}
```
### 9.2 Usage
When a job enters a known failure pattern:
1. **Match** current job state against `failure_signatures` table
2. **Enrich** signal with `last_known_outcome.likely_cause`
3. **Predict** ETA based on historical MTTR
4. **Suggest** remediation via `next_actions`
## 10) Database Schema
See `docs/db/schemas/ttfs.sql` for the complete schema definition.
### 10.1 Core Tables
| Table | Purpose |
|-------|---------|
| `scheduler.first_signal_snapshots` | Cached signal state per job |
| `scheduler.ttfs_events` | Telemetry event log |
| `scheduler.failure_signatures` | Historical failure patterns |
### 10.2 Hourly Rollup View
The `scheduler.ttfs_hourly_summary` view provides pre-aggregated metrics for dashboard performance.
## 11) Testing Requirements
### 11.1 Unit Tests
- Signal store state machine transitions
- ETag generation and validation
- Cache hit/miss scenarios
- Failure signature matching
### 11.2 Integration Tests
- End-to-end API latency measurement
- SSE connection lifecycle
- Air-gap mode fallback
- Multi-tenant isolation
### 11.3 Deterministic Fixtures
```typescript
// tests/fixtures/ttfs/
export const TTFS_FIXTURES = {
FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
DETERMINISTIC_SEED: 0x5EED2025,
SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
SAMPLE_TENANT_ID: 'tenant-test-001'
};
```
## 12) References
- Advisory: `docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md`
- Sprint 1 (Foundation): `docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md`
- Sprint 2 (API): `docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md`
- Sprint 3 (UI): `docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md`
- Sprint 4 (Enhancements): `docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md`
- TTE Architecture: `docs/modules/telemetry/architecture.md`
- Telemetry Schema: `docs/schemas/ttfs-event.schema.json`
- Database Schema: `docs/db/schemas/ttfs.sql`