up
This commit is contained in:
373
docs/modules/telemetry/ttfs-architecture.md
Normal file
373
docs/modules/telemetry/ttfs-architecture.md
Normal file
@@ -0,0 +1,373 @@
|
||||
# Time-to-First-Signal (TTFS) Architecture
|
||||
|
||||
> Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.
|
||||
|
||||
## 1) Overview
|
||||
|
||||
Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.
|
||||
|
||||
### 1.1 Design Goals
|
||||
|
||||
- **Instant Feedback:** P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
|
||||
- **Graceful Degradation:** Skeleton → Cached Signal → Live Data progression
|
||||
- **Offline-First:** Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
|
||||
- **Predictive Context:** Provide "last known outcome" and ETA estimates for in-progress jobs
|
||||
|
||||
### 1.2 Signal Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ TTFS Signal Flow │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ User Action API Layer Cache Layer Data Layer │
|
||||
│ ─────────── ───────── ─────────── ────────── │
|
||||
│ │
|
||||
│ [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐ │
|
||||
│ [CLI Start] ───┤ │ │ │ │
|
||||
│ [CI Job] ───┘ │ │ ▼ │
|
||||
│ │ │ ┌──────────────┐ │
|
||||
│ ▼ │ │ PostgreSQL │ │
|
||||
│ ┌──────────┐ │ │ first_signal │ │
|
||||
│ │ ETag │◄────────────────┤ │ _snapshots │ │
|
||||
│ │ Validation│ │ └──────────────┘ │
|
||||
│ └──────────┘ │ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────────────────────┐ │
|
||||
│ │ Response Assembly │ │
|
||||
│ │ • kind (status indicator) │ │
|
||||
│ │ • phase (current stage) │ │
|
||||
│ │ • summary (human text) │ │
|
||||
│ │ • eta_seconds (estimate) │ │
|
||||
│ │ • last_known_outcome │ │
|
||||
│ │ • next_actions │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────┐ │
|
||||
│ │ SSE / Polling Client │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 2) Component Budgets
|
||||
|
||||
The 5-second P95 budget is allocated across components:
|
||||
|
||||
| Component | P50 Budget | P95 Budget | Notes |
|
||||
|-----------|------------|------------|-------|
|
||||
| Frontend (skeleton + hydration) | 100ms | 150ms | Network-independent |
|
||||
| Edge API (auth + routing) | 150ms | 250ms | JWT validation, rate limiting |
|
||||
| Core Services (lookup + assembly) | 700ms | 1,500ms | Cache hit vs cold path |
|
||||
| SSE/WebSocket establishment | — | 300ms | Fallback to polling if exceeded |
|
||||
| **Total (warm path)** | **700ms** | **2,500ms** | Cache hit scenario |
|
||||
| **Total (cold path)** | **1,200ms** | **4,000ms** | Cache miss, compute required |
|
||||
|
||||
## 3) Signal Kinds
|
||||
|
||||
The `kind` field indicates the current signal state:
|
||||
|
||||
| Kind | Description | Typical Duration | Icon |
|
||||
|------|-------------|------------------|------|
|
||||
| `queued` | Job waiting in queue | 0-30s | Queue |
|
||||
| `started` | Job has begun execution | — | Play |
|
||||
| `phase` | Job in specific phase | Varies | Progress |
|
||||
| `blocked` | Waiting on dependency/policy | — | Pause |
|
||||
| `failed` | Job has failed | — | Error |
|
||||
| `succeeded` | Job completed successfully | — | Check |
|
||||
| `canceled` | Job was canceled | — | Cancel |
|
||||
| `unavailable` | Signal cannot be determined | — | Unknown |
|
||||
|
||||
## 4) Signal Phases
|
||||
|
||||
The `phase` field indicates the current execution phase:
|
||||
|
||||
| Phase | Description | SLO Target |
|
||||
|-------|-------------|------------|
|
||||
| `resolve` | Dependency/artifact resolution | P95 < 30s |
|
||||
| `fetch` | Data retrieval (registry, advisories) | P95 < 45s |
|
||||
| `restore` | Cache/snapshot restoration | P95 < 10s |
|
||||
| `analyze` | Analysis execution (scan, policy) | P95 < 120s |
|
||||
| `policy` | Policy evaluation | P95 < 15s |
|
||||
| `report` | Report generation/upload | P95 < 30s |
|
||||
| `unknown` | Phase cannot be determined | — |
|
||||
|
||||
## 5) API Contracts
|
||||
|
||||
### 5.1 First Signal Endpoint
|
||||
|
||||
```http
|
||||
GET /api/v1/orchestrator/jobs/{jobId}/first-signal
|
||||
Accept: application/json
|
||||
If-None-Match: "{etag}"
|
||||
|
||||
200 OK
|
||||
ETag: "job-{id}-{updated_at.unix_ms}"
|
||||
Cache-Control: private, max-age=1, stale-while-revalidate=5
|
||||
X-Signal-Source: snapshot | cold_start | failure_index
|
||||
|
||||
{
|
||||
"kind": "started",
|
||||
"phase": "analyze",
|
||||
"summary": "Scanning image layers (47%)",
|
||||
"eta_seconds": 38,
|
||||
"last_known_outcome": {
|
||||
"status": "succeeded",
|
||||
"finished_at": "2025-12-13T10:15:00Z",
|
||||
"findings_count": 12
|
||||
},
|
||||
"next_actions": [
|
||||
{"label": "View previous run", "href": "/runs/abc-123"}
|
||||
],
|
||||
"diagnostics": {
|
||||
"queue_position": null,
|
||||
"worker_id": "worker-7"
|
||||
}
|
||||
}
|
||||
|
||||
304 Not Modified (if ETag matches)
|
||||
```
|
||||
|
||||
### 5.2 SSE Stream
|
||||
|
||||
```http
|
||||
GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
|
||||
Accept: text/event-stream
|
||||
|
||||
event: signal
|
||||
data: {"kind":"started","phase":"analyze",...}
|
||||
|
||||
event: signal
|
||||
data: {"kind":"phase","phase":"policy",...}
|
||||
|
||||
event: done
|
||||
data: {"kind":"succeeded",...}
|
||||
```
|
||||
|
||||
### 5.3 CLI Integration
|
||||
|
||||
```bash
|
||||
# Job status with immediate signal
|
||||
stella job status <job-id> --watch
|
||||
|
||||
# Output progression:
|
||||
# [queued] Waiting in queue (position: 3)
|
||||
# [started] Job started on worker-7
|
||||
# [phase:analyze] Scanning image layers (47%)
|
||||
# [succeeded] Completed in 2m 34s
|
||||
```
|
||||
|
||||
## 6) Caching Strategy
|
||||
|
||||
### 6.1 Cache Tiers
|
||||
|
||||
| Tier | Storage | TTL | Use Case |
|
||||
|------|---------|-----|----------|
|
||||
| L1 | In-memory (per-instance) | 1s | Hot path, same-instance requests |
|
||||
| L2 | Valkey/Redis | 5s | Cross-instance, active jobs |
|
||||
| L3 | PostgreSQL | 24h | Persistent snapshots, air-gap mode |
|
||||
|
||||
### 6.2 Cache Keys
|
||||
|
||||
```
|
||||
ttfs:job:{tenant_id}:{job_id}:signal # Current signal
|
||||
ttfs:job:{tenant_id}:{job_id}:eta # ETA prediction
|
||||
ttfs:run:{tenant_id}:{run_id}:signals # Run-level aggregation
|
||||
ttfs:tenant:{tenant_id}:failure_sig # Failure signatures
|
||||
```
|
||||
|
||||
### 6.3 Air-Gap Mode
|
||||
|
||||
In air-gapped environments without Valkey/Redis:
|
||||
|
||||
1. **PostgreSQL NOTIFY/LISTEN** replaces pub/sub for real-time updates
|
||||
2. **Polling fallback** with 2-second intervals
|
||||
3. **first_signal_snapshots** table serves as L2 cache
|
||||
4. All SSE endpoints gracefully degrade to long-polling
|
||||
|
||||
## 7) Telemetry & Observability
|
||||
|
||||
### 7.1 Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `ttfs_latency_seconds` | Histogram | End-to-end signal latency |
|
||||
| `ttfs_cache_latency_seconds` | Histogram | Cache lookup time |
|
||||
| `ttfs_cold_latency_seconds` | Histogram | Cold path computation time |
|
||||
| `ttfs_signal_total` | Counter | Signals by kind/surface |
|
||||
| `ttfs_cache_hit_total` | Counter | Cache hits |
|
||||
| `ttfs_cache_miss_total` | Counter | Cache misses |
|
||||
| `ttfs_slo_breach_total` | Counter | SLO breaches |
|
||||
| `ttfs_error_total` | Counter | Errors by type |
|
||||
|
||||
### 7.2 Labels
|
||||
|
||||
All metrics include the following labels:
|
||||
|
||||
- `surface`: `ui` | `cli` | `ci`
|
||||
- `cache_hit`: `true` | `false`
|
||||
- `signal_source`: `snapshot` | `cold_start` | `failure_index`
|
||||
- `kind`: Signal kind enum
|
||||
- `tenant_id`: Tenant identifier (for multi-tenant deployments)
|
||||
|
||||
### 7.3 SLO Definitions
|
||||
|
||||
```yaml
|
||||
# Prometheus recording rules
|
||||
- record: ttfs:slo:p50_target
|
||||
expr: 2.0 # seconds
|
||||
|
||||
- record: ttfs:slo:p95_target
|
||||
expr: 5.0 # seconds
|
||||
|
||||
- record: ttfs:slo:compliance
|
||||
expr: |
|
||||
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
|
||||
< 5.0
|
||||
|
||||
# Alerting rules
|
||||
- alert: TtfsSloBreachP95
|
||||
expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: "TTFS P95 exceeds 5s SLO"
|
||||
|
||||
- alert: TtfsHighErrorRate
|
||||
expr: rate(ttfs_error_total[5m]) > 0.1
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
## 8) Frontend Integration
|
||||
|
||||
### 8.1 Component Hierarchy
|
||||
|
||||
```
|
||||
FirstSignalCard (Smart Component)
|
||||
├── FirstSignalStore (Signal-based State)
|
||||
│ ├── SSE subscription
|
||||
│ ├── Polling fallback
|
||||
│ └── ETag caching
|
||||
├── StatusIndicator (Dumb Component)
|
||||
│ └── kind → icon + color mapping
|
||||
├── PhaseProgress (Dumb Component)
|
||||
│ └── phase → progress bar
|
||||
└── ActionButtons (Dumb Component)
|
||||
└── next_actions rendering
|
||||
```
|
||||
|
||||
### 8.2 State Machine
|
||||
|
||||
```typescript
|
||||
type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';
|
||||
|
||||
// State transitions:
|
||||
// idle → loading (initial fetch)
|
||||
// loading → streaming (SSE connected) | error (fetch failed)
|
||||
// streaming → done (terminal signal) | error (connection lost)
|
||||
// error → loading (retry)
|
||||
```
|
||||
|
||||
### 8.3 Animation Tokens
|
||||
|
||||
| Token | Value | Usage |
|
||||
|-------|-------|-------|
|
||||
| `--motion-duration-quick` | 150ms | Skeleton fade, icon transitions |
|
||||
| `--motion-duration-normal` | 250ms | Card expansion, phase transitions |
|
||||
| `--motion-duration-slow` | 400ms | Success/failure celebrations |
|
||||
| `--motion-easing-standard` | cubic-bezier(0.4, 0, 0.2, 1) | Default easing |
|
||||
| `--motion-easing-decelerate` | cubic-bezier(0, 0, 0.2, 1) | Entries |
|
||||
| `--motion-easing-accelerate` | cubic-bezier(0.4, 0, 1, 1) | Exits |
|
||||
|
||||
## 9) Failure Signatures
|
||||
|
||||
Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.
|
||||
|
||||
### 9.1 Signature Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"signature_hash": "sha256:abc123...",
|
||||
"pattern": {
|
||||
"phase": "analyze",
|
||||
"error_code": "LAYER_EXTRACT_FAILED",
|
||||
"image_pattern": "registry.io/.*:v1.*"
|
||||
},
|
||||
"outcome": {
|
||||
"likely_cause": "Registry rate limiting",
|
||||
"mttr_p50_seconds": 300,
|
||||
"suggested_action": "Wait 5 minutes and retry"
|
||||
},
|
||||
"confidence": 0.87,
|
||||
"sample_count": 42
|
||||
}
|
||||
```
|
||||
|
||||
### 9.2 Usage
|
||||
|
||||
When a job enters a known failure pattern:
|
||||
|
||||
1. **Match** current job state against `failure_signatures` table
|
||||
2. **Enrich** signal with `last_known_outcome.likely_cause`
|
||||
3. **Predict** ETA based on historical MTTR
|
||||
4. **Suggest** remediation via `next_actions`
|
||||
|
||||
## 10) Database Schema
|
||||
|
||||
See `docs/db/schemas/ttfs.sql` for the complete schema definition.
|
||||
|
||||
### 10.1 Core Tables
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `scheduler.first_signal_snapshots` | Cached signal state per job |
|
||||
| `scheduler.ttfs_events` | Telemetry event log |
|
||||
| `scheduler.failure_signatures` | Historical failure patterns |
|
||||
|
||||
### 10.2 Hourly Rollup View
|
||||
|
||||
The `scheduler.ttfs_hourly_summary` view provides pre-aggregated metrics for dashboard performance.
|
||||
|
||||
## 11) Testing Requirements
|
||||
|
||||
### 11.1 Unit Tests
|
||||
|
||||
- Signal store state machine transitions
|
||||
- ETag generation and validation
|
||||
- Cache hit/miss scenarios
|
||||
- Failure signature matching
|
||||
|
||||
### 11.2 Integration Tests
|
||||
|
||||
- End-to-end API latency measurement
|
||||
- SSE connection lifecycle
|
||||
- Air-gap mode fallback
|
||||
- Multi-tenant isolation
|
||||
|
||||
### 11.3 Deterministic Fixtures
|
||||
|
||||
```typescript
|
||||
// tests/fixtures/ttfs/
|
||||
export const TTFS_FIXTURES = {
|
||||
FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
|
||||
DETERMINISTIC_SEED: 0x5EED2025,
|
||||
SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
|
||||
SAMPLE_TENANT_ID: 'tenant-test-001'
|
||||
};
|
||||
```
|
||||
|
||||
## 12) References
|
||||
|
||||
- Advisory: `docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md`
|
||||
- Sprint 1 (Foundation): `docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md`
|
||||
- Sprint 2 (API): `docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md`
|
||||
- Sprint 3 (UI): `docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md`
|
||||
- Sprint 4 (Enhancements): `docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md`
|
||||
- TTE Architecture: `docs/modules/telemetry/architecture.md`
|
||||
- Telemetry Schema: `docs/schemas/ttfs-event.schema.json`
|
||||
- Database Schema: `docs/db/schemas/ttfs.sql`
|
||||
Reference in New Issue
Block a user