Files
git.stella-ops.org/docs/modules/telemetry/ttfs-architecture.md
master 2170a58734
Some checks failed
Lighthouse CI / Lighthouse Audit (push) Waiting to run
Lighthouse CI / Axe Accessibility Audit (push) Waiting to run
Manifest Integrity / Validate Schema Integrity (push) Waiting to run
Manifest Integrity / Validate Contract Documents (push) Waiting to run
Manifest Integrity / Validate Pack Fixtures (push) Waiting to run
Manifest Integrity / Audit SHA256SUMS Files (push) Waiting to run
Manifest Integrity / Verify Merkle Roots (push) Waiting to run
Policy Lint & Smoke / policy-lint (push) Waiting to run
Policy Simulation / policy-simulate (push) Waiting to run
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Findings Ledger CI / build-test (push) Has been cancelled
Findings Ledger CI / migration-validation (push) Has been cancelled
Findings Ledger CI / generate-manifest (push) Has been cancelled
Add comprehensive security tests for OWASP A02, A05, A07, and A08 categories
- Implemented tests for Cryptographic Failures (A02) to ensure proper handling of sensitive data, secure algorithms, and key management.
- Added tests for Security Misconfiguration (A05) to validate production configurations, security headers, CORS settings, and feature management.
- Developed tests for Authentication Failures (A07) to enforce strong password policies, rate limiting, session management, and MFA support.
- Created tests for Software and Data Integrity Failures (A08) to verify artifact signatures, SBOM integrity, attestation chains, and feed updates.
2025-12-16 16:40:44 +02:00

16 KiB

Time-to-First-Signal (TTFS) Architecture

Derived from Product Advisory (14-Dec-2025): UX and Time-to-Evidence Technical Reference; details the TTFS subsystem for providing immediate feedback on run/job status.

1) Overview

Time-to-First-Signal (TTFS) measures the latency from user action (opening a run, starting a scan, CLI invocation) to the first meaningful signal being displayed or logged. This architecture ensures users receive immediate feedback regardless of actual job completion time.

1.1 Design Goals

  • Instant Feedback: P50 < 2s, P95 < 5s across all surfaces (UI, CLI, CI)
  • Graceful Degradation: Skeleton → Cached Signal → Live Data progression
  • Offline-First: Full functionality in air-gapped environments using PostgreSQL NOTIFY/LISTEN
  • Predictive Context: Provide "last known outcome" and ETA estimates for in-progress jobs

1.2 Signal Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TTFS Signal Flow                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   User Action          API Layer              Cache Layer      Data Layer   │
│   ───────────          ─────────              ───────────      ──────────   │
│                                                                             │
│   [Route Enter] ──┬──► /first-signal ───────► Valkey/Redis ─┐              │
│   [CLI Start]  ───┤        │                      │         │              │
│   [CI Job]     ───┘        │                      │         ▼              │
│                            │                      │    ┌──────────────┐    │
│                            ▼                      │    │ PostgreSQL   │    │
│                      ┌──────────┐                 │    │ first_signal │    │
│                      │ ETag     │◄────────────────┤    │ _snapshots   │    │
│                      │ Validation│                │    └──────────────┘    │
│                      └──────────┘                 │                        │
│                            │                      │                        │
│                            ▼                      ▼                        │
│                      ┌──────────────────────────────┐                      │
│                      │     Response Assembly        │                      │
│                      │ • kind (status indicator)    │                      │
│                      │ • phase (current stage)      │                      │
│                      │ • summary (human text)       │                      │
│                      │ • eta_seconds (estimate)     │                      │
│                      │ • last_known_outcome         │                      │
│                      │ • next_actions               │                      │
│                      └──────────────────────────────┘                      │
│                                    │                                        │
│                                    ▼                                        │
│                      ┌──────────────────────────────┐                      │
│                      │     SSE / Polling Client     │                      │
│                      └──────────────────────────────┘                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2) Component Budgets

The 5-second P95 budget is allocated across components:

Component P50 Budget P95 Budget Notes
Frontend (skeleton + hydration) 100ms 150ms Network-independent
Edge API (auth + routing) 150ms 250ms JWT validation, rate limiting
Core Services (lookup + assembly) 700ms 1,500ms Cache hit vs cold path
SSE/WebSocket establishment 300ms Fallback to polling if exceeded
Total (warm path) 700ms 2,500ms Cache hit scenario
Total (cold path) 1,200ms 4,000ms Cache miss, compute required

3) Signal Kinds

The kind field indicates the current signal state:

Kind Description Typical Duration Icon
queued Job waiting in queue 0-30s Queue
started Job has begun execution Play
phase Job in specific phase Varies Progress
blocked Waiting on dependency/policy Pause
failed Job has failed Error
succeeded Job completed successfully Check
canceled Job was canceled Cancel
unavailable Signal cannot be determined Unknown

4) Signal Phases

The phase field indicates the current execution phase:

Phase Description SLO Target
resolve Dependency/artifact resolution P95 < 30s
fetch Data retrieval (registry, advisories) P95 < 45s
restore Cache/snapshot restoration P95 < 10s
analyze Analysis execution (scan, policy) P95 < 120s
policy Policy evaluation P95 < 15s
report Report generation/upload P95 < 30s
unknown Phase cannot be determined

5) API Contracts

5.1 First Signal Endpoint

GET /api/v1/orchestrator/jobs/{jobId}/first-signal
Accept: application/json
If-None-Match: "{etag}"

200 OK
ETag: "job-{id}-{updated_at.unix_ms}"
Cache-Control: private, max-age=1, stale-while-revalidate=5
X-Signal-Source: snapshot | cold_start | failure_index

{
  "kind": "started",
  "phase": "analyze",
  "summary": "Scanning image layers (47%)",
  "eta_seconds": 38,
  "last_known_outcome": {
    "status": "succeeded",
    "finished_at": "2025-12-13T10:15:00Z",
    "findings_count": 12
  },
  "next_actions": [
    {"label": "View previous run", "href": "/runs/abc-123"}
  ],
  "diagnostics": {
    "queue_position": null,
    "worker_id": "worker-7"
  }
}

304 Not Modified (if ETag matches)

5.2 SSE Stream

GET /api/v1/orchestrator/stream/jobs/{jobId}/first-signal
Accept: text/event-stream

event: signal
data: {"kind":"started","phase":"analyze",...}

event: signal
data: {"kind":"phase","phase":"policy",...}

event: done
data: {"kind":"succeeded",...}

5.3 CLI Integration

# Job status with immediate signal
stella job status <job-id> --watch

# Output progression:
# [queued] Waiting in queue (position: 3)
# [started] Job started on worker-7
# [phase:analyze] Scanning image layers (47%)
# [succeeded] Completed in 2m 34s

6) Caching Strategy

6.1 Cache Tiers

Tier Storage TTL Use Case
L1 In-memory (per-instance) 1s Hot path, same-instance requests
L2 Valkey/Redis 5s Cross-instance, active jobs
L3 PostgreSQL 24h Persistent snapshots, air-gap mode

6.2 Cache Keys

ttfs:job:{tenant_id}:{job_id}:signal      # Current signal
ttfs:job:{tenant_id}:{job_id}:eta         # ETA prediction
ttfs:run:{tenant_id}:{run_id}:signals     # Run-level aggregation
ttfs:tenant:{tenant_id}:failure_sig       # Failure signatures

6.3 Air-Gap Mode

In air-gapped environments without Valkey/Redis:

  1. PostgreSQL NOTIFY/LISTEN replaces pub/sub for real-time updates
  2. Polling fallback with 2-second intervals
  3. first_signal_snapshots table serves as L2 cache
  4. All SSE endpoints gracefully degrade to long-polling

7) Telemetry & Observability

7.1 Metrics

Metric Type Description
ttfs_latency_seconds Histogram End-to-end signal latency
ttfs_cache_latency_seconds Histogram Cache lookup time
ttfs_cold_latency_seconds Histogram Cold path computation time
ttfs_signal_total Counter Signals by kind/surface
ttfs_cache_hit_total Counter Cache hits
ttfs_cache_miss_total Counter Cache misses
ttfs_slo_breach_total Counter SLO breaches
ttfs_error_total Counter Errors by type

7.2 Labels

All metrics include the following labels:

  • surface: ui | cli | ci
  • cache_hit: true | false
  • signal_source: snapshot | cold_start | failure_index
  • kind: Signal kind enum
  • tenant_id: Tenant identifier (for multi-tenant deployments)

7.3 SLO Definitions

# Prometheus recording rules
- record: ttfs:slo:p50_target
  expr: 2.0  # seconds

- record: ttfs:slo:p95_target
  expr: 5.0  # seconds

- record: ttfs:slo:compliance
  expr: |
    histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
    < 5.0

# Alerting rules
- alert: TtfsSloBreachP95
  expr: histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le)) > 5.0
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "TTFS P95 exceeds 5s SLO"

- alert: TtfsHighErrorRate
  expr: rate(ttfs_error_total[5m]) > 0.1
  for: 2m
  labels:
    severity: warning

8) Frontend Integration

8.1 Component Hierarchy

FirstSignalCard (Smart Component)
├── FirstSignalStore (Signal-based State)
│   ├── SSE subscription
│   ├── Polling fallback
│   └── ETag caching
├── StatusIndicator (Dumb Component)
│   └── kind → icon + color mapping
├── PhaseProgress (Dumb Component)
│   └── phase → progress bar
└── ActionButtons (Dumb Component)
    └── next_actions rendering

8.2 State Machine

type FirstSignalLoadState = 'idle' | 'loading' | 'streaming' | 'error' | 'done';

// State transitions:
// idle → loading (initial fetch)
// loading → streaming (SSE connected) | error (fetch failed)
// streaming → done (terminal signal) | error (connection lost)
// error → loading (retry)

8.3 Animation Tokens

Token Value Usage
--motion-duration-quick 150ms Skeleton fade, icon transitions
--motion-duration-normal 250ms Card expansion, phase transitions
--motion-duration-slow 400ms Success/failure celebrations
--motion-easing-standard cubic-bezier(0.4, 0, 0.2, 1) Default easing
--motion-easing-decelerate cubic-bezier(0, 0, 0.2, 1) Entries
--motion-easing-accelerate cubic-bezier(0.4, 0, 1, 1) Exits

9) Failure Signatures

Failure signatures enable predictive "last known outcome" by pattern-matching historical failures.

9.1 Signature Schema

{
  "signature_hash": "sha256:abc123...",
  "pattern": {
    "phase": "analyze",
    "error_code": "LAYER_EXTRACT_FAILED",
    "image_pattern": "registry.io/.*:v1.*"
  },
  "outcome": {
    "likely_cause": "Registry rate limiting",
    "mttr_p50_seconds": 300,
    "suggested_action": "Wait 5 minutes and retry"
  },
  "confidence": 0.87,
  "sample_count": 42
}

9.2 Usage

When a job enters a known failure pattern:

  1. Match current job state against failure_signatures table
  2. Enrich signal with last_known_outcome.likely_cause
  3. Predict ETA based on historical MTTR
  4. Suggest remediation via next_actions

10) Database Schema

See docs/db/schemas/ttfs.sql for the complete schema definition.

10.1 Core Tables

Table Purpose
scheduler.first_signal_snapshots Cached signal state per job
scheduler.ttfs_events Telemetry event log
scheduler.failure_signatures Historical failure patterns

10.2 Hourly Rollup View

The scheduler.ttfs_hourly_summary view provides pre-aggregated metrics for dashboard performance.

11) Testing Requirements

11.1 Unit Tests

  • Signal store state machine transitions
  • ETag generation and validation
  • Cache hit/miss scenarios
  • Failure signature matching

11.2 Integration Tests

  • End-to-end API latency measurement
  • SSE connection lifecycle
  • Air-gap mode fallback
  • Multi-tenant isolation

11.3 Deterministic Fixtures

// tests/fixtures/ttfs/
export const TTFS_FIXTURES = {
  FROZEN_TIMESTAMP: '2025-12-04T12:00:00.000Z',
  DETERMINISTIC_SEED: 0x5EED2025,
  SAMPLE_JOB_ID: '550e8400-e29b-41d4-a716-446655440000',
  SAMPLE_TENANT_ID: 'tenant-test-001'
};

12) Observability

12.1 Grafana Dashboard

The TTFS observability dashboard provides real-time visibility into signal latency, cache performance, and SLO compliance.

  • Dashboard file: docs/modules/telemetry/operations/dashboards/ttfs-observability.json
  • UID: ttfs-overview

Key panels:

  • TTFS P50/P95/P99 by Surface (timeseries)
  • Cache Hit Rate (stat)
  • SLO Breaches (stat with threshold coloring)
  • Signal Source Distribution (piechart)
  • Signals by Kind (stacked timeseries)
  • Error Rate (timeseries)
  • TTFS Latency Heatmap
  • Top Failure Signatures (table)

12.2 Alert Rules

TTFS alerts are defined in docs/modules/telemetry/operations/alerts/ttfs-alerts.yaml.

Critical alerts:

Alert Threshold For
TtfsP95High P95 > 5s 5m
TtfsSloBreach >10 breaches in 5m 1m
FirstSignalEndpointDown Orchestrator unavailable 2m

Warning alerts:

Alert Threshold For
TtfsCacheHitRateLow <70% 10m
TtfsErrorRateHigh >1% 5m
FirstSignalEndpointLatencyHigh P95 > 500ms 5m

12.3 Load Testing

Load tests validate TTFS performance under realistic conditions.

  • Test file: tests/load/ttfs-load-test.js
  • Framework: k6

Scenarios:

  • Sustained: 50 RPS for 5 minutes
  • Spike: Ramp to 200 RPS
  • Soak: 25 RPS for 15 minutes

Thresholds:

  • Cache-hit P95 ≤ 250ms
  • Cold-path P95 ≤ 500ms
  • Error rate < 0.1%

13) References

  • Advisory: docs/product-advisories/14-Dec-2025 - UX and Time-to-Evidence Technical Reference.md
  • Sprint 1 (Foundation): docs/implplan/SPRINT_0338_0001_0001_ttfs_foundation.md
  • Sprint 2 (API): docs/implplan/SPRINT_0339_0001_0001_first_signal_api.md
  • Sprint 3 (UI): docs/implplan/SPRINT_0340_0001_0001_first_signal_card_ui.md
  • Sprint 4 (Enhancements): docs/implplan/SPRINT_0341_0001_0001_ttfs_enhancements.md
  • TTE Architecture: docs/modules/telemetry/architecture.md
  • Telemetry Schema: docs/schemas/ttfs-event.schema.json
  • Database Schema: docs/db/schemas/ttfs.sql
  • Grafana Dashboard: docs/modules/telemetry/operations/dashboards/ttfs-observability.json
  • Alert Rules: docs/modules/telemetry/operations/alerts/ttfs-alerts.yaml
  • Load Tests: tests/load/ttfs-load-test.js