Files
git.stella-ops.org/docs/modules/attestor/tile-proxy-design.md

263 lines
7.9 KiB
Markdown

# Tile-Proxy Service Design
## Overview
The Tile-Proxy service acts as an intermediary between StellaOps clients and upstream Rekor transparency log APIs. It provides centralized tile caching, request coalescing, and offline support for air-gapped environments.
## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CI/CD Agents │────►│ Tile Proxy │────►│ Rekor API │
│ (StellaOps) │ │ (StellaOps) │ │ (Upstream) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Tile Cache │ │ TUF Metadata │ │ Checkpoint │
│ (CAS Store) │ │ (TrustRepo) │ │ Cache │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Core Responsibilities
1. **Tile Proxying**: Forward tile requests to upstream Rekor, caching responses locally
2. **Content-Addressed Storage**: Store tiles by hash for deduplication and immutability
3. **TUF Integration**: Optionally validate metadata using TUF trust anchors
4. **Request Coalescing**: Deduplicate concurrent requests for the same tile
5. **Checkpoint Caching**: Cache and serve recent checkpoints
6. **Offline Mode**: Serve from cache when upstream is unavailable
## API Surface
### Proxy Endpoints (Passthrough)
| Endpoint | Description |
|----------|-------------|
| `GET /tile/{level}/{index}` | Proxy tile request (cache-through) |
| `GET /tile/{level}/{index}.p/{partialWidth}` | Proxy partial tile |
| `GET /checkpoint` | Proxy checkpoint request |
| `GET /api/v1/log/entries/{uuid}` | Proxy entry lookup |
### Admin Endpoints
| Endpoint | Description |
|----------|-------------|
| `GET /_admin/cache/stats` | Cache statistics (hits, misses, size) |
| `POST /_admin/cache/sync` | Trigger manual sync job |
| `DELETE /_admin/cache/prune` | Prune old tiles |
| `GET /_admin/health` | Health check |
| `GET /_admin/ready` | Readiness check |
## Caching Strategy
### Content-Addressed Tile Storage
Tiles are stored using content-addressed paths based on SHA-256 hash:
```
{cache_root}/
├── tiles/
│ ├── {origin_hash}/
│ │ ├── {level}/
│ │ │ ├── {index}.tile
│ │ │ └── {index}.meta.json
│ │ └── checkpoints/
│ │ └── {tree_size}.checkpoint
│ └── ...
└── metadata/
└── cache_stats.json
```
### Tile Metadata
Each tile has associated metadata:
```json
{
"cachedAt": "2026-01-25T10:00:00Z",
"treeSize": 1050000,
"isPartial": false,
"contentHash": "sha256:abc123...",
"upstreamUrl": "https://rekor.sigstore.dev"
}
```
### Eviction Policy
1. **LRU by Access Time**: Least recently accessed tiles evicted first
2. **Max Size Limit**: Configurable maximum cache size
3. **TTL Override**: Force re-fetch after configurable time (for checkpoints)
4. **Immutability Preservation**: Full tiles (width=256) never evicted unless explicitly pruned
## Request Coalescing
Concurrent requests for the same tile are coalesced:
```csharp
// Pseudo-code for request coalescing
var key = $"{origin}/{level}/{index}";
if (_inflightRequests.TryGetValue(key, out var existing))
{
return await existing; // Wait for in-flight request
}
var tcs = new TaskCompletionSource<byte[]>();
_inflightRequests[key] = tcs.Task;
try
{
var tile = await FetchFromUpstream(origin, level, index);
tcs.SetResult(tile);
return tile;
}
finally
{
_inflightRequests.Remove(key);
}
```
## TUF Integration Point
When `TufValidationEnabled` is true:
1. Load service map from TUF to discover Rekor URL
2. Validate Rekor public key from TUF targets
3. Verify checkpoint signatures using TUF-loaded keys
4. Reject tiles if checkpoint signature invalid
## Upstream Failover
Support multiple upstream sources with failover:
```yaml
tile_proxy:
upstreams:
- url: https://rekor.sigstore.dev
priority: 1
timeout: 30s
- url: https://rekor-mirror.internal
priority: 2
timeout: 10s
```
Failover behavior:
1. Try primary upstream first
2. On timeout/error, try next upstream
3. Cache successful source for subsequent requests
4. Reset failover state on explicit refresh
## Deployment Model
### Standalone Service
Run as dedicated service with persistent volume:
```yaml
services:
tile-proxy:
image: stellaops/tile-proxy:latest
ports:
- "8090:8080"
volumes:
- tile-cache:/var/cache/stellaops/tiles
- tuf-cache:/var/cache/stellaops/tuf
environment:
- TILE_PROXY__UPSTREAM_URL=https://rekor.sigstore.dev
- TILE_PROXY__TUF_URL=https://trust.stella-ops.org/tuf/
```
### Sidecar Mode
Run alongside attestor service:
```yaml
services:
attestor:
image: stellaops/attestor:latest
environment:
- ATTESTOR__REKOR_URL=http://localhost:8090 # Use sidecar
tile-proxy:
image: stellaops/tile-proxy:latest
network_mode: "service:attestor"
```
## Metrics
Prometheus metrics exposed at `/_admin/metrics`:
| Metric | Type | Description |
|--------|------|-------------|
| `tile_proxy_cache_hits_total` | Counter | Total cache hits |
| `tile_proxy_cache_misses_total` | Counter | Total cache misses |
| `tile_proxy_cache_size_bytes` | Gauge | Current cache size |
| `tile_proxy_upstream_requests_total` | Counter | Upstream requests by status |
| `tile_proxy_request_duration_seconds` | Histogram | Request latency |
| `tile_proxy_sync_last_success_timestamp` | Gauge | Last successful sync time |
## Configuration
```yaml
tile_proxy:
# Upstream Rekor configuration
upstream_url: https://rekor.sigstore.dev
tile_base_url: https://rekor.sigstore.dev/tile/
# TUF integration (optional)
tuf:
enabled: true
url: https://trust.stella-ops.org/tuf/
validate_checkpoint_signature: true
# Cache configuration
cache:
base_path: /var/cache/stellaops/tiles
max_size_gb: 10
eviction_policy: lru
checkpoint_ttl_minutes: 5
# Sync job configuration
sync:
enabled: true
schedule: "0 */6 * * *"
depth: 10000
# Request handling
coalescing:
enabled: true
max_wait_ms: 5000
# Failover
failover:
enabled: true
retry_count: 2
retry_delay_ms: 1000
```
## Security Considerations
1. **No Authentication by Default**: Designed for internal network use
2. **Optional mTLS**: Can enable client certificate validation
3. **Rate Limiting**: Optional rate limiting per client IP
4. **Audit Logging**: Log all cache operations for compliance
5. **Immutable Tiles**: Full tiles are never modified after caching
## Error Handling
| Scenario | Behavior |
|----------|----------|
| Upstream unavailable | Serve from cache if available; 503 otherwise |
| Invalid tile data | Reject, don't cache, log error |
| Cache full | Evict LRU tiles, continue serving |
| TUF validation fails | Reject request, return 502 |
| Checkpoint stale | Refresh from upstream, warn in logs |
## Future Enhancements
1. **Tile Prefetching**: Prefetch tiles for known verification patterns
2. **Multi-Log Support**: Support multiple transparency logs
3. **Replication**: Sync cache between proxy instances
4. **Compression**: Optional tile compression for storage