docs(ops): Complete operations runbooks for Epic 3500

Sprint 3500.0004.0004 (Documentation & Handoff) - T2 DONE Operations Runbooks Added: - score-replay-runbook.md: Deterministic replay procedures - proof-verification-runbook.md: DSSE/Merkle verification ops - airgap-operations-runbook.md: Offline kit management CLI Reference Docs: - reachability-cli-reference.md - score-proofs-cli-reference.md - unknowns-cli-reference.md Air-Gap Guides: - score-proofs-reachability-airgap-runbook.md Training Materials: - score-proofs-concept-guide.md UI API Clients: - proof.client.ts - reachability.client.ts - unknowns.client.ts All 5 operations runbooks now complete (reachability, unknowns-queue, score-replay, proof-verification, airgap-operations).
2025-12-20 22:30:02 +02:00
parent 09c7155f1b
commit 4b3db9ca85
13 changed files with 5630 additions and 12 deletions
--- a/docs/operations/score-replay-runbook.md
+++ b/docs/operations/score-replay-runbook.md
@@ -0,0 +1,518 @@
+# Score Replay Operations Runbook
+
+> **Version**: 1.0.0  
+> **Sprint**: 3500.0004.0004  
+> **Last Updated**: 2025-12-20
+
+This runbook covers operational procedures for Score Replay, including deterministic score computation verification, proof bundle validation, and troubleshooting replay discrepancies.
+
+---
+
+## Table of Contents
+
+1. [Overview](#1-overview)
+2. [Score Replay Operations](#2-score-replay-operations)
+3. [Determinism Verification](#3-determinism-verification)
+4. [Proof Bundle Management](#4-proof-bundle-management)
+5. [Troubleshooting](#5-troubleshooting)
+6. [Monitoring & Alerting](#6-monitoring--alerting)
+7. [Escalation Procedures](#7-escalation-procedures)
+
+---
+
+## 1. Overview
+
+### What is Score Replay?
+
+Score Replay is the ability to re-execute a vulnerability score computation using the exact same inputs (SBOM, rules, policies, feeds) that were used in the original scan. This provides:
+
+- **Auditability**: Prove that a score was computed correctly
+- **Determinism verification**: Confirm that identical inputs produce identical outputs
+- **Compliance evidence**: Generate proof bundles for regulatory requirements
+- **Dispute resolution**: Verify contested scan results
+
+### Key Concepts
+
+| Term | Definition |
+|------|------------|
+| **Manifest** | Content-addressed record of all scoring inputs (SBOM hash, rules hash, policy hash, feed hash) |
+| **Proof Bundle** | Signed attestation containing manifest, score, and Merkle proof |
+| **Root Hash** | Merkle tree root computed from all input hashes |
+| **DSSE Envelope** | Dead Simple Signing Envelope containing the signed proof |
+| **Freeze Timestamp** | Optional timestamp to replay scoring at a specific point in time |
+
+### Architecture Components
+
+| Component | Purpose | Location |
+|-----------|---------|----------|
+| Score Engine | Computes vulnerability scores | Scanner Worker |
+| Manifest Store | Persists scoring manifests | `scanner.manifest` table |
+| Proof Chain | Generates Merkle proofs | Attestor library |
+| Signer | Signs proof bundles (DSSE) | Signer service |
+
+---
+
+## 2. Score Replay Operations
+
+### 2.1 Triggering a Score Replay
+
+#### Via CLI
+
+```bash
+# Basic replay
+stella score replay --scan <scan-id>
+
+# Replay with specific manifest
+stella score replay --scan <scan-id> --manifest-hash sha256:abc123...
+
+# Replay with frozen timestamp (for determinism testing)
+stella score replay --scan <scan-id> --freeze 2025-01-15T00:00:00Z
+
+# Output as JSON
+stella score replay --scan <scan-id> --output json
+```
+
+#### Via API
+
+```bash
+# POST /api/v1/scanner/score/{scanId}/replay
+curl -X POST "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/replay" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "manifestHash": "sha256:abc123...",
+    "freezeTimestamp": "2025-01-15T00:00:00Z"
+  }'
+```
+
+#### Expected Response
+
+```json
+{
+  "scanId": "scan-123",
+  "score": 7.5,
+  "rootHash": "sha256:def456...",
+  "bundleUri": "/api/v1/scanner/scans/scan-123/proofs/sha256:def456...",
+  "manifestHash": "sha256:abc123...",
+  "replayedAt": "2025-01-16T10:30:00Z",
+  "deterministic": true
+}
+```
+
+### 2.2 Retrieving Proof Bundles
+
+#### Via CLI
+
+```bash
+# Get bundle for a scan
+stella score bundle --scan <scan-id>
+
+# Download bundle to file
+stella score bundle --scan <scan-id> --output bundle.tar.gz
+```
+
+#### Via API
+
+```bash
+# GET /api/v1/scanner/score/{scanId}/bundle
+curl "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/bundle" \
+  -H "Authorization: Bearer $TOKEN" \
+  -o bundle.tar.gz
+```
+
+### 2.3 Verifying Score Integrity
+
+#### Via CLI
+
+```bash
+# Verify against expected root hash
+stella score verify --scan <scan-id> --root-hash sha256:def456...
+
+# Verify downloaded bundle
+stella proof verify --bundle bundle.tar.gz
+```
+
+#### Via API
+
+```bash
+# POST /api/v1/scanner/score/{scanId}/verify
+curl -X POST "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/verify" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"expectedRootHash": "sha256:def456..."}'
+```
+
+---
+
+## 3. Determinism Verification
+
+### 3.1 What Affects Determinism?
+
+Score computation is deterministic when:
+
+| Input | Requirement |
+|-------|-------------|
+| SBOM | Identical content (same hash) |
+| Rules | Same rule version and configuration |
+| Policy | Same policy document |
+| Feeds | Same feed snapshot (freeze timestamp) |
+| Ordering | Findings sorted deterministically |
+
+### 3.2 Running Determinism Checks
+
+```bash
+# Run replay twice and compare
+REPLAY1=$(stella score replay --scan $SCAN_ID --output json)
+REPLAY2=$(stella score replay --scan $SCAN_ID --output json)
+
+# Extract root hashes
+HASH1=$(echo $REPLAY1 | jq -r '.rootHash')
+HASH2=$(echo $REPLAY2 | jq -r '.rootHash')
+
+# Compare
+if [ "$HASH1" = "$HASH2" ]; then
+  echo "✓ Determinism verified: $HASH1"
+else
+  echo "✗ Non-deterministic! $HASH1 != $HASH2"
+  exit 1
+fi
+```
+
+### 3.3 Common Determinism Issues
+
+| Issue | Cause | Resolution |
+|-------|-------|------------|
+| Different root hash | Feed data changed between replays | Use `--freeze` timestamp |
+| Score drift | Rule version mismatch | Pin rules version in manifest |
+| Ordering differences | Non-stable sort in findings | Check Scanner version (fixed in v2.1+) |
+| Timestamp in output | Current time in computation | Ensure frozen time mode |
+
+### 3.4 Feed Freeze for Reproducibility
+
+```bash
+# Replay with feed state frozen to original scan time
+stella score replay --scan $SCAN_ID \
+  --freeze $(stella scan show $SCAN_ID --output json | jq -r '.scannedAt')
+```
+
+---
+
+## 4. Proof Bundle Management
+
+### 4.1 Bundle Contents
+
+A proof bundle (`.tar.gz`) contains:
+
+```
+bundle/
+├── manifest.json       # Input hashes and metadata
+├── score.json          # Computed score and findings summary
+├── merkle-proof.json   # Merkle tree with inclusion proofs
+├── dsse-envelope.json  # Signed attestation (DSSE format)
+└── certificate.pem     # Signing certificate (optional)
+```
+
+### 4.2 Inspecting Bundles
+
+```bash
+# Extract and view manifest
+tar -xzf bundle.tar.gz
+cat bundle/manifest.json | jq .
+
+# Verify DSSE signature
+stella proof verify --bundle bundle.tar.gz --verbose
+
+# Check Merkle proof
+stella proof spine --bundle bundle.tar.gz
+```
+
+### 4.3 Bundle Retention Policy
+
+| Environment | Retention | Notes |
+|-------------|-----------|-------|
+| Production | 7 years | Regulatory compliance |
+| Staging | 90 days | Testing purposes |
+| Development | 30 days | Cleanup automatically |
+
+### 4.4 Archiving Bundles
+
+```bash
+# Export bundle to long-term storage
+stella score bundle --scan $SCAN_ID --output /archive/proofs/$SCAN_ID.tar.gz
+
+# Bulk export for compliance audit
+stella score bundle-export \
+  --since 2024-01-01 \
+  --until 2024-12-31 \
+  --output /archive/2024-proofs/
+```
+
+---
+
+## 5. Troubleshooting
+
+### 5.1 Replay Returns Different Score
+
+**Symptoms**: Replayed score differs from original scan score.
+
+**Diagnostic Steps**:
+
+1. Check manifest integrity:
+   ```bash
+   stella scan show $SCAN_ID --output json | jq '.manifest'
+   ```
+
+2. Verify feed state:
+   ```bash
+   # Compare feed hashes
+   stella score replay --scan $SCAN_ID --freeze $ORIGINAL_TIME --output json | jq '.manifestHash'
+   ```
+
+3. Check for rule updates:
+   ```bash
+   stella rules show --version --output json
+   ```
+
+**Resolution**:
+- Use `--freeze` timestamp matching original scan
+- Pin rule versions in policy
+- Regenerate manifest if inputs changed legitimately
+
+### 5.2 Proof Verification Fails
+
+**Symptoms**: `stella proof verify` returns validation errors.
+
+**Diagnostic Steps**:
+
+1. Check DSSE signature:
+   ```bash
+   stella proof verify --bundle bundle.tar.gz --verbose 2>&1 | grep -i signature
+   ```
+
+2. Verify certificate validity:
+   ```bash
+   openssl x509 -in bundle/certificate.pem -noout -dates
+   ```
+
+3. Check Merkle proof:
+   ```bash
+   stella proof spine --bundle bundle.tar.gz --verify
+   ```
+
+**Common Errors**:
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `SIGNATURE_INVALID` | Bundle tampered or wrong key | Re-download bundle |
+| `CERTIFICATE_EXPIRED` | Signing cert expired | Check signing key rotation |
+| `MERKLE_MISMATCH` | Root hash doesn't match | Verify correct bundle version |
+| `MANIFEST_MISSING` | Incomplete bundle | Re-export from API |
+
+### 5.3 Replay Timeout
+
+**Symptoms**: Replay request times out or takes too long.
+
+**Diagnostic Steps**:
+
+1. Check scan size:
+   ```bash
+   stella scan show $SCAN_ID --output json | jq '.findingsCount'
+   ```
+
+2. Monitor replay progress:
+   ```bash
+   stella score replay --scan $SCAN_ID --verbose
+   ```
+
+**Resolution**:
+- For large scans (>10k findings), increase timeout
+- Check Scanner Worker health
+- Consider async replay for very large scans
+
+### 5.4 Missing Manifest
+
+**Symptoms**: `Manifest not found` error on replay.
+
+**Diagnostic Steps**:
+
+1. Verify scan exists:
+   ```bash
+   stella scan show $SCAN_ID
+   ```
+
+2. Check manifest table:
+   ```sql
+   SELECT * FROM scanner.manifest WHERE scan_id = 'scan-123';
+   ```
+
+**Resolution**:
+- Manifest may have been purged (check retention policy)
+- Restore from backup if available
+- Re-run scan if original inputs available
+
+---
+
+## 6. Monitoring & Alerting
+
+### 6.1 Key Metrics
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|-----------------|
+| `score_replay_duration_ms` | Time to complete replay | p99 > 30s |
+| `score_replay_determinism_failures` | Non-deterministic replays | > 0 |
+| `proof_verification_failures` | Failed verifications | > 5/hour |
+| `manifest_storage_size_bytes` | Manifest table size | > 100GB |
+
+### 6.2 Grafana Dashboard Queries
+
+```promql
+# Replay latency
+histogram_quantile(0.99, 
+  rate(score_replay_duration_ms_bucket[5m])
+)
+
+# Determinism failure rate
+rate(score_replay_determinism_failures_total[1h])
+
+# Proof verification success rate
+sum(rate(proof_verification_success_total[1h])) /
+sum(rate(proof_verification_total[1h]))
+```
+
+### 6.3 Alert Rules
+
+```yaml
+groups:
+  - name: score-replay
+    rules:
+      - alert: ScoreReplayLatencyHigh
+        expr: histogram_quantile(0.99, rate(score_replay_duration_ms_bucket[5m])) > 30000
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Score replay latency exceeds 30s at p99
+
+      - alert: DeterminismFailure
+        expr: increase(score_replay_determinism_failures_total[1h]) > 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: Non-deterministic score replay detected
+```
+
+---
+
+## 7. Escalation Procedures
+
+### 7.1 Escalation Matrix
+
+| Severity | Condition | Response Time | Escalate To |
+|----------|-----------|---------------|-------------|
+| P1 - Critical | Determinism failure in production | 15 minutes | Platform Team Lead |
+| P2 - High | Proof verification failures > 10/hour | 1 hour | Scanner Team |
+| P3 - Medium | Replay latency degradation | 4 hours | Scanner Team |
+| P4 - Low | Single replay failure | Next business day | Support Queue |
+
+### 7.2 P1: Determinism Failure Response
+
+1. **Immediate Actions** (0-15 min):
+   - Capture affected scan IDs
+   - Preserve original manifest data
+   - Check for recent deployments
+
+2. **Investigation** (15-60 min):
+   - Compare input hashes between replays
+   - Check feed synchronization status
+   - Review rule engine logs
+
+3. **Remediation**:
+   - Roll back if deployment-related
+   - Freeze feeds if data drift
+   - Hotfix if code bug identified
+
+### 7.3 Contacts
+
+| Role | Contact | Availability |
+|------|---------|--------------|
+| Scanner Team Lead | scanner-lead@stellaops.io | Business hours |
+| Platform On-Call | platform-oncall@stellaops.io | 24/7 |
+| Security Team | security@stellaops.io | Business hours |
+
+---
+
+## Appendix A: SQL Queries
+
+### Check Manifest History
+
+```sql
+SELECT 
+  scan_id,
+  manifest_hash,
+  sbom_hash,
+  rules_hash,
+  policy_hash,
+  feed_hash,
+  created_at
+FROM scanner.manifest
+WHERE scan_id = 'scan-123'
+ORDER BY created_at DESC;
+```
+
+### Find Non-Deterministic Replays
+
+```sql
+SELECT 
+  scan_id,
+  COUNT(DISTINCT root_hash) as unique_hashes,
+  MIN(replayed_at) as first_replay,
+  MAX(replayed_at) as last_replay
+FROM scanner.replay_log
+GROUP BY scan_id
+HAVING COUNT(DISTINCT root_hash) > 1;
+```
+
+### Proof Bundle Statistics
+
+```sql
+SELECT 
+  DATE_TRUNC('day', created_at) as day,
+  COUNT(*) as bundles_created,
+  AVG(bundle_size_bytes) as avg_size,
+  SUM(bundle_size_bytes) as total_size
+FROM scanner.proof_bundle
+WHERE created_at > NOW() - INTERVAL '30 days'
+GROUP BY DATE_TRUNC('day', created_at)
+ORDER BY day DESC;
+```
+
+---
+
+## Appendix B: CLI Quick Reference
+
+```bash
+# Score Replay Commands
+stella score replay --scan <id>              # Replay score computation
+stella score replay --scan <id> --freeze <ts> # Replay with frozen time
+stella score bundle --scan <id>              # Get proof bundle
+stella score verify --scan <id> --root-hash <hash>  # Verify score
+
+# Proof Commands
+stella proof verify --bundle <path>          # Verify bundle file
+stella proof verify --bundle <path> --offline # Offline verification
+stella proof spine --bundle <path>           # Show Merkle spine
+
+# Output Formats
+--output json                                # JSON output
+--output table                               # Table output (default)
+--output yaml                                # YAML output
+```
+
+---
+
+## Revision History
+
+| Version | Date | Author | Changes |
+|---------|------|--------|---------|
+| 1.0.0 | 2025-12-20 | Agent | Initial release |