docs(ops): Complete operations runbooks for Epic 3500
Sprint 3500.0004.0004 (Documentation & Handoff) - T2 DONE Operations Runbooks Added: - score-replay-runbook.md: Deterministic replay procedures - proof-verification-runbook.md: DSSE/Merkle verification ops - airgap-operations-runbook.md: Offline kit management CLI Reference Docs: - reachability-cli-reference.md - score-proofs-cli-reference.md - unknowns-cli-reference.md Air-Gap Guides: - score-proofs-reachability-airgap-runbook.md Training Materials: - score-proofs-concept-guide.md UI API Clients: - proof.client.ts - reachability.client.ts - unknowns.client.ts All 5 operations runbooks now complete (reachability, unknowns-queue, score-replay, proof-verification, airgap-operations).
This commit is contained in:
518
docs/operations/score-replay-runbook.md
Normal file
518
docs/operations/score-replay-runbook.md
Normal file
@@ -0,0 +1,518 @@
|
||||
# Score Replay Operations Runbook
|
||||
|
||||
> **Version**: 1.0.0
|
||||
> **Sprint**: 3500.0004.0004
|
||||
> **Last Updated**: 2025-12-20
|
||||
|
||||
This runbook covers operational procedures for Score Replay, including deterministic score computation verification, proof bundle validation, and troubleshooting replay discrepancies.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#1-overview)
|
||||
2. [Score Replay Operations](#2-score-replay-operations)
|
||||
3. [Determinism Verification](#3-determinism-verification)
|
||||
4. [Proof Bundle Management](#4-proof-bundle-management)
|
||||
5. [Troubleshooting](#5-troubleshooting)
|
||||
6. [Monitoring & Alerting](#6-monitoring--alerting)
|
||||
7. [Escalation Procedures](#7-escalation-procedures)
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
### What is Score Replay?
|
||||
|
||||
Score Replay is the ability to re-execute a vulnerability score computation using the exact same inputs (SBOM, rules, policies, feeds) that were used in the original scan. This provides:
|
||||
|
||||
- **Auditability**: Prove that a score was computed correctly
|
||||
- **Determinism verification**: Confirm that identical inputs produce identical outputs
|
||||
- **Compliance evidence**: Generate proof bundles for regulatory requirements
|
||||
- **Dispute resolution**: Verify contested scan results
|
||||
|
||||
### Key Concepts
|
||||
|
||||
| Term | Definition |
|
||||
|------|------------|
|
||||
| **Manifest** | Content-addressed record of all scoring inputs (SBOM hash, rules hash, policy hash, feed hash) |
|
||||
| **Proof Bundle** | Signed attestation containing manifest, score, and Merkle proof |
|
||||
| **Root Hash** | Merkle tree root computed from all input hashes |
|
||||
| **DSSE Envelope** | Dead Simple Signing Envelope containing the signed proof |
|
||||
| **Freeze Timestamp** | Optional timestamp to replay scoring at a specific point in time |
|
||||
|
||||
### Architecture Components
|
||||
|
||||
| Component | Purpose | Location |
|
||||
|-----------|---------|----------|
|
||||
| Score Engine | Computes vulnerability scores | Scanner Worker |
|
||||
| Manifest Store | Persists scoring manifests | `scanner.manifest` table |
|
||||
| Proof Chain | Generates Merkle proofs | Attestor library |
|
||||
| Signer | Signs proof bundles (DSSE) | Signer service |
|
||||
|
||||
---
|
||||
|
||||
## 2. Score Replay Operations
|
||||
|
||||
### 2.1 Triggering a Score Replay
|
||||
|
||||
#### Via CLI
|
||||
|
||||
```bash
|
||||
# Basic replay
|
||||
stella score replay --scan <scan-id>
|
||||
|
||||
# Replay with specific manifest
|
||||
stella score replay --scan <scan-id> --manifest-hash sha256:abc123...
|
||||
|
||||
# Replay with frozen timestamp (for determinism testing)
|
||||
stella score replay --scan <scan-id> --freeze 2025-01-15T00:00:00Z
|
||||
|
||||
# Output as JSON
|
||||
stella score replay --scan <scan-id> --output json
|
||||
```
|
||||
|
||||
#### Via API
|
||||
|
||||
```bash
|
||||
# POST /api/v1/scanner/score/{scanId}/replay
|
||||
curl -X POST "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/replay" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"manifestHash": "sha256:abc123...",
|
||||
"freezeTimestamp": "2025-01-15T00:00:00Z"
|
||||
}'
|
||||
```
|
||||
|
||||
#### Expected Response
|
||||
|
||||
```json
|
||||
{
|
||||
"scanId": "scan-123",
|
||||
"score": 7.5,
|
||||
"rootHash": "sha256:def456...",
|
||||
"bundleUri": "/api/v1/scanner/scans/scan-123/proofs/sha256:def456...",
|
||||
"manifestHash": "sha256:abc123...",
|
||||
"replayedAt": "2025-01-16T10:30:00Z",
|
||||
"deterministic": true
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 Retrieving Proof Bundles
|
||||
|
||||
#### Via CLI
|
||||
|
||||
```bash
|
||||
# Get bundle for a scan
|
||||
stella score bundle --scan <scan-id>
|
||||
|
||||
# Download bundle to file
|
||||
stella score bundle --scan <scan-id> --output bundle.tar.gz
|
||||
```
|
||||
|
||||
#### Via API
|
||||
|
||||
```bash
|
||||
# GET /api/v1/scanner/score/{scanId}/bundle
|
||||
curl "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/bundle" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-o bundle.tar.gz
|
||||
```
|
||||
|
||||
### 2.3 Verifying Score Integrity
|
||||
|
||||
#### Via CLI
|
||||
|
||||
```bash
|
||||
# Verify against expected root hash
|
||||
stella score verify --scan <scan-id> --root-hash sha256:def456...
|
||||
|
||||
# Verify downloaded bundle
|
||||
stella proof verify --bundle bundle.tar.gz
|
||||
```
|
||||
|
||||
#### Via API
|
||||
|
||||
```bash
|
||||
# POST /api/v1/scanner/score/{scanId}/verify
|
||||
curl -X POST "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/verify" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"expectedRootHash": "sha256:def456..."}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Determinism Verification
|
||||
|
||||
### 3.1 What Affects Determinism?
|
||||
|
||||
Score computation is deterministic when:
|
||||
|
||||
| Input | Requirement |
|
||||
|-------|-------------|
|
||||
| SBOM | Identical content (same hash) |
|
||||
| Rules | Same rule version and configuration |
|
||||
| Policy | Same policy document |
|
||||
| Feeds | Same feed snapshot (freeze timestamp) |
|
||||
| Ordering | Findings sorted deterministically |
|
||||
|
||||
### 3.2 Running Determinism Checks
|
||||
|
||||
```bash
|
||||
# Run replay twice and compare
|
||||
REPLAY1=$(stella score replay --scan $SCAN_ID --output json)
|
||||
REPLAY2=$(stella score replay --scan $SCAN_ID --output json)
|
||||
|
||||
# Extract root hashes
|
||||
HASH1=$(echo $REPLAY1 | jq -r '.rootHash')
|
||||
HASH2=$(echo $REPLAY2 | jq -r '.rootHash')
|
||||
|
||||
# Compare
|
||||
if [ "$HASH1" = "$HASH2" ]; then
|
||||
echo "✓ Determinism verified: $HASH1"
|
||||
else
|
||||
echo "✗ Non-deterministic! $HASH1 != $HASH2"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### 3.3 Common Determinism Issues
|
||||
|
||||
| Issue | Cause | Resolution |
|
||||
|-------|-------|------------|
|
||||
| Different root hash | Feed data changed between replays | Use `--freeze` timestamp |
|
||||
| Score drift | Rule version mismatch | Pin rules version in manifest |
|
||||
| Ordering differences | Non-stable sort in findings | Check Scanner version (fixed in v2.1+) |
|
||||
| Timestamp in output | Current time in computation | Ensure frozen time mode |
|
||||
|
||||
### 3.4 Feed Freeze for Reproducibility
|
||||
|
||||
```bash
|
||||
# Replay with feed state frozen to original scan time
|
||||
stella score replay --scan $SCAN_ID \
|
||||
--freeze $(stella scan show $SCAN_ID --output json | jq -r '.scannedAt')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Proof Bundle Management
|
||||
|
||||
### 4.1 Bundle Contents
|
||||
|
||||
A proof bundle (`.tar.gz`) contains:
|
||||
|
||||
```
|
||||
bundle/
|
||||
├── manifest.json # Input hashes and metadata
|
||||
├── score.json # Computed score and findings summary
|
||||
├── merkle-proof.json # Merkle tree with inclusion proofs
|
||||
├── dsse-envelope.json # Signed attestation (DSSE format)
|
||||
└── certificate.pem # Signing certificate (optional)
|
||||
```
|
||||
|
||||
### 4.2 Inspecting Bundles
|
||||
|
||||
```bash
|
||||
# Extract and view manifest
|
||||
tar -xzf bundle.tar.gz
|
||||
cat bundle/manifest.json | jq .
|
||||
|
||||
# Verify DSSE signature
|
||||
stella proof verify --bundle bundle.tar.gz --verbose
|
||||
|
||||
# Check Merkle proof
|
||||
stella proof spine --bundle bundle.tar.gz
|
||||
```
|
||||
|
||||
### 4.3 Bundle Retention Policy
|
||||
|
||||
| Environment | Retention | Notes |
|
||||
|-------------|-----------|-------|
|
||||
| Production | 7 years | Regulatory compliance |
|
||||
| Staging | 90 days | Testing purposes |
|
||||
| Development | 30 days | Cleanup automatically |
|
||||
|
||||
### 4.4 Archiving Bundles
|
||||
|
||||
```bash
|
||||
# Export bundle to long-term storage
|
||||
stella score bundle --scan $SCAN_ID --output /archive/proofs/$SCAN_ID.tar.gz
|
||||
|
||||
# Bulk export for compliance audit
|
||||
stella score bundle-export \
|
||||
--since 2024-01-01 \
|
||||
--until 2024-12-31 \
|
||||
--output /archive/2024-proofs/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
### 5.1 Replay Returns Different Score
|
||||
|
||||
**Symptoms**: Replayed score differs from original scan score.
|
||||
|
||||
**Diagnostic Steps**:
|
||||
|
||||
1. Check manifest integrity:
|
||||
```bash
|
||||
stella scan show $SCAN_ID --output json | jq '.manifest'
|
||||
```
|
||||
|
||||
2. Verify feed state:
|
||||
```bash
|
||||
# Compare feed hashes
|
||||
stella score replay --scan $SCAN_ID --freeze $ORIGINAL_TIME --output json | jq '.manifestHash'
|
||||
```
|
||||
|
||||
3. Check for rule updates:
|
||||
```bash
|
||||
stella rules show --version --output json
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Use `--freeze` timestamp matching original scan
|
||||
- Pin rule versions in policy
|
||||
- Regenerate manifest if inputs changed legitimately
|
||||
|
||||
### 5.2 Proof Verification Fails
|
||||
|
||||
**Symptoms**: `stella proof verify` returns validation errors.
|
||||
|
||||
**Diagnostic Steps**:
|
||||
|
||||
1. Check DSSE signature:
|
||||
```bash
|
||||
stella proof verify --bundle bundle.tar.gz --verbose 2>&1 | grep -i signature
|
||||
```
|
||||
|
||||
2. Verify certificate validity:
|
||||
```bash
|
||||
openssl x509 -in bundle/certificate.pem -noout -dates
|
||||
```
|
||||
|
||||
3. Check Merkle proof:
|
||||
```bash
|
||||
stella proof spine --bundle bundle.tar.gz --verify
|
||||
```
|
||||
|
||||
**Common Errors**:
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| `SIGNATURE_INVALID` | Bundle tampered or wrong key | Re-download bundle |
|
||||
| `CERTIFICATE_EXPIRED` | Signing cert expired | Check signing key rotation |
|
||||
| `MERKLE_MISMATCH` | Root hash doesn't match | Verify correct bundle version |
|
||||
| `MANIFEST_MISSING` | Incomplete bundle | Re-export from API |
|
||||
|
||||
### 5.3 Replay Timeout
|
||||
|
||||
**Symptoms**: Replay request times out or takes too long.
|
||||
|
||||
**Diagnostic Steps**:
|
||||
|
||||
1. Check scan size:
|
||||
```bash
|
||||
stella scan show $SCAN_ID --output json | jq '.findingsCount'
|
||||
```
|
||||
|
||||
2. Monitor replay progress:
|
||||
```bash
|
||||
stella score replay --scan $SCAN_ID --verbose
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- For large scans (>10k findings), increase timeout
|
||||
- Check Scanner Worker health
|
||||
- Consider async replay for very large scans
|
||||
|
||||
### 5.4 Missing Manifest
|
||||
|
||||
**Symptoms**: `Manifest not found` error on replay.
|
||||
|
||||
**Diagnostic Steps**:
|
||||
|
||||
1. Verify scan exists:
|
||||
```bash
|
||||
stella scan show $SCAN_ID
|
||||
```
|
||||
|
||||
2. Check manifest table:
|
||||
```sql
|
||||
SELECT * FROM scanner.manifest WHERE scan_id = 'scan-123';
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Manifest may have been purged (check retention policy)
|
||||
- Restore from backup if available
|
||||
- Re-run scan if original inputs available
|
||||
|
||||
---
|
||||
|
||||
## 6. Monitoring & Alerting
|
||||
|
||||
### 6.1 Key Metrics
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `score_replay_duration_ms` | Time to complete replay | p99 > 30s |
|
||||
| `score_replay_determinism_failures` | Non-deterministic replays | > 0 |
|
||||
| `proof_verification_failures` | Failed verifications | > 5/hour |
|
||||
| `manifest_storage_size_bytes` | Manifest table size | > 100GB |
|
||||
|
||||
### 6.2 Grafana Dashboard Queries
|
||||
|
||||
```promql
|
||||
# Replay latency
|
||||
histogram_quantile(0.99,
|
||||
rate(score_replay_duration_ms_bucket[5m])
|
||||
)
|
||||
|
||||
# Determinism failure rate
|
||||
rate(score_replay_determinism_failures_total[1h])
|
||||
|
||||
# Proof verification success rate
|
||||
sum(rate(proof_verification_success_total[1h])) /
|
||||
sum(rate(proof_verification_total[1h]))
|
||||
```
|
||||
|
||||
### 6.3 Alert Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: score-replay
|
||||
rules:
|
||||
- alert: ScoreReplayLatencyHigh
|
||||
expr: histogram_quantile(0.99, rate(score_replay_duration_ms_bucket[5m])) > 30000
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Score replay latency exceeds 30s at p99
|
||||
|
||||
- alert: DeterminismFailure
|
||||
expr: increase(score_replay_determinism_failures_total[1h]) > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: Non-deterministic score replay detected
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Escalation Procedures
|
||||
|
||||
### 7.1 Escalation Matrix
|
||||
|
||||
| Severity | Condition | Response Time | Escalate To |
|
||||
|----------|-----------|---------------|-------------|
|
||||
| P1 - Critical | Determinism failure in production | 15 minutes | Platform Team Lead |
|
||||
| P2 - High | Proof verification failures > 10/hour | 1 hour | Scanner Team |
|
||||
| P3 - Medium | Replay latency degradation | 4 hours | Scanner Team |
|
||||
| P4 - Low | Single replay failure | Next business day | Support Queue |
|
||||
|
||||
### 7.2 P1: Determinism Failure Response
|
||||
|
||||
1. **Immediate Actions** (0-15 min):
|
||||
- Capture affected scan IDs
|
||||
- Preserve original manifest data
|
||||
- Check for recent deployments
|
||||
|
||||
2. **Investigation** (15-60 min):
|
||||
- Compare input hashes between replays
|
||||
- Check feed synchronization status
|
||||
- Review rule engine logs
|
||||
|
||||
3. **Remediation**:
|
||||
- Roll back if deployment-related
|
||||
- Freeze feeds if data drift
|
||||
- Hotfix if code bug identified
|
||||
|
||||
### 7.3 Contacts
|
||||
|
||||
| Role | Contact | Availability |
|
||||
|------|---------|--------------|
|
||||
| Scanner Team Lead | scanner-lead@stellaops.io | Business hours |
|
||||
| Platform On-Call | platform-oncall@stellaops.io | 24/7 |
|
||||
| Security Team | security@stellaops.io | Business hours |
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: SQL Queries
|
||||
|
||||
### Check Manifest History
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
scan_id,
|
||||
manifest_hash,
|
||||
sbom_hash,
|
||||
rules_hash,
|
||||
policy_hash,
|
||||
feed_hash,
|
||||
created_at
|
||||
FROM scanner.manifest
|
||||
WHERE scan_id = 'scan-123'
|
||||
ORDER BY created_at DESC;
|
||||
```
|
||||
|
||||
### Find Non-Deterministic Replays
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
scan_id,
|
||||
COUNT(DISTINCT root_hash) as unique_hashes,
|
||||
MIN(replayed_at) as first_replay,
|
||||
MAX(replayed_at) as last_replay
|
||||
FROM scanner.replay_log
|
||||
GROUP BY scan_id
|
||||
HAVING COUNT(DISTINCT root_hash) > 1;
|
||||
```
|
||||
|
||||
### Proof Bundle Statistics
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
DATE_TRUNC('day', created_at) as day,
|
||||
COUNT(*) as bundles_created,
|
||||
AVG(bundle_size_bytes) as avg_size,
|
||||
SUM(bundle_size_bytes) as total_size
|
||||
FROM scanner.proof_bundle
|
||||
WHERE created_at > NOW() - INTERVAL '30 days'
|
||||
GROUP BY DATE_TRUNC('day', created_at)
|
||||
ORDER BY day DESC;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Appendix B: CLI Quick Reference
|
||||
|
||||
```bash
|
||||
# Score Replay Commands
|
||||
stella score replay --scan <id> # Replay score computation
|
||||
stella score replay --scan <id> --freeze <ts> # Replay with frozen time
|
||||
stella score bundle --scan <id> # Get proof bundle
|
||||
stella score verify --scan <id> --root-hash <hash> # Verify score
|
||||
|
||||
# Proof Commands
|
||||
stella proof verify --bundle <path> # Verify bundle file
|
||||
stella proof verify --bundle <path> --offline # Offline verification
|
||||
stella proof spine --bundle <path> # Show Merkle spine
|
||||
|
||||
# Output Formats
|
||||
--output json # JSON output
|
||||
--output table # Table output (default)
|
||||
--output yaml # YAML output
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 1.0.0 | 2025-12-20 | Agent | Initial release |
|
||||
Reference in New Issue
Block a user