Files

StellaOps Bot 4b3db9ca85 docs(ops): Complete operations runbooks for Epic 3500

Sprint 3500.0004.0004 (Documentation & Handoff) - T2 DONE

Operations Runbooks Added:
- score-replay-runbook.md: Deterministic replay procedures
- proof-verification-runbook.md: DSSE/Merkle verification ops
- airgap-operations-runbook.md: Offline kit management

CLI Reference Docs:
- reachability-cli-reference.md
- score-proofs-cli-reference.md
- unknowns-cli-reference.md

Air-Gap Guides:
- score-proofs-reachability-airgap-runbook.md

Training Materials:
- score-proofs-concept-guide.md

UI API Clients:
- proof.client.ts
- reachability.client.ts
- unknowns.client.ts

All 5 operations runbooks now complete (reachability, unknowns-queue,
score-replay, proof-verification, airgap-operations).

2025-12-20 22:30:02 +02:00

13 KiB

Raw Blame History

Score Replay Operations Runbook

Version: 1.0.0
Sprint: 3500.0004.0004
Last Updated: 2025-12-20

This runbook covers operational procedures for Score Replay, including deterministic score computation verification, proof bundle validation, and troubleshooting replay discrepancies.

Overview
Score Replay Operations
Determinism Verification
Proof Bundle Management
Troubleshooting
Monitoring & Alerting
Escalation Procedures

1. Overview

What is Score Replay?

Score Replay is the ability to re-execute a vulnerability score computation using the exact same inputs (SBOM, rules, policies, feeds) that were used in the original scan. This provides:

Auditability: Prove that a score was computed correctly
Determinism verification: Confirm that identical inputs produce identical outputs
Compliance evidence: Generate proof bundles for regulatory requirements
Dispute resolution: Verify contested scan results

Key Concepts

Term	Definition
Manifest	Content-addressed record of all scoring inputs (SBOM hash, rules hash, policy hash, feed hash)
Proof Bundle	Signed attestation containing manifest, score, and Merkle proof
Root Hash	Merkle tree root computed from all input hashes
DSSE Envelope	Dead Simple Signing Envelope containing the signed proof
Freeze Timestamp	Optional timestamp to replay scoring at a specific point in time

Architecture Components

Component	Purpose	Location
Score Engine	Computes vulnerability scores	Scanner Worker
Manifest Store	Persists scoring manifests	`scanner.manifest` table
Proof Chain	Generates Merkle proofs	Attestor library
Signer	Signs proof bundles (DSSE)	Signer service

2. Score Replay Operations

2.1 Triggering a Score Replay

Via CLI

# Basic replay
stella score replay --scan <scan-id>

# Replay with specific manifest
stella score replay --scan <scan-id> --manifest-hash sha256:abc123...

# Replay with frozen timestamp (for determinism testing)
stella score replay --scan <scan-id> --freeze 2025-01-15T00:00:00Z

# Output as JSON
stella score replay --scan <scan-id> --output json

Via API

# POST /api/v1/scanner/score/{scanId}/replay
curl -X POST "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/replay" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "manifestHash": "sha256:abc123...",
    "freezeTimestamp": "2025-01-15T00:00:00Z"
  }'

Expected Response

{
  "scanId": "scan-123",
  "score": 7.5,
  "rootHash": "sha256:def456...",
  "bundleUri": "/api/v1/scanner/scans/scan-123/proofs/sha256:def456...",
  "manifestHash": "sha256:abc123...",
  "replayedAt": "2025-01-16T10:30:00Z",
  "deterministic": true
}

2.2 Retrieving Proof Bundles

Via CLI

# Get bundle for a scan
stella score bundle --scan <scan-id>

# Download bundle to file
stella score bundle --scan <scan-id> --output bundle.tar.gz

Via API

# GET /api/v1/scanner/score/{scanId}/bundle
curl "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/bundle" \
  -H "Authorization: Bearer $TOKEN" \
  -o bundle.tar.gz

2.3 Verifying Score Integrity

Via CLI

# Verify against expected root hash
stella score verify --scan <scan-id> --root-hash sha256:def456...

# Verify downloaded bundle
stella proof verify --bundle bundle.tar.gz

Via API

# POST /api/v1/scanner/score/{scanId}/verify
curl -X POST "https://scanner.stellaops.local/api/v1/scanner/score/scan-123/verify" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"expectedRootHash": "sha256:def456..."}'

3. Determinism Verification

3.1 What Affects Determinism?

Score computation is deterministic when:

Input	Requirement
SBOM	Identical content (same hash)
Rules	Same rule version and configuration
Policy	Same policy document
Feeds	Same feed snapshot (freeze timestamp)
Ordering	Findings sorted deterministically

3.2 Running Determinism Checks

# Run replay twice and compare
REPLAY1=$(stella score replay --scan $SCAN_ID --output json)
REPLAY2=$(stella score replay --scan $SCAN_ID --output json)

# Extract root hashes
HASH1=$(echo $REPLAY1 | jq -r '.rootHash')
HASH2=$(echo $REPLAY2 | jq -r '.rootHash')

# Compare
if [ "$HASH1" = "$HASH2" ]; then
  echo "✓ Determinism verified: $HASH1"
else
  echo "✗ Non-deterministic! $HASH1 != $HASH2"
  exit 1
fi

3.3 Common Determinism Issues

Issue	Cause	Resolution
Different root hash	Feed data changed between replays	Use `--freeze` timestamp
Score drift	Rule version mismatch	Pin rules version in manifest
Ordering differences	Non-stable sort in findings	Check Scanner version (fixed in v2.1+)
Timestamp in output	Current time in computation	Ensure frozen time mode

3.4 Feed Freeze for Reproducibility

# Replay with feed state frozen to original scan time
stella score replay --scan $SCAN_ID \
  --freeze $(stella scan show $SCAN_ID --output json | jq -r '.scannedAt')

4. Proof Bundle Management

4.1 Bundle Contents

A proof bundle (.tar.gz) contains:

bundle/
├── manifest.json       # Input hashes and metadata
├── score.json          # Computed score and findings summary
├── merkle-proof.json   # Merkle tree with inclusion proofs
├── dsse-envelope.json  # Signed attestation (DSSE format)
└── certificate.pem     # Signing certificate (optional)

4.2 Inspecting Bundles

# Extract and view manifest
tar -xzf bundle.tar.gz
cat bundle/manifest.json | jq .

# Verify DSSE signature
stella proof verify --bundle bundle.tar.gz --verbose

# Check Merkle proof
stella proof spine --bundle bundle.tar.gz

4.3 Bundle Retention Policy

Environment	Retention	Notes
Production	7 years	Regulatory compliance
Staging	90 days	Testing purposes
Development	30 days	Cleanup automatically

4.4 Archiving Bundles

# Export bundle to long-term storage
stella score bundle --scan $SCAN_ID --output /archive/proofs/$SCAN_ID.tar.gz

# Bulk export for compliance audit
stella score bundle-export \
  --since 2024-01-01 \
  --until 2024-12-31 \
  --output /archive/2024-proofs/

5. Troubleshooting

5.1 Replay Returns Different Score

Symptoms: Replayed score differs from original scan score.

Diagnostic Steps:

Check manifest integrity:

stella scan show $SCAN_ID --output json | jq '.manifest'

Verify feed state:

# Compare feed hashes
stella score replay --scan $SCAN_ID --freeze $ORIGINAL_TIME --output json | jq '.manifestHash'

Check for rule updates:

stella rules show --version --output json

Resolution:

Use --freeze timestamp matching original scan
Pin rule versions in policy
Regenerate manifest if inputs changed legitimately

5.2 Proof Verification Fails

Symptoms: stella proof verify returns validation errors.

Diagnostic Steps:

Check DSSE signature:

stella proof verify --bundle bundle.tar.gz --verbose 2>&1 | grep -i signature

Verify certificate validity:

openssl x509 -in bundle/certificate.pem -noout -dates

Check Merkle proof:

stella proof spine --bundle bundle.tar.gz --verify

Common Errors:

Error	Cause	Fix
`SIGNATURE_INVALID`	Bundle tampered or wrong key	Re-download bundle
`CERTIFICATE_EXPIRED`	Signing cert expired	Check signing key rotation
`MERKLE_MISMATCH`	Root hash doesn't match	Verify correct bundle version
`MANIFEST_MISSING`	Incomplete bundle	Re-export from API

5.3 Replay Timeout

Symptoms: Replay request times out or takes too long.

Diagnostic Steps:

Check scan size:

stella scan show $SCAN_ID --output json | jq '.findingsCount'

Monitor replay progress:

stella score replay --scan $SCAN_ID --verbose

Resolution:

For large scans (>10k findings), increase timeout
Check Scanner Worker health
Consider async replay for very large scans

5.4 Missing Manifest

Symptoms: Manifest not found error on replay.

Diagnostic Steps:

Verify scan exists:
```
stella scan show $SCAN_ID
```

Check manifest table:

SELECT * FROM scanner.manifest WHERE scan_id = 'scan-123';

Resolution:

Manifest may have been purged (check retention policy)
Restore from backup if available
Re-run scan if original inputs available

6. Monitoring & Alerting

6.1 Key Metrics

Metric	Description	Alert Threshold
`score_replay_duration_ms`	Time to complete replay	p99 > 30s
`score_replay_determinism_failures`	Non-deterministic replays	> 0
`proof_verification_failures`	Failed verifications	> 5/hour
`manifest_storage_size_bytes`	Manifest table size	> 100GB

6.2 Grafana Dashboard Queries

# Replay latency
histogram_quantile(0.99, 
  rate(score_replay_duration_ms_bucket[5m])
)

# Determinism failure rate
rate(score_replay_determinism_failures_total[1h])

# Proof verification success rate
sum(rate(proof_verification_success_total[1h])) /
sum(rate(proof_verification_total[1h]))

6.3 Alert Rules

groups:
  - name: score-replay
    rules:
      - alert: ScoreReplayLatencyHigh
        expr: histogram_quantile(0.99, rate(score_replay_duration_ms_bucket[5m])) > 30000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Score replay latency exceeds 30s at p99

      - alert: DeterminismFailure
        expr: increase(score_replay_determinism_failures_total[1h]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: Non-deterministic score replay detected

7. Escalation Procedures

7.1 Escalation Matrix

Severity	Condition	Response Time	Escalate To
P1 - Critical	Determinism failure in production	15 minutes	Platform Team Lead
P2 - High	Proof verification failures > 10/hour	1 hour	Scanner Team
P3 - Medium	Replay latency degradation	4 hours	Scanner Team
P4 - Low	Single replay failure	Next business day	Support Queue

7.2 P1: Determinism Failure Response

Immediate Actions (0-15 min):
- Capture affected scan IDs
- Preserve original manifest data
- Check for recent deployments
Investigation (15-60 min):
- Compare input hashes between replays
- Check feed synchronization status
- Review rule engine logs
Remediation:
- Roll back if deployment-related
- Freeze feeds if data drift
- Hotfix if code bug identified

7.3 Contacts

Role	Contact	Availability
Scanner Team Lead	scanner-lead@stellaops.io	Business hours
Platform On-Call	platform-oncall@stellaops.io	24/7
Security Team	security@stellaops.io	Business hours

Appendix A: SQL Queries

Check Manifest History

SELECT 
  scan_id,
  manifest_hash,
  sbom_hash,
  rules_hash,
  policy_hash,
  feed_hash,
  created_at
FROM scanner.manifest
WHERE scan_id = 'scan-123'
ORDER BY created_at DESC;

Find Non-Deterministic Replays

SELECT 
  scan_id,
  COUNT(DISTINCT root_hash) as unique_hashes,
  MIN(replayed_at) as first_replay,
  MAX(replayed_at) as last_replay
FROM scanner.replay_log
GROUP BY scan_id
HAVING COUNT(DISTINCT root_hash) > 1;

Proof Bundle Statistics

SELECT 
  DATE_TRUNC('day', created_at) as day,
  COUNT(*) as bundles_created,
  AVG(bundle_size_bytes) as avg_size,
  SUM(bundle_size_bytes) as total_size
FROM scanner.proof_bundle
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', created_at)
ORDER BY day DESC;

Appendix B: CLI Quick Reference

# Score Replay Commands
stella score replay --scan <id>              # Replay score computation
stella score replay --scan <id> --freeze <ts> # Replay with frozen time
stella score bundle --scan <id>              # Get proof bundle
stella score verify --scan <id> --root-hash <hash>  # Verify score

# Proof Commands
stella proof verify --bundle <path>          # Verify bundle file
stella proof verify --bundle <path> --offline # Offline verification
stella proof spine --bundle <path>           # Show Merkle spine

# Output Formats
--output json                                # JSON output
--output table                               # Table output (default)
--output yaml                                # YAML output

Revision History

Version	Date	Author	Changes
1.0.0	2025-12-20	Agent	Initial release

13 KiB Raw Blame History

Score Replay Operations Runbook

Table of Contents

1. Overview

What is Score Replay?

Key Concepts

Architecture Components

2. Score Replay Operations

2.1 Triggering a Score Replay

Via CLI

Via API

Expected Response

2.2 Retrieving Proof Bundles

Via CLI

Via API

2.3 Verifying Score Integrity

Via CLI

Via API

3. Determinism Verification

3.1 What Affects Determinism?

3.2 Running Determinism Checks

3.3 Common Determinism Issues

3.4 Feed Freeze for Reproducibility

4. Proof Bundle Management

4.1 Bundle Contents

4.2 Inspecting Bundles

4.3 Bundle Retention Policy

4.4 Archiving Bundles

5. Troubleshooting

5.1 Replay Returns Different Score

5.2 Proof Verification Fails

5.3 Replay Timeout

5.4 Missing Manifest

6. Monitoring & Alerting

6.1 Key Metrics

6.2 Grafana Dashboard Queries

6.3 Alert Rules

7. Escalation Procedures

7.1 Escalation Matrix

7.2 P1: Determinism Failure Response

7.3 Contacts

Appendix A: SQL Queries

Check Manifest History

Find Non-Deterministic Replays

Proof Bundle Statistics

Appendix B: CLI Quick Reference

Revision History

13 KiB

Raw Blame History