Files
git.stella-ops.org/docs/operations/score-proofs-runbook.md
StellaOps Bot efe9bd8cfe Add integration tests for Proof Chain and Reachability workflows
- Implement ProofChainTestFixture for PostgreSQL-backed integration tests.
- Create StellaOps.Integration.ProofChain project with necessary dependencies.
- Add ReachabilityIntegrationTests to validate call graph extraction and reachability analysis.
- Introduce ReachabilityTestFixture for managing corpus and fixture paths.
- Establish StellaOps.Integration.Reachability project with required references.
- Develop UnknownsWorkflowTests to cover the unknowns lifecycle: detection, ranking, escalation, and resolution.
- Create StellaOps.Integration.Unknowns project with dependencies for unknowns workflow.
2025-12-20 22:19:26 +02:00

14 KiB

Score Proofs Operations Runbook

Version: 1.0.0
Sprint: 3500.0004.0004
Last Updated: 2025-12-20

This runbook covers operational procedures for Score Proofs, including score replay, proof verification, and troubleshooting.


Table of Contents

  1. Overview
  2. Score Replay Operations
  3. Proof Verification Operations
  4. Proof Bundle Management
  5. Troubleshooting
  6. Monitoring & Alerting
  7. Escalation Procedures

1. Overview

What are Score Proofs?

Score Proofs provide cryptographically verifiable audit trails for vulnerability scoring decisions. Each proof:

  • Records inputs: SBOM, feed snapshots, VEX data, policy hashes
  • Traces computation: Every scoring rule application
  • Signs results: DSSE envelopes with configurable trust anchors
  • Enables replay: Same inputs → same outputs (deterministic)

Key Components

Component Purpose Location
Scan Manifest Records all inputs deterministically scanner.scan_manifest table
Proof Ledger DAG of scoring computation nodes scanner.proof_bundle table
DSSE Envelope Cryptographic signature wrapper In proof bundle JSON
Proof Bundle ZIP archive for offline verification Stored in object storage

Prerequisites

  • Access to Scanner WebService API
  • scanner.proofs OAuth scope
  • CLI access with stella configured
  • Trust anchor public keys (for verification)

2. Score Replay Operations

2.1 When to Replay Scores

Score replay is needed when:

  • Feed updates: New advisories from Concelier
  • VEX updates: New VEX statements from Excititor
  • Policy changes: Updated scoring policy rules
  • Audit requests: Need to verify historical scores
  • Investigation: Analyze why a score changed

2.2 Manual Score Replay (API)

# Get current scan manifest
curl -s "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/manifest" \
  -H "Authorization: Bearer $TOKEN" | jq '.manifest'

# Replay with current feeds (uses latest snapshots)
curl -X POST "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/score/replay" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}' | jq '.scoreProof.rootHash'

# Replay with specific feed snapshot
curl -X POST "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/score/replay" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "overrides": {
      "concelierSnapshotHash": "sha256:specific-feed-snapshot..."
    }
  }'

2.3 Manual Score Replay (CLI)

# Replay with current feeds
stella score replay --scan-id $SCAN_ID

# Replay with specific snapshot
stella score replay --scan-id $SCAN_ID \
  --feed-snapshot sha256:specific-feed-snapshot...

# Replay and compare with original
stella score replay --scan-id $SCAN_ID --diff

# Replay in offline mode (air-gap)
stella score replay --scan-id $SCAN_ID \
  --offline \
  --bundle /path/to/offline-bundle.zip

2.4 Batch Score Replay

For bulk replay (e.g., after major feed update):

# List all scans from last 7 days
stella scan list --since 7d --format json > scans.json

# Replay each scan
cat scans.json | jq -r '.[].scanId' | while read SCAN_ID; do
  echo "Replaying $SCAN_ID..."
  stella score replay --scan-id "$SCAN_ID" --quiet
done

# Or use the batch API endpoint (more efficient)
curl -X POST "https://scanner.example.com/api/v1/scanner/batch/replay" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "scanIds": ["scan-1", "scan-2", "scan-3"],
    "parallel": true,
    "maxConcurrency": 10
  }'

2.5 Nightly Replay Job

The Scheduler automatically replays scores when Concelier publishes new snapshots:

# Job configuration in Scheduler
job:
  name: nightly-score-replay
  schedule: "0 3 * * *"  # 3 AM daily
  trigger:
    type: concelier-snapshot-published
  action:
    type: batch-replay
    config:
      maxAge: 30d
      parallel: true
      maxConcurrency: 20

Monitoring the nightly job:

# Check job status
stella scheduler job status nightly-score-replay

# View recent runs
stella scheduler job runs nightly-score-replay --last 7

# Check for failures
stella scheduler job runs nightly-score-replay --status failed

3. Proof Verification Operations

3.1 Online Verification

# Verify via API
curl -X POST "https://scanner.example.com/api/v1/proofs/verify" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "proofBundleId": "sha256:proof123...",
    "checkRekor": true,
    "anchorIds": ["anchor-001"]
  }'

# Verify via CLI
stella proof verify --bundle-id sha256:proof123... --check-rekor

3.2 Offline Verification (Air-Gap)

For air-gapped environments:

# 1. Download proof bundle (on connected system)
curl -o proof-bundle.zip \
  "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/proofs/sha256:proof123..."

# 2. Transfer to air-gapped system (USB, etc.)

# 3. Verify offline (on air-gapped system)
stella proof verify --bundle proof-bundle.zip \
  --offline \
  --trust-anchor /path/to/trust-anchor.pem

# 4. Verify with explicit public key
stella proof verify --bundle proof-bundle.zip \
  --offline \
  --public-key /path/to/public-key.pem \
  --skip-rekor  # No network access

3.3 Verification Checks

Check Description Can Skip?
Signature Valid DSSE signature matches payload No
ID Recomputed Content-addressed ID matches No
Merkle Path Valid Merkle tree construction correct No
Rekor Inclusion Transparency log entry exists Yes (offline)
Timestamp Valid Proof created within valid window Configurable

3.4 Failed Verification Troubleshooting

# Get detailed verification report
stella proof verify --bundle-id sha256:proof123... --verbose

# Check specific failures
stella proof verify --bundle-id sha256:proof123... --check signatureValid
stella proof verify --bundle-id sha256:proof123... --check idRecomputed
stella proof verify --bundle-id sha256:proof123... --check merklePathValid

# Dump proof bundle contents for inspection
stella proof inspect --bundle proof-bundle.zip --output-dir ./inspection/

4. Proof Bundle Management

4.1 Download Proof Bundles

# Download single bundle
stella proof download --scan-id $SCAN_ID --output proof.zip

# Download with specific root hash
stella proof download --scan-id $SCAN_ID \
  --root-hash sha256:proof123... \
  --output proof.zip

# Download all bundles for a scan
stella proof download --scan-id $SCAN_ID --all --output-dir ./proofs/

4.2 Bundle Contents

# List bundle contents
unzip -l proof-bundle.zip

# Expected contents:
#   manifest.json        - Scan manifest (canonical JSON)
#   manifest.dsse.json   - DSSE signature of manifest
#   score_proof.json     - Proof ledger (ProofNode array)
#   proof_root.dsse.json - DSSE signature of proof root
#   meta.json            - Metadata (timestamps, versions)

# Extract and inspect
unzip proof-bundle.zip -d ./proof-contents/
cat ./proof-contents/manifest.json | jq .
cat ./proof-contents/score_proof.json | jq '.nodes | length'

4.3 Proof Retention

Proof bundles are retained based on policy:

Tier Retention Description
Hot 30 days Recent proofs, fast access
Warm 1 year Archived proofs, slower access
Cold 7 years Compliance archive, retrieval required

Check retention status:

stella proof status --scan-id $SCAN_ID
# Output: tier=hot, expires=2025-01-17, retrievable=true

Retrieve from cold storage:

# Request retrieval (async, may take hours)
stella proof retrieve --scan-id $SCAN_ID --root-hash sha256:proof123...

# Check retrieval status
stella proof retrieve-status --request-id req-001

4.4 Export for Audit

# Export proof bundle with full chain
stella proof export --scan-id $SCAN_ID \
  --include-chain \
  --include-anchors \
  --output audit-bundle.zip

# Export multiple scans for audit period
stella proof export-batch \
  --from 2025-01-01 \
  --to 2025-01-31 \
  --output-dir ./audit-jan-2025/

5. Troubleshooting

5.1 Score Mismatch After Replay

Symptom: Replayed score differs from original.

Diagnosis:

# Compare manifests
stella score diff --scan-id $SCAN_ID --original --replayed

# Check for feed changes
stella score manifest --scan-id $SCAN_ID | jq '.concelierSnapshotHash'

# Compare input hashes
stella score inputs --scan-id $SCAN_ID --hash

Common causes:

  1. Feed snapshot changed: Original used different advisory data
  2. Policy updated: Scoring rules changed between runs
  3. VEX statements added: New VEX data affects scores
  4. Non-deterministic seed: Check if deterministic: true in manifest

Resolution:

# Replay with exact original snapshots
stella score replay --scan-id $SCAN_ID --use-original-snapshots

5.2 Proof Verification Failed

Symptom: Verification returns verified: false.

Diagnosis:

# Get detailed error
stella proof verify --bundle-id sha256:proof123... --verbose 2>&1 | head -50

# Common errors:
# - "Signature verification failed": Key mismatch or tampering
# - "ID recomputation failed": Canonical JSON issue
# - "Merkle path invalid": Proof chain corrupted
# - "Rekor entry not found": Not logged to transparency log

Resolution by error type:

Error Cause Resolution
Signature failed Key rotated Use correct trust anchor
ID mismatch Content modified Re-generate proof
Merkle invalid Partial upload Re-download bundle
Rekor missing Log lag or skip Wait or verify offline

5.3 Missing Proof Bundle

Symptom: Proof bundle not found.

Diagnosis:

# Check if scan exists
stella scan status --scan-id $SCAN_ID

# Check proof generation status
stella proof status --scan-id $SCAN_ID

# Check if proof was generated
stella proof list --scan-id $SCAN_ID

Common causes:

  1. Scan still in progress: Proof generated after completion
  2. Proof generation failed: Check worker logs
  3. Archived to cold storage: Needs retrieval
  4. Retention expired: Proof deleted per policy

5.4 Replay Performance Issues

Symptom: Replay taking too long.

Diagnosis:

# Check replay queue depth
stella scheduler queue status replay

# Check worker health
stella scanner workers status

# Check for resource constraints
kubectl top pods -l app=scanner-worker

Optimization:

# Reduce parallelism during peak hours
stella scheduler job update nightly-score-replay \
  --config.maxConcurrency=5

# Skip unchanged scans
stella score replay --scan-id $SCAN_ID --skip-unchanged

6. Monitoring & Alerting

6.1 Key Metrics

Metric Description Alert Threshold
score_replay_duration_seconds Time to replay a score > 30s
proof_verification_success_rate % of successful verifications < 99%
proof_bundle_size_bytes Size of proof bundles > 100MB
replay_queue_depth Pending replay jobs > 1000
proof_generation_failures Failed proof generations > 0/hour

6.2 Grafana Dashboard

Dashboard: Score Proofs Operations
Panels:
- Replay throughput (replays/minute)
- Replay latency (p50, p95, p99)
- Verification success rate
- Proof bundle storage usage
- Queue depth over time

6.3 Alerting Rules

# Prometheus alerting rules
groups:
  - name: score-proofs
    rules:
      - alert: ReplayLatencyHigh
        expr: histogram_quantile(0.95, score_replay_duration_seconds) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Score replay latency is high"
          
      - alert: ProofVerificationFailures
        expr: increase(proof_verification_failures_total[1h]) > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Multiple proof verification failures detected"
          
      - alert: ReplayQueueBacklog
        expr: replay_queue_depth > 1000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Score replay queue backlog is growing"

7. Escalation Procedures

7.1 Escalation Matrix

Severity Condition Response Time Escalation Path
P1 Proof verification failing for all scans 15 min On-call → Team Lead → VP Eng
P2 Replay failures > 10% 1 hour On-call → Team Lead
P3 Replay latency > 60s p95 4 hours On-call
P4 Queue backlog > 5000 24 hours Ticket

7.2 P1 Response Procedure

  1. Acknowledge alert in PagerDuty
  2. Triage:
    # Check service health
    stella health check --service scanner
    stella health check --service attestor
    
    # Check recent changes
    kubectl rollout history deployment/scanner-worker
    
  3. Mitigate:
    # If recent deployment, rollback
    kubectl rollout undo deployment/scanner-worker
    
    # If key rotation issue, restore previous anchor
    stella anchor restore --anchor-id anchor-001 --revision previous
    
  4. Communicate: Update status page, notify stakeholders
  5. Resolve: Fix root cause, verify fix
  6. Postmortem: Document incident within 48 hours

7.3 Contact Information

Role Contact Availability
On-Call Engineer PagerDuty scanner-oncall 24/7
Scanner Team Lead @scanner-lead Business hours
Security Team security@stellaops.local Business hours
VP Engineering @vp-eng Escalation only


Last Updated: 2025-12-20
Version: 1.0.0
Sprint: 3500.0004.0004