Add integration tests for Proof Chain and Reachability workflows

- Implement ProofChainTestFixture for PostgreSQL-backed integration tests. - Create StellaOps.Integration.ProofChain project with necessary dependencies. - Add ReachabilityIntegrationTests to validate call graph extraction and reachability analysis. - Introduce ReachabilityTestFixture for managing corpus and fixture paths. - Establish StellaOps.Integration.Reachability project with required references. - Develop UnknownsWorkflowTests to cover the unknowns lifecycle: detection, ranking, escalation, and resolution. - Create StellaOps.Integration.Unknowns project with dependencies for unknowns workflow.
2025-12-20 22:19:26 +02:00
parent 3c6e14fca5
commit efe9bd8cfe
86 changed files with 9616 additions and 323 deletions
--- a/docs/operations/score-proofs-runbook.md
+++ b/docs/operations/score-proofs-runbook.md
@@ -0,0 +1,544 @@
+# Score Proofs Operations Runbook
+
+> **Version**: 1.0.0  
+> **Sprint**: 3500.0004.0004  
+> **Last Updated**: 2025-12-20
+
+This runbook covers operational procedures for Score Proofs, including score replay, proof verification, and troubleshooting.
+
+---
+
+## Table of Contents
+
+1. [Overview](#1-overview)
+2. [Score Replay Operations](#2-score-replay-operations)
+3. [Proof Verification Operations](#3-proof-verification-operations)
+4. [Proof Bundle Management](#4-proof-bundle-management)
+5. [Troubleshooting](#5-troubleshooting)
+6. [Monitoring & Alerting](#6-monitoring--alerting)
+7. [Escalation Procedures](#7-escalation-procedures)
+
+---
+
+## 1. Overview
+
+### What are Score Proofs?
+
+Score Proofs provide cryptographically verifiable audit trails for vulnerability scoring decisions. Each proof:
+
+- **Records inputs**: SBOM, feed snapshots, VEX data, policy hashes
+- **Traces computation**: Every scoring rule application
+- **Signs results**: DSSE envelopes with configurable trust anchors
+- **Enables replay**: Same inputs → same outputs (deterministic)
+
+### Key Components
+
+| Component | Purpose | Location |
+|-----------|---------|----------|
+| Scan Manifest | Records all inputs deterministically | `scanner.scan_manifest` table |
+| Proof Ledger | DAG of scoring computation nodes | `scanner.proof_bundle` table |
+| DSSE Envelope | Cryptographic signature wrapper | In proof bundle JSON |
+| Proof Bundle | ZIP archive for offline verification | Stored in object storage |
+
+### Prerequisites
+
+- Access to Scanner WebService API
+- `scanner.proofs` OAuth scope
+- CLI access with `stella` configured
+- Trust anchor public keys (for verification)
+
+---
+
+## 2. Score Replay Operations
+
+### 2.1 When to Replay Scores
+
+Score replay is needed when:
+
+- **Feed updates**: New advisories from Concelier
+- **VEX updates**: New VEX statements from Excititor
+- **Policy changes**: Updated scoring policy rules
+- **Audit requests**: Need to verify historical scores
+- **Investigation**: Analyze why a score changed
+
+### 2.2 Manual Score Replay (API)
+
+```bash
+# Get current scan manifest
+curl -s "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/manifest" \
+  -H "Authorization: Bearer $TOKEN" | jq '.manifest'
+
+# Replay with current feeds (uses latest snapshots)
+curl -X POST "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/score/replay" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{}' | jq '.scoreProof.rootHash'
+
+# Replay with specific feed snapshot
+curl -X POST "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/score/replay" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "overrides": {
+      "concelierSnapshotHash": "sha256:specific-feed-snapshot..."
+    }
+  }'
+```
+
+### 2.3 Manual Score Replay (CLI)
+
+```bash
+# Replay with current feeds
+stella score replay --scan-id $SCAN_ID
+
+# Replay with specific snapshot
+stella score replay --scan-id $SCAN_ID \
+  --feed-snapshot sha256:specific-feed-snapshot...
+
+# Replay and compare with original
+stella score replay --scan-id $SCAN_ID --diff
+
+# Replay in offline mode (air-gap)
+stella score replay --scan-id $SCAN_ID \
+  --offline \
+  --bundle /path/to/offline-bundle.zip
+```
+
+### 2.4 Batch Score Replay
+
+For bulk replay (e.g., after major feed update):
+
+```bash
+# List all scans from last 7 days
+stella scan list --since 7d --format json > scans.json
+
+# Replay each scan
+cat scans.json | jq -r '.[].scanId' | while read SCAN_ID; do
+  echo "Replaying $SCAN_ID..."
+  stella score replay --scan-id "$SCAN_ID" --quiet
+done
+
+# Or use the batch API endpoint (more efficient)
+curl -X POST "https://scanner.example.com/api/v1/scanner/batch/replay" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "scanIds": ["scan-1", "scan-2", "scan-3"],
+    "parallel": true,
+    "maxConcurrency": 10
+  }'
+```
+
+### 2.5 Nightly Replay Job
+
+The Scheduler automatically replays scores when Concelier publishes new snapshots:
+
+```yaml
+# Job configuration in Scheduler
+job:
+  name: nightly-score-replay
+  schedule: "0 3 * * *"  # 3 AM daily
+  trigger:
+    type: concelier-snapshot-published
+  action:
+    type: batch-replay
+    config:
+      maxAge: 30d
+      parallel: true
+      maxConcurrency: 20
+```
+
+**Monitoring the nightly job**:
+
+```bash
+# Check job status
+stella scheduler job status nightly-score-replay
+
+# View recent runs
+stella scheduler job runs nightly-score-replay --last 7
+
+# Check for failures
+stella scheduler job runs nightly-score-replay --status failed
+```
+
+---
+
+## 3. Proof Verification Operations
+
+### 3.1 Online Verification
+
+```bash
+# Verify via API
+curl -X POST "https://scanner.example.com/api/v1/proofs/verify" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "proofBundleId": "sha256:proof123...",
+    "checkRekor": true,
+    "anchorIds": ["anchor-001"]
+  }'
+
+# Verify via CLI
+stella proof verify --bundle-id sha256:proof123... --check-rekor
+```
+
+### 3.2 Offline Verification (Air-Gap)
+
+For air-gapped environments:
+
+```bash
+# 1. Download proof bundle (on connected system)
+curl -o proof-bundle.zip \
+  "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/proofs/sha256:proof123..."
+
+# 2. Transfer to air-gapped system (USB, etc.)
+
+# 3. Verify offline (on air-gapped system)
+stella proof verify --bundle proof-bundle.zip \
+  --offline \
+  --trust-anchor /path/to/trust-anchor.pem
+
+# 4. Verify with explicit public key
+stella proof verify --bundle proof-bundle.zip \
+  --offline \
+  --public-key /path/to/public-key.pem \
+  --skip-rekor  # No network access
+```
+
+### 3.3 Verification Checks
+
+| Check | Description | Can Skip? |
+|-------|-------------|-----------|
+| Signature Valid | DSSE signature matches payload | No |
+| ID Recomputed | Content-addressed ID matches | No |
+| Merkle Path Valid | Merkle tree construction correct | No |
+| Rekor Inclusion | Transparency log entry exists | Yes (offline) |
+| Timestamp Valid | Proof created within valid window | Configurable |
+
+### 3.4 Failed Verification Troubleshooting
+
+```bash
+# Get detailed verification report
+stella proof verify --bundle-id sha256:proof123... --verbose
+
+# Check specific failures
+stella proof verify --bundle-id sha256:proof123... --check signatureValid
+stella proof verify --bundle-id sha256:proof123... --check idRecomputed
+stella proof verify --bundle-id sha256:proof123... --check merklePathValid
+
+# Dump proof bundle contents for inspection
+stella proof inspect --bundle proof-bundle.zip --output-dir ./inspection/
+```
+
+---
+
+## 4. Proof Bundle Management
+
+### 4.1 Download Proof Bundles
+
+```bash
+# Download single bundle
+stella proof download --scan-id $SCAN_ID --output proof.zip
+
+# Download with specific root hash
+stella proof download --scan-id $SCAN_ID \
+  --root-hash sha256:proof123... \
+  --output proof.zip
+
+# Download all bundles for a scan
+stella proof download --scan-id $SCAN_ID --all --output-dir ./proofs/
+```
+
+### 4.2 Bundle Contents
+
+```bash
+# List bundle contents
+unzip -l proof-bundle.zip
+
+# Expected contents:
+#   manifest.json        - Scan manifest (canonical JSON)
+#   manifest.dsse.json   - DSSE signature of manifest
+#   score_proof.json     - Proof ledger (ProofNode array)
+#   proof_root.dsse.json - DSSE signature of proof root
+#   meta.json            - Metadata (timestamps, versions)
+
+# Extract and inspect
+unzip proof-bundle.zip -d ./proof-contents/
+cat ./proof-contents/manifest.json | jq .
+cat ./proof-contents/score_proof.json | jq '.nodes | length'
+```
+
+### 4.3 Proof Retention
+
+Proof bundles are retained based on policy:
+
+| Tier | Retention | Description |
+|------|-----------|-------------|
+| Hot | 30 days | Recent proofs, fast access |
+| Warm | 1 year | Archived proofs, slower access |
+| Cold | 7 years | Compliance archive, retrieval required |
+
+**Check retention status**:
+
+```bash
+stella proof status --scan-id $SCAN_ID
+# Output: tier=hot, expires=2025-01-17, retrievable=true
+```
+
+**Retrieve from cold storage**:
+
+```bash
+# Request retrieval (async, may take hours)
+stella proof retrieve --scan-id $SCAN_ID --root-hash sha256:proof123...
+
+# Check retrieval status
+stella proof retrieve-status --request-id req-001
+```
+
+### 4.4 Export for Audit
+
+```bash
+# Export proof bundle with full chain
+stella proof export --scan-id $SCAN_ID \
+  --include-chain \
+  --include-anchors \
+  --output audit-bundle.zip
+
+# Export multiple scans for audit period
+stella proof export-batch \
+  --from 2025-01-01 \
+  --to 2025-01-31 \
+  --output-dir ./audit-jan-2025/
+```
+
+---
+
+## 5. Troubleshooting
+
+### 5.1 Score Mismatch After Replay
+
+**Symptom**: Replayed score differs from original.
+
+**Diagnosis**:
+
+```bash
+# Compare manifests
+stella score diff --scan-id $SCAN_ID --original --replayed
+
+# Check for feed changes
+stella score manifest --scan-id $SCAN_ID | jq '.concelierSnapshotHash'
+
+# Compare input hashes
+stella score inputs --scan-id $SCAN_ID --hash
+```
+
+**Common causes**:
+
+1. **Feed snapshot changed**: Original used different advisory data
+2. **Policy updated**: Scoring rules changed between runs
+3. **VEX statements added**: New VEX data affects scores
+4. **Non-deterministic seed**: Check if `deterministic: true` in manifest
+
+**Resolution**:
+
+```bash
+# Replay with exact original snapshots
+stella score replay --scan-id $SCAN_ID --use-original-snapshots
+```
+
+### 5.2 Proof Verification Failed
+
+**Symptom**: Verification returns `verified: false`.
+
+**Diagnosis**:
+
+```bash
+# Get detailed error
+stella proof verify --bundle-id sha256:proof123... --verbose 2>&1 | head -50
+
+# Common errors:
+# - "Signature verification failed": Key mismatch or tampering
+# - "ID recomputation failed": Canonical JSON issue
+# - "Merkle path invalid": Proof chain corrupted
+# - "Rekor entry not found": Not logged to transparency log
+```
+
+**Resolution by error type**:
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| Signature failed | Key rotated | Use correct trust anchor |
+| ID mismatch | Content modified | Re-generate proof |
+| Merkle invalid | Partial upload | Re-download bundle |
+| Rekor missing | Log lag or skip | Wait or verify offline |
+
+### 5.3 Missing Proof Bundle
+
+**Symptom**: Proof bundle not found.
+
+**Diagnosis**:
+
+```bash
+# Check if scan exists
+stella scan status --scan-id $SCAN_ID
+
+# Check proof generation status
+stella proof status --scan-id $SCAN_ID
+
+# Check if proof was generated
+stella proof list --scan-id $SCAN_ID
+```
+
+**Common causes**:
+
+1. **Scan still in progress**: Proof generated after completion
+2. **Proof generation failed**: Check worker logs
+3. **Archived to cold storage**: Needs retrieval
+4. **Retention expired**: Proof deleted per policy
+
+### 5.4 Replay Performance Issues
+
+**Symptom**: Replay taking too long.
+
+**Diagnosis**:
+
+```bash
+# Check replay queue depth
+stella scheduler queue status replay
+
+# Check worker health
+stella scanner workers status
+
+# Check for resource constraints
+kubectl top pods -l app=scanner-worker
+```
+
+**Optimization**:
+
+```bash
+# Reduce parallelism during peak hours
+stella scheduler job update nightly-score-replay \
+  --config.maxConcurrency=5
+
+# Skip unchanged scans
+stella score replay --scan-id $SCAN_ID --skip-unchanged
+```
+
+---
+
+## 6. Monitoring & Alerting
+
+### 6.1 Key Metrics
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|-----------------|
+| `score_replay_duration_seconds` | Time to replay a score | > 30s |
+| `proof_verification_success_rate` | % of successful verifications | < 99% |
+| `proof_bundle_size_bytes` | Size of proof bundles | > 100MB |
+| `replay_queue_depth` | Pending replay jobs | > 1000 |
+| `proof_generation_failures` | Failed proof generations | > 0/hour |
+
+### 6.2 Grafana Dashboard
+
+```
+Dashboard: Score Proofs Operations
+Panels:
+- Replay throughput (replays/minute)
+- Replay latency (p50, p95, p99)
+- Verification success rate
+- Proof bundle storage usage
+- Queue depth over time
+```
+
+### 6.3 Alerting Rules
+
+```yaml
+# Prometheus alerting rules
+groups:
+  - name: score-proofs
+    rules:
+      - alert: ReplayLatencyHigh
+        expr: histogram_quantile(0.95, score_replay_duration_seconds) > 30
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Score replay latency is high"
+          
+      - alert: ProofVerificationFailures
+        expr: increase(proof_verification_failures_total[1h]) > 10
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Multiple proof verification failures detected"
+          
+      - alert: ReplayQueueBacklog
+        expr: replay_queue_depth > 1000
+        for: 15m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Score replay queue backlog is growing"
+```
+
+---
+
+## 7. Escalation Procedures
+
+### 7.1 Escalation Matrix
+
+| Severity | Condition | Response Time | Escalation Path |
+|----------|-----------|---------------|-----------------|
+| P1 | Proof verification failing for all scans | 15 min | On-call → Team Lead → VP Eng |
+| P2 | Replay failures > 10% | 1 hour | On-call → Team Lead |
+| P3 | Replay latency > 60s p95 | 4 hours | On-call |
+| P4 | Queue backlog > 5000 | 24 hours | Ticket |
+
+### 7.2 P1 Response Procedure
+
+1. **Acknowledge** alert in PagerDuty
+2. **Triage**:
+   ```bash
+   # Check service health
+   stella health check --service scanner
+   stella health check --service attestor
+   
+   # Check recent changes
+   kubectl rollout history deployment/scanner-worker
+   ```
+3. **Mitigate**:
+   ```bash
+   # If recent deployment, rollback
+   kubectl rollout undo deployment/scanner-worker
+   
+   # If key rotation issue, restore previous anchor
+   stella anchor restore --anchor-id anchor-001 --revision previous
+   ```
+4. **Communicate**: Update status page, notify stakeholders
+5. **Resolve**: Fix root cause, verify fix
+6. **Postmortem**: Document incident within 48 hours
+
+### 7.3 Contact Information
+
+| Role | Contact | Availability |
+|------|---------|--------------|
+| On-Call Engineer | PagerDuty `scanner-oncall` | 24/7 |
+| Scanner Team Lead | @scanner-lead | Business hours |
+| Security Team | security@stellaops.local | Business hours |
+| VP Engineering | @vp-eng | Escalation only |
+
+---
+
+## Related Documentation
+
+- [Score Proofs API Reference](../api/score-proofs-reachability-api-reference.md)
+- [Proof Chain Architecture](../modules/attestor/architecture.md)
+- [CLI Reference](./cli-reference.md)
+- [Air-Gap Operations](../airgap/operations.md)
+
+---
+
+**Last Updated**: 2025-12-20  
+**Version**: 1.0.0  
+**Sprint**: 3500.0004.0004