# StellaOps Disaster Recovery Guide > Sprint: SPRINT_20260125_003 - WORKFLOW-003 > Last updated: 2026-01-25 ## Overview This guide covers disaster recovery procedures for StellaOps trust infrastructure, including Rekor outages, key compromise, and TUF repository failures. ## Scenario 1: Rekor Service Outage ### Symptoms - Attestation submissions failing - Verification requests timing out - Circuit breaker reporting OPEN state ### Immediate Actions 1. **Verify the outage** ```bash # Check Rekor health curl -sf https://rekor.sigstore.dev/api/v1/log | jq . # Check circuit breaker state stella trust status --show-circuit-breaker ``` 2. **Check if mirror is active** ```bash # If mirror failover is enabled, verify it's working stella trust status --show-backends ``` 3. **If mirror is not available, swap endpoints via TUF** ```bash # On TUF repository admin system ./devops/scripts/disaster-swap-endpoint.sh \ --repo /path/to/tuf \ --new-rekor-url https://rekor-mirror.internal:8080 \ --note "Emergency: Production Rekor outage $(date -u)" ``` 4. **Publish the update** ```bash cd /path/to/tuf ./scripts/sign-metadata.sh # Sign updated metadata ./scripts/publish.sh # Deploy to TUF server ``` 5. **Force client sync (optional, for immediate effect)** ```bash stella trust sync --force ``` ### Key Principle **No client reconfiguration required.** Endpoint changes flow through TUF. Clients discover new endpoints within their configured refresh interval. ### Recovery Once the primary Rekor is restored: 1. **Swap back to primary** ```bash ./devops/scripts/disaster-swap-endpoint.sh \ --repo /path/to/tuf \ --new-rekor-url https://rekor.sigstore.dev \ --note "Recovery: Primary Rekor restored" ``` 2. **Verify service map published** ```bash stella trust sync --force stella trust status --show-endpoints ``` 3. **Reset circuit breakers** ```bash stella trust reset-circuits ``` ## Scenario 2: Rekor Key Compromise ### Symptoms - Security team reports potential key exposure - Unusual entries in transparency log - Third-party security advisory ### Immediate Actions 1. **Assess the compromise scope** - When was the key potentially exposed? - What entries may be affected? - Are there signed entries from the compromised period? 2. **Emergency key rotation** ```bash # Phase 1: Add new key immediately (no grace period) ./devops/scripts/rotate-rekor-key.sh add-key \ --repo /path/to/tuf \ --new-key /secure/new-rekor-key-v2.pub # Sign and publish immediately cd /path/to/tuf ./scripts/sign-metadata.sh ./scripts/publish.sh ``` 3. **Force all clients to sync** - Announce emergency update to all teams - Clients should run: `stella trust sync --force` 4. **Revoke compromised key immediately** ```bash # Phase 2: Remove old key (skip grace period due to compromise) ./devops/scripts/rotate-rekor-key.sh remove-old \ --repo /path/to/tuf \ --old-key-name rekor-key-v1 # Sign and publish cd /path/to/tuf ./scripts/sign-metadata.sh ./scripts/publish.sh ``` 5. **Document the incident** - Log rotation time - Affected key ID and fingerprint - List of potentially affected entries - Remediation steps taken ### Forensics Identify entries signed during the compromise window: ```bash # Query entries by time range stella rekor query \ --after "2026-01-20T00:00:00Z" \ --before "2026-01-25T00:00:00Z" \ --key-id compromised-key-id ``` ## Scenario 3: TUF Repository Unavailable ### Symptoms - Clients cannot sync trust metadata - `stella trust sync` failing with network errors - TUF timestamp verification failing ### Immediate Actions 1. **Diagnose the issue** ```bash # Check TUF repository health curl -sf https://trust.example.com/tuf/timestamp.json | jq . # Check DNS resolution nslookup trust.example.com # Check TLS certificate openssl s_client -connect trust.example.com:443 -servername trust.example.com ``` 2. **For clients - extend offline tolerance** ```bash # Temporarily allow stale metadata (use with caution) stella trust sync --allow-stale --max-age 7d ``` 3. **Restore TUF server** - Check hosting infrastructure - Restore from backup if needed - Verify metadata integrity 4. **Deploy mirror (if available)** ```bash # Update DNS or load balancer to point to mirror # Or update clients directly (less preferred) stella trust init \ --tuf-url https://trust-mirror.example.com/tuf/ \ --force ``` ## Scenario 4: Signing Key Compromise ### Symptoms - Security team reports key exposure - Unauthorized attestations appearing ### Immediate Actions 1. **Revoke the compromised key** ```bash ./devops/scripts/rotate-signing-key.sh retire \ --old-key compromised-key-name ``` 2. **Generate new signing key** ```bash ./devops/scripts/rotate-signing-key.sh generate \ --key-type ecdsa-p256 ``` 3. **Update CI/CD immediately** - Remove compromised key from all pipelines - Add new key - Trigger rebuild of recent releases 4. **Notify downstream consumers** - Announce key rotation - Provide new public key - Advise re-verification of recent attestations ## Scenario 5: Root Key Ceremony Required ### When Required - Scheduled root key rotation (typically annual) - Root key compromise (emergency) - Threshold change for root signatures ### Procedure 1. **Schedule ceremony** - Require M-of-N key holders present - Air-gapped ceremony machine - Hardware security modules 2. **Generate new root** ```bash # On air-gapped ceremony machine tuf-ceremony init \ --threshold 3 \ --keys 5 \ --algorithm ed25519 ``` 3. **Sign new root with old keys** - Requires old threshold of signatures - Ensures continuous trust chain 4. **Distribute new root** - Publish to TUF repository - Update bootstrap documentation - Notify all operators ### Air-Gap Considerations For air-gapped deployments after root rotation: ```bash # Export new trust bundle with updated root stella trust snapshot export \ --include-root \ --out post-rotation-bundle.tar.zst # Transfer and import on air-gapped systems ./devops/scripts/bootstrap-trust-offline.sh \ post-rotation-bundle.tar.zst \ --force # Required due to root change ``` ## Communication Templates ### Outage Notification ``` Subject: [StellaOps] Rekor Service Disruption - Failover Active Status: Service Degradation Impact: Attestation submissions may be delayed Mitigation: Automatic failover to mirror active Action Required: None - clients will auto-discover new endpoint Updates: Monitor status at https://status.example.com ``` ### Key Rotation Notice ``` Subject: [StellaOps] Emergency Key Rotation - Action Required Reason: Security precaution / Scheduled rotation Affected Key: rekor-key-v1 (fingerprint: abc123...) New Key: rekor-key-v2 (fingerprint: def456...) Action Required: 1. Run: stella trust sync --force 2. Verify: stella trust status --show-keys Timeline: Old key will be revoked at [DATE/TIME UTC] ``` ## Monitoring and Alerting ### Key Metrics - Circuit breaker state changes - TUF metadata freshness - Rekor submission latency - Verification success rate ### Alert Thresholds | Metric | Warning | Critical | |--------|---------|----------| | TUF metadata age | > 12h | > 24h | | Circuit breaker opens | > 2/hour | > 5/hour | | Submission failures | > 5% | > 20% | | Verification failures | > 1% | > 5% | ## Contacts | Role | Contact | Escalation | |------|---------|------------| | TUF Admin | tuf-admin@example.com | On-call | | Security Team | security@example.com | Immediate | | Platform Team | platform@example.com | Business hours | ## Related Documentation - [Bootstrap Guide](bootstrap-guide.md) - [Key Rotation Runbook](key-rotation-runbook.md) - [TUF Integration Guide](../modules/attestor/tuf-integration.md)