7.9 KiB
StellaOps Disaster Recovery Guide
Sprint: SPRINT_20260125_003 - WORKFLOW-003 Last updated: 2026-01-25
Overview
This guide covers disaster recovery procedures for StellaOps trust infrastructure, including Rekor outages, key compromise, and TUF repository failures.
Scenario 1: Rekor Service Outage
Symptoms
- Attestation submissions failing
- Verification requests timing out
- Circuit breaker reporting OPEN state
Immediate Actions
-
Verify the outage
# Check Rekor health curl -sf https://rekor.sigstore.dev/api/v1/log | jq . # Check circuit breaker state stella trust status --show-circuit-breaker -
Check if mirror is active
# If mirror failover is enabled, verify it's working stella trust status --show-backends -
If mirror is not available, swap endpoints via TUF
# On TUF repository admin system ./devops/scripts/disaster-swap-endpoint.sh \ --repo /path/to/tuf \ --new-rekor-url https://rekor-mirror.internal:8080 \ --note "Emergency: Production Rekor outage $(date -u)" -
Publish the update
cd /path/to/tuf ./scripts/sign-metadata.sh # Sign updated metadata ./scripts/publish.sh # Deploy to TUF server -
Force client sync (optional, for immediate effect)
stella trust sync --force
Key Principle
No client reconfiguration required. Endpoint changes flow through TUF. Clients discover new endpoints within their configured refresh interval.
Recovery
Once the primary Rekor is restored:
-
Swap back to primary
./devops/scripts/disaster-swap-endpoint.sh \ --repo /path/to/tuf \ --new-rekor-url https://rekor.sigstore.dev \ --note "Recovery: Primary Rekor restored" -
Verify service map published
stella trust sync --force stella trust status --show-endpoints -
Reset circuit breakers
stella trust reset-circuits
Scenario 2: Rekor Key Compromise
Symptoms
- Security team reports potential key exposure
- Unusual entries in transparency log
- Third-party security advisory
Immediate Actions
-
Assess the compromise scope
- When was the key potentially exposed?
- What entries may be affected?
- Are there signed entries from the compromised period?
-
Emergency key rotation
# Phase 1: Add new key immediately (no grace period) ./devops/scripts/rotate-rekor-key.sh add-key \ --repo /path/to/tuf \ --new-key /secure/new-rekor-key-v2.pub # Sign and publish immediately cd /path/to/tuf ./scripts/sign-metadata.sh ./scripts/publish.sh -
Force all clients to sync
- Announce emergency update to all teams
- Clients should run:
stella trust sync --force
-
Revoke compromised key immediately
# Phase 2: Remove old key (skip grace period due to compromise) ./devops/scripts/rotate-rekor-key.sh remove-old \ --repo /path/to/tuf \ --old-key-name rekor-key-v1 # Sign and publish cd /path/to/tuf ./scripts/sign-metadata.sh ./scripts/publish.sh -
Document the incident
- Log rotation time
- Affected key ID and fingerprint
- List of potentially affected entries
- Remediation steps taken
Forensics
Identify entries signed during the compromise window:
# Query entries by time range
stella rekor query \
--after "2026-01-20T00:00:00Z" \
--before "2026-01-25T00:00:00Z" \
--key-id compromised-key-id
Scenario 3: TUF Repository Unavailable
Symptoms
- Clients cannot sync trust metadata
stella trust syncfailing with network errors- TUF timestamp verification failing
Immediate Actions
-
Diagnose the issue
# Check TUF repository health curl -sf https://trust.example.com/tuf/timestamp.json | jq . # Check DNS resolution nslookup trust.example.com # Check TLS certificate openssl s_client -connect trust.example.com:443 -servername trust.example.com -
For clients - extend offline tolerance
# Temporarily allow stale metadata (use with caution) stella trust sync --allow-stale --max-age 7d -
Restore TUF server
- Check hosting infrastructure
- Restore from backup if needed
- Verify metadata integrity
-
Deploy mirror (if available)
# Update DNS or load balancer to point to mirror # Or update clients directly (less preferred) stella trust init \ --tuf-url https://trust-mirror.example.com/tuf/ \ --force
Scenario 4: Signing Key Compromise
Symptoms
- Security team reports key exposure
- Unauthorized attestations appearing
Immediate Actions
-
Revoke the compromised key
./devops/scripts/rotate-signing-key.sh retire \ --old-key compromised-key-name -
Generate new signing key
./devops/scripts/rotate-signing-key.sh generate \ --key-type ecdsa-p256 -
Update CI/CD immediately
- Remove compromised key from all pipelines
- Add new key
- Trigger rebuild of recent releases
-
Notify downstream consumers
- Announce key rotation
- Provide new public key
- Advise re-verification of recent attestations
Scenario 5: Root Key Ceremony Required
When Required
- Scheduled root key rotation (typically annual)
- Root key compromise (emergency)
- Threshold change for root signatures
Procedure
-
Schedule ceremony
- Require M-of-N key holders present
- Air-gapped ceremony machine
- Hardware security modules
-
Generate new root
# On air-gapped ceremony machine tuf-ceremony init \ --threshold 3 \ --keys 5 \ --algorithm ed25519 -
Sign new root with old keys
- Requires old threshold of signatures
- Ensures continuous trust chain
-
Distribute new root
- Publish to TUF repository
- Update bootstrap documentation
- Notify all operators
Air-Gap Considerations
For air-gapped deployments after root rotation:
# Export new trust bundle with updated root
stella trust snapshot export \
--include-root \
--out post-rotation-bundle.tar.zst
# Transfer and import on air-gapped systems
./devops/scripts/bootstrap-trust-offline.sh \
post-rotation-bundle.tar.zst \
--force # Required due to root change
Communication Templates
Outage Notification
Subject: [StellaOps] Rekor Service Disruption - Failover Active
Status: Service Degradation
Impact: Attestation submissions may be delayed
Mitigation: Automatic failover to mirror active
Action Required: None - clients will auto-discover new endpoint
Updates: Monitor status at https://status.example.com
Key Rotation Notice
Subject: [StellaOps] Emergency Key Rotation - Action Required
Reason: Security precaution / Scheduled rotation
Affected Key: rekor-key-v1 (fingerprint: abc123...)
New Key: rekor-key-v2 (fingerprint: def456...)
Action Required:
1. Run: stella trust sync --force
2. Verify: stella trust status --show-keys
Timeline: Old key will be revoked at [DATE/TIME UTC]
Monitoring and Alerting
Key Metrics
- Circuit breaker state changes
- TUF metadata freshness
- Rekor submission latency
- Verification success rate
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| TUF metadata age | > 12h | > 24h |
| Circuit breaker opens | > 2/hour | > 5/hour |
| Submission failures | > 5% | > 20% |
| Verification failures | > 1% | > 5% |
Contacts
| Role | Contact | Escalation |
|---|---|---|
| TUF Admin | tuf-admin@example.com | On-call |
| Security Team | security@example.com | Immediate |
| Platform Team | platform@example.com | Business hours |