Files
git.stella-ops.org/docs/operations/disaster-recovery.md

329 lines
7.9 KiB
Markdown

# StellaOps Disaster Recovery Guide
> Sprint: SPRINT_20260125_003 - WORKFLOW-003
> Last updated: 2026-01-25
## Overview
This guide covers disaster recovery procedures for StellaOps trust
infrastructure, including Rekor outages, key compromise, and TUF repository
failures.
## Scenario 1: Rekor Service Outage
### Symptoms
- Attestation submissions failing
- Verification requests timing out
- Circuit breaker reporting OPEN state
### Immediate Actions
1. **Verify the outage**
```bash
# Check Rekor health
curl -sf https://rekor.sigstore.dev/api/v1/log | jq .
# Check circuit breaker state
stella trust status --show-circuit-breaker
```
2. **Check if mirror is active**
```bash
# If mirror failover is enabled, verify it's working
stella trust status --show-backends
```
3. **If mirror is not available, swap endpoints via TUF**
```bash
# On TUF repository admin system
./devops/scripts/disaster-swap-endpoint.sh \
--repo /path/to/tuf \
--new-rekor-url https://rekor-mirror.internal:8080 \
--note "Emergency: Production Rekor outage $(date -u)"
```
4. **Publish the update**
```bash
cd /path/to/tuf
./scripts/sign-metadata.sh # Sign updated metadata
./scripts/publish.sh # Deploy to TUF server
```
5. **Force client sync (optional, for immediate effect)**
```bash
stella trust sync --force
```
### Key Principle
**No client reconfiguration required.** Endpoint changes flow through TUF.
Clients discover new endpoints within their configured refresh interval.
### Recovery
Once the primary Rekor is restored:
1. **Swap back to primary**
```bash
./devops/scripts/disaster-swap-endpoint.sh \
--repo /path/to/tuf \
--new-rekor-url https://rekor.sigstore.dev \
--note "Recovery: Primary Rekor restored"
```
2. **Verify service map published**
```bash
stella trust sync --force
stella trust status --show-endpoints
```
3. **Reset circuit breakers**
```bash
stella trust reset-circuits
```
## Scenario 2: Rekor Key Compromise
### Symptoms
- Security team reports potential key exposure
- Unusual entries in transparency log
- Third-party security advisory
### Immediate Actions
1. **Assess the compromise scope**
- When was the key potentially exposed?
- What entries may be affected?
- Are there signed entries from the compromised period?
2. **Emergency key rotation**
```bash
# Phase 1: Add new key immediately (no grace period)
./devops/scripts/rotate-rekor-key.sh add-key \
--repo /path/to/tuf \
--new-key /secure/new-rekor-key-v2.pub
# Sign and publish immediately
cd /path/to/tuf
./scripts/sign-metadata.sh
./scripts/publish.sh
```
3. **Force all clients to sync**
- Announce emergency update to all teams
- Clients should run: `stella trust sync --force`
4. **Revoke compromised key immediately**
```bash
# Phase 2: Remove old key (skip grace period due to compromise)
./devops/scripts/rotate-rekor-key.sh remove-old \
--repo /path/to/tuf \
--old-key-name rekor-key-v1
# Sign and publish
cd /path/to/tuf
./scripts/sign-metadata.sh
./scripts/publish.sh
```
5. **Document the incident**
- Log rotation time
- Affected key ID and fingerprint
- List of potentially affected entries
- Remediation steps taken
### Forensics
Identify entries signed during the compromise window:
```bash
# Query entries by time range
stella rekor query \
--after "2026-01-20T00:00:00Z" \
--before "2026-01-25T00:00:00Z" \
--key-id compromised-key-id
```
## Scenario 3: TUF Repository Unavailable
### Symptoms
- Clients cannot sync trust metadata
- `stella trust sync` failing with network errors
- TUF timestamp verification failing
### Immediate Actions
1. **Diagnose the issue**
```bash
# Check TUF repository health
curl -sf https://trust.example.com/tuf/timestamp.json | jq .
# Check DNS resolution
nslookup trust.example.com
# Check TLS certificate
openssl s_client -connect trust.example.com:443 -servername trust.example.com
```
2. **For clients - extend offline tolerance**
```bash
# Temporarily allow stale metadata (use with caution)
stella trust sync --allow-stale --max-age 7d
```
3. **Restore TUF server**
- Check hosting infrastructure
- Restore from backup if needed
- Verify metadata integrity
4. **Deploy mirror (if available)**
```bash
# Update DNS or load balancer to point to mirror
# Or update clients directly (less preferred)
stella trust init \
--tuf-url https://trust-mirror.example.com/tuf/ \
--force
```
## Scenario 4: Signing Key Compromise
### Symptoms
- Security team reports key exposure
- Unauthorized attestations appearing
### Immediate Actions
1. **Revoke the compromised key**
```bash
./devops/scripts/rotate-signing-key.sh retire \
--old-key compromised-key-name
```
2. **Generate new signing key**
```bash
./devops/scripts/rotate-signing-key.sh generate \
--key-type ecdsa-p256
```
3. **Update CI/CD immediately**
- Remove compromised key from all pipelines
- Add new key
- Trigger rebuild of recent releases
4. **Notify downstream consumers**
- Announce key rotation
- Provide new public key
- Advise re-verification of recent attestations
## Scenario 5: Root Key Ceremony Required
### When Required
- Scheduled root key rotation (typically annual)
- Root key compromise (emergency)
- Threshold change for root signatures
### Procedure
1. **Schedule ceremony**
- Require M-of-N key holders present
- Air-gapped ceremony machine
- Hardware security modules
2. **Generate new root**
```bash
# On air-gapped ceremony machine
tuf-ceremony init \
--threshold 3 \
--keys 5 \
--algorithm ed25519
```
3. **Sign new root with old keys**
- Requires old threshold of signatures
- Ensures continuous trust chain
4. **Distribute new root**
- Publish to TUF repository
- Update bootstrap documentation
- Notify all operators
### Air-Gap Considerations
For air-gapped deployments after root rotation:
```bash
# Export new trust bundle with updated root
stella trust snapshot export \
--include-root \
--out post-rotation-bundle.tar.zst
# Transfer and import on air-gapped systems
./devops/scripts/bootstrap-trust-offline.sh \
post-rotation-bundle.tar.zst \
--force # Required due to root change
```
## Communication Templates
### Outage Notification
```
Subject: [StellaOps] Rekor Service Disruption - Failover Active
Status: Service Degradation
Impact: Attestation submissions may be delayed
Mitigation: Automatic failover to mirror active
Action Required: None - clients will auto-discover new endpoint
Updates: Monitor status at https://status.example.com
```
### Key Rotation Notice
```
Subject: [StellaOps] Emergency Key Rotation - Action Required
Reason: Security precaution / Scheduled rotation
Affected Key: rekor-key-v1 (fingerprint: abc123...)
New Key: rekor-key-v2 (fingerprint: def456...)
Action Required:
1. Run: stella trust sync --force
2. Verify: stella trust status --show-keys
Timeline: Old key will be revoked at [DATE/TIME UTC]
```
## Monitoring and Alerting
### Key Metrics
- Circuit breaker state changes
- TUF metadata freshness
- Rekor submission latency
- Verification success rate
### Alert Thresholds
| Metric | Warning | Critical |
|--------|---------|----------|
| TUF metadata age | > 12h | > 24h |
| Circuit breaker opens | > 2/hour | > 5/hour |
| Submission failures | > 5% | > 20% |
| Verification failures | > 1% | > 5% |
## Contacts
| Role | Contact | Escalation |
|------|---------|------------|
| TUF Admin | tuf-admin@example.com | On-call |
| Security Team | security@example.com | Immediate |
| Platform Team | platform@example.com | Business hours |
## Related Documentation
- [Bootstrap Guide](bootstrap-guide.md)
- [Key Rotation Runbook](key-rotation-runbook.md)
- [TUF Integration Guide](../modules/attestor/tuf-integration.md)