329 lines
7.9 KiB
Markdown
329 lines
7.9 KiB
Markdown
# StellaOps Disaster Recovery Guide
|
|
|
|
> Sprint: SPRINT_20260125_003 - WORKFLOW-003
|
|
> Last updated: 2026-01-25
|
|
|
|
## Overview
|
|
|
|
This guide covers disaster recovery procedures for StellaOps trust
|
|
infrastructure, including Rekor outages, key compromise, and TUF repository
|
|
failures.
|
|
|
|
## Scenario 1: Rekor Service Outage
|
|
|
|
### Symptoms
|
|
- Attestation submissions failing
|
|
- Verification requests timing out
|
|
- Circuit breaker reporting OPEN state
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Verify the outage**
|
|
```bash
|
|
# Check Rekor health
|
|
curl -sf https://rekor.sigstore.dev/api/v1/log | jq .
|
|
|
|
# Check circuit breaker state
|
|
stella trust status --show-circuit-breaker
|
|
```
|
|
|
|
2. **Check if mirror is active**
|
|
```bash
|
|
# If mirror failover is enabled, verify it's working
|
|
stella trust status --show-backends
|
|
```
|
|
|
|
3. **If mirror is not available, swap endpoints via TUF**
|
|
```bash
|
|
# On TUF repository admin system
|
|
./devops/scripts/disaster-swap-endpoint.sh \
|
|
--repo /path/to/tuf \
|
|
--new-rekor-url https://rekor-mirror.internal:8080 \
|
|
--note "Emergency: Production Rekor outage $(date -u)"
|
|
```
|
|
|
|
4. **Publish the update**
|
|
```bash
|
|
cd /path/to/tuf
|
|
./scripts/sign-metadata.sh # Sign updated metadata
|
|
./scripts/publish.sh # Deploy to TUF server
|
|
```
|
|
|
|
5. **Force client sync (optional, for immediate effect)**
|
|
```bash
|
|
stella trust sync --force
|
|
```
|
|
|
|
### Key Principle
|
|
|
|
**No client reconfiguration required.** Endpoint changes flow through TUF.
|
|
Clients discover new endpoints within their configured refresh interval.
|
|
|
|
### Recovery
|
|
|
|
Once the primary Rekor is restored:
|
|
|
|
1. **Swap back to primary**
|
|
```bash
|
|
./devops/scripts/disaster-swap-endpoint.sh \
|
|
--repo /path/to/tuf \
|
|
--new-rekor-url https://rekor.sigstore.dev \
|
|
--note "Recovery: Primary Rekor restored"
|
|
```
|
|
|
|
2. **Verify service map published**
|
|
```bash
|
|
stella trust sync --force
|
|
stella trust status --show-endpoints
|
|
```
|
|
|
|
3. **Reset circuit breakers**
|
|
```bash
|
|
stella trust reset-circuits
|
|
```
|
|
|
|
## Scenario 2: Rekor Key Compromise
|
|
|
|
### Symptoms
|
|
- Security team reports potential key exposure
|
|
- Unusual entries in transparency log
|
|
- Third-party security advisory
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Assess the compromise scope**
|
|
- When was the key potentially exposed?
|
|
- What entries may be affected?
|
|
- Are there signed entries from the compromised period?
|
|
|
|
2. **Emergency key rotation**
|
|
```bash
|
|
# Phase 1: Add new key immediately (no grace period)
|
|
./devops/scripts/rotate-rekor-key.sh add-key \
|
|
--repo /path/to/tuf \
|
|
--new-key /secure/new-rekor-key-v2.pub
|
|
|
|
# Sign and publish immediately
|
|
cd /path/to/tuf
|
|
./scripts/sign-metadata.sh
|
|
./scripts/publish.sh
|
|
```
|
|
|
|
3. **Force all clients to sync**
|
|
- Announce emergency update to all teams
|
|
- Clients should run: `stella trust sync --force`
|
|
|
|
4. **Revoke compromised key immediately**
|
|
```bash
|
|
# Phase 2: Remove old key (skip grace period due to compromise)
|
|
./devops/scripts/rotate-rekor-key.sh remove-old \
|
|
--repo /path/to/tuf \
|
|
--old-key-name rekor-key-v1
|
|
|
|
# Sign and publish
|
|
cd /path/to/tuf
|
|
./scripts/sign-metadata.sh
|
|
./scripts/publish.sh
|
|
```
|
|
|
|
5. **Document the incident**
|
|
- Log rotation time
|
|
- Affected key ID and fingerprint
|
|
- List of potentially affected entries
|
|
- Remediation steps taken
|
|
|
|
### Forensics
|
|
|
|
Identify entries signed during the compromise window:
|
|
|
|
```bash
|
|
# Query entries by time range
|
|
stella rekor query \
|
|
--after "2026-01-20T00:00:00Z" \
|
|
--before "2026-01-25T00:00:00Z" \
|
|
--key-id compromised-key-id
|
|
```
|
|
|
|
## Scenario 3: TUF Repository Unavailable
|
|
|
|
### Symptoms
|
|
- Clients cannot sync trust metadata
|
|
- `stella trust sync` failing with network errors
|
|
- TUF timestamp verification failing
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Diagnose the issue**
|
|
```bash
|
|
# Check TUF repository health
|
|
curl -sf https://trust.example.com/tuf/timestamp.json | jq .
|
|
|
|
# Check DNS resolution
|
|
nslookup trust.example.com
|
|
|
|
# Check TLS certificate
|
|
openssl s_client -connect trust.example.com:443 -servername trust.example.com
|
|
```
|
|
|
|
2. **For clients - extend offline tolerance**
|
|
```bash
|
|
# Temporarily allow stale metadata (use with caution)
|
|
stella trust sync --allow-stale --max-age 7d
|
|
```
|
|
|
|
3. **Restore TUF server**
|
|
- Check hosting infrastructure
|
|
- Restore from backup if needed
|
|
- Verify metadata integrity
|
|
|
|
4. **Deploy mirror (if available)**
|
|
```bash
|
|
# Update DNS or load balancer to point to mirror
|
|
# Or update clients directly (less preferred)
|
|
stella trust init \
|
|
--tuf-url https://trust-mirror.example.com/tuf/ \
|
|
--force
|
|
```
|
|
|
|
## Scenario 4: Signing Key Compromise
|
|
|
|
### Symptoms
|
|
- Security team reports key exposure
|
|
- Unauthorized attestations appearing
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Revoke the compromised key**
|
|
```bash
|
|
./devops/scripts/rotate-signing-key.sh retire \
|
|
--old-key compromised-key-name
|
|
```
|
|
|
|
2. **Generate new signing key**
|
|
```bash
|
|
./devops/scripts/rotate-signing-key.sh generate \
|
|
--key-type ecdsa-p256
|
|
```
|
|
|
|
3. **Update CI/CD immediately**
|
|
- Remove compromised key from all pipelines
|
|
- Add new key
|
|
- Trigger rebuild of recent releases
|
|
|
|
4. **Notify downstream consumers**
|
|
- Announce key rotation
|
|
- Provide new public key
|
|
- Advise re-verification of recent attestations
|
|
|
|
## Scenario 5: Root Key Ceremony Required
|
|
|
|
### When Required
|
|
- Scheduled root key rotation (typically annual)
|
|
- Root key compromise (emergency)
|
|
- Threshold change for root signatures
|
|
|
|
### Procedure
|
|
|
|
1. **Schedule ceremony**
|
|
- Require M-of-N key holders present
|
|
- Air-gapped ceremony machine
|
|
- Hardware security modules
|
|
|
|
2. **Generate new root**
|
|
```bash
|
|
# On air-gapped ceremony machine
|
|
tuf-ceremony init \
|
|
--threshold 3 \
|
|
--keys 5 \
|
|
--algorithm ed25519
|
|
```
|
|
|
|
3. **Sign new root with old keys**
|
|
- Requires old threshold of signatures
|
|
- Ensures continuous trust chain
|
|
|
|
4. **Distribute new root**
|
|
- Publish to TUF repository
|
|
- Update bootstrap documentation
|
|
- Notify all operators
|
|
|
|
### Air-Gap Considerations
|
|
|
|
For air-gapped deployments after root rotation:
|
|
|
|
```bash
|
|
# Export new trust bundle with updated root
|
|
stella trust snapshot export \
|
|
--include-root \
|
|
--out post-rotation-bundle.tar.zst
|
|
|
|
# Transfer and import on air-gapped systems
|
|
./devops/scripts/bootstrap-trust-offline.sh \
|
|
post-rotation-bundle.tar.zst \
|
|
--force # Required due to root change
|
|
```
|
|
|
|
## Communication Templates
|
|
|
|
### Outage Notification
|
|
|
|
```
|
|
Subject: [StellaOps] Rekor Service Disruption - Failover Active
|
|
|
|
Status: Service Degradation
|
|
Impact: Attestation submissions may be delayed
|
|
Mitigation: Automatic failover to mirror active
|
|
|
|
Action Required: None - clients will auto-discover new endpoint
|
|
|
|
Updates: Monitor status at https://status.example.com
|
|
```
|
|
|
|
### Key Rotation Notice
|
|
|
|
```
|
|
Subject: [StellaOps] Emergency Key Rotation - Action Required
|
|
|
|
Reason: Security precaution / Scheduled rotation
|
|
Affected Key: rekor-key-v1 (fingerprint: abc123...)
|
|
New Key: rekor-key-v2 (fingerprint: def456...)
|
|
|
|
Action Required:
|
|
1. Run: stella trust sync --force
|
|
2. Verify: stella trust status --show-keys
|
|
|
|
Timeline: Old key will be revoked at [DATE/TIME UTC]
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Key Metrics
|
|
|
|
- Circuit breaker state changes
|
|
- TUF metadata freshness
|
|
- Rekor submission latency
|
|
- Verification success rate
|
|
|
|
### Alert Thresholds
|
|
|
|
| Metric | Warning | Critical |
|
|
|--------|---------|----------|
|
|
| TUF metadata age | > 12h | > 24h |
|
|
| Circuit breaker opens | > 2/hour | > 5/hour |
|
|
| Submission failures | > 5% | > 20% |
|
|
| Verification failures | > 1% | > 5% |
|
|
|
|
## Contacts
|
|
|
|
| Role | Contact | Escalation |
|
|
|------|---------|------------|
|
|
| TUF Admin | tuf-admin@example.com | On-call |
|
|
| Security Team | security@example.com | Immediate |
|
|
| Platform Team | platform@example.com | Business hours |
|
|
|
|
## Related Documentation
|
|
|
|
- [Bootstrap Guide](bootstrap-guide.md)
|
|
- [Key Rotation Runbook](key-rotation-runbook.md)
|
|
- [TUF Integration Guide](../modules/attestor/tuf-integration.md)
|