fix tests. new product advisories enhancements
This commit is contained in:
328
docs/operations/disaster-recovery.md
Normal file
328
docs/operations/disaster-recovery.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# StellaOps Disaster Recovery Guide
|
||||
|
||||
> Sprint: SPRINT_20260125_003 - WORKFLOW-003
|
||||
> Last updated: 2026-01-25
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers disaster recovery procedures for StellaOps trust
|
||||
infrastructure, including Rekor outages, key compromise, and TUF repository
|
||||
failures.
|
||||
|
||||
## Scenario 1: Rekor Service Outage
|
||||
|
||||
### Symptoms
|
||||
- Attestation submissions failing
|
||||
- Verification requests timing out
|
||||
- Circuit breaker reporting OPEN state
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Verify the outage**
|
||||
```bash
|
||||
# Check Rekor health
|
||||
curl -sf https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# Check circuit breaker state
|
||||
stella trust status --show-circuit-breaker
|
||||
```
|
||||
|
||||
2. **Check if mirror is active**
|
||||
```bash
|
||||
# If mirror failover is enabled, verify it's working
|
||||
stella trust status --show-backends
|
||||
```
|
||||
|
||||
3. **If mirror is not available, swap endpoints via TUF**
|
||||
```bash
|
||||
# On TUF repository admin system
|
||||
./devops/scripts/disaster-swap-endpoint.sh \
|
||||
--repo /path/to/tuf \
|
||||
--new-rekor-url https://rekor-mirror.internal:8080 \
|
||||
--note "Emergency: Production Rekor outage $(date -u)"
|
||||
```
|
||||
|
||||
4. **Publish the update**
|
||||
```bash
|
||||
cd /path/to/tuf
|
||||
./scripts/sign-metadata.sh # Sign updated metadata
|
||||
./scripts/publish.sh # Deploy to TUF server
|
||||
```
|
||||
|
||||
5. **Force client sync (optional, for immediate effect)**
|
||||
```bash
|
||||
stella trust sync --force
|
||||
```
|
||||
|
||||
### Key Principle
|
||||
|
||||
**No client reconfiguration required.** Endpoint changes flow through TUF.
|
||||
Clients discover new endpoints within their configured refresh interval.
|
||||
|
||||
### Recovery
|
||||
|
||||
Once the primary Rekor is restored:
|
||||
|
||||
1. **Swap back to primary**
|
||||
```bash
|
||||
./devops/scripts/disaster-swap-endpoint.sh \
|
||||
--repo /path/to/tuf \
|
||||
--new-rekor-url https://rekor.sigstore.dev \
|
||||
--note "Recovery: Primary Rekor restored"
|
||||
```
|
||||
|
||||
2. **Verify service map published**
|
||||
```bash
|
||||
stella trust sync --force
|
||||
stella trust status --show-endpoints
|
||||
```
|
||||
|
||||
3. **Reset circuit breakers**
|
||||
```bash
|
||||
stella trust reset-circuits
|
||||
```
|
||||
|
||||
## Scenario 2: Rekor Key Compromise
|
||||
|
||||
### Symptoms
|
||||
- Security team reports potential key exposure
|
||||
- Unusual entries in transparency log
|
||||
- Third-party security advisory
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Assess the compromise scope**
|
||||
- When was the key potentially exposed?
|
||||
- What entries may be affected?
|
||||
- Are there signed entries from the compromised period?
|
||||
|
||||
2. **Emergency key rotation**
|
||||
```bash
|
||||
# Phase 1: Add new key immediately (no grace period)
|
||||
./devops/scripts/rotate-rekor-key.sh add-key \
|
||||
--repo /path/to/tuf \
|
||||
--new-key /secure/new-rekor-key-v2.pub
|
||||
|
||||
# Sign and publish immediately
|
||||
cd /path/to/tuf
|
||||
./scripts/sign-metadata.sh
|
||||
./scripts/publish.sh
|
||||
```
|
||||
|
||||
3. **Force all clients to sync**
|
||||
- Announce emergency update to all teams
|
||||
- Clients should run: `stella trust sync --force`
|
||||
|
||||
4. **Revoke compromised key immediately**
|
||||
```bash
|
||||
# Phase 2: Remove old key (skip grace period due to compromise)
|
||||
./devops/scripts/rotate-rekor-key.sh remove-old \
|
||||
--repo /path/to/tuf \
|
||||
--old-key-name rekor-key-v1
|
||||
|
||||
# Sign and publish
|
||||
cd /path/to/tuf
|
||||
./scripts/sign-metadata.sh
|
||||
./scripts/publish.sh
|
||||
```
|
||||
|
||||
5. **Document the incident**
|
||||
- Log rotation time
|
||||
- Affected key ID and fingerprint
|
||||
- List of potentially affected entries
|
||||
- Remediation steps taken
|
||||
|
||||
### Forensics
|
||||
|
||||
Identify entries signed during the compromise window:
|
||||
|
||||
```bash
|
||||
# Query entries by time range
|
||||
stella rekor query \
|
||||
--after "2026-01-20T00:00:00Z" \
|
||||
--before "2026-01-25T00:00:00Z" \
|
||||
--key-id compromised-key-id
|
||||
```
|
||||
|
||||
## Scenario 3: TUF Repository Unavailable
|
||||
|
||||
### Symptoms
|
||||
- Clients cannot sync trust metadata
|
||||
- `stella trust sync` failing with network errors
|
||||
- TUF timestamp verification failing
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Diagnose the issue**
|
||||
```bash
|
||||
# Check TUF repository health
|
||||
curl -sf https://trust.example.com/tuf/timestamp.json | jq .
|
||||
|
||||
# Check DNS resolution
|
||||
nslookup trust.example.com
|
||||
|
||||
# Check TLS certificate
|
||||
openssl s_client -connect trust.example.com:443 -servername trust.example.com
|
||||
```
|
||||
|
||||
2. **For clients - extend offline tolerance**
|
||||
```bash
|
||||
# Temporarily allow stale metadata (use with caution)
|
||||
stella trust sync --allow-stale --max-age 7d
|
||||
```
|
||||
|
||||
3. **Restore TUF server**
|
||||
- Check hosting infrastructure
|
||||
- Restore from backup if needed
|
||||
- Verify metadata integrity
|
||||
|
||||
4. **Deploy mirror (if available)**
|
||||
```bash
|
||||
# Update DNS or load balancer to point to mirror
|
||||
# Or update clients directly (less preferred)
|
||||
stella trust init \
|
||||
--tuf-url https://trust-mirror.example.com/tuf/ \
|
||||
--force
|
||||
```
|
||||
|
||||
## Scenario 4: Signing Key Compromise
|
||||
|
||||
### Symptoms
|
||||
- Security team reports key exposure
|
||||
- Unauthorized attestations appearing
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Revoke the compromised key**
|
||||
```bash
|
||||
./devops/scripts/rotate-signing-key.sh retire \
|
||||
--old-key compromised-key-name
|
||||
```
|
||||
|
||||
2. **Generate new signing key**
|
||||
```bash
|
||||
./devops/scripts/rotate-signing-key.sh generate \
|
||||
--key-type ecdsa-p256
|
||||
```
|
||||
|
||||
3. **Update CI/CD immediately**
|
||||
- Remove compromised key from all pipelines
|
||||
- Add new key
|
||||
- Trigger rebuild of recent releases
|
||||
|
||||
4. **Notify downstream consumers**
|
||||
- Announce key rotation
|
||||
- Provide new public key
|
||||
- Advise re-verification of recent attestations
|
||||
|
||||
## Scenario 5: Root Key Ceremony Required
|
||||
|
||||
### When Required
|
||||
- Scheduled root key rotation (typically annual)
|
||||
- Root key compromise (emergency)
|
||||
- Threshold change for root signatures
|
||||
|
||||
### Procedure
|
||||
|
||||
1. **Schedule ceremony**
|
||||
- Require M-of-N key holders present
|
||||
- Air-gapped ceremony machine
|
||||
- Hardware security modules
|
||||
|
||||
2. **Generate new root**
|
||||
```bash
|
||||
# On air-gapped ceremony machine
|
||||
tuf-ceremony init \
|
||||
--threshold 3 \
|
||||
--keys 5 \
|
||||
--algorithm ed25519
|
||||
```
|
||||
|
||||
3. **Sign new root with old keys**
|
||||
- Requires old threshold of signatures
|
||||
- Ensures continuous trust chain
|
||||
|
||||
4. **Distribute new root**
|
||||
- Publish to TUF repository
|
||||
- Update bootstrap documentation
|
||||
- Notify all operators
|
||||
|
||||
### Air-Gap Considerations
|
||||
|
||||
For air-gapped deployments after root rotation:
|
||||
|
||||
```bash
|
||||
# Export new trust bundle with updated root
|
||||
stella trust snapshot export \
|
||||
--include-root \
|
||||
--out post-rotation-bundle.tar.zst
|
||||
|
||||
# Transfer and import on air-gapped systems
|
||||
./devops/scripts/bootstrap-trust-offline.sh \
|
||||
post-rotation-bundle.tar.zst \
|
||||
--force # Required due to root change
|
||||
```
|
||||
|
||||
## Communication Templates
|
||||
|
||||
### Outage Notification
|
||||
|
||||
```
|
||||
Subject: [StellaOps] Rekor Service Disruption - Failover Active
|
||||
|
||||
Status: Service Degradation
|
||||
Impact: Attestation submissions may be delayed
|
||||
Mitigation: Automatic failover to mirror active
|
||||
|
||||
Action Required: None - clients will auto-discover new endpoint
|
||||
|
||||
Updates: Monitor status at https://status.example.com
|
||||
```
|
||||
|
||||
### Key Rotation Notice
|
||||
|
||||
```
|
||||
Subject: [StellaOps] Emergency Key Rotation - Action Required
|
||||
|
||||
Reason: Security precaution / Scheduled rotation
|
||||
Affected Key: rekor-key-v1 (fingerprint: abc123...)
|
||||
New Key: rekor-key-v2 (fingerprint: def456...)
|
||||
|
||||
Action Required:
|
||||
1. Run: stella trust sync --force
|
||||
2. Verify: stella trust status --show-keys
|
||||
|
||||
Timeline: Old key will be revoked at [DATE/TIME UTC]
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Key Metrics
|
||||
|
||||
- Circuit breaker state changes
|
||||
- TUF metadata freshness
|
||||
- Rekor submission latency
|
||||
- Verification success rate
|
||||
|
||||
### Alert Thresholds
|
||||
|
||||
| Metric | Warning | Critical |
|
||||
|--------|---------|----------|
|
||||
| TUF metadata age | > 12h | > 24h |
|
||||
| Circuit breaker opens | > 2/hour | > 5/hour |
|
||||
| Submission failures | > 5% | > 20% |
|
||||
| Verification failures | > 1% | > 5% |
|
||||
|
||||
## Contacts
|
||||
|
||||
| Role | Contact | Escalation |
|
||||
|------|---------|------------|
|
||||
| TUF Admin | tuf-admin@example.com | On-call |
|
||||
| Security Team | security@example.com | Immediate |
|
||||
| Platform Team | platform@example.com | Business hours |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Bootstrap Guide](bootstrap-guide.md)
|
||||
- [Key Rotation Runbook](key-rotation-runbook.md)
|
||||
- [TUF Integration Guide](../modules/attestor/tuf-integration.md)
|
||||
Reference in New Issue
Block a user