Files
git.stella-ops.org/docs/operations/disaster-recovery.md

7.9 KiB

StellaOps Disaster Recovery Guide

Sprint: SPRINT_20260125_003 - WORKFLOW-003 Last updated: 2026-01-25

Overview

This guide covers disaster recovery procedures for StellaOps trust infrastructure, including Rekor outages, key compromise, and TUF repository failures.

Scenario 1: Rekor Service Outage

Symptoms

  • Attestation submissions failing
  • Verification requests timing out
  • Circuit breaker reporting OPEN state

Immediate Actions

  1. Verify the outage

    # Check Rekor health
    curl -sf https://rekor.sigstore.dev/api/v1/log | jq .
    
    # Check circuit breaker state
    stella trust status --show-circuit-breaker
    
  2. Check if mirror is active

    # If mirror failover is enabled, verify it's working
    stella trust status --show-backends
    
  3. If mirror is not available, swap endpoints via TUF

    # On TUF repository admin system
    ./devops/scripts/disaster-swap-endpoint.sh \
      --repo /path/to/tuf \
      --new-rekor-url https://rekor-mirror.internal:8080 \
      --note "Emergency: Production Rekor outage $(date -u)"
    
  4. Publish the update

    cd /path/to/tuf
    ./scripts/sign-metadata.sh  # Sign updated metadata
    ./scripts/publish.sh        # Deploy to TUF server
    
  5. Force client sync (optional, for immediate effect)

    stella trust sync --force
    

Key Principle

No client reconfiguration required. Endpoint changes flow through TUF. Clients discover new endpoints within their configured refresh interval.

Recovery

Once the primary Rekor is restored:

  1. Swap back to primary

    ./devops/scripts/disaster-swap-endpoint.sh \
      --repo /path/to/tuf \
      --new-rekor-url https://rekor.sigstore.dev \
      --note "Recovery: Primary Rekor restored"
    
  2. Verify service map published

    stella trust sync --force
    stella trust status --show-endpoints
    
  3. Reset circuit breakers

    stella trust reset-circuits
    

Scenario 2: Rekor Key Compromise

Symptoms

  • Security team reports potential key exposure
  • Unusual entries in transparency log
  • Third-party security advisory

Immediate Actions

  1. Assess the compromise scope

    • When was the key potentially exposed?
    • What entries may be affected?
    • Are there signed entries from the compromised period?
  2. Emergency key rotation

    # Phase 1: Add new key immediately (no grace period)
    ./devops/scripts/rotate-rekor-key.sh add-key \
      --repo /path/to/tuf \
      --new-key /secure/new-rekor-key-v2.pub
    
    # Sign and publish immediately
    cd /path/to/tuf
    ./scripts/sign-metadata.sh
    ./scripts/publish.sh
    
  3. Force all clients to sync

    • Announce emergency update to all teams
    • Clients should run: stella trust sync --force
  4. Revoke compromised key immediately

    # Phase 2: Remove old key (skip grace period due to compromise)
    ./devops/scripts/rotate-rekor-key.sh remove-old \
      --repo /path/to/tuf \
      --old-key-name rekor-key-v1
    
    # Sign and publish
    cd /path/to/tuf
    ./scripts/sign-metadata.sh
    ./scripts/publish.sh
    
  5. Document the incident

    • Log rotation time
    • Affected key ID and fingerprint
    • List of potentially affected entries
    • Remediation steps taken

Forensics

Identify entries signed during the compromise window:

# Query entries by time range
stella rekor query \
  --after "2026-01-20T00:00:00Z" \
  --before "2026-01-25T00:00:00Z" \
  --key-id compromised-key-id

Scenario 3: TUF Repository Unavailable

Symptoms

  • Clients cannot sync trust metadata
  • stella trust sync failing with network errors
  • TUF timestamp verification failing

Immediate Actions

  1. Diagnose the issue

    # Check TUF repository health
    curl -sf https://trust.example.com/tuf/timestamp.json | jq .
    
    # Check DNS resolution
    nslookup trust.example.com
    
    # Check TLS certificate
    openssl s_client -connect trust.example.com:443 -servername trust.example.com
    
  2. For clients - extend offline tolerance

    # Temporarily allow stale metadata (use with caution)
    stella trust sync --allow-stale --max-age 7d
    
  3. Restore TUF server

    • Check hosting infrastructure
    • Restore from backup if needed
    • Verify metadata integrity
  4. Deploy mirror (if available)

    # Update DNS or load balancer to point to mirror
    # Or update clients directly (less preferred)
    stella trust init \
      --tuf-url https://trust-mirror.example.com/tuf/ \
      --force
    

Scenario 4: Signing Key Compromise

Symptoms

  • Security team reports key exposure
  • Unauthorized attestations appearing

Immediate Actions

  1. Revoke the compromised key

    ./devops/scripts/rotate-signing-key.sh retire \
      --old-key compromised-key-name
    
  2. Generate new signing key

    ./devops/scripts/rotate-signing-key.sh generate \
      --key-type ecdsa-p256
    
  3. Update CI/CD immediately

    • Remove compromised key from all pipelines
    • Add new key
    • Trigger rebuild of recent releases
  4. Notify downstream consumers

    • Announce key rotation
    • Provide new public key
    • Advise re-verification of recent attestations

Scenario 5: Root Key Ceremony Required

When Required

  • Scheduled root key rotation (typically annual)
  • Root key compromise (emergency)
  • Threshold change for root signatures

Procedure

  1. Schedule ceremony

    • Require M-of-N key holders present
    • Air-gapped ceremony machine
    • Hardware security modules
  2. Generate new root

    # On air-gapped ceremony machine
    tuf-ceremony init \
      --threshold 3 \
      --keys 5 \
      --algorithm ed25519
    
  3. Sign new root with old keys

    • Requires old threshold of signatures
    • Ensures continuous trust chain
  4. Distribute new root

    • Publish to TUF repository
    • Update bootstrap documentation
    • Notify all operators

Air-Gap Considerations

For air-gapped deployments after root rotation:

# Export new trust bundle with updated root
stella trust snapshot export \
  --include-root \
  --out post-rotation-bundle.tar.zst

# Transfer and import on air-gapped systems
./devops/scripts/bootstrap-trust-offline.sh \
  post-rotation-bundle.tar.zst \
  --force  # Required due to root change

Communication Templates

Outage Notification

Subject: [StellaOps] Rekor Service Disruption - Failover Active

Status: Service Degradation
Impact: Attestation submissions may be delayed
Mitigation: Automatic failover to mirror active

Action Required: None - clients will auto-discover new endpoint

Updates: Monitor status at https://status.example.com

Key Rotation Notice

Subject: [StellaOps] Emergency Key Rotation - Action Required

Reason: Security precaution / Scheduled rotation
Affected Key: rekor-key-v1 (fingerprint: abc123...)
New Key: rekor-key-v2 (fingerprint: def456...)

Action Required:
1. Run: stella trust sync --force
2. Verify: stella trust status --show-keys

Timeline: Old key will be revoked at [DATE/TIME UTC]

Monitoring and Alerting

Key Metrics

  • Circuit breaker state changes
  • TUF metadata freshness
  • Rekor submission latency
  • Verification success rate

Alert Thresholds

Metric Warning Critical
TUF metadata age > 12h > 24h
Circuit breaker opens > 2/hour > 5/hour
Submission failures > 5% > 20%
Verification failures > 1% > 5%

Contacts

Role Contact Escalation
TUF Admin tuf-admin@example.com On-call
Security Team security@example.com Immediate
Platform Team platform@example.com Business hours