Files
git.stella-ops.org/docs/operations/key-escrow-runbook.md

418 lines
10 KiB
Markdown

# Key Escrow and Recovery Runbook
This runbook documents Shamir secret sharing key escrow and recovery procedures in Stella Ops.
> **Sprint:** SPRINT_20260112_018_CRYPTO_key_escrow_shamir
## Overview
Key escrow ensures critical cryptographic keys can be recovered if primary access is lost. Stella Ops uses Shamir's Secret Sharing to split keys into shares distributed among trusted custodians.
Key features:
- M-of-N threshold recovery (any M shares reconstruct the key)
- Share encryption at rest
- Custodian-based share distribution
- Integration with dual-control ceremonies
- Full audit trail
## When to Use Key Escrow
| Scenario | Escrow Required |
|----------|-----------------|
| Root signing keys | Yes |
| HSM master keys | Yes |
| Trust anchor keys | Yes |
| Service signing keys | Recommended |
| User signing keys | Optional |
| Ephemeral keys | No |
## Shamir Secret Sharing
### How It Works
Shamir's Secret Sharing splits a secret into N shares where any M shares can reconstruct the original:
```
Secret S → Split(S, M, N) → [Share₁, Share₂, ..., Shareₙ]
Any M shares → Combine → Secret S
Fewer than M shares → Cannot reconstruct
```
### Configuration Parameters
| Parameter | Description | Recommended |
|-----------|-------------|-------------|
| Threshold (M) | Minimum shares needed | 2-3 for keys |
| Total Shares (N) | Total shares created | M + 2 minimum |
| Share Encryption | Encrypt shares at rest | Always enabled |
### Threshold Guidelines
| Key Type | Minimum M | Recommended N | Rationale |
|----------|-----------|---------------|-----------|
| Root keys | 3 | 5 | High assurance |
| HSM keys | 2 | 4 | Availability + security |
| Service keys | 2 | 3 | Operational recovery |
## Escrowing a Key
### Via CLI
```bash
stella escrow create \
--key-id root-signing-key-2026 \
--threshold 3 \
--shares 5 \
--custodians custodian-1,custodian-2,custodian-3,custodian-4,custodian-5 \
--expires-in 365d \
--reason "Annual key escrow for root signing key"
```
### Via API
```bash
curl -X POST https://signer.example.com/api/v1/escrow \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"keyId": "root-signing-key-2026",
"threshold": 3,
"totalShares": 5,
"custodianIds": [
"custodian-1", "custodian-2", "custodian-3",
"custodian-4", "custodian-5"
],
"expirationDays": 365,
"reason": "Annual key escrow for root signing key"
}'
```
### Escrow Response
```json
{
"escrowId": "esc-abc123",
"keyId": "root-signing-key-2026",
"threshold": 3,
"totalShares": 5,
"status": "Active",
"createdAt": "2026-01-16T10:00:00Z",
"expiresAt": "2027-01-16T10:00:00Z",
"shares": [
{ "shareId": "shr-001", "custodianId": "custodian-1", "distributed": true },
{ "shareId": "shr-002", "custodianId": "custodian-2", "distributed": true },
{ "shareId": "shr-003", "custodianId": "custodian-3", "distributed": true },
{ "shareId": "shr-004", "custodianId": "custodian-4", "distributed": true },
{ "shareId": "shr-005", "custodianId": "custodian-5", "distributed": true }
]
}
```
## Share Distribution
### Distribution Methods
| Method | Security | Use Case |
|--------|----------|----------|
| Direct API delivery | High | Automated systems |
| Encrypted email | Medium | Remote custodians |
| In-person ceremony | Highest | Root keys |
| Hardware token | Highest | HSM keys |
### Custodian Requirements
Each custodian must:
1. Have verified identity in Authority
2. Complete escrow custodian training
3. Have secure share storage capability
4. Be geographically distributed (recommended)
### Verifying Share Distribution
```bash
stella escrow status --escrow-id esc-abc123
# Output:
# Escrow: esc-abc123
# Key: root-signing-key-2026
# Status: Active
# Threshold: 3 of 5
# Shares:
# [1] custodian-1: Distributed ✓
# [2] custodian-2: Distributed ✓
# [3] custodian-3: Distributed ✓
# [4] custodian-4: Distributed ✓
# [5] custodian-5: Distributed ✓
```
## Key Recovery
### Prerequisites
Recovery requires:
1. Valid recovery request (incident, key loss, rotation)
2. Dual-control ceremony approval (if configured)
3. Minimum M custodians available with shares
4. Secure recovery environment
### Recovery Workflow
```
1. Initiate recovery request
2. (If required) Dual-control ceremony approval
3. Collect shares from M custodians
4. Verify share checksums
5. Reconstruct key
6. Verify reconstructed key
7. Log recovery event
```
### Via CLI
```bash
# Step 1: Initiate recovery
stella escrow recover init \
--escrow-id esc-abc123 \
--reason "HSM failure - emergency key recovery" \
--ceremony-required
# Step 2: Collect shares (each custodian runs)
stella escrow recover submit-share \
--recovery-id rec-xyz789 \
--share-file /secure/my-share.enc \
--passphrase-file /secure/passphrase
# Step 3: Execute recovery (after threshold reached)
stella escrow recover execute \
--recovery-id rec-xyz789 \
--output-key-file /secure/recovered-key.pem
```
### Via API
```bash
# Initiate recovery
curl -X POST https://signer.example.com/api/v1/escrow/esc-abc123/recover \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "HSM failure - emergency key recovery",
"requireCeremony": true
}'
# Submit share
curl -X POST https://signer.example.com/api/v1/recovery/rec-xyz789/shares \
-H "Authorization: Bearer $CUSTODIAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"shareId": "shr-001",
"encryptedShare": "base64-encoded-share",
"checksum": "sha256:abc123..."
}'
# Execute recovery (after threshold)
curl -X POST https://signer.example.com/api/v1/recovery/rec-xyz789/execute \
-H "Authorization: Bearer $TOKEN"
```
### Recovery Response
```json
{
"recoveryId": "rec-xyz789",
"status": "Completed",
"keyId": "root-signing-key-2026",
"sharesCollected": 3,
"threshold": 3,
"completedAt": "2026-01-16T15:30:00Z",
"keyFingerprint": "SHA256:xyz789...",
"verified": true
}
```
## Share Management
### Custodian Share Storage
Custodians should store shares:
| Storage | Security Level | Notes |
|---------|----------------|-------|
| HSM | Highest | Preferred for root keys |
| Hardware token | High | YubiKey, smart card |
| Encrypted file | Medium | AES-256-GCM minimum |
| Password manager | Medium | Enterprise vault only |
### Share Format
```json
{
"shareId": "shr-001",
"escrowId": "esc-abc123",
"index": 1,
"threshold": 3,
"totalShares": 5,
"encryptedData": "base64-encoded-aes-256-gcm-ciphertext",
"checksum": "sha256:abc123...",
"createdAt": "2026-01-16T10:00:00Z",
"expiresAt": "2027-01-16T10:00:00Z"
}
```
### Share Rotation
Re-escrow keys periodically:
```bash
stella escrow re-escrow \
--escrow-id esc-abc123 \
--new-custodians custodian-1,custodian-2,custodian-6,custodian-7,custodian-8 \
--reason "Annual share rotation"
```
This creates new shares and revokes old ones.
## Audit Trail
### Audit Events
| Event | Description |
|-------|-------------|
| `escrow.created` | Key escrowed |
| `escrow.share.distributed` | Share sent to custodian |
| `escrow.share.accessed` | Custodian accessed share |
| `recovery.initiated` | Recovery started |
| `recovery.share.submitted` | Share submitted for recovery |
| `recovery.completed` | Key reconstructed |
| `recovery.failed` | Recovery failed |
| `escrow.revoked` | Escrow revoked |
### Query Audit Logs
```bash
stella audit query \
--event-type "escrow.*,recovery.*" \
--escrow-id esc-abc123 \
--since 30d
```
## Configuration
### Escrow Settings
```yaml
# escrow-config.yaml
escrow:
enabled: true
defaultThreshold: 2
minimumThreshold: 2
maximumShares: 10
shareEncryption:
algorithm: AES-256-GCM
keyDerivation: HKDF-SHA256
requireDualControlForRecovery: true
maxRecoveryAttempts: 3
recoveryTimeoutHours: 24
```
### Custodian Configuration
```yaml
# custodians.yaml
custodians:
- id: custodian-1
name: "Security Lead"
email: security-lead@company.com
publicKey: "-----BEGIN PUBLIC KEY-----..."
location: "US-East"
- id: custodian-2
name: "Key Officer A"
email: key-officer-a@company.com
publicKey: "-----BEGIN PUBLIC KEY-----..."
location: "EU-West"
```
## Security Considerations
### Share Security
- Never transmit shares in plaintext
- Encrypt shares with custodian's public key
- Verify checksums before and after storage
- Use secure channels for distribution
### Recovery Security
- Require dual-control ceremonies for critical keys
- Limit recovery time window
- Verify recovered key fingerprint
- Audit all recovery attempts
### Custodian Security
- Verify custodian identity before share access
- Geographic distribution reduces collusion risk
- Rotate custodians periodically
- Train custodians on secure handling
## Troubleshooting
### Common Issues
| Issue | Cause | Resolution |
|-------|-------|------------|
| Share checksum mismatch | Corrupted share | Request re-distribution |
| Cannot decrypt share | Wrong passphrase | Verify passphrase |
| Recovery timeout | Shares not collected in time | Restart recovery |
| Key verification failed | Wrong shares combined | Verify share indices |
### Verification Failures
```bash
# Verify share integrity
stella escrow verify-share --share-file share.enc
# Test reconstruction with subset
stella escrow test-recovery \
--escrow-id esc-abc123 \
--share-files share1.enc,share2.enc,share3.enc
```
## Emergency Procedures
### Lost Share
If a custodian loses their share:
1. Verify at least M shares remain accessible
2. Re-escrow with new share set
3. Revoke compromised escrow
4. Document incident
### Compromised Custodian
If a custodian is compromised:
1. Do NOT use their share for any recovery
2. Re-escrow immediately with new custodians
3. Revoke old escrow
4. Consider key rotation if threshold was exposed
### Multiple Lost Shares
If fewer than M shares are available:
1. Key cannot be recovered via escrow
2. Use backup key if available
3. Generate new key and re-establish trust
4. Document as key loss incident
## Related Documentation
- [Dual-Control Ceremony Runbook](./dual-control-ceremony-runbook.md)
- [Key Rotation Runbook](./key-rotation-runbook.md)
- [HSM Setup Runbook](./hsm-setup-runbook.md)
- [Cryptography Architecture](../modules/cryptography/architecture.md)