sprints completion. new product advisories prepared

2026-01-16 16:30:03 +02:00
parent a927d924e3
commit 4ca3ce8fb4
255 changed files with 42434 additions and 1020 deletions
--- a/docs/operations/break-glass-runbook.md
+++ b/docs/operations/break-glass-runbook.md
@@ -0,0 +1,331 @@
+# Break-Glass Account Runbook
+
+This runbook documents emergency access procedures using the break-glass account system when standard authentication is unavailable.
+
+> **Sprint:** SPRINT_20260112_018_AUTH_local_rbac_fallback
+
+## Overview
+
+Break-glass accounts provide emergency administrative access when:
+- PostgreSQL database is unavailable
+- OIDC/OAuth2 identity provider is unreachable
+- Authority service is degraded
+- Network isolation prevents standard authentication
+
+Break-glass access is fully audited and time-limited by design.
+
+## When to Use Break-Glass Access
+
+| Scenario | Standard Auth | Break-Glass |
+|----------|---------------|-------------|
+| Database maintenance | N/A | Use |
+| IdP outage | Unavailable | Use |
+| Network partition | Unavailable | Use |
+| Routine operations | Available | Do NOT use |
+| Security incident response | May be unavailable | Use with incident code |
+
+**CRITICAL:** Break-glass access should only be used when standard authentication is genuinely unavailable. All usage is logged and auditable.
+
+## Prerequisites
+
+### Configuration Requirements
+
+Break-glass must be explicitly enabled in local policy:
+
+```yaml
+# /etc/stellaops/authority/local-policy.yaml
+breakGlass:
+  enabled: true
+  sessionTimeoutMinutes: 15
+  maxExtensions: 2
+  allowedReasonCodes:
+    - database_maintenance
+    - idp_outage
+    - network_partition
+    - security_incident
+    - disaster_recovery
+  accounts:
+    - id: "break-glass-admin"
+      passwordHash: "$argon2id$v=19$m=65536,t=3,p=4$..."
+      roles: ["admin"]
+```
+
+### Password Hash Generation
+
+Generate password hashes using Argon2id:
+
+```bash
+# Using argon2 CLI tool
+echo -n "your-secure-password" | argon2 $(openssl rand -base64 16) -id -t 3 -m 16 -p 4 -l 32 -e
+
+# Or using stella CLI
+stella auth hash-password --algorithm argon2id
+```
+
+## Break-Glass Login Procedure
+
+### Step 1: Verify Standard Auth is Unavailable
+
+Before using break-glass, confirm standard authentication is genuinely unavailable:
+
+```bash
+# Check Authority health
+curl -s https://authority.example.com/health | jq .
+
+# Check OIDC endpoint
+curl -s https://idp.example.com/.well-known/openid-configuration
+
+# Check database connectivity
+stella doctor check --component postgres
+```
+
+### Step 2: Access Break-Glass Login
+
+Navigate to the break-glass endpoint:
+
+```
+https://authority.example.com/break-glass/login
+```
+
+Or use the CLI:
+
+```bash
+stella auth break-glass login \
+  --account break-glass-admin \
+  --reason database_maintenance
+```
+
+### Step 3: Provide Credentials and Reason
+
+| Field | Description | Required |
+|-------|-------------|----------|
+| Account ID | Break-glass account identifier | Yes |
+| Password | Account password | Yes |
+| Reason Code | Pre-approved reason code | Yes |
+| Reason Details | Free-text explanation | Recommended |
+
+**Approved Reason Codes:**
+
+| Code | Description |
+|------|-------------|
+| `database_maintenance` | Scheduled or emergency database work |
+| `idp_outage` | Identity provider unavailable |
+| `network_partition` | Network connectivity issues |
+| `security_incident` | Active security incident response |
+| `disaster_recovery` | DR/BCP activation |
+
+### Step 4: Session Created
+
+On successful authentication:
+
+- Session token issued with limited TTL (default: 15 minutes)
+- Audit event logged: `breakglass.session.created`
+- All subsequent actions are tagged with break-glass context
+
+## Session Management
+
+### Session Timeout
+
+Break-glass sessions have strict time limits:
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `sessionTimeoutMinutes` | 15 | Session lifetime |
+| `maxExtensions` | 2 | Maximum session extensions |
+| Extension period | 15 min | Time added per extension |
+
+### Extending a Session
+
+If additional time is needed:
+
+```bash
+# CLI
+stella auth break-glass extend \
+  --session-id <session-id> \
+  --reason "database migration still running"
+
+# UI
+# Click "Extend Session" button in break-glass banner
+```
+
+Extension requires:
+1. Re-entering password
+2. Providing extension reason
+3. Not exceeding `maxExtensions` limit
+
+### Session Termination
+
+Sessions end when:
+- User explicitly logs out
+- Session timeout expires
+- Max extensions reached
+- Administrator force-terminates
+
+```bash
+# Explicit logout
+stella auth break-glass logout --session-id <session-id>
+
+# Force terminate (admin)
+stella auth break-glass terminate --session-id <session-id> --reason "normal auth restored"
+```
+
+## Audit Trail
+
+### Audit Events
+
+All break-glass activity is logged:
+
+| Event | Description |
+|-------|-------------|
+| `breakglass.session.created` | Session started |
+| `breakglass.session.extended` | Session extended |
+| `breakglass.session.terminated` | User logout |
+| `breakglass.session.expired` | Timeout reached |
+| `breakglass.auth.failed` | Authentication failed |
+| `breakglass.reason.invalid` | Invalid reason code |
+| `breakglass.extensions.exceeded` | Max extensions reached |
+
+### Audit Event Structure
+
+```json
+{
+  "eventType": "breakglass.session.created",
+  "timestamp": "2026-01-16T10:30:00Z",
+  "accountId": "break-glass-admin",
+  "sessionId": "bg-sess-abc123",
+  "reasonCode": "database_maintenance",
+  "reasonDetails": "PostgreSQL major version upgrade",
+  "sourceIp": "10.0.1.50",
+  "userAgent": "stella-cli/2027.Q1"
+}
+```
+
+### Querying Audit Logs
+
+```bash
+# List all break-glass events
+stella audit query --event-type "breakglass.*" --since "24h"
+
+# Export for compliance
+stella audit export \
+  --event-type "breakglass.*" \
+  --start 2026-01-01 \
+  --end 2026-01-31 \
+  --format json \
+  --output break-glass-audit-jan2026.json
+```
+
+## Fallback Policy Store
+
+### Automatic Failover
+
+When PostgreSQL becomes unavailable:
+
+1. Authority detects health check failures
+2. After `failureThreshold` (default: 3) consecutive failures
+3. Authority switches to local policy store
+4. Mode changes to `Fallback`
+5. Event logged: `authority.mode.changed`
+
+### Policy Store Modes
+
+| Mode | Description | Available Features |
+|------|-------------|-------------------|
+| `Primary` | PostgreSQL available | Full RBAC, user management |
+| `Fallback` | Using local policy | Break-glass only |
+| `Degraded` | Both degraded | Emergency access only |
+
+### Recovery
+
+When PostgreSQL recovers:
+
+1. Health checks pass
+2. After `minFallbackDurationMs` (default: 30s) cooldown
+3. Authority switches back to Primary
+4. Fallback sessions can continue until expiry
+
+## Security Considerations
+
+### Password Policy
+
+Break-glass account passwords should:
+- Be at least 20 characters
+- Include upper, lower, numbers, symbols
+- Be stored securely (HSM, Vault, split custody)
+- Be rotated on a schedule (quarterly recommended)
+
+### Access Control
+
+- Limit break-glass accounts to essential personnel
+- Use separate accounts per operator when possible
+- Review access list quarterly
+- Disable unused accounts immediately
+
+### Monitoring
+
+Set up alerts for break-glass activity:
+
+```yaml
+# Alert rule example
+- alert: BreakGlassSessionCreated
+  expr: stellaops_breakglass_sessions_created_total > 0
+  for: 0m
+  labels:
+    severity: warning
+  annotations:
+    summary: Break-glass session created
+    description: A break-glass session was created. Verify this is expected.
+```
+
+## Troubleshooting
+
+### Login Failures
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| `invalid_credentials` | Wrong password | Verify password |
+| `invalid_reason_code` | Reason not in allowed list | Use approved reason code |
+| `account_disabled` | Account explicitly disabled | Contact administrator |
+| `break_glass_disabled` | Feature disabled in config | Enable in local-policy.yaml |
+
+### Session Issues
+
+| Issue | Cause | Resolution |
+|-------|-------|------------|
+| Session expired immediately | Clock skew | Sync server time |
+| Cannot extend | Max extensions reached | Log out and re-authenticate |
+| Actions failing | Insufficient roles | Verify account has required roles |
+
+### Policy Store Issues
+
+```bash
+# Check policy store status
+stella doctor check --component authority
+
+# Verify local policy file
+stella auth policy validate --file /etc/stellaops/authority/local-policy.yaml
+
+# Force reload policy
+stella auth policy reload
+```
+
+## Compliance Notes
+
+Break-glass usage must be:
+- Documented in incident reports
+- Reviewed during security audits
+- Reported in compliance dashboards
+- Justified for each session
+
+Retain audit logs for:
+- SOC 2: 1 year minimum
+- HIPAA: 6 years
+- PCI-DSS: 1 year
+- Internal policy: As defined
+
+## Related Documentation
+
+- [Local RBAC Policy Schema](../modules/authority/local-policy-schema.md)
+- [Authority Architecture](../modules/authority/architecture.md)
+- [Offline Operations](../operations/airgap-operations-runbook.md)
+- [Audit System](../modules/audit/architecture.md)
--- a/docs/operations/checkpoint-divergence-runbook.md
+++ b/docs/operations/checkpoint-divergence-runbook.md
@@ -0,0 +1,262 @@
+# Checkpoint Divergence Detection and Incident Response
+
+This runbook covers the detection of Rekor checkpoint divergence, anomaly types, alert handling, and incident response procedures.
+
+## Overview
+
+Checkpoint divergence detection monitors the integrity of Rekor transparency logs by:
+- Comparing root hashes at the same tree size
+- Verifying tree size monotonicity (only increases)
+- Cross-checking primary logs against mirrors
+- Detecting stale or unresponsive logs
+
+Divergence can indicate:
+- Split-view attacks (malicious log server showing different trees to different clients)
+- Rollback attacks (hiding recent log entries)
+- Log compromise or key theft
+- Network partitions or operational issues
+
+## Detection Rules
+
+| Check | Condition | Severity | Recommended Action |
+|-------|-----------|----------|-------------------|
+| Root hash mismatch | Same tree_size, different root_hash | CRITICAL | Quarantine + immediate investigation |
+| Tree size rollback | new_tree_size < stored_tree_size | CRITICAL | Reject checkpoint + alert |
+| Cross-log divergence | Primary root ≠ mirror root at same size | WARNING | Alert + investigate |
+| Stale checkpoint | Checkpoint age > threshold | WARNING | Alert + monitor |
+
+## Alert Payloads
+
+### Root Hash Mismatch Alert
+```json
+{
+  "eventType": "rekor.checkpoint.divergence",
+  "severity": "critical",
+  "origin": "rekor.sigstore.dev",
+  "treeSize": 12345678,
+  "expectedRootHash": "sha256:abc123...",
+  "actualRootHash": "sha256:def456...",
+  "detectedAt": "2026-01-15T12:34:56Z",
+  "backend": "sigstore-prod",
+  "description": "Checkpoint root hash mismatch detected. Possible split-view attack.",
+  "recommendedAction": "Quarantine"
+}
+```
+
+### Rollback Attempt Alert
+```json
+{
+  "eventType": "rekor.checkpoint.rollback",
+  "severity": "critical",
+  "origin": "rekor.sigstore.dev",
+  "previousTreeSize": 12345678,
+  "attemptedTreeSize": 12345600,
+  "detectedAt": "2026-01-15T12:34:56Z",
+  "description": "Tree size regression detected. Possible rollback attack."
+}
+```
+
+### Cross-Log Divergence Alert
+```json
+{
+  "eventType": "rekor.checkpoint.cross_log_divergence",
+  "severity": "warning",
+  "primaryOrigin": "rekor.sigstore.dev",
+  "mirrorOrigin": "rekor.mirror.example.com",
+  "treeSize": 12345678,
+  "primaryRootHash": "sha256:abc123...",
+  "mirrorRootHash": "sha256:def456...",
+  "description": "Cross-log divergence detected between primary and mirror."
+}
+```
+
+## Metrics
+
+```
+# Counter: total checkpoint mismatches
+attestor_rekor_checkpoint_mismatch_total{backend="sigstore-prod",origin="rekor.sigstore.dev"} 0
+
+# Counter: rollback attempts detected
+attestor_rekor_checkpoint_rollback_detected_total{backend="sigstore-prod"} 0
+
+# Counter: cross-log divergences detected
+attestor_rekor_cross_log_divergence_total{primary="rekor.sigstore.dev",mirror="mirror.example.com"} 0
+
+# Gauge: seconds since last valid checkpoint
+attestor_rekor_checkpoint_age_seconds{backend="sigstore-prod"} 120
+
+# Counter: total anomalies detected (all types)
+attestor_rekor_anomalies_detected_total{type="RootHashMismatch",severity="critical"} 0
+```
+
+## Incident Response Procedures
+
+### Level 1: Root Hash Mismatch (CRITICAL)
+
+**Symptoms:**
+- `attestor_rekor_checkpoint_mismatch_total` increments
+- Alert received: "rekor.checkpoint.divergence"
+
+**Immediate Actions:**
+1. **Quarantine all affected proofs** - Do not rely on any inclusion proofs from the affected log until resolved
+2. **Suspend automated verifications** - Halt any automated systems that depend on the log
+3. **Preserve evidence** - Capture both checkpoints (expected and actual) with full metadata
+4. **Alert security team** - This is a potential compromise indicator
+
+**Investigation Steps:**
+1. Verify the mismatch isn't a local storage corruption
+   ```bash
+   stella attestor checkpoint verify --origin rekor.sigstore.dev --tree-size 12345678
+   ```
+2. Cross-check with independent sources (other clients, mirrors)
+3. Check if Sigstore has published any incident reports
+4. Review network logs for MITM indicators
+
+**Resolution:**
+- If confirmed attack: Follow security incident process
+- If local corruption: Resync from trusted source
+- If upstream issue: Wait for Sigstore remediation, follow their guidance
+
+### Level 2: Tree Size Rollback (CRITICAL)
+
+**Symptoms:**
+- `attestor_rekor_checkpoint_rollback_detected_total` increments
+- Alert received: "rekor.checkpoint.rollback"
+
+**Immediate Actions:**
+1. **Reject the checkpoint** - Do not accept or store it
+2. **Log full details** for forensic analysis
+3. **Check network path** - Could indicate MITM or DNS hijacking
+
+**Investigation Steps:**
+1. Verify current log state directly:
+   ```bash
+   curl -s https://rekor.sigstore.dev/api/v1/log | jq .treeSize
+   ```
+2. Compare with stored latest tree size
+3. Check DNS resolution and TLS certificate chain
+
+**Resolution:**
+- If network attack: Remediate network path, rotate credentials
+- If temporary glitch: Monitor for repetition
+- If persistent: Escalate to upstream provider
+
+### Level 3: Cross-Log Divergence (WARNING)
+
+**Symptoms:**
+- `attestor_rekor_cross_log_divergence_total` increments
+- Alert received: "rekor.checkpoint.cross_log_divergence"
+
+**Immediate Actions:**
+1. **Do not panic** - Mirrors may have legitimate lag
+2. **Check mirror sync status** - May be catching up
+
+**Investigation Steps:**
+1. Compare tree sizes:
+   ```bash
+   stella attestor checkpoint list --origins rekor.sigstore.dev,mirror.example.com
+   ```
+2. If same tree size with different roots: Escalate to CRITICAL
+3. If different tree sizes: Allow time for sync
+4. If persistent: Investigate mirror operator
+
+**Resolution:**
+- Sync lag: Monitor until caught up
+- Persistent divergence: Disable mirror, investigate, or remove from trust list
+
+### Level 4: Stale Checkpoint (WARNING)
+
+**Symptoms:**
+- `attestor_rekor_checkpoint_age_seconds` exceeds threshold
+- Log health status: DEGRADED or UNHEALTHY
+
+**Immediate Actions:**
+1. Check log service status
+2. Verify network connectivity to log
+
+**Investigation Steps:**
+1. Check Sigstore status page
+2. Test direct API access:
+   ```bash
+   curl -I https://rekor.sigstore.dev/api/v1/log
+   ```
+3. Review recent checkpoint fetch attempts
+
+**Resolution:**
+- Upstream outage: Wait, rely on cached data
+- Local network issue: Restore connectivity
+- Persistent: Consider failover to mirror
+
+## Configuration
+
+### Detector Options
+
+```yaml
+attestor:
+  divergenceDetection:
+    # Enable checkpoint monitoring
+    enabled: true
+
+    # Threshold for "stale checkpoint" warning
+    staleCheckpointThreshold: 1h
+
+    # Threshold for "stale tree size" (no growth)
+    staleTreeSizeThreshold: 2h
+
+    # Log health thresholds
+    degradedCheckpointAgeThreshold: 30m
+    unhealthyCheckpointAgeThreshold: 2h
+
+    # Enable cross-log consistency checks
+    enableCrossLogChecks: true
+
+    # Mirror origins to check against primary
+    mirrorOrigins:
+      - rekor.mirror.example.com
+      - rekor.mirror2.example.com
+```
+
+### Alert Options
+
+```yaml
+attestor:
+  alerts:
+    # Enable alert publishing to Notify service
+    enabled: true
+
+    # Default tenant for system alerts
+    defaultTenant: system
+
+    # Severity thresholds for alerting
+    alertOnHighSeverity: true
+    alertOnWarning: true
+    alertOnInfo: false
+
+    # Alert stream name
+    stream: attestor.alerts
+```
+
+## Runbook Checklist
+
+### Daily Operations
+- [ ] Verify `attestor_rekor_checkpoint_age_seconds` < threshold
+- [ ] Check for any anomaly counter increments
+- [ ] Review divergence detector logs for warnings
+
+### Weekly Review
+- [ ] Audit checkpoint storage integrity
+- [ ] Verify mirror sync status
+- [ ] Review and tune alerting thresholds
+
+### Post-Incident
+- [ ] Document root cause
+- [ ] Update detection rules if needed
+- [ ] Review and improve response procedures
+- [ ] Share learnings with team
+
+## See Also
+
+- [Rekor Verification Design](../modules/attestor/rekor-verification-design.md)
+- [Attestor Architecture](../modules/attestor/architecture.md)
+- [Sigstore Rekor Documentation](https://docs.sigstore.dev/rekor/overview/)
+- [Certificate Transparency RFC 6962](https://www.rfc-editor.org/rfc/rfc6962)
--- a/docs/operations/dual-control-ceremony-runbook.md
+++ b/docs/operations/dual-control-ceremony-runbook.md
@@ -0,0 +1,443 @@
+# Dual-Control Ceremony Runbook
+
+This runbook documents M-of-N threshold signing ceremonies for high-assurance key operations in Stella Ops.
+
+> **Sprint:** SPRINT_20260112_018_SIGNER_dual_control_ceremonies
+
+## Overview
+
+Dual-control ceremonies ensure critical cryptographic operations require approval from multiple authorized individuals before execution. This prevents single points of compromise for sensitive operations like:
+
+- Root key rotation
+- Trust anchor updates
+- Emergency key revocation
+- HSM key generation
+- Recovery key activation
+
+## When Ceremonies Are Required
+
+| Operation | Default Threshold | Configurable |
+|-----------|------------------|--------------|
+| Root signing key rotation | 2-of-3 | Yes |
+| Trust anchor update | 2-of-3 | Yes |
+| Key revocation | 2-of-3 | Yes |
+| HSM key generation | 2-of-4 | Yes |
+| Recovery key activation | 3-of-5 | Yes |
+
+## Ceremony Lifecycle
+
+### State Machine
+
+```
+         +------------------+
+         |     Pending      |
+         +--------+---------+
+                  |
+                  | Approvals collected
+                  v
+    +-------------+-------------+
+    |   PartiallyApproved      |
+    +-------------+-------------+
+                  |
+                  | Threshold reached
+                  v
+         +--------+---------+
+         |     Approved     |
+         +--------+---------+
+                  |
+                  | Execute
+                  v
+         +--------+---------+
+         |     Executed     |
+         +------------------+
+
+   Alternative paths:
+   - Pending -> Expired (timeout)
+   - Pending -> Cancelled (initiator cancel)
+   - PartiallyApproved -> Expired (timeout)
+   - PartiallyApproved -> Cancelled
+```
+
+### State Descriptions
+
+| State | Description |
+|-------|-------------|
+| `Pending` | Ceremony created, awaiting first approval |
+| `PartiallyApproved` | At least one approval, threshold not reached |
+| `Approved` | Threshold reached, ready for execution |
+| `Executed` | Operation completed successfully |
+| `Expired` | Timeout reached without execution |
+| `Cancelled` | Explicitly cancelled before execution |
+
+## Creating a Ceremony
+
+### Via CLI
+
+```bash
+stella ceremony create \
+  --type key-rotation \
+  --subject "Root signing key Q1-2026" \
+  --threshold 2 \
+  --required-approvers 3 \
+  --expires-in 24h \
+  --payload '{"keyId": "root-2026-q1", "algorithm": "ecdsa-p384"}'
+```
+
+### Via API
+
+```bash
+curl -X POST https://signer.example.com/api/v1/ceremonies \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "type": "key-rotation",
+    "subject": "Root signing key Q1-2026",
+    "threshold": 2,
+    "requiredApprovers": 3,
+    "expiresAt": "2026-01-17T10:00:00Z",
+    "payload": {
+      "keyId": "root-2026-q1",
+      "algorithm": "ecdsa-p384"
+    }
+  }'
+```
+
+### Response
+
+```json
+{
+  "ceremonyId": "cer-abc123",
+  "type": "key-rotation",
+  "state": "Pending",
+  "threshold": 2,
+  "requiredApprovers": 3,
+  "currentApprovals": 0,
+  "createdAt": "2026-01-16T10:00:00Z",
+  "expiresAt": "2026-01-17T10:00:00Z",
+  "initiator": "admin@company.com"
+}
+```
+
+## Approving a Ceremony
+
+### Prerequisites
+
+Approvers must:
+1. Be in the ceremony's allowed approvers list
+2. Have the `ceremony:approve` scope
+3. Have valid authentication (OIDC or break-glass)
+4. Not have already approved this ceremony
+
+### Via CLI
+
+```bash
+stella ceremony approve \
+  --ceremony-id cer-abc123 \
+  --reason "Reviewed rotation plan, verified key parameters" \
+  --sign
+```
+
+The `--sign` flag creates a DSSE signature over the approval using the approver's signing key.
+
+### Via API
+
+```bash
+curl -X POST https://signer.example.com/api/v1/ceremonies/cer-abc123/approve \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "reason": "Reviewed rotation plan, verified key parameters",
+    "signature": "base64-encoded-dsse-signature"
+  }'
+```
+
+### Approval Response
+
+```json
+{
+  "ceremonyId": "cer-abc123",
+  "state": "PartiallyApproved",
+  "currentApprovals": 1,
+  "threshold": 2,
+  "approval": {
+    "approvalId": "apr-def456",
+    "approver": "security-lead@company.com",
+    "approvedAt": "2026-01-16T11:30:00Z",
+    "reason": "Reviewed rotation plan, verified key parameters",
+    "signatureValid": true
+  }
+}
+```
+
+## Executing a Ceremony
+
+Once the approval threshold is reached:
+
+### Via CLI
+
+```bash
+stella ceremony execute --ceremony-id cer-abc123
+```
+
+### Via API
+
+```bash
+curl -X POST https://signer.example.com/api/v1/ceremonies/cer-abc123/execute \
+  -H "Authorization: Bearer $TOKEN"
+```
+
+### Execution Response
+
+```json
+{
+  "ceremonyId": "cer-abc123",
+  "state": "Executed",
+  "executedAt": "2026-01-16T14:00:00Z",
+  "result": {
+    "keyId": "root-2026-q1",
+    "publicKey": "-----BEGIN PUBLIC KEY-----...",
+    "fingerprint": "SHA256:abc123...",
+    "activatedAt": "2026-01-16T14:00:00Z"
+  }
+}
+```
+
+## Monitoring Ceremonies
+
+### List Active Ceremonies
+
+```bash
+# CLI
+stella ceremony list --state pending,partially-approved
+
+# API
+curl "https://signer.example.com/api/v1/ceremonies?state=pending,partially-approved"
+```
+
+### Check Ceremony Status
+
+```bash
+# CLI
+stella ceremony status --ceremony-id cer-abc123
+
+# API
+curl "https://signer.example.com/api/v1/ceremonies/cer-abc123"
+```
+
+## Cancelling a Ceremony
+
+Ceremonies can be cancelled before execution:
+
+```bash
+# CLI
+stella ceremony cancel \
+  --ceremony-id cer-abc123 \
+  --reason "Postponed due to schedule conflict"
+
+# API
+curl -X DELETE https://signer.example.com/api/v1/ceremonies/cer-abc123 \
+  -H "Authorization: Bearer $TOKEN"
+```
+
+Only the initiator or users with `ceremony:cancel` scope can cancel.
+
+## Audit Events
+
+All ceremony actions are logged:
+
+| Event | Description |
+|-------|-------------|
+| `signer.ceremony.initiated` | Ceremony created |
+| `signer.ceremony.approved` | Approval submitted |
+| `signer.ceremony.approval_rejected` | Approval rejected (invalid signature, unauthorized) |
+| `signer.ceremony.executed` | Operation executed |
+| `signer.ceremony.expired` | Timeout reached |
+| `signer.ceremony.cancelled` | Explicitly cancelled |
+
+### Audit Event Structure
+
+```json
+{
+  "eventType": "signer.ceremony.approved",
+  "timestamp": "2026-01-16T11:30:00Z",
+  "ceremonyId": "cer-abc123",
+  "ceremonyType": "key-rotation",
+  "actor": "security-lead@company.com",
+  "approvalId": "apr-def456",
+  "currentApprovals": 1,
+  "threshold": 2,
+  "signatureAlgorithm": "ecdsa-p256",
+  "signatureKeyId": "user-signing-key-456"
+}
+```
+
+### Query Audit Logs
+
+```bash
+stella audit query \
+  --event-type "signer.ceremony.*" \
+  --since 7d \
+  --ceremony-id cer-abc123
+```
+
+## Configuration
+
+### Ceremony Settings
+
+```yaml
+# signer-config.yaml
+ceremonies:
+  enabled: true
+  defaultTimeout: 24h
+  maxTimeout: 168h  # 7 days
+  requireSignedApprovals: true
+  
+  thresholds:
+    key-rotation: 
+      minimum: 2
+      default: 2
+      maximum: 5
+    key-revocation:
+      minimum: 2
+      default: 3
+      maximum: 5
+    trust-anchor-update:
+      minimum: 2
+      default: 2
+      maximum: 4
+```
+
+### Approver Configuration
+
+```yaml
+# approvers.yaml
+approverGroups:
+  - name: key-custodians
+    members:
+      - security-lead@company.com
+      - ciso@company.com
+      - key-officer-1@company.com
+      - key-officer-2@company.com
+    operations:
+      - key-rotation
+      - key-revocation
+      
+  - name: trust-admins
+    members:
+      - trust-admin@company.com
+      - security-lead@company.com
+    operations:
+      - trust-anchor-update
+```
+
+## Notifications
+
+Ceremonies trigger notifications to approvers:
+
+| Event | Notification |
+|-------|-------------|
+| Ceremony created | Email/Slack to all eligible approvers |
+| Approval submitted | Email/Slack to remaining approvers |
+| Threshold reached | Email/Slack to initiator |
+| Approaching expiry | Email/Slack at 75% and 90% of timeout |
+| Expired | Email/Slack to initiator and approvers |
+
+Configure notifications in `notifier-config.yaml`:
+
+```yaml
+notifications:
+  ceremonies:
+    enabled: true
+    channels:
+      - type: email
+        recipients: "@approverGroup"
+      - type: slack
+        webhook: ${SLACK_CEREMONY_WEBHOOK}
+        channel: "#key-ceremonies"
+```
+
+## Security Best Practices
+
+### Approver Requirements
+
+- Maintain at least N+1 approvers for N-of-M ceremonies
+- Distribute approvers across security domains
+- Require hardware tokens for signing keys
+- Rotate approver list quarterly
+
+### Ceremony Hygiene
+
+- Use descriptive subjects for audit clarity
+- Set reasonable timeouts (not too long, not too short)
+- Document approval reasons thoroughly
+- Review executed ceremonies monthly
+
+### Monitoring
+
+Set up alerts for:
+
+```yaml
+alerts:
+  - name: CeremonyPendingTooLong
+    condition: ceremony.pending_duration > 12h
+    severity: warning
+    
+  - name: CeremonyApprovalRejected
+    condition: ceremony.approval_rejected
+    severity: critical
+    
+  - name: UnauthorizedCeremonyAttempt
+    condition: ceremony.unauthorized_access
+    severity: critical
+```
+
+## Troubleshooting
+
+### Common Issues
+
+| Issue | Cause | Resolution |
+|-------|-------|------------|
+| Approval rejected | Invalid signature | Re-sign with correct key |
+| Cannot approve | Already approved | Different approver must approve |
+| Cannot execute | Threshold not met | Collect more approvals |
+| Ceremony expired | Timeout reached | Create new ceremony |
+
+### Signature Verification Failures
+
+```bash
+# Verify signing key is accessible
+stella auth keys list
+
+# Test signature
+echo "test" | stella sign --key-id my-signing-key | stella verify
+
+# Check key permissions
+stella auth keys info --key-id my-signing-key
+```
+
+## Emergency Procedures
+
+### Stuck Ceremony
+
+If a ceremony is stuck (approvers unavailable):
+
+1. Cancel the stuck ceremony
+2. Create new ceremony with available approvers
+3. Document the situation in audit notes
+
+### Compromised Approver
+
+If an approver's credentials are compromised:
+
+1. Revoke approver's signing key immediately
+2. Cancel any pending ceremonies they created
+3. Review recent approvals for anomalies
+4. Remove from approver groups
+5. Document in incident report
+
+## Related Documentation
+
+- [Key Rotation Runbook](./key-rotation-runbook.md)
+- [HSM Setup Runbook](./hsm-setup-runbook.md)
+- [Signer Architecture](../modules/signer/architecture.md)
+- [Break-Glass Runbook](./break-glass-runbook.md)
--- a/docs/operations/evidence-migration.md
+++ b/docs/operations/evidence-migration.md
@@ -0,0 +1,278 @@
+# Evidence Migration Guide
+
+This guide covers evidence-specific migration procedures during upgrades, schema changes, or disaster recovery scenarios.
+
+## Overview
+
+Evidence bundles are cryptographically linked data structures that must maintain integrity across upgrades. This guide ensures chain-of-custody is preserved during migrations.
+
+## Quick Reference
+
+| Scenario | CLI Command | Risk Level | Downtime |
+|----------|-------------|------------|----------|
+| Schema upgrade | `stella evidence migrate` | Medium | Minutes |
+| Reindex after algorithm change | `stella evidence reindex` | Low | None |
+| Cross-version continuity check | `stella evidence verify-continuity` | None | None |
+| Full evidence export | `stella evidence export --all` | None | None |
+
+## Pre-Migration Checklist
+
+### 1. Capture Current State
+
+```bash
+# Record current evidence statistics
+stella evidence stats --detailed > pre-migration-stats.json
+
+# Export Merkle roots for all tenants
+stella evidence roots-export --all > pre-migration-roots.json
+
+# Verify existing evidence integrity
+stella evidence verify-all --output pre-migration-verify.json
+if [ $? -ne 0 ]; then
+  echo "ABORT: Evidence integrity check failed"
+  exit 1
+fi
+```
+
+### 2. Create Evidence Backup
+
+```bash
+# Full evidence bundle export
+stella evidence export \
+  --all \
+  --include-attestations \
+  --include-proofs \
+  --output /backup/evidence-$(date +%Y%m%d)/
+
+# Verify export integrity
+stella evidence verify-bundle /backup/evidence-*/
+```
+
+### 3. Document Chain-of-Custody
+
+```bash
+# Record the current root hashes
+OLD_MERKLE_ROOT=$(stella evidence roots-export --format json | jq -r '.globalRoot')
+echo "Pre-migration Merkle root: ${OLD_MERKLE_ROOT}" > custody-log.txt
+date >> custody-log.txt
+```
+
+## Migration Procedures
+
+### Schema Migration (Version Upgrade)
+
+When upgrading between versions with schema changes:
+
+```bash
+# Step 1: Assess migration impact (dry-run)
+stella evidence migrate \
+  --from-version 1.0 \
+  --to-version 2.0 \
+  --dry-run
+
+# Step 2: Review migration plan output
+# Ensure all changes are expected
+
+# Step 3: Execute migration
+stella evidence migrate \
+  --from-version 1.0 \
+  --to-version 2.0
+
+# Step 4: Verify migration
+stella evidence verify-all
+```
+
+### Evidence Reindex (Algorithm Change)
+
+When the hashing algorithm or Merkle tree structure changes:
+
+```bash
+# Step 1: Assess reindex impact
+stella evidence reindex \
+  --dry-run \
+  --output reindex-plan.json
+
+# Review reindex-plan.json for:
+# - Total records affected
+# - Estimated duration
+# - New schema version
+
+# Step 2: Execute reindex with batching
+stella evidence reindex \
+  --batch-size 100 \
+  --since 2026-01-01
+
+# Step 3: Capture new root
+NEW_MERKLE_ROOT=$(stella evidence roots-export --format json | jq -r '.globalRoot')
+echo "Post-migration Merkle root: ${NEW_MERKLE_ROOT}" >> custody-log.txt
+date >> custody-log.txt
+```
+
+### Chain-of-Custody Verification
+
+After any evidence migration, verify continuity:
+
+```bash
+# Verify that old proofs remain valid
+stella evidence verify-continuity \
+  --old-root "${OLD_MERKLE_ROOT}" \
+  --new-root "${NEW_MERKLE_ROOT}" \
+  --output continuity-report.html \
+  --format html
+
+# Check verification results
+if grep -q "FAIL" continuity-report.html; then
+  echo "ERROR: Chain-of-custody verification failed!"
+  echo "Review continuity-report.html for details"
+  exit 1
+fi
+```
+
+## Rollback Procedures
+
+### Immediate Rollback (Within Migration Window)
+
+```bash
+# If migration fails mid-way, rollback is automatic
+# Check current migration state
+stella evidence migrate --status
+
+# Force rollback if needed
+stella evidence migrate \
+  --rollback \
+  --from-version 2.0
+```
+
+### Restore from Backup
+
+```bash
+# Step 1: Stop evidence-related services
+kubectl scale deployment evidence-locker --replicas=0
+
+# Step 2: Restore PostgreSQL evidence tables
+pg_restore -d stellaops \
+  --table='evidence.*' \
+  /backup/stellaops-backup.dump
+
+# Step 3: Restore evidence files
+stella evidence import /backup/evidence-*/
+
+# Step 4: Verify restoration
+stella evidence verify-all
+
+# Step 5: Restart services
+kubectl scale deployment evidence-locker --replicas=3
+```
+
+## Air-Gap Migration
+
+For air-gapped environments without network access:
+
+### Export Phase (Online Environment)
+
+```bash
+# Create portable evidence bundle
+stella evidence export \
+  --all \
+  --portable \
+  --include-schemas \
+  --output /media/airgap-evidence.tar.gz
+
+# Generate checksums
+sha256sum /media/airgap-evidence.tar.gz > /media/checksums.txt
+```
+
+### Transfer Phase
+
+1. Copy to removable media
+2. Verify checksums at destination
+3. Scan media for security
+
+### Import Phase (Air-Gap Environment)
+
+```bash
+# Verify transfer integrity
+sha256sum -c /media/checksums.txt
+
+# Import evidence bundle
+stella evidence import \
+  --portable \
+  /media/airgap-evidence.tar.gz
+
+# Verify import
+stella evidence verify-all
+```
+
+## Troubleshooting
+
+### Migration Stuck or Timeout
+
+```bash
+# Check migration status
+stella evidence migrate --status
+
+# View migration logs
+stella evidence migrate --logs
+
+# Resume from last checkpoint
+stella evidence migrate --resume
+```
+
+### Root Hash Mismatch
+
+If verification reports root hash mismatch:
+
+1. **Do not proceed** with upgrade
+2. Check for data corruption:
+   ```bash
+   stella evidence integrity-check --deep
+   ```
+3. Review recent changes to evidence store
+4. Contact support with integrity report
+
+### Missing Evidence Records
+
+```bash
+# Count records by type
+stella evidence stats --by-type
+
+# Find orphaned records
+stella evidence orphans --list
+
+# Reconcile with source systems
+stella evidence reconcile --source attestor
+```
+
+### Performance Issues
+
+For large evidence stores (>1M records):
+
+```bash
+# Run reindex in parallel batches
+stella evidence reindex \
+  --parallel 4 \
+  --batch-size 500 \
+  --since 2026-01-01
+
+# Monitor progress
+stella evidence reindex --progress
+```
+
+## Audit Trail Requirements
+
+All evidence migrations must maintain audit trail:
+
+| Event | Required Data | Retention |
+|-------|---------------|-----------|
+| Migration Start | Timestamp, version, operator | Permanent |
+| Schema Change | Before/after schema versions | Permanent |
+| Root Hash Change | Old root, new root, cross-reference | Permanent |
+| Verification | Pass/fail, anomalies, timestamps | 7 years |
+| Rollback | Reason, restored version | Permanent |
+
+## Related Documents
+
+- [Upgrade Runbook](upgrade-runbook.md) - Overall upgrade procedures
+- [Blue-Green Deployment](blue-green-deployment.md) - Zero-downtime deployment
+- [Evidence Locker Architecture](../modules/evidencelocker/architecture.md) - Technical design
+- [Air-Gap Operations](airgap-operations-runbook.md) - Offline deployment guide
--- a/docs/operations/hsm-setup-runbook.md
+++ b/docs/operations/hsm-setup-runbook.md
@@ -34,6 +34,8 @@ pkcs11-tool --version

 ## SoftHSM2 Setup (Development)

+See [docs/operations/softhsm2-test-environment.md](operations/softhsm2-test-environment.md) for a focused test environment setup.
+
 ### Step 1: Initialize SoftHSM

 ```bash
@@ -197,7 +199,7 @@ stringData:

 ```bash
 # Run HSM connectivity doctor check
-stella doctor --check hsm
+stella doctor --check check.crypto.hsm

 # Expected output:
 # [PASS] HSM Connectivity
--- a/docs/operations/key-escrow-runbook.md
+++ b/docs/operations/key-escrow-runbook.md
@@ -0,0 +1,417 @@
+# Key Escrow and Recovery Runbook
+
+This runbook documents Shamir secret sharing key escrow and recovery procedures in Stella Ops.
+
+> **Sprint:** SPRINT_20260112_018_CRYPTO_key_escrow_shamir
+
+## Overview
+
+Key escrow ensures critical cryptographic keys can be recovered if primary access is lost. Stella Ops uses Shamir's Secret Sharing to split keys into shares distributed among trusted custodians.
+
+Key features:
+- M-of-N threshold recovery (any M shares reconstruct the key)
+- Share encryption at rest
+- Custodian-based share distribution
+- Integration with dual-control ceremonies
+- Full audit trail
+
+## When to Use Key Escrow
+
+| Scenario | Escrow Required |
+|----------|-----------------|
+| Root signing keys | Yes |
+| HSM master keys | Yes |
+| Trust anchor keys | Yes |
+| Service signing keys | Recommended |
+| User signing keys | Optional |
+| Ephemeral keys | No |
+
+## Shamir Secret Sharing
+
+### How It Works
+
+Shamir's Secret Sharing splits a secret into N shares where any M shares can reconstruct the original:
+
+```
+Secret S → Split(S, M, N) → [Share₁, Share₂, ..., Shareₙ]
+
+Any M shares → Combine → Secret S
+Fewer than M shares → Cannot reconstruct
+```
+
+### Configuration Parameters
+
+| Parameter | Description | Recommended |
+|-----------|-------------|-------------|
+| Threshold (M) | Minimum shares needed | 2-3 for keys |
+| Total Shares (N) | Total shares created | M + 2 minimum |
+| Share Encryption | Encrypt shares at rest | Always enabled |
+
+### Threshold Guidelines
+
+| Key Type | Minimum M | Recommended N | Rationale |
+|----------|-----------|---------------|-----------|
+| Root keys | 3 | 5 | High assurance |
+| HSM keys | 2 | 4 | Availability + security |
+| Service keys | 2 | 3 | Operational recovery |
+
+## Escrowing a Key
+
+### Via CLI
+
+```bash
+stella escrow create \
+  --key-id root-signing-key-2026 \
+  --threshold 3 \
+  --shares 5 \
+  --custodians custodian-1,custodian-2,custodian-3,custodian-4,custodian-5 \
+  --expires-in 365d \
+  --reason "Annual key escrow for root signing key"
+```
+
+### Via API
+
+```bash
+curl -X POST https://signer.example.com/api/v1/escrow \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "keyId": "root-signing-key-2026",
+    "threshold": 3,
+    "totalShares": 5,
+    "custodianIds": [
+      "custodian-1", "custodian-2", "custodian-3",
+      "custodian-4", "custodian-5"
+    ],
+    "expirationDays": 365,
+    "reason": "Annual key escrow for root signing key"
+  }'
+```
+
+### Escrow Response
+
+```json
+{
+  "escrowId": "esc-abc123",
+  "keyId": "root-signing-key-2026",
+  "threshold": 3,
+  "totalShares": 5,
+  "status": "Active",
+  "createdAt": "2026-01-16T10:00:00Z",
+  "expiresAt": "2027-01-16T10:00:00Z",
+  "shares": [
+    { "shareId": "shr-001", "custodianId": "custodian-1", "distributed": true },
+    { "shareId": "shr-002", "custodianId": "custodian-2", "distributed": true },
+    { "shareId": "shr-003", "custodianId": "custodian-3", "distributed": true },
+    { "shareId": "shr-004", "custodianId": "custodian-4", "distributed": true },
+    { "shareId": "shr-005", "custodianId": "custodian-5", "distributed": true }
+  ]
+}
+```
+
+## Share Distribution
+
+### Distribution Methods
+
+| Method | Security | Use Case |
+|--------|----------|----------|
+| Direct API delivery | High | Automated systems |
+| Encrypted email | Medium | Remote custodians |
+| In-person ceremony | Highest | Root keys |
+| Hardware token | Highest | HSM keys |
+
+### Custodian Requirements
+
+Each custodian must:
+1. Have verified identity in Authority
+2. Complete escrow custodian training
+3. Have secure share storage capability
+4. Be geographically distributed (recommended)
+
+### Verifying Share Distribution
+
+```bash
+stella escrow status --escrow-id esc-abc123
+
+# Output:
+# Escrow: esc-abc123
+# Key: root-signing-key-2026
+# Status: Active
+# Threshold: 3 of 5
+# Shares:
+#   [1] custodian-1: Distributed ✓
+#   [2] custodian-2: Distributed ✓
+#   [3] custodian-3: Distributed ✓
+#   [4] custodian-4: Distributed ✓
+#   [5] custodian-5: Distributed ✓
+```
+
+## Key Recovery
+
+### Prerequisites
+
+Recovery requires:
+1. Valid recovery request (incident, key loss, rotation)
+2. Dual-control ceremony approval (if configured)
+3. Minimum M custodians available with shares
+4. Secure recovery environment
+
+### Recovery Workflow
+
+```
+1. Initiate recovery request
+2. (If required) Dual-control ceremony approval
+3. Collect shares from M custodians
+4. Verify share checksums
+5. Reconstruct key
+6. Verify reconstructed key
+7. Log recovery event
+```
+
+### Via CLI
+
+```bash
+# Step 1: Initiate recovery
+stella escrow recover init \
+  --escrow-id esc-abc123 \
+  --reason "HSM failure - emergency key recovery" \
+  --ceremony-required
+
+# Step 2: Collect shares (each custodian runs)
+stella escrow recover submit-share \
+  --recovery-id rec-xyz789 \
+  --share-file /secure/my-share.enc \
+  --passphrase-file /secure/passphrase
+
+# Step 3: Execute recovery (after threshold reached)
+stella escrow recover execute \
+  --recovery-id rec-xyz789 \
+  --output-key-file /secure/recovered-key.pem
+```
+
+### Via API
+
+```bash
+# Initiate recovery
+curl -X POST https://signer.example.com/api/v1/escrow/esc-abc123/recover \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "reason": "HSM failure - emergency key recovery",
+    "requireCeremony": true
+  }'
+
+# Submit share
+curl -X POST https://signer.example.com/api/v1/recovery/rec-xyz789/shares \
+  -H "Authorization: Bearer $CUSTODIAN_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "shareId": "shr-001",
+    "encryptedShare": "base64-encoded-share",
+    "checksum": "sha256:abc123..."
+  }'
+
+# Execute recovery (after threshold)
+curl -X POST https://signer.example.com/api/v1/recovery/rec-xyz789/execute \
+  -H "Authorization: Bearer $TOKEN"
+```
+
+### Recovery Response
+
+```json
+{
+  "recoveryId": "rec-xyz789",
+  "status": "Completed",
+  "keyId": "root-signing-key-2026",
+  "sharesCollected": 3,
+  "threshold": 3,
+  "completedAt": "2026-01-16T15:30:00Z",
+  "keyFingerprint": "SHA256:xyz789...",
+  "verified": true
+}
+```
+
+## Share Management
+
+### Custodian Share Storage
+
+Custodians should store shares:
+
+| Storage | Security Level | Notes |
+|---------|----------------|-------|
+| HSM | Highest | Preferred for root keys |
+| Hardware token | High | YubiKey, smart card |
+| Encrypted file | Medium | AES-256-GCM minimum |
+| Password manager | Medium | Enterprise vault only |
+
+### Share Format
+
+```json
+{
+  "shareId": "shr-001",
+  "escrowId": "esc-abc123",
+  "index": 1,
+  "threshold": 3,
+  "totalShares": 5,
+  "encryptedData": "base64-encoded-aes-256-gcm-ciphertext",
+  "checksum": "sha256:abc123...",
+  "createdAt": "2026-01-16T10:00:00Z",
+  "expiresAt": "2027-01-16T10:00:00Z"
+}
+```
+
+### Share Rotation
+
+Re-escrow keys periodically:
+
+```bash
+stella escrow re-escrow \
+  --escrow-id esc-abc123 \
+  --new-custodians custodian-1,custodian-2,custodian-6,custodian-7,custodian-8 \
+  --reason "Annual share rotation"
+```
+
+This creates new shares and revokes old ones.
+
+## Audit Trail
+
+### Audit Events
+
+| Event | Description |
+|-------|-------------|
+| `escrow.created` | Key escrowed |
+| `escrow.share.distributed` | Share sent to custodian |
+| `escrow.share.accessed` | Custodian accessed share |
+| `recovery.initiated` | Recovery started |
+| `recovery.share.submitted` | Share submitted for recovery |
+| `recovery.completed` | Key reconstructed |
+| `recovery.failed` | Recovery failed |
+| `escrow.revoked` | Escrow revoked |
+
+### Query Audit Logs
+
+```bash
+stella audit query \
+  --event-type "escrow.*,recovery.*" \
+  --escrow-id esc-abc123 \
+  --since 30d
+```
+
+## Configuration
+
+### Escrow Settings
+
+```yaml
+# escrow-config.yaml
+escrow:
+  enabled: true
+  defaultThreshold: 2
+  minimumThreshold: 2
+  maximumShares: 10
+  shareEncryption:
+    algorithm: AES-256-GCM
+    keyDerivation: HKDF-SHA256
+  requireDualControlForRecovery: true
+  maxRecoveryAttempts: 3
+  recoveryTimeoutHours: 24
+```
+
+### Custodian Configuration
+
+```yaml
+# custodians.yaml
+custodians:
+  - id: custodian-1
+    name: "Security Lead"
+    email: security-lead@company.com
+    publicKey: "-----BEGIN PUBLIC KEY-----..."
+    location: "US-East"
+    
+  - id: custodian-2
+    name: "Key Officer A"
+    email: key-officer-a@company.com
+    publicKey: "-----BEGIN PUBLIC KEY-----..."
+    location: "EU-West"
+```
+
+## Security Considerations
+
+### Share Security
+
+- Never transmit shares in plaintext
+- Encrypt shares with custodian's public key
+- Verify checksums before and after storage
+- Use secure channels for distribution
+
+### Recovery Security
+
+- Require dual-control ceremonies for critical keys
+- Limit recovery time window
+- Verify recovered key fingerprint
+- Audit all recovery attempts
+
+### Custodian Security
+
+- Verify custodian identity before share access
+- Geographic distribution reduces collusion risk
+- Rotate custodians periodically
+- Train custodians on secure handling
+
+## Troubleshooting
+
+### Common Issues
+
+| Issue | Cause | Resolution |
+|-------|-------|------------|
+| Share checksum mismatch | Corrupted share | Request re-distribution |
+| Cannot decrypt share | Wrong passphrase | Verify passphrase |
+| Recovery timeout | Shares not collected in time | Restart recovery |
+| Key verification failed | Wrong shares combined | Verify share indices |
+
+### Verification Failures
+
+```bash
+# Verify share integrity
+stella escrow verify-share --share-file share.enc
+
+# Test reconstruction with subset
+stella escrow test-recovery \
+  --escrow-id esc-abc123 \
+  --share-files share1.enc,share2.enc,share3.enc
+```
+
+## Emergency Procedures
+
+### Lost Share
+
+If a custodian loses their share:
+
+1. Verify at least M shares remain accessible
+2. Re-escrow with new share set
+3. Revoke compromised escrow
+4. Document incident
+
+### Compromised Custodian
+
+If a custodian is compromised:
+
+1. Do NOT use their share for any recovery
+2. Re-escrow immediately with new custodians
+3. Revoke old escrow
+4. Consider key rotation if threshold was exposed
+
+### Multiple Lost Shares
+
+If fewer than M shares are available:
+
+1. Key cannot be recovered via escrow
+2. Use backup key if available
+3. Generate new key and re-establish trust
+4. Document as key loss incident
+
+## Related Documentation
+
+- [Dual-Control Ceremony Runbook](./dual-control-ceremony-runbook.md)
+- [Key Rotation Runbook](./key-rotation-runbook.md)
+- [HSM Setup Runbook](./hsm-setup-runbook.md)
+- [Cryptography Architecture](../modules/cryptography/architecture.md)
--- a/docs/operations/rekor-sync-guide.md
+++ b/docs/operations/rekor-sync-guide.md
@@ -0,0 +1,362 @@
+# Rekor Checkpoint Sync Configuration and Operations
+
+This guide covers the configuration and operational procedures for the Rekor periodic checkpoint synchronization service.
+
+## Overview
+
+The Rekor sync service maintains a local mirror of Rekor transparency log checkpoints and tiles. This enables:
+
+- **Offline verification**: Verify attestations without network access to Sigstore
+- **Air-gapped operation**: Run in environments without internet connectivity
+- **Performance**: Reduce latency by using local checkpoint data
+- **Auditability**: Maintain local evidence of log state at verification time
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                      RekorSyncBackgroundService                  │
+│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     │
+│  │  Checkpoint  │     │   Signature  │     │    Tile      │     │
+│  │   Fetcher    │────▶│   Verifier   │────▶│   Syncer     │     │
+│  └──────────────┘     └──────────────┘     └──────────────┘     │
+└─────────────────────────────────────────────────────────────────┘
+          │                     │                     │
+          ▼                     ▼                     ▼
+   ┌──────────────┐      ┌──────────────┐     ┌──────────────┐
+   │  HTTP Tile   │      │  Checkpoint  │     │    Tile      │
+   │   Client     │      │    Store     │     │   Cache      │
+   └──────────────┘      │ (PostgreSQL) │     │(File System) │
+          │              └──────────────┘     └──────────────┘
+          ▼
+   ┌──────────────┐
+   │    Rekor     │
+   │   Server     │
+   └──────────────┘
+```
+
+## Configuration
+
+### Basic Configuration
+
+```yaml
+attestor:
+  rekorSync:
+    # Enable or disable sync service
+    enabled: true
+
+    # How often to fetch new checkpoints
+    syncInterval: 5m
+
+    # Delay before first sync after startup
+    initialDelay: 30s
+
+    # Enable tile synchronization for full offline support
+    enableTileSync: true
+
+    # Maximum tiles to fetch per sync cycle
+    maxTilesPerSync: 100
+
+    # Backend configurations
+    backends:
+      - id: sigstore-prod
+        origin: rekor.sigstore.dev
+        baseUrl: https://rekor.sigstore.dev
+        publicKeyPath: /etc/stella/keys/rekor-sigstore-prod.pub
+
+      - id: sigstore-staging
+        origin: rekor.sigstage.dev
+        baseUrl: https://rekor.sigstage.dev
+        publicKeyPath: /etc/stella/keys/rekor-sigstore-staging.pub
+```
+
+### Checkpoint Store Configuration (PostgreSQL)
+
+```yaml
+attestor:
+  checkpointStore:
+    connectionString: "Host=localhost;Database=stella;Username=stella;Password=secret"
+    schema: attestor
+    autoInitializeSchema: true
+```
+
+### Tile Cache Configuration (File System)
+
+```yaml
+attestor:
+  tileCache:
+    # Base directory for tile storage
+    basePath: /var/lib/stella/attestor/tiles
+
+    # Maximum cache size (0 = unlimited)
+    maxCacheSizeBytes: 10737418240  # 10 GB
+
+    # Auto-prune tiles older than this
+    autoPruneAfter: 720h  # 30 days
+```
+
+## Operational Procedures
+
+### Initial Setup
+
+1. **Initialize the checkpoint store schema**:
+   ```bash
+   stella attestor checkpoint-store init --connection "Host=localhost;..."
+   ```
+
+2. **Configure backend(s)**:
+   ```bash
+   stella attestor backend add sigstore-prod \
+     --origin rekor.sigstore.dev \
+     --url https://rekor.sigstore.dev \
+     --public-key /path/to/rekor.pub
+   ```
+
+3. **Perform initial sync**:
+   ```bash
+   stella attestor sync --backend sigstore-prod --full
+   ```
+
+### Manual Sync Operations
+
+**Force immediate sync**:
+```bash
+stella attestor sync --backend sigstore-prod
+```
+
+**Sync all backends**:
+```bash
+stella attestor sync --all
+```
+
+**Full tile sync** (for offline kit preparation):
+```bash
+stella attestor sync --backend sigstore-prod --full-tiles
+```
+
+### Monitoring
+
+**Check sync status**:
+```bash
+stella attestor sync-status
+```
+
+Output:
+```
+Backend         Origin                 Tree Size    Last Sync            Age
+sigstore-prod   rekor.sigstore.dev    45,678,901   2026-01-15 12:34:56  2m 15s
+sigstore-staging rekor.sigstage.dev   1,234,567    2026-01-15 12:30:00  6m 30s
+```
+
+**Check checkpoint history**:
+```bash
+stella attestor checkpoints list --backend sigstore-prod --last 10
+```
+
+**Check tile cache status**:
+```bash
+stella attestor tiles stats --backend sigstore-prod
+```
+
+Output:
+```
+Origin: rekor.sigstore.dev
+Total Tiles: 45,678
+Cache Size: 1.4 GB
+Coverage: 100% (tree size 45,678,901)
+Oldest Tile: 2026-01-01 00:00:00
+Newest Tile: 2026-01-15 12:34:56
+```
+
+### Metrics
+
+The sync service exposes the following Prometheus metrics:
+
+```
+# Counter: checkpoints fetched from remote
+attestor_rekor_sync_checkpoints_fetched_total{backend="sigstore-prod"} 1234
+
+# Counter: checkpoints stored locally
+attestor_rekor_sync_checkpoints_stored_total{backend="sigstore-prod"} 1234
+
+# Counter: tiles fetched from remote
+attestor_rekor_sync_tiles_fetched_total{backend="sigstore-prod"} 56789
+
+# Counter: tiles cached locally
+attestor_rekor_sync_tiles_cached_total{backend="sigstore-prod"} 56789
+
+# Histogram: checkpoint age at sync time (seconds)
+attestor_rekor_sync_checkpoint_age_seconds{backend="sigstore-prod"} 
+
+# Gauge: total tiles cached
+attestor_rekor_sync_tiles_cached{backend="sigstore-prod"} 45678
+
+# Gauge: time since last successful sync (seconds)
+attestor_rekor_sync_last_success_seconds{backend="sigstore-prod"} 135
+
+# Counter: sync errors
+attestor_rekor_sync_errors_total{backend="sigstore-prod",error_type="network"} 5
+```
+
+### Alerting Recommendations
+
+```yaml
+groups:
+  - name: attestor-rekor-sync
+    rules:
+      - alert: RekorSyncStale
+        expr: attestor_rekor_sync_last_success_seconds > 900
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Rekor sync is stale
+          description: "No successful sync in {{ $value }}s for {{ $labels.backend }}"
+
+      - alert: RekorSyncFailing
+        expr: rate(attestor_rekor_sync_errors_total[5m]) > 0.1
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: Rekor sync experiencing errors
+          description: "Sync errors detected for {{ $labels.backend }}"
+```
+
+### Maintenance Tasks
+
+**Prune old checkpoints**:
+```bash
+# Keep only last 30 days of checkpoints
+stella attestor checkpoints prune --older-than 720h --keep-latest
+```
+
+**Prune old tiles**:
+```bash
+# Remove tiles for entries no longer needed
+stella attestor tiles prune --older-than 720h
+```
+
+**Verify checkpoint store integrity**:
+```bash
+stella attestor checkpoints verify --backend sigstore-prod
+```
+
+**Export checkpoints for air-gap**:
+```bash
+stella attestor export \
+  --backend sigstore-prod \
+  --output /mnt/airgap/attestor-bundle.tar.gz \
+  --include-tiles
+```
+
+## Troubleshooting
+
+### Sync Not Running
+
+1. Check service logs:
+   ```bash
+   journalctl -u stella-attestor -f
+   ```
+
+2. Verify configuration:
+   ```bash
+   stella attestor config validate
+   ```
+
+3. Check database connectivity:
+   ```bash
+   stella attestor checkpoint-store test
+   ```
+
+### Signature Verification Failing
+
+1. Verify public key is correct:
+   ```bash
+   stella attestor backend verify-key sigstore-prod
+   ```
+
+2. Check for key rotation:
+   - Monitor Sigstore announcements
+   - Update public key if rotated
+
+3. Compare with direct fetch:
+   ```bash
+   curl -s https://rekor.sigstore.dev/api/v1/log | jq
+   ```
+
+### Tile Cache Issues
+
+1. Check disk space:
+   ```bash
+   df -h /var/lib/stella/attestor/tiles
+   ```
+
+2. Verify permissions:
+   ```bash
+   ls -la /var/lib/stella/attestor/tiles
+   ```
+
+3. Clear and resync:
+   ```bash
+   stella attestor tiles clear --backend sigstore-prod
+   stella attestor sync --backend sigstore-prod --full-tiles
+   ```
+
+### Database Issues
+
+1. Check PostgreSQL connectivity:
+   ```bash
+   psql -h localhost -U stella -d stella -c "SELECT 1"
+   ```
+
+2. Verify schema exists:
+   ```sql
+   SELECT * FROM attestor.rekor_checkpoints LIMIT 1;
+   ```
+
+3. Reinitialize schema if needed:
+   ```bash
+   stella attestor checkpoint-store init --force
+   ```
+
+## Air-Gap Operations
+
+### Preparing an Offline Bundle
+
+1. Sync to latest checkpoint:
+   ```bash
+   stella attestor sync --backend sigstore-prod --full-tiles
+   ```
+
+2. Export bundle:
+   ```bash
+   stella attestor export \
+     --backend sigstore-prod \
+     --output offline-attestor-bundle.tar.gz \
+     --include-tiles \
+     --checkpoints-only-verified
+   ```
+
+3. Transfer bundle to air-gapped environment
+
+### Importing in Air-Gapped Environment
+
+1. Import the bundle:
+   ```bash
+   stella attestor import offline-attestor-bundle.tar.gz
+   ```
+
+2. Verify import:
+   ```bash
+   stella attestor sync-status
+   ```
+
+3. Checkpoints and tiles are now available for offline verification
+
+## See Also
+
+- [Rekor Verification Design](../modules/attestor/rekor-verification-design.md)
+- [Checkpoint Divergence Detection](./checkpoint-divergence-runbook.md)
+- [Offline Kit Preparation](./offline-kit-guide.md)
+- [Sigstore Rekor Documentation](https://docs.sigstore.dev/rekor/overview/)
--- a/docs/operations/softhsm2-test-environment.md
+++ b/docs/operations/softhsm2-test-environment.md
@@ -0,0 +1,70 @@
+# SoftHSM2 Test Environment Setup
+
+This guide describes how to configure SoftHSM2 for PKCS#11 integration tests and local validation.
+
+## Install SoftHSM2
+
+```bash
+# Ubuntu/Debian
+sudo apt-get install softhsm2 opensc
+
+# Verify installation
+softhsm2-util --version
+pkcs11-tool --version
+```
+
+## Initialize Token
+
+```bash
+# Create token directory
+mkdir -p /var/lib/softhsm/tokens
+chmod 700 /var/lib/softhsm/tokens
+
+# Initialize token
+softhsm2-util --init-token \
+  --slot 0 \
+  --label "StellaOps-Dev" \
+  --so-pin 12345678 \
+  --pin 87654321
+
+# Verify token
+softhsm2-util --show-slots
+```
+
+## Create a Test Key
+
+```bash
+# Generate RSA keypair
+pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
+  --login --pin 87654321 \
+  --keypairgen \
+  --key-type rsa:2048 \
+  --id 01 \
+  --label "stellaops-hsm-test"
+
+# List objects
+pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
+  --login --pin 87654321 \
+  --list-objects
+```
+
+## Environment Variables for Tests
+
+```bash
+export STELLAOPS_SOFTHSM_LIB="/usr/lib/softhsm/libsofthsm2.so"
+export STELLAOPS_SOFTHSM_SLOT="0"
+export STELLAOPS_SOFTHSM_PIN="87654321"
+export STELLAOPS_SOFTHSM_KEY_ID="stellaops-hsm-test"
+export STELLAOPS_SOFTHSM_MECHANISM="RsaSha256"
+```
+
+## Run Integration Tests
+
+```bash
+dotnet test src/Cryptography/__Tests/StellaOps.Cryptography.Tests/StellaOps.Cryptography.Tests.csproj \
+  --filter FullyQualifiedName~Pkcs11HsmClientIntegrationTests
+```
+
+## Notes
+- The integration tests skip automatically if SoftHSM2 variables are not configured.
+- Use a dedicated test token; never reuse production tokens.
--- a/docs/operations/unknowns-queue-runbook.md
+++ b/docs/operations/unknowns-queue-runbook.md
@@ -628,9 +628,150 @@ To allow approved exceptions to cover specific unknown reason codes, set excepti
 - [Triage Technical Reference](../product/advisories/14-Dec-2025%20-%20Triage%20and%20Unknowns%20Technical%20Reference.md)
 - [Score Proofs Runbook](./score-proofs-runbook.md)
 - [Policy Engine](../modules/policy/architecture.md)
+- [Determinization API](../modules/policy/determinization-api.md)
+- [VEX Consensus Guide](../VEX_CONSENSUS_GUIDE.md)

 ---

-**Last Updated**: 2025-12-22  
-**Version**: 1.0.0  
-**Sprint**: 3500.0004.0004
+## 8. Grey Queue Operations
+
+> **Sprint**: SPRINT_20260112_010_CLI_unknowns_grey_queue_cli
+
+The Grey Queue handles observations with uncertain status requiring operator attention or additional evidence. These are distinct from standard HOT/WARM/COLD band unknowns.
+
+### 8.1 Grey Queue Overview
+
+Grey Queue items have:
+- **Observation state**: `PendingDeterminization`, `Disputed`, or `GuardedPass`
+- **Reanalysis fingerprint**: Deterministic ID for reproducible replays
+- **Triggers**: Events that caused reanalysis
+- **Conflicts**: Detected evidence disagreements
+- **Next actions**: Suggested resolution paths
+
+### 8.2 List Grey Queue Items
+
+```bash
+# List all grey queue items
+stella unknowns list --state grey
+
+# List by observation state
+stella unknowns list --observation-state pending-determinization
+stella unknowns list --observation-state disputed
+stella unknowns list --observation-state guarded-pass
+
+# List with fingerprint details
+stella unknowns list --state grey --show-fingerprint
+
+# List with conflict summary
+stella unknowns list --state grey --show-conflicts
+```
+
+### 8.3 View Grey Queue Details
+
+```bash
+# Show grey queue item with full details
+stella unknowns show unk-12345678-... --grey
+
+# Output:
+# ID: unk-12345678-...
+# Observation State: Disputed
+# 
+# Reanalysis Fingerprint:
+#   ID: sha256:abc123...
+#   Computed At: 2026-01-15T10:00:00Z
+#   Policy Config Hash: sha256:def456...
+# 
+# Triggers (2):
+#   - epss.updated@1 (2026-01-15T09:55:00Z) delta=0.15
+#   - vex.updated@1 (2026-01-15T09:50:00Z)
+# 
+# Conflicts (1):
+#   - VexStatusConflict: vendor-a reports 'not_affected', vendor-b reports 'affected'
+#     Severity: high
+#     Adjudication: manual_review
+# 
+# Next Actions:
+#   - trust_resolution: Resolve issuer trust conflict
+#   - manual_review: Escalate to security team
+
+# Show fingerprint only
+stella unknowns fingerprint unk-12345678-...
+
+# Show triggers only
+stella unknowns triggers unk-12345678-...
+```
+
+### 8.4 Grey Queue Triage Actions
+
+```bash
+# Resolve a grey queue item (operator determination)
+stella unknowns resolve unk-12345678-... \
+  --status not_affected \
+  --justification "Verified vendor VEX is authoritative" \
+  --evidence-ref "vex-observation-id-123"
+
+# Escalate for manual review
+stella unknowns escalate unk-12345678-... \
+  --priority P1 \
+  --reason "Conflicting VEX requires security team decision"
+
+# Defer pending additional evidence
+stella unknowns defer unk-12345678-... \
+  --await vex \
+  --reason "Waiting for upstream vendor VEX statement"
+```
+
+### 8.5 Grey Queue Conflict Resolution
+
+```bash
+# List items with conflicts
+stella unknowns list --has-conflicts
+
+# Filter by conflict type
+stella unknowns list --conflict-type vex-status-conflict
+stella unknowns list --conflict-type vex-reachability-contradiction
+stella unknowns list --conflict-type trust-tie
+
+# Resolve a conflict manually
+stella unknowns resolve-conflict unk-12345678-... \
+  --winner vendor-a \
+  --reason "vendor-a is the upstream maintainer"
+```
+
+### 8.6 Grey Queue Summary
+
+```bash
+# Get grey queue summary
+stella unknowns summary --grey
+
+# Output:
+# Grey Queue: 23 items
+# 
+# By State:
+#   PendingDeterminization: 15 (65%)
+#   Disputed: 5 (22%)
+#   GuardedPass: 3 (13%)
+# 
+# Conflicts: 8 items have conflicts
+# Avg. Triggers: 2.3 per item
+# Oldest: 7 days
+```
+
+### 8.7 Grey Queue Export
+
+```bash
+# Export grey queue for analysis
+stella unknowns export --state grey --format json --output grey-queue.json
+
+# Export with full fingerprints and triggers
+stella unknowns export --state grey --verbose --output grey-full.json
+
+# Export conflicts only
+stella unknowns export --has-conflicts --format csv --output conflicts.csv
+```
+
+---
+
+**Last Updated**: 2026-01-16  
+**Version**: 1.1.0  
+**Sprint**: SPRINT_20260112_010_CLI_unknowns_grey_queue_cli