sprints completion. new product advisories prepared

This commit is contained in:
master
2026-01-16 16:30:03 +02:00
parent a927d924e3
commit 4ca3ce8fb4
255 changed files with 42434 additions and 1020 deletions

View File

@@ -0,0 +1,331 @@
# Break-Glass Account Runbook
This runbook documents emergency access procedures using the break-glass account system when standard authentication is unavailable.
> **Sprint:** SPRINT_20260112_018_AUTH_local_rbac_fallback
## Overview
Break-glass accounts provide emergency administrative access when:
- PostgreSQL database is unavailable
- OIDC/OAuth2 identity provider is unreachable
- Authority service is degraded
- Network isolation prevents standard authentication
Break-glass access is fully audited and time-limited by design.
## When to Use Break-Glass Access
| Scenario | Standard Auth | Break-Glass |
|----------|---------------|-------------|
| Database maintenance | N/A | Use |
| IdP outage | Unavailable | Use |
| Network partition | Unavailable | Use |
| Routine operations | Available | Do NOT use |
| Security incident response | May be unavailable | Use with incident code |
**CRITICAL:** Break-glass access should only be used when standard authentication is genuinely unavailable. All usage is logged and auditable.
## Prerequisites
### Configuration Requirements
Break-glass must be explicitly enabled in local policy:
```yaml
# /etc/stellaops/authority/local-policy.yaml
breakGlass:
enabled: true
sessionTimeoutMinutes: 15
maxExtensions: 2
allowedReasonCodes:
- database_maintenance
- idp_outage
- network_partition
- security_incident
- disaster_recovery
accounts:
- id: "break-glass-admin"
passwordHash: "$argon2id$v=19$m=65536,t=3,p=4$..."
roles: ["admin"]
```
### Password Hash Generation
Generate password hashes using Argon2id:
```bash
# Using argon2 CLI tool
echo -n "your-secure-password" | argon2 $(openssl rand -base64 16) -id -t 3 -m 16 -p 4 -l 32 -e
# Or using stella CLI
stella auth hash-password --algorithm argon2id
```
## Break-Glass Login Procedure
### Step 1: Verify Standard Auth is Unavailable
Before using break-glass, confirm standard authentication is genuinely unavailable:
```bash
# Check Authority health
curl -s https://authority.example.com/health | jq .
# Check OIDC endpoint
curl -s https://idp.example.com/.well-known/openid-configuration
# Check database connectivity
stella doctor check --component postgres
```
### Step 2: Access Break-Glass Login
Navigate to the break-glass endpoint:
```
https://authority.example.com/break-glass/login
```
Or use the CLI:
```bash
stella auth break-glass login \
--account break-glass-admin \
--reason database_maintenance
```
### Step 3: Provide Credentials and Reason
| Field | Description | Required |
|-------|-------------|----------|
| Account ID | Break-glass account identifier | Yes |
| Password | Account password | Yes |
| Reason Code | Pre-approved reason code | Yes |
| Reason Details | Free-text explanation | Recommended |
**Approved Reason Codes:**
| Code | Description |
|------|-------------|
| `database_maintenance` | Scheduled or emergency database work |
| `idp_outage` | Identity provider unavailable |
| `network_partition` | Network connectivity issues |
| `security_incident` | Active security incident response |
| `disaster_recovery` | DR/BCP activation |
### Step 4: Session Created
On successful authentication:
- Session token issued with limited TTL (default: 15 minutes)
- Audit event logged: `breakglass.session.created`
- All subsequent actions are tagged with break-glass context
## Session Management
### Session Timeout
Break-glass sessions have strict time limits:
| Setting | Default | Description |
|---------|---------|-------------|
| `sessionTimeoutMinutes` | 15 | Session lifetime |
| `maxExtensions` | 2 | Maximum session extensions |
| Extension period | 15 min | Time added per extension |
### Extending a Session
If additional time is needed:
```bash
# CLI
stella auth break-glass extend \
--session-id <session-id> \
--reason "database migration still running"
# UI
# Click "Extend Session" button in break-glass banner
```
Extension requires:
1. Re-entering password
2. Providing extension reason
3. Not exceeding `maxExtensions` limit
### Session Termination
Sessions end when:
- User explicitly logs out
- Session timeout expires
- Max extensions reached
- Administrator force-terminates
```bash
# Explicit logout
stella auth break-glass logout --session-id <session-id>
# Force terminate (admin)
stella auth break-glass terminate --session-id <session-id> --reason "normal auth restored"
```
## Audit Trail
### Audit Events
All break-glass activity is logged:
| Event | Description |
|-------|-------------|
| `breakglass.session.created` | Session started |
| `breakglass.session.extended` | Session extended |
| `breakglass.session.terminated` | User logout |
| `breakglass.session.expired` | Timeout reached |
| `breakglass.auth.failed` | Authentication failed |
| `breakglass.reason.invalid` | Invalid reason code |
| `breakglass.extensions.exceeded` | Max extensions reached |
### Audit Event Structure
```json
{
"eventType": "breakglass.session.created",
"timestamp": "2026-01-16T10:30:00Z",
"accountId": "break-glass-admin",
"sessionId": "bg-sess-abc123",
"reasonCode": "database_maintenance",
"reasonDetails": "PostgreSQL major version upgrade",
"sourceIp": "10.0.1.50",
"userAgent": "stella-cli/2027.Q1"
}
```
### Querying Audit Logs
```bash
# List all break-glass events
stella audit query --event-type "breakglass.*" --since "24h"
# Export for compliance
stella audit export \
--event-type "breakglass.*" \
--start 2026-01-01 \
--end 2026-01-31 \
--format json \
--output break-glass-audit-jan2026.json
```
## Fallback Policy Store
### Automatic Failover
When PostgreSQL becomes unavailable:
1. Authority detects health check failures
2. After `failureThreshold` (default: 3) consecutive failures
3. Authority switches to local policy store
4. Mode changes to `Fallback`
5. Event logged: `authority.mode.changed`
### Policy Store Modes
| Mode | Description | Available Features |
|------|-------------|-------------------|
| `Primary` | PostgreSQL available | Full RBAC, user management |
| `Fallback` | Using local policy | Break-glass only |
| `Degraded` | Both degraded | Emergency access only |
### Recovery
When PostgreSQL recovers:
1. Health checks pass
2. After `minFallbackDurationMs` (default: 30s) cooldown
3. Authority switches back to Primary
4. Fallback sessions can continue until expiry
## Security Considerations
### Password Policy
Break-glass account passwords should:
- Be at least 20 characters
- Include upper, lower, numbers, symbols
- Be stored securely (HSM, Vault, split custody)
- Be rotated on a schedule (quarterly recommended)
### Access Control
- Limit break-glass accounts to essential personnel
- Use separate accounts per operator when possible
- Review access list quarterly
- Disable unused accounts immediately
### Monitoring
Set up alerts for break-glass activity:
```yaml
# Alert rule example
- alert: BreakGlassSessionCreated
expr: stellaops_breakglass_sessions_created_total > 0
for: 0m
labels:
severity: warning
annotations:
summary: Break-glass session created
description: A break-glass session was created. Verify this is expected.
```
## Troubleshooting
### Login Failures
| Error | Cause | Resolution |
|-------|-------|------------|
| `invalid_credentials` | Wrong password | Verify password |
| `invalid_reason_code` | Reason not in allowed list | Use approved reason code |
| `account_disabled` | Account explicitly disabled | Contact administrator |
| `break_glass_disabled` | Feature disabled in config | Enable in local-policy.yaml |
### Session Issues
| Issue | Cause | Resolution |
|-------|-------|------------|
| Session expired immediately | Clock skew | Sync server time |
| Cannot extend | Max extensions reached | Log out and re-authenticate |
| Actions failing | Insufficient roles | Verify account has required roles |
### Policy Store Issues
```bash
# Check policy store status
stella doctor check --component authority
# Verify local policy file
stella auth policy validate --file /etc/stellaops/authority/local-policy.yaml
# Force reload policy
stella auth policy reload
```
## Compliance Notes
Break-glass usage must be:
- Documented in incident reports
- Reviewed during security audits
- Reported in compliance dashboards
- Justified for each session
Retain audit logs for:
- SOC 2: 1 year minimum
- HIPAA: 6 years
- PCI-DSS: 1 year
- Internal policy: As defined
## Related Documentation
- [Local RBAC Policy Schema](../modules/authority/local-policy-schema.md)
- [Authority Architecture](../modules/authority/architecture.md)
- [Offline Operations](../operations/airgap-operations-runbook.md)
- [Audit System](../modules/audit/architecture.md)

View File

@@ -0,0 +1,262 @@
# Checkpoint Divergence Detection and Incident Response
This runbook covers the detection of Rekor checkpoint divergence, anomaly types, alert handling, and incident response procedures.
## Overview
Checkpoint divergence detection monitors the integrity of Rekor transparency logs by:
- Comparing root hashes at the same tree size
- Verifying tree size monotonicity (only increases)
- Cross-checking primary logs against mirrors
- Detecting stale or unresponsive logs
Divergence can indicate:
- Split-view attacks (malicious log server showing different trees to different clients)
- Rollback attacks (hiding recent log entries)
- Log compromise or key theft
- Network partitions or operational issues
## Detection Rules
| Check | Condition | Severity | Recommended Action |
|-------|-----------|----------|-------------------|
| Root hash mismatch | Same tree_size, different root_hash | CRITICAL | Quarantine + immediate investigation |
| Tree size rollback | new_tree_size < stored_tree_size | CRITICAL | Reject checkpoint + alert |
| Cross-log divergence | Primary root mirror root at same size | WARNING | Alert + investigate |
| Stale checkpoint | Checkpoint age > threshold | WARNING | Alert + monitor |
## Alert Payloads
### Root Hash Mismatch Alert
```json
{
"eventType": "rekor.checkpoint.divergence",
"severity": "critical",
"origin": "rekor.sigstore.dev",
"treeSize": 12345678,
"expectedRootHash": "sha256:abc123...",
"actualRootHash": "sha256:def456...",
"detectedAt": "2026-01-15T12:34:56Z",
"backend": "sigstore-prod",
"description": "Checkpoint root hash mismatch detected. Possible split-view attack.",
"recommendedAction": "Quarantine"
}
```
### Rollback Attempt Alert
```json
{
"eventType": "rekor.checkpoint.rollback",
"severity": "critical",
"origin": "rekor.sigstore.dev",
"previousTreeSize": 12345678,
"attemptedTreeSize": 12345600,
"detectedAt": "2026-01-15T12:34:56Z",
"description": "Tree size regression detected. Possible rollback attack."
}
```
### Cross-Log Divergence Alert
```json
{
"eventType": "rekor.checkpoint.cross_log_divergence",
"severity": "warning",
"primaryOrigin": "rekor.sigstore.dev",
"mirrorOrigin": "rekor.mirror.example.com",
"treeSize": 12345678,
"primaryRootHash": "sha256:abc123...",
"mirrorRootHash": "sha256:def456...",
"description": "Cross-log divergence detected between primary and mirror."
}
```
## Metrics
```
# Counter: total checkpoint mismatches
attestor_rekor_checkpoint_mismatch_total{backend="sigstore-prod",origin="rekor.sigstore.dev"} 0
# Counter: rollback attempts detected
attestor_rekor_checkpoint_rollback_detected_total{backend="sigstore-prod"} 0
# Counter: cross-log divergences detected
attestor_rekor_cross_log_divergence_total{primary="rekor.sigstore.dev",mirror="mirror.example.com"} 0
# Gauge: seconds since last valid checkpoint
attestor_rekor_checkpoint_age_seconds{backend="sigstore-prod"} 120
# Counter: total anomalies detected (all types)
attestor_rekor_anomalies_detected_total{type="RootHashMismatch",severity="critical"} 0
```
## Incident Response Procedures
### Level 1: Root Hash Mismatch (CRITICAL)
**Symptoms:**
- `attestor_rekor_checkpoint_mismatch_total` increments
- Alert received: "rekor.checkpoint.divergence"
**Immediate Actions:**
1. **Quarantine all affected proofs** - Do not rely on any inclusion proofs from the affected log until resolved
2. **Suspend automated verifications** - Halt any automated systems that depend on the log
3. **Preserve evidence** - Capture both checkpoints (expected and actual) with full metadata
4. **Alert security team** - This is a potential compromise indicator
**Investigation Steps:**
1. Verify the mismatch isn't a local storage corruption
```bash
stella attestor checkpoint verify --origin rekor.sigstore.dev --tree-size 12345678
```
2. Cross-check with independent sources (other clients, mirrors)
3. Check if Sigstore has published any incident reports
4. Review network logs for MITM indicators
**Resolution:**
- If confirmed attack: Follow security incident process
- If local corruption: Resync from trusted source
- If upstream issue: Wait for Sigstore remediation, follow their guidance
### Level 2: Tree Size Rollback (CRITICAL)
**Symptoms:**
- `attestor_rekor_checkpoint_rollback_detected_total` increments
- Alert received: "rekor.checkpoint.rollback"
**Immediate Actions:**
1. **Reject the checkpoint** - Do not accept or store it
2. **Log full details** for forensic analysis
3. **Check network path** - Could indicate MITM or DNS hijacking
**Investigation Steps:**
1. Verify current log state directly:
```bash
curl -s https://rekor.sigstore.dev/api/v1/log | jq .treeSize
```
2. Compare with stored latest tree size
3. Check DNS resolution and TLS certificate chain
**Resolution:**
- If network attack: Remediate network path, rotate credentials
- If temporary glitch: Monitor for repetition
- If persistent: Escalate to upstream provider
### Level 3: Cross-Log Divergence (WARNING)
**Symptoms:**
- `attestor_rekor_cross_log_divergence_total` increments
- Alert received: "rekor.checkpoint.cross_log_divergence"
**Immediate Actions:**
1. **Do not panic** - Mirrors may have legitimate lag
2. **Check mirror sync status** - May be catching up
**Investigation Steps:**
1. Compare tree sizes:
```bash
stella attestor checkpoint list --origins rekor.sigstore.dev,mirror.example.com
```
2. If same tree size with different roots: Escalate to CRITICAL
3. If different tree sizes: Allow time for sync
4. If persistent: Investigate mirror operator
**Resolution:**
- Sync lag: Monitor until caught up
- Persistent divergence: Disable mirror, investigate, or remove from trust list
### Level 4: Stale Checkpoint (WARNING)
**Symptoms:**
- `attestor_rekor_checkpoint_age_seconds` exceeds threshold
- Log health status: DEGRADED or UNHEALTHY
**Immediate Actions:**
1. Check log service status
2. Verify network connectivity to log
**Investigation Steps:**
1. Check Sigstore status page
2. Test direct API access:
```bash
curl -I https://rekor.sigstore.dev/api/v1/log
```
3. Review recent checkpoint fetch attempts
**Resolution:**
- Upstream outage: Wait, rely on cached data
- Local network issue: Restore connectivity
- Persistent: Consider failover to mirror
## Configuration
### Detector Options
```yaml
attestor:
divergenceDetection:
# Enable checkpoint monitoring
enabled: true
# Threshold for "stale checkpoint" warning
staleCheckpointThreshold: 1h
# Threshold for "stale tree size" (no growth)
staleTreeSizeThreshold: 2h
# Log health thresholds
degradedCheckpointAgeThreshold: 30m
unhealthyCheckpointAgeThreshold: 2h
# Enable cross-log consistency checks
enableCrossLogChecks: true
# Mirror origins to check against primary
mirrorOrigins:
- rekor.mirror.example.com
- rekor.mirror2.example.com
```
### Alert Options
```yaml
attestor:
alerts:
# Enable alert publishing to Notify service
enabled: true
# Default tenant for system alerts
defaultTenant: system
# Severity thresholds for alerting
alertOnHighSeverity: true
alertOnWarning: true
alertOnInfo: false
# Alert stream name
stream: attestor.alerts
```
## Runbook Checklist
### Daily Operations
- [ ] Verify `attestor_rekor_checkpoint_age_seconds` < threshold
- [ ] Check for any anomaly counter increments
- [ ] Review divergence detector logs for warnings
### Weekly Review
- [ ] Audit checkpoint storage integrity
- [ ] Verify mirror sync status
- [ ] Review and tune alerting thresholds
### Post-Incident
- [ ] Document root cause
- [ ] Update detection rules if needed
- [ ] Review and improve response procedures
- [ ] Share learnings with team
## See Also
- [Rekor Verification Design](../modules/attestor/rekor-verification-design.md)
- [Attestor Architecture](../modules/attestor/architecture.md)
- [Sigstore Rekor Documentation](https://docs.sigstore.dev/rekor/overview/)
- [Certificate Transparency RFC 6962](https://www.rfc-editor.org/rfc/rfc6962)

View File

@@ -0,0 +1,443 @@
# Dual-Control Ceremony Runbook
This runbook documents M-of-N threshold signing ceremonies for high-assurance key operations in Stella Ops.
> **Sprint:** SPRINT_20260112_018_SIGNER_dual_control_ceremonies
## Overview
Dual-control ceremonies ensure critical cryptographic operations require approval from multiple authorized individuals before execution. This prevents single points of compromise for sensitive operations like:
- Root key rotation
- Trust anchor updates
- Emergency key revocation
- HSM key generation
- Recovery key activation
## When Ceremonies Are Required
| Operation | Default Threshold | Configurable |
|-----------|------------------|--------------|
| Root signing key rotation | 2-of-3 | Yes |
| Trust anchor update | 2-of-3 | Yes |
| Key revocation | 2-of-3 | Yes |
| HSM key generation | 2-of-4 | Yes |
| Recovery key activation | 3-of-5 | Yes |
## Ceremony Lifecycle
### State Machine
```
+------------------+
| Pending |
+--------+---------+
|
| Approvals collected
v
+-------------+-------------+
| PartiallyApproved |
+-------------+-------------+
|
| Threshold reached
v
+--------+---------+
| Approved |
+--------+---------+
|
| Execute
v
+--------+---------+
| Executed |
+------------------+
Alternative paths:
- Pending -> Expired (timeout)
- Pending -> Cancelled (initiator cancel)
- PartiallyApproved -> Expired (timeout)
- PartiallyApproved -> Cancelled
```
### State Descriptions
| State | Description |
|-------|-------------|
| `Pending` | Ceremony created, awaiting first approval |
| `PartiallyApproved` | At least one approval, threshold not reached |
| `Approved` | Threshold reached, ready for execution |
| `Executed` | Operation completed successfully |
| `Expired` | Timeout reached without execution |
| `Cancelled` | Explicitly cancelled before execution |
## Creating a Ceremony
### Via CLI
```bash
stella ceremony create \
--type key-rotation \
--subject "Root signing key Q1-2026" \
--threshold 2 \
--required-approvers 3 \
--expires-in 24h \
--payload '{"keyId": "root-2026-q1", "algorithm": "ecdsa-p384"}'
```
### Via API
```bash
curl -X POST https://signer.example.com/api/v1/ceremonies \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "key-rotation",
"subject": "Root signing key Q1-2026",
"threshold": 2,
"requiredApprovers": 3,
"expiresAt": "2026-01-17T10:00:00Z",
"payload": {
"keyId": "root-2026-q1",
"algorithm": "ecdsa-p384"
}
}'
```
### Response
```json
{
"ceremonyId": "cer-abc123",
"type": "key-rotation",
"state": "Pending",
"threshold": 2,
"requiredApprovers": 3,
"currentApprovals": 0,
"createdAt": "2026-01-16T10:00:00Z",
"expiresAt": "2026-01-17T10:00:00Z",
"initiator": "admin@company.com"
}
```
## Approving a Ceremony
### Prerequisites
Approvers must:
1. Be in the ceremony's allowed approvers list
2. Have the `ceremony:approve` scope
3. Have valid authentication (OIDC or break-glass)
4. Not have already approved this ceremony
### Via CLI
```bash
stella ceremony approve \
--ceremony-id cer-abc123 \
--reason "Reviewed rotation plan, verified key parameters" \
--sign
```
The `--sign` flag creates a DSSE signature over the approval using the approver's signing key.
### Via API
```bash
curl -X POST https://signer.example.com/api/v1/ceremonies/cer-abc123/approve \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Reviewed rotation plan, verified key parameters",
"signature": "base64-encoded-dsse-signature"
}'
```
### Approval Response
```json
{
"ceremonyId": "cer-abc123",
"state": "PartiallyApproved",
"currentApprovals": 1,
"threshold": 2,
"approval": {
"approvalId": "apr-def456",
"approver": "security-lead@company.com",
"approvedAt": "2026-01-16T11:30:00Z",
"reason": "Reviewed rotation plan, verified key parameters",
"signatureValid": true
}
}
```
## Executing a Ceremony
Once the approval threshold is reached:
### Via CLI
```bash
stella ceremony execute --ceremony-id cer-abc123
```
### Via API
```bash
curl -X POST https://signer.example.com/api/v1/ceremonies/cer-abc123/execute \
-H "Authorization: Bearer $TOKEN"
```
### Execution Response
```json
{
"ceremonyId": "cer-abc123",
"state": "Executed",
"executedAt": "2026-01-16T14:00:00Z",
"result": {
"keyId": "root-2026-q1",
"publicKey": "-----BEGIN PUBLIC KEY-----...",
"fingerprint": "SHA256:abc123...",
"activatedAt": "2026-01-16T14:00:00Z"
}
}
```
## Monitoring Ceremonies
### List Active Ceremonies
```bash
# CLI
stella ceremony list --state pending,partially-approved
# API
curl "https://signer.example.com/api/v1/ceremonies?state=pending,partially-approved"
```
### Check Ceremony Status
```bash
# CLI
stella ceremony status --ceremony-id cer-abc123
# API
curl "https://signer.example.com/api/v1/ceremonies/cer-abc123"
```
## Cancelling a Ceremony
Ceremonies can be cancelled before execution:
```bash
# CLI
stella ceremony cancel \
--ceremony-id cer-abc123 \
--reason "Postponed due to schedule conflict"
# API
curl -X DELETE https://signer.example.com/api/v1/ceremonies/cer-abc123 \
-H "Authorization: Bearer $TOKEN"
```
Only the initiator or users with `ceremony:cancel` scope can cancel.
## Audit Events
All ceremony actions are logged:
| Event | Description |
|-------|-------------|
| `signer.ceremony.initiated` | Ceremony created |
| `signer.ceremony.approved` | Approval submitted |
| `signer.ceremony.approval_rejected` | Approval rejected (invalid signature, unauthorized) |
| `signer.ceremony.executed` | Operation executed |
| `signer.ceremony.expired` | Timeout reached |
| `signer.ceremony.cancelled` | Explicitly cancelled |
### Audit Event Structure
```json
{
"eventType": "signer.ceremony.approved",
"timestamp": "2026-01-16T11:30:00Z",
"ceremonyId": "cer-abc123",
"ceremonyType": "key-rotation",
"actor": "security-lead@company.com",
"approvalId": "apr-def456",
"currentApprovals": 1,
"threshold": 2,
"signatureAlgorithm": "ecdsa-p256",
"signatureKeyId": "user-signing-key-456"
}
```
### Query Audit Logs
```bash
stella audit query \
--event-type "signer.ceremony.*" \
--since 7d \
--ceremony-id cer-abc123
```
## Configuration
### Ceremony Settings
```yaml
# signer-config.yaml
ceremonies:
enabled: true
defaultTimeout: 24h
maxTimeout: 168h # 7 days
requireSignedApprovals: true
thresholds:
key-rotation:
minimum: 2
default: 2
maximum: 5
key-revocation:
minimum: 2
default: 3
maximum: 5
trust-anchor-update:
minimum: 2
default: 2
maximum: 4
```
### Approver Configuration
```yaml
# approvers.yaml
approverGroups:
- name: key-custodians
members:
- security-lead@company.com
- ciso@company.com
- key-officer-1@company.com
- key-officer-2@company.com
operations:
- key-rotation
- key-revocation
- name: trust-admins
members:
- trust-admin@company.com
- security-lead@company.com
operations:
- trust-anchor-update
```
## Notifications
Ceremonies trigger notifications to approvers:
| Event | Notification |
|-------|-------------|
| Ceremony created | Email/Slack to all eligible approvers |
| Approval submitted | Email/Slack to remaining approvers |
| Threshold reached | Email/Slack to initiator |
| Approaching expiry | Email/Slack at 75% and 90% of timeout |
| Expired | Email/Slack to initiator and approvers |
Configure notifications in `notifier-config.yaml`:
```yaml
notifications:
ceremonies:
enabled: true
channels:
- type: email
recipients: "@approverGroup"
- type: slack
webhook: ${SLACK_CEREMONY_WEBHOOK}
channel: "#key-ceremonies"
```
## Security Best Practices
### Approver Requirements
- Maintain at least N+1 approvers for N-of-M ceremonies
- Distribute approvers across security domains
- Require hardware tokens for signing keys
- Rotate approver list quarterly
### Ceremony Hygiene
- Use descriptive subjects for audit clarity
- Set reasonable timeouts (not too long, not too short)
- Document approval reasons thoroughly
- Review executed ceremonies monthly
### Monitoring
Set up alerts for:
```yaml
alerts:
- name: CeremonyPendingTooLong
condition: ceremony.pending_duration > 12h
severity: warning
- name: CeremonyApprovalRejected
condition: ceremony.approval_rejected
severity: critical
- name: UnauthorizedCeremonyAttempt
condition: ceremony.unauthorized_access
severity: critical
```
## Troubleshooting
### Common Issues
| Issue | Cause | Resolution |
|-------|-------|------------|
| Approval rejected | Invalid signature | Re-sign with correct key |
| Cannot approve | Already approved | Different approver must approve |
| Cannot execute | Threshold not met | Collect more approvals |
| Ceremony expired | Timeout reached | Create new ceremony |
### Signature Verification Failures
```bash
# Verify signing key is accessible
stella auth keys list
# Test signature
echo "test" | stella sign --key-id my-signing-key | stella verify
# Check key permissions
stella auth keys info --key-id my-signing-key
```
## Emergency Procedures
### Stuck Ceremony
If a ceremony is stuck (approvers unavailable):
1. Cancel the stuck ceremony
2. Create new ceremony with available approvers
3. Document the situation in audit notes
### Compromised Approver
If an approver's credentials are compromised:
1. Revoke approver's signing key immediately
2. Cancel any pending ceremonies they created
3. Review recent approvals for anomalies
4. Remove from approver groups
5. Document in incident report
## Related Documentation
- [Key Rotation Runbook](./key-rotation-runbook.md)
- [HSM Setup Runbook](./hsm-setup-runbook.md)
- [Signer Architecture](../modules/signer/architecture.md)
- [Break-Glass Runbook](./break-glass-runbook.md)

View File

@@ -0,0 +1,278 @@
# Evidence Migration Guide
This guide covers evidence-specific migration procedures during upgrades, schema changes, or disaster recovery scenarios.
## Overview
Evidence bundles are cryptographically linked data structures that must maintain integrity across upgrades. This guide ensures chain-of-custody is preserved during migrations.
## Quick Reference
| Scenario | CLI Command | Risk Level | Downtime |
|----------|-------------|------------|----------|
| Schema upgrade | `stella evidence migrate` | Medium | Minutes |
| Reindex after algorithm change | `stella evidence reindex` | Low | None |
| Cross-version continuity check | `stella evidence verify-continuity` | None | None |
| Full evidence export | `stella evidence export --all` | None | None |
## Pre-Migration Checklist
### 1. Capture Current State
```bash
# Record current evidence statistics
stella evidence stats --detailed > pre-migration-stats.json
# Export Merkle roots for all tenants
stella evidence roots-export --all > pre-migration-roots.json
# Verify existing evidence integrity
stella evidence verify-all --output pre-migration-verify.json
if [ $? -ne 0 ]; then
echo "ABORT: Evidence integrity check failed"
exit 1
fi
```
### 2. Create Evidence Backup
```bash
# Full evidence bundle export
stella evidence export \
--all \
--include-attestations \
--include-proofs \
--output /backup/evidence-$(date +%Y%m%d)/
# Verify export integrity
stella evidence verify-bundle /backup/evidence-*/
```
### 3. Document Chain-of-Custody
```bash
# Record the current root hashes
OLD_MERKLE_ROOT=$(stella evidence roots-export --format json | jq -r '.globalRoot')
echo "Pre-migration Merkle root: ${OLD_MERKLE_ROOT}" > custody-log.txt
date >> custody-log.txt
```
## Migration Procedures
### Schema Migration (Version Upgrade)
When upgrading between versions with schema changes:
```bash
# Step 1: Assess migration impact (dry-run)
stella evidence migrate \
--from-version 1.0 \
--to-version 2.0 \
--dry-run
# Step 2: Review migration plan output
# Ensure all changes are expected
# Step 3: Execute migration
stella evidence migrate \
--from-version 1.0 \
--to-version 2.0
# Step 4: Verify migration
stella evidence verify-all
```
### Evidence Reindex (Algorithm Change)
When the hashing algorithm or Merkle tree structure changes:
```bash
# Step 1: Assess reindex impact
stella evidence reindex \
--dry-run \
--output reindex-plan.json
# Review reindex-plan.json for:
# - Total records affected
# - Estimated duration
# - New schema version
# Step 2: Execute reindex with batching
stella evidence reindex \
--batch-size 100 \
--since 2026-01-01
# Step 3: Capture new root
NEW_MERKLE_ROOT=$(stella evidence roots-export --format json | jq -r '.globalRoot')
echo "Post-migration Merkle root: ${NEW_MERKLE_ROOT}" >> custody-log.txt
date >> custody-log.txt
```
### Chain-of-Custody Verification
After any evidence migration, verify continuity:
```bash
# Verify that old proofs remain valid
stella evidence verify-continuity \
--old-root "${OLD_MERKLE_ROOT}" \
--new-root "${NEW_MERKLE_ROOT}" \
--output continuity-report.html \
--format html
# Check verification results
if grep -q "FAIL" continuity-report.html; then
echo "ERROR: Chain-of-custody verification failed!"
echo "Review continuity-report.html for details"
exit 1
fi
```
## Rollback Procedures
### Immediate Rollback (Within Migration Window)
```bash
# If migration fails mid-way, rollback is automatic
# Check current migration state
stella evidence migrate --status
# Force rollback if needed
stella evidence migrate \
--rollback \
--from-version 2.0
```
### Restore from Backup
```bash
# Step 1: Stop evidence-related services
kubectl scale deployment evidence-locker --replicas=0
# Step 2: Restore PostgreSQL evidence tables
pg_restore -d stellaops \
--table='evidence.*' \
/backup/stellaops-backup.dump
# Step 3: Restore evidence files
stella evidence import /backup/evidence-*/
# Step 4: Verify restoration
stella evidence verify-all
# Step 5: Restart services
kubectl scale deployment evidence-locker --replicas=3
```
## Air-Gap Migration
For air-gapped environments without network access:
### Export Phase (Online Environment)
```bash
# Create portable evidence bundle
stella evidence export \
--all \
--portable \
--include-schemas \
--output /media/airgap-evidence.tar.gz
# Generate checksums
sha256sum /media/airgap-evidence.tar.gz > /media/checksums.txt
```
### Transfer Phase
1. Copy to removable media
2. Verify checksums at destination
3. Scan media for security
### Import Phase (Air-Gap Environment)
```bash
# Verify transfer integrity
sha256sum -c /media/checksums.txt
# Import evidence bundle
stella evidence import \
--portable \
/media/airgap-evidence.tar.gz
# Verify import
stella evidence verify-all
```
## Troubleshooting
### Migration Stuck or Timeout
```bash
# Check migration status
stella evidence migrate --status
# View migration logs
stella evidence migrate --logs
# Resume from last checkpoint
stella evidence migrate --resume
```
### Root Hash Mismatch
If verification reports root hash mismatch:
1. **Do not proceed** with upgrade
2. Check for data corruption:
```bash
stella evidence integrity-check --deep
```
3. Review recent changes to evidence store
4. Contact support with integrity report
### Missing Evidence Records
```bash
# Count records by type
stella evidence stats --by-type
# Find orphaned records
stella evidence orphans --list
# Reconcile with source systems
stella evidence reconcile --source attestor
```
### Performance Issues
For large evidence stores (>1M records):
```bash
# Run reindex in parallel batches
stella evidence reindex \
--parallel 4 \
--batch-size 500 \
--since 2026-01-01
# Monitor progress
stella evidence reindex --progress
```
## Audit Trail Requirements
All evidence migrations must maintain audit trail:
| Event | Required Data | Retention |
|-------|---------------|-----------|
| Migration Start | Timestamp, version, operator | Permanent |
| Schema Change | Before/after schema versions | Permanent |
| Root Hash Change | Old root, new root, cross-reference | Permanent |
| Verification | Pass/fail, anomalies, timestamps | 7 years |
| Rollback | Reason, restored version | Permanent |
## Related Documents
- [Upgrade Runbook](upgrade-runbook.md) - Overall upgrade procedures
- [Blue-Green Deployment](blue-green-deployment.md) - Zero-downtime deployment
- [Evidence Locker Architecture](../modules/evidencelocker/architecture.md) - Technical design
- [Air-Gap Operations](airgap-operations-runbook.md) - Offline deployment guide

View File

@@ -34,6 +34,8 @@ pkcs11-tool --version
## SoftHSM2 Setup (Development)
See [docs/operations/softhsm2-test-environment.md](operations/softhsm2-test-environment.md) for a focused test environment setup.
### Step 1: Initialize SoftHSM
```bash
@@ -197,7 +199,7 @@ stringData:
```bash
# Run HSM connectivity doctor check
stella doctor --check hsm
stella doctor --check check.crypto.hsm
# Expected output:
# [PASS] HSM Connectivity

View File

@@ -0,0 +1,417 @@
# Key Escrow and Recovery Runbook
This runbook documents Shamir secret sharing key escrow and recovery procedures in Stella Ops.
> **Sprint:** SPRINT_20260112_018_CRYPTO_key_escrow_shamir
## Overview
Key escrow ensures critical cryptographic keys can be recovered if primary access is lost. Stella Ops uses Shamir's Secret Sharing to split keys into shares distributed among trusted custodians.
Key features:
- M-of-N threshold recovery (any M shares reconstruct the key)
- Share encryption at rest
- Custodian-based share distribution
- Integration with dual-control ceremonies
- Full audit trail
## When to Use Key Escrow
| Scenario | Escrow Required |
|----------|-----------------|
| Root signing keys | Yes |
| HSM master keys | Yes |
| Trust anchor keys | Yes |
| Service signing keys | Recommended |
| User signing keys | Optional |
| Ephemeral keys | No |
## Shamir Secret Sharing
### How It Works
Shamir's Secret Sharing splits a secret into N shares where any M shares can reconstruct the original:
```
Secret S → Split(S, M, N) → [Share₁, Share₂, ..., Shareₙ]
Any M shares → Combine → Secret S
Fewer than M shares → Cannot reconstruct
```
### Configuration Parameters
| Parameter | Description | Recommended |
|-----------|-------------|-------------|
| Threshold (M) | Minimum shares needed | 2-3 for keys |
| Total Shares (N) | Total shares created | M + 2 minimum |
| Share Encryption | Encrypt shares at rest | Always enabled |
### Threshold Guidelines
| Key Type | Minimum M | Recommended N | Rationale |
|----------|-----------|---------------|-----------|
| Root keys | 3 | 5 | High assurance |
| HSM keys | 2 | 4 | Availability + security |
| Service keys | 2 | 3 | Operational recovery |
## Escrowing a Key
### Via CLI
```bash
stella escrow create \
--key-id root-signing-key-2026 \
--threshold 3 \
--shares 5 \
--custodians custodian-1,custodian-2,custodian-3,custodian-4,custodian-5 \
--expires-in 365d \
--reason "Annual key escrow for root signing key"
```
### Via API
```bash
curl -X POST https://signer.example.com/api/v1/escrow \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"keyId": "root-signing-key-2026",
"threshold": 3,
"totalShares": 5,
"custodianIds": [
"custodian-1", "custodian-2", "custodian-3",
"custodian-4", "custodian-5"
],
"expirationDays": 365,
"reason": "Annual key escrow for root signing key"
}'
```
### Escrow Response
```json
{
"escrowId": "esc-abc123",
"keyId": "root-signing-key-2026",
"threshold": 3,
"totalShares": 5,
"status": "Active",
"createdAt": "2026-01-16T10:00:00Z",
"expiresAt": "2027-01-16T10:00:00Z",
"shares": [
{ "shareId": "shr-001", "custodianId": "custodian-1", "distributed": true },
{ "shareId": "shr-002", "custodianId": "custodian-2", "distributed": true },
{ "shareId": "shr-003", "custodianId": "custodian-3", "distributed": true },
{ "shareId": "shr-004", "custodianId": "custodian-4", "distributed": true },
{ "shareId": "shr-005", "custodianId": "custodian-5", "distributed": true }
]
}
```
## Share Distribution
### Distribution Methods
| Method | Security | Use Case |
|--------|----------|----------|
| Direct API delivery | High | Automated systems |
| Encrypted email | Medium | Remote custodians |
| In-person ceremony | Highest | Root keys |
| Hardware token | Highest | HSM keys |
### Custodian Requirements
Each custodian must:
1. Have verified identity in Authority
2. Complete escrow custodian training
3. Have secure share storage capability
4. Be geographically distributed (recommended)
### Verifying Share Distribution
```bash
stella escrow status --escrow-id esc-abc123
# Output:
# Escrow: esc-abc123
# Key: root-signing-key-2026
# Status: Active
# Threshold: 3 of 5
# Shares:
# [1] custodian-1: Distributed ✓
# [2] custodian-2: Distributed ✓
# [3] custodian-3: Distributed ✓
# [4] custodian-4: Distributed ✓
# [5] custodian-5: Distributed ✓
```
## Key Recovery
### Prerequisites
Recovery requires:
1. Valid recovery request (incident, key loss, rotation)
2. Dual-control ceremony approval (if configured)
3. Minimum M custodians available with shares
4. Secure recovery environment
### Recovery Workflow
```
1. Initiate recovery request
2. (If required) Dual-control ceremony approval
3. Collect shares from M custodians
4. Verify share checksums
5. Reconstruct key
6. Verify reconstructed key
7. Log recovery event
```
### Via CLI
```bash
# Step 1: Initiate recovery
stella escrow recover init \
--escrow-id esc-abc123 \
--reason "HSM failure - emergency key recovery" \
--ceremony-required
# Step 2: Collect shares (each custodian runs)
stella escrow recover submit-share \
--recovery-id rec-xyz789 \
--share-file /secure/my-share.enc \
--passphrase-file /secure/passphrase
# Step 3: Execute recovery (after threshold reached)
stella escrow recover execute \
--recovery-id rec-xyz789 \
--output-key-file /secure/recovered-key.pem
```
### Via API
```bash
# Initiate recovery
curl -X POST https://signer.example.com/api/v1/escrow/esc-abc123/recover \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "HSM failure - emergency key recovery",
"requireCeremony": true
}'
# Submit share
curl -X POST https://signer.example.com/api/v1/recovery/rec-xyz789/shares \
-H "Authorization: Bearer $CUSTODIAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"shareId": "shr-001",
"encryptedShare": "base64-encoded-share",
"checksum": "sha256:abc123..."
}'
# Execute recovery (after threshold)
curl -X POST https://signer.example.com/api/v1/recovery/rec-xyz789/execute \
-H "Authorization: Bearer $TOKEN"
```
### Recovery Response
```json
{
"recoveryId": "rec-xyz789",
"status": "Completed",
"keyId": "root-signing-key-2026",
"sharesCollected": 3,
"threshold": 3,
"completedAt": "2026-01-16T15:30:00Z",
"keyFingerprint": "SHA256:xyz789...",
"verified": true
}
```
## Share Management
### Custodian Share Storage
Custodians should store shares:
| Storage | Security Level | Notes |
|---------|----------------|-------|
| HSM | Highest | Preferred for root keys |
| Hardware token | High | YubiKey, smart card |
| Encrypted file | Medium | AES-256-GCM minimum |
| Password manager | Medium | Enterprise vault only |
### Share Format
```json
{
"shareId": "shr-001",
"escrowId": "esc-abc123",
"index": 1,
"threshold": 3,
"totalShares": 5,
"encryptedData": "base64-encoded-aes-256-gcm-ciphertext",
"checksum": "sha256:abc123...",
"createdAt": "2026-01-16T10:00:00Z",
"expiresAt": "2027-01-16T10:00:00Z"
}
```
### Share Rotation
Re-escrow keys periodically:
```bash
stella escrow re-escrow \
--escrow-id esc-abc123 \
--new-custodians custodian-1,custodian-2,custodian-6,custodian-7,custodian-8 \
--reason "Annual share rotation"
```
This creates new shares and revokes old ones.
## Audit Trail
### Audit Events
| Event | Description |
|-------|-------------|
| `escrow.created` | Key escrowed |
| `escrow.share.distributed` | Share sent to custodian |
| `escrow.share.accessed` | Custodian accessed share |
| `recovery.initiated` | Recovery started |
| `recovery.share.submitted` | Share submitted for recovery |
| `recovery.completed` | Key reconstructed |
| `recovery.failed` | Recovery failed |
| `escrow.revoked` | Escrow revoked |
### Query Audit Logs
```bash
stella audit query \
--event-type "escrow.*,recovery.*" \
--escrow-id esc-abc123 \
--since 30d
```
## Configuration
### Escrow Settings
```yaml
# escrow-config.yaml
escrow:
enabled: true
defaultThreshold: 2
minimumThreshold: 2
maximumShares: 10
shareEncryption:
algorithm: AES-256-GCM
keyDerivation: HKDF-SHA256
requireDualControlForRecovery: true
maxRecoveryAttempts: 3
recoveryTimeoutHours: 24
```
### Custodian Configuration
```yaml
# custodians.yaml
custodians:
- id: custodian-1
name: "Security Lead"
email: security-lead@company.com
publicKey: "-----BEGIN PUBLIC KEY-----..."
location: "US-East"
- id: custodian-2
name: "Key Officer A"
email: key-officer-a@company.com
publicKey: "-----BEGIN PUBLIC KEY-----..."
location: "EU-West"
```
## Security Considerations
### Share Security
- Never transmit shares in plaintext
- Encrypt shares with custodian's public key
- Verify checksums before and after storage
- Use secure channels for distribution
### Recovery Security
- Require dual-control ceremonies for critical keys
- Limit recovery time window
- Verify recovered key fingerprint
- Audit all recovery attempts
### Custodian Security
- Verify custodian identity before share access
- Geographic distribution reduces collusion risk
- Rotate custodians periodically
- Train custodians on secure handling
## Troubleshooting
### Common Issues
| Issue | Cause | Resolution |
|-------|-------|------------|
| Share checksum mismatch | Corrupted share | Request re-distribution |
| Cannot decrypt share | Wrong passphrase | Verify passphrase |
| Recovery timeout | Shares not collected in time | Restart recovery |
| Key verification failed | Wrong shares combined | Verify share indices |
### Verification Failures
```bash
# Verify share integrity
stella escrow verify-share --share-file share.enc
# Test reconstruction with subset
stella escrow test-recovery \
--escrow-id esc-abc123 \
--share-files share1.enc,share2.enc,share3.enc
```
## Emergency Procedures
### Lost Share
If a custodian loses their share:
1. Verify at least M shares remain accessible
2. Re-escrow with new share set
3. Revoke compromised escrow
4. Document incident
### Compromised Custodian
If a custodian is compromised:
1. Do NOT use their share for any recovery
2. Re-escrow immediately with new custodians
3. Revoke old escrow
4. Consider key rotation if threshold was exposed
### Multiple Lost Shares
If fewer than M shares are available:
1. Key cannot be recovered via escrow
2. Use backup key if available
3. Generate new key and re-establish trust
4. Document as key loss incident
## Related Documentation
- [Dual-Control Ceremony Runbook](./dual-control-ceremony-runbook.md)
- [Key Rotation Runbook](./key-rotation-runbook.md)
- [HSM Setup Runbook](./hsm-setup-runbook.md)
- [Cryptography Architecture](../modules/cryptography/architecture.md)

View File

@@ -0,0 +1,362 @@
# Rekor Checkpoint Sync Configuration and Operations
This guide covers the configuration and operational procedures for the Rekor periodic checkpoint synchronization service.
## Overview
The Rekor sync service maintains a local mirror of Rekor transparency log checkpoints and tiles. This enables:
- **Offline verification**: Verify attestations without network access to Sigstore
- **Air-gapped operation**: Run in environments without internet connectivity
- **Performance**: Reduce latency by using local checkpoint data
- **Auditability**: Maintain local evidence of log state at verification time
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ RekorSyncBackgroundService │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Checkpoint │ │ Signature │ │ Tile │ │
│ │ Fetcher │────▶│ Verifier │────▶│ Syncer │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ HTTP Tile │ │ Checkpoint │ │ Tile │
│ Client │ │ Store │ │ Cache │
└──────────────┘ │ (PostgreSQL) │ │(File System) │
│ └──────────────┘ └──────────────┘
┌──────────────┐
│ Rekor │
│ Server │
└──────────────┘
```
## Configuration
### Basic Configuration
```yaml
attestor:
rekorSync:
# Enable or disable sync service
enabled: true
# How often to fetch new checkpoints
syncInterval: 5m
# Delay before first sync after startup
initialDelay: 30s
# Enable tile synchronization for full offline support
enableTileSync: true
# Maximum tiles to fetch per sync cycle
maxTilesPerSync: 100
# Backend configurations
backends:
- id: sigstore-prod
origin: rekor.sigstore.dev
baseUrl: https://rekor.sigstore.dev
publicKeyPath: /etc/stella/keys/rekor-sigstore-prod.pub
- id: sigstore-staging
origin: rekor.sigstage.dev
baseUrl: https://rekor.sigstage.dev
publicKeyPath: /etc/stella/keys/rekor-sigstore-staging.pub
```
### Checkpoint Store Configuration (PostgreSQL)
```yaml
attestor:
checkpointStore:
connectionString: "Host=localhost;Database=stella;Username=stella;Password=secret"
schema: attestor
autoInitializeSchema: true
```
### Tile Cache Configuration (File System)
```yaml
attestor:
tileCache:
# Base directory for tile storage
basePath: /var/lib/stella/attestor/tiles
# Maximum cache size (0 = unlimited)
maxCacheSizeBytes: 10737418240 # 10 GB
# Auto-prune tiles older than this
autoPruneAfter: 720h # 30 days
```
## Operational Procedures
### Initial Setup
1. **Initialize the checkpoint store schema**:
```bash
stella attestor checkpoint-store init --connection "Host=localhost;..."
```
2. **Configure backend(s)**:
```bash
stella attestor backend add sigstore-prod \
--origin rekor.sigstore.dev \
--url https://rekor.sigstore.dev \
--public-key /path/to/rekor.pub
```
3. **Perform initial sync**:
```bash
stella attestor sync --backend sigstore-prod --full
```
### Manual Sync Operations
**Force immediate sync**:
```bash
stella attestor sync --backend sigstore-prod
```
**Sync all backends**:
```bash
stella attestor sync --all
```
**Full tile sync** (for offline kit preparation):
```bash
stella attestor sync --backend sigstore-prod --full-tiles
```
### Monitoring
**Check sync status**:
```bash
stella attestor sync-status
```
Output:
```
Backend Origin Tree Size Last Sync Age
sigstore-prod rekor.sigstore.dev 45,678,901 2026-01-15 12:34:56 2m 15s
sigstore-staging rekor.sigstage.dev 1,234,567 2026-01-15 12:30:00 6m 30s
```
**Check checkpoint history**:
```bash
stella attestor checkpoints list --backend sigstore-prod --last 10
```
**Check tile cache status**:
```bash
stella attestor tiles stats --backend sigstore-prod
```
Output:
```
Origin: rekor.sigstore.dev
Total Tiles: 45,678
Cache Size: 1.4 GB
Coverage: 100% (tree size 45,678,901)
Oldest Tile: 2026-01-01 00:00:00
Newest Tile: 2026-01-15 12:34:56
```
### Metrics
The sync service exposes the following Prometheus metrics:
```
# Counter: checkpoints fetched from remote
attestor_rekor_sync_checkpoints_fetched_total{backend="sigstore-prod"} 1234
# Counter: checkpoints stored locally
attestor_rekor_sync_checkpoints_stored_total{backend="sigstore-prod"} 1234
# Counter: tiles fetched from remote
attestor_rekor_sync_tiles_fetched_total{backend="sigstore-prod"} 56789
# Counter: tiles cached locally
attestor_rekor_sync_tiles_cached_total{backend="sigstore-prod"} 56789
# Histogram: checkpoint age at sync time (seconds)
attestor_rekor_sync_checkpoint_age_seconds{backend="sigstore-prod"}
# Gauge: total tiles cached
attestor_rekor_sync_tiles_cached{backend="sigstore-prod"} 45678
# Gauge: time since last successful sync (seconds)
attestor_rekor_sync_last_success_seconds{backend="sigstore-prod"} 135
# Counter: sync errors
attestor_rekor_sync_errors_total{backend="sigstore-prod",error_type="network"} 5
```
### Alerting Recommendations
```yaml
groups:
- name: attestor-rekor-sync
rules:
- alert: RekorSyncStale
expr: attestor_rekor_sync_last_success_seconds > 900
for: 5m
labels:
severity: warning
annotations:
summary: Rekor sync is stale
description: "No successful sync in {{ $value }}s for {{ $labels.backend }}"
- alert: RekorSyncFailing
expr: rate(attestor_rekor_sync_errors_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: Rekor sync experiencing errors
description: "Sync errors detected for {{ $labels.backend }}"
```
### Maintenance Tasks
**Prune old checkpoints**:
```bash
# Keep only last 30 days of checkpoints
stella attestor checkpoints prune --older-than 720h --keep-latest
```
**Prune old tiles**:
```bash
# Remove tiles for entries no longer needed
stella attestor tiles prune --older-than 720h
```
**Verify checkpoint store integrity**:
```bash
stella attestor checkpoints verify --backend sigstore-prod
```
**Export checkpoints for air-gap**:
```bash
stella attestor export \
--backend sigstore-prod \
--output /mnt/airgap/attestor-bundle.tar.gz \
--include-tiles
```
## Troubleshooting
### Sync Not Running
1. Check service logs:
```bash
journalctl -u stella-attestor -f
```
2. Verify configuration:
```bash
stella attestor config validate
```
3. Check database connectivity:
```bash
stella attestor checkpoint-store test
```
### Signature Verification Failing
1. Verify public key is correct:
```bash
stella attestor backend verify-key sigstore-prod
```
2. Check for key rotation:
- Monitor Sigstore announcements
- Update public key if rotated
3. Compare with direct fetch:
```bash
curl -s https://rekor.sigstore.dev/api/v1/log | jq
```
### Tile Cache Issues
1. Check disk space:
```bash
df -h /var/lib/stella/attestor/tiles
```
2. Verify permissions:
```bash
ls -la /var/lib/stella/attestor/tiles
```
3. Clear and resync:
```bash
stella attestor tiles clear --backend sigstore-prod
stella attestor sync --backend sigstore-prod --full-tiles
```
### Database Issues
1. Check PostgreSQL connectivity:
```bash
psql -h localhost -U stella -d stella -c "SELECT 1"
```
2. Verify schema exists:
```sql
SELECT * FROM attestor.rekor_checkpoints LIMIT 1;
```
3. Reinitialize schema if needed:
```bash
stella attestor checkpoint-store init --force
```
## Air-Gap Operations
### Preparing an Offline Bundle
1. Sync to latest checkpoint:
```bash
stella attestor sync --backend sigstore-prod --full-tiles
```
2. Export bundle:
```bash
stella attestor export \
--backend sigstore-prod \
--output offline-attestor-bundle.tar.gz \
--include-tiles \
--checkpoints-only-verified
```
3. Transfer bundle to air-gapped environment
### Importing in Air-Gapped Environment
1. Import the bundle:
```bash
stella attestor import offline-attestor-bundle.tar.gz
```
2. Verify import:
```bash
stella attestor sync-status
```
3. Checkpoints and tiles are now available for offline verification
## See Also
- [Rekor Verification Design](../modules/attestor/rekor-verification-design.md)
- [Checkpoint Divergence Detection](./checkpoint-divergence-runbook.md)
- [Offline Kit Preparation](./offline-kit-guide.md)
- [Sigstore Rekor Documentation](https://docs.sigstore.dev/rekor/overview/)

View File

@@ -0,0 +1,70 @@
# SoftHSM2 Test Environment Setup
This guide describes how to configure SoftHSM2 for PKCS#11 integration tests and local validation.
## Install SoftHSM2
```bash
# Ubuntu/Debian
sudo apt-get install softhsm2 opensc
# Verify installation
softhsm2-util --version
pkcs11-tool --version
```
## Initialize Token
```bash
# Create token directory
mkdir -p /var/lib/softhsm/tokens
chmod 700 /var/lib/softhsm/tokens
# Initialize token
softhsm2-util --init-token \
--slot 0 \
--label "StellaOps-Dev" \
--so-pin 12345678 \
--pin 87654321
# Verify token
softhsm2-util --show-slots
```
## Create a Test Key
```bash
# Generate RSA keypair
pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
--login --pin 87654321 \
--keypairgen \
--key-type rsa:2048 \
--id 01 \
--label "stellaops-hsm-test"
# List objects
pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
--login --pin 87654321 \
--list-objects
```
## Environment Variables for Tests
```bash
export STELLAOPS_SOFTHSM_LIB="/usr/lib/softhsm/libsofthsm2.so"
export STELLAOPS_SOFTHSM_SLOT="0"
export STELLAOPS_SOFTHSM_PIN="87654321"
export STELLAOPS_SOFTHSM_KEY_ID="stellaops-hsm-test"
export STELLAOPS_SOFTHSM_MECHANISM="RsaSha256"
```
## Run Integration Tests
```bash
dotnet test src/Cryptography/__Tests/StellaOps.Cryptography.Tests/StellaOps.Cryptography.Tests.csproj \
--filter FullyQualifiedName~Pkcs11HsmClientIntegrationTests
```
## Notes
- The integration tests skip automatically if SoftHSM2 variables are not configured.
- Use a dedicated test token; never reuse production tokens.

View File

@@ -628,9 +628,150 @@ To allow approved exceptions to cover specific unknown reason codes, set excepti
- [Triage Technical Reference](../product/advisories/14-Dec-2025%20-%20Triage%20and%20Unknowns%20Technical%20Reference.md)
- [Score Proofs Runbook](./score-proofs-runbook.md)
- [Policy Engine](../modules/policy/architecture.md)
- [Determinization API](../modules/policy/determinization-api.md)
- [VEX Consensus Guide](../VEX_CONSENSUS_GUIDE.md)
---
**Last Updated**: 2025-12-22
**Version**: 1.0.0
**Sprint**: 3500.0004.0004
## 8. Grey Queue Operations
> **Sprint**: SPRINT_20260112_010_CLI_unknowns_grey_queue_cli
The Grey Queue handles observations with uncertain status requiring operator attention or additional evidence. These are distinct from standard HOT/WARM/COLD band unknowns.
### 8.1 Grey Queue Overview
Grey Queue items have:
- **Observation state**: `PendingDeterminization`, `Disputed`, or `GuardedPass`
- **Reanalysis fingerprint**: Deterministic ID for reproducible replays
- **Triggers**: Events that caused reanalysis
- **Conflicts**: Detected evidence disagreements
- **Next actions**: Suggested resolution paths
### 8.2 List Grey Queue Items
```bash
# List all grey queue items
stella unknowns list --state grey
# List by observation state
stella unknowns list --observation-state pending-determinization
stella unknowns list --observation-state disputed
stella unknowns list --observation-state guarded-pass
# List with fingerprint details
stella unknowns list --state grey --show-fingerprint
# List with conflict summary
stella unknowns list --state grey --show-conflicts
```
### 8.3 View Grey Queue Details
```bash
# Show grey queue item with full details
stella unknowns show unk-12345678-... --grey
# Output:
# ID: unk-12345678-...
# Observation State: Disputed
#
# Reanalysis Fingerprint:
# ID: sha256:abc123...
# Computed At: 2026-01-15T10:00:00Z
# Policy Config Hash: sha256:def456...
#
# Triggers (2):
# - epss.updated@1 (2026-01-15T09:55:00Z) delta=0.15
# - vex.updated@1 (2026-01-15T09:50:00Z)
#
# Conflicts (1):
# - VexStatusConflict: vendor-a reports 'not_affected', vendor-b reports 'affected'
# Severity: high
# Adjudication: manual_review
#
# Next Actions:
# - trust_resolution: Resolve issuer trust conflict
# - manual_review: Escalate to security team
# Show fingerprint only
stella unknowns fingerprint unk-12345678-...
# Show triggers only
stella unknowns triggers unk-12345678-...
```
### 8.4 Grey Queue Triage Actions
```bash
# Resolve a grey queue item (operator determination)
stella unknowns resolve unk-12345678-... \
--status not_affected \
--justification "Verified vendor VEX is authoritative" \
--evidence-ref "vex-observation-id-123"
# Escalate for manual review
stella unknowns escalate unk-12345678-... \
--priority P1 \
--reason "Conflicting VEX requires security team decision"
# Defer pending additional evidence
stella unknowns defer unk-12345678-... \
--await vex \
--reason "Waiting for upstream vendor VEX statement"
```
### 8.5 Grey Queue Conflict Resolution
```bash
# List items with conflicts
stella unknowns list --has-conflicts
# Filter by conflict type
stella unknowns list --conflict-type vex-status-conflict
stella unknowns list --conflict-type vex-reachability-contradiction
stella unknowns list --conflict-type trust-tie
# Resolve a conflict manually
stella unknowns resolve-conflict unk-12345678-... \
--winner vendor-a \
--reason "vendor-a is the upstream maintainer"
```
### 8.6 Grey Queue Summary
```bash
# Get grey queue summary
stella unknowns summary --grey
# Output:
# Grey Queue: 23 items
#
# By State:
# PendingDeterminization: 15 (65%)
# Disputed: 5 (22%)
# GuardedPass: 3 (13%)
#
# Conflicts: 8 items have conflicts
# Avg. Triggers: 2.3 per item
# Oldest: 7 days
```
### 8.7 Grey Queue Export
```bash
# Export grey queue for analysis
stella unknowns export --state grey --format json --output grey-queue.json
# Export with full fingerprints and triggers
stella unknowns export --state grey --verbose --output grey-full.json
# Export conflicts only
stella unknowns export --has-conflicts --format csv --output conflicts.csv
```
---
**Last Updated**: 2026-01-16
**Version**: 1.1.0
**Sprint**: SPRINT_20260112_010_CLI_unknowns_grey_queue_cli