Files
git.stella-ops.org/docs/operations/checkpoint-divergence-runbook.md

7.9 KiB

Checkpoint Divergence Detection and Incident Response

This runbook covers the detection of Rekor checkpoint divergence, anomaly types, alert handling, and incident response procedures.

Overview

Checkpoint divergence detection monitors the integrity of Rekor transparency logs by:

  • Comparing root hashes at the same tree size
  • Verifying tree size monotonicity (only increases)
  • Cross-checking primary logs against mirrors
  • Detecting stale or unresponsive logs

Divergence can indicate:

  • Split-view attacks (malicious log server showing different trees to different clients)
  • Rollback attacks (hiding recent log entries)
  • Log compromise or key theft
  • Network partitions or operational issues

Detection Rules

Check Condition Severity Recommended Action
Root hash mismatch Same tree_size, different root_hash CRITICAL Quarantine + immediate investigation
Tree size rollback new_tree_size < stored_tree_size CRITICAL Reject checkpoint + alert
Cross-log divergence Primary root ≠ mirror root at same size WARNING Alert + investigate
Stale checkpoint Checkpoint age > threshold WARNING Alert + monitor

Alert Payloads

Root Hash Mismatch Alert

{
  "eventType": "rekor.checkpoint.divergence",
  "severity": "critical",
  "origin": "rekor.sigstore.dev",
  "treeSize": 12345678,
  "expectedRootHash": "sha256:abc123...",
  "actualRootHash": "sha256:def456...",
  "detectedAt": "2026-01-15T12:34:56Z",
  "backend": "sigstore-prod",
  "description": "Checkpoint root hash mismatch detected. Possible split-view attack.",
  "recommendedAction": "Quarantine"
}

Rollback Attempt Alert

{
  "eventType": "rekor.checkpoint.rollback",
  "severity": "critical",
  "origin": "rekor.sigstore.dev",
  "previousTreeSize": 12345678,
  "attemptedTreeSize": 12345600,
  "detectedAt": "2026-01-15T12:34:56Z",
  "description": "Tree size regression detected. Possible rollback attack."
}

Cross-Log Divergence Alert

{
  "eventType": "rekor.checkpoint.cross_log_divergence",
  "severity": "warning",
  "primaryOrigin": "rekor.sigstore.dev",
  "mirrorOrigin": "rekor.mirror.example.com",
  "treeSize": 12345678,
  "primaryRootHash": "sha256:abc123...",
  "mirrorRootHash": "sha256:def456...",
  "description": "Cross-log divergence detected between primary and mirror."
}

Metrics

# Counter: total checkpoint mismatches
attestor_rekor_checkpoint_mismatch_total{backend="sigstore-prod",origin="rekor.sigstore.dev"} 0

# Counter: rollback attempts detected
attestor_rekor_checkpoint_rollback_detected_total{backend="sigstore-prod"} 0

# Counter: cross-log divergences detected
attestor_rekor_cross_log_divergence_total{primary="rekor.sigstore.dev",mirror="mirror.example.com"} 0

# Gauge: seconds since last valid checkpoint
attestor_rekor_checkpoint_age_seconds{backend="sigstore-prod"} 120

# Counter: total anomalies detected (all types)
attestor_rekor_anomalies_detected_total{type="RootHashMismatch",severity="critical"} 0

Incident Response Procedures

Level 1: Root Hash Mismatch (CRITICAL)

Symptoms:

  • attestor_rekor_checkpoint_mismatch_total increments
  • Alert received: "rekor.checkpoint.divergence"

Immediate Actions:

  1. Quarantine all affected proofs - Do not rely on any inclusion proofs from the affected log until resolved
  2. Suspend automated verifications - Halt any automated systems that depend on the log
  3. Preserve evidence - Capture both checkpoints (expected and actual) with full metadata
  4. Alert security team - This is a potential compromise indicator

Investigation Steps:

  1. Verify the mismatch isn't a local storage corruption
    stella attestor checkpoint verify --origin rekor.sigstore.dev --tree-size 12345678
    
  2. Cross-check with independent sources (other clients, mirrors)
  3. Check if Sigstore has published any incident reports
  4. Review network logs for MITM indicators

Resolution:

  • If confirmed attack: Follow security incident process
  • If local corruption: Resync from trusted source
  • If upstream issue: Wait for Sigstore remediation, follow their guidance

Level 2: Tree Size Rollback (CRITICAL)

Symptoms:

  • attestor_rekor_checkpoint_rollback_detected_total increments
  • Alert received: "rekor.checkpoint.rollback"

Immediate Actions:

  1. Reject the checkpoint - Do not accept or store it
  2. Log full details for forensic analysis
  3. Check network path - Could indicate MITM or DNS hijacking

Investigation Steps:

  1. Verify current log state directly:
    curl -s https://rekor.sigstore.dev/api/v1/log | jq .treeSize
    
  2. Compare with stored latest tree size
  3. Check DNS resolution and TLS certificate chain

Resolution:

  • If network attack: Remediate network path, rotate credentials
  • If temporary glitch: Monitor for repetition
  • If persistent: Escalate to upstream provider

Level 3: Cross-Log Divergence (WARNING)

Symptoms:

  • attestor_rekor_cross_log_divergence_total increments
  • Alert received: "rekor.checkpoint.cross_log_divergence"

Immediate Actions:

  1. Do not panic - Mirrors may have legitimate lag
  2. Check mirror sync status - May be catching up

Investigation Steps:

  1. Compare tree sizes:
    stella attestor checkpoint list --origins rekor.sigstore.dev,mirror.example.com
    
  2. If same tree size with different roots: Escalate to CRITICAL
  3. If different tree sizes: Allow time for sync
  4. If persistent: Investigate mirror operator

Resolution:

  • Sync lag: Monitor until caught up
  • Persistent divergence: Disable mirror, investigate, or remove from trust list

Level 4: Stale Checkpoint (WARNING)

Symptoms:

  • attestor_rekor_checkpoint_age_seconds exceeds threshold
  • Log health status: DEGRADED or UNHEALTHY

Immediate Actions:

  1. Check log service status
  2. Verify network connectivity to log

Investigation Steps:

  1. Check Sigstore status page
  2. Test direct API access:
    curl -I https://rekor.sigstore.dev/api/v1/log
    
  3. Review recent checkpoint fetch attempts

Resolution:

  • Upstream outage: Wait, rely on cached data
  • Local network issue: Restore connectivity
  • Persistent: Consider failover to mirror

Configuration

Detector Options

attestor:
  divergenceDetection:
    # Enable checkpoint monitoring
    enabled: true

    # Threshold for "stale checkpoint" warning
    staleCheckpointThreshold: 1h

    # Threshold for "stale tree size" (no growth)
    staleTreeSizeThreshold: 2h

    # Log health thresholds
    degradedCheckpointAgeThreshold: 30m
    unhealthyCheckpointAgeThreshold: 2h

    # Enable cross-log consistency checks
    enableCrossLogChecks: true

    # Mirror origins to check against primary
    mirrorOrigins:
      - rekor.mirror.example.com
      - rekor.mirror2.example.com

Alert Options

attestor:
  alerts:
    # Enable alert publishing to Notify service
    enabled: true

    # Default tenant for system alerts
    defaultTenant: system

    # Severity thresholds for alerting
    alertOnHighSeverity: true
    alertOnWarning: true
    alertOnInfo: false

    # Alert stream name
    stream: attestor.alerts

Runbook Checklist

Daily Operations

  • Verify attestor_rekor_checkpoint_age_seconds < threshold
  • Check for any anomaly counter increments
  • Review divergence detector logs for warnings

Weekly Review

  • Audit checkpoint storage integrity
  • Verify mirror sync status
  • Review and tune alerting thresholds

Post-Incident

  • Document root cause
  • Update detection rules if needed
  • Review and improve response procedures
  • Share learnings with team

See Also