Files
git.stella-ops.org/docs/operations/watchlist-monitoring-runbook.md

5.8 KiB

Identity Watchlist Monitoring Runbook

This runbook covers operational procedures for the Stella Ops identity watchlist monitoring system.

Service Overview

The identity watchlist monitor is a background service that:

  1. Monitors new Attestor entries in real-time (or via polling)
  2. Matches signer identities against configured watchlist patterns
  3. Emits alerts through the notification system
  4. Applies deduplication to prevent alert storms

Configuration

# appsettings.json
{
  "Attestor": {
    "Watchlist": {
      "Enabled": true,
      "Mode": "ChangeFeed",           # or "Polling" for air-gap
      "PollingInterval": "00:00:05",  # 5 seconds
      "MaxEventsPerSecond": 100,
      "DefaultDedupWindowMinutes": 60,
      "RegexTimeoutMs": 100,
      "MaxWatchlistEntriesPerTenant": 1000,
      "PatternCacheSize": 1000,
      "InitialDelay": "00:00:10",
      "NotifyChannelName": "attestor_entries_inserted"
    }
  }
}

Alert Triage Procedures

Critical Severity Alert

Response Time: Immediate (< 15 minutes)

  1. Acknowledge the alert in your incident management system
  2. Verify the matched identity in Rekor:
    rekor-cli get --uuid <rekor-uuid>
    
  3. Determine impact:
    • What artifact was signed?
    • Is this a known/expected signer?
    • What systems consume this artifact?
  4. Escalate if malicious activity is confirmed
  5. Document findings in incident record

Warning Severity Alert

Response Time: Within 1 hour

  1. Review the alert details
  2. Check context:
    • Is this a new legitimate workflow?
    • Is the pattern too broad?
  3. Adjust watchlist entry if needed:
    stella watchlist update <id> --severity info
    # or
    stella watchlist update <id> --enabled false
    
  4. Document decision rationale

Info Severity Alert

Response Time: Next business day

  1. Review for patterns or trends
  2. Consider if alert should be disabled or tuned
  3. Archive after review

Performance Tuning

High Scan Latency

Symptom: attestor.watchlist.scan_latency_seconds > 10ms

Investigation:

  1. Check pattern cache hit rate:
    SELECT COUNT(*) FROM attestor.identity_watchlist WHERE enabled = true;
    
  2. Review regex patterns for complexity
  3. Check tenant watchlist count

Resolution:

  • Increase PatternCacheSize if cache misses are high
  • Simplify complex regex patterns
  • Consider splitting overly broad patterns

High Alert Volume

Symptom: attestor.watchlist.alerts_emitted_total growing rapidly

Investigation:

  1. Identify top-triggering entries:
    stella watchlist alerts --since 1h --format json | jq 'group_by(.watchlistEntryId) | map({id: .[0].watchlistEntryId, count: length}) | sort_by(-.count)'
    
  2. Check if pattern is too broad

Resolution:

  • Narrow pattern scope
  • Increase dedup window
  • Reduce severity if appropriate

Database Performance

Symptom: Slow list/match queries

Investigation:

EXPLAIN ANALYZE
SELECT * FROM attestor.identity_watchlist
WHERE enabled = true AND (tenant_id = 'tenant-1' OR scope IN ('Global', 'System'));

Resolution:

  • Verify indexes exist:
    SELECT indexname FROM pg_indexes WHERE tablename = 'identity_watchlist';
    
  • Run VACUUM ANALYZE if needed
  • Consider partitioning for large deployments

Deduplication Table Maintenance

Cleanup Expired Records

Run periodically (daily recommended):

DELETE FROM attestor.identity_alert_dedup
WHERE last_alert_at < NOW() - INTERVAL '7 days';

Check Dedup Effectiveness

SELECT
  watchlist_id,
  COUNT(*) as suppressed_identities,
  SUM(alert_count) as total_suppressions
FROM attestor.identity_alert_dedup
GROUP BY watchlist_id
ORDER BY total_suppressions DESC
LIMIT 10;

Air-Gap Operation

For environments without network access to PostgreSQL LISTEN/NOTIFY:

  1. Set Mode: Polling in configuration
  2. Adjust PollingInterval based on acceptable delay (default: 5s)
  3. Ensure sufficient database connection pool size
  4. Monitor for missed entries during polling gaps

Disaster Recovery

Service Restart

  1. Entries are processed based on IntegratedTimeUtc
  2. On restart, the service resumes from last checkpoint
  3. Some duplicate alerts may occur during recovery (handled by dedup)

Database Failover

  1. Service will retry connections automatically
  2. Pattern cache survives in-memory during brief outages
  3. Long outages may require service restart

Watchlist Export/Import

Export:

stella watchlist list --include-global --format json > watchlist-backup.json

Import (manual):

# Process each entry and recreate
jq -c '.[]' watchlist-backup.json | while read entry; do
  # Extract fields and call stella watchlist add
done

Metrics Reference

Metric Description Alert Threshold
attestor.watchlist.entries_scanned_total Processing volume N/A (informational)
attestor.watchlist.matches_total Match frequency > 100/min (review patterns)
attestor.watchlist.alerts_emitted_total Alert volume > 50/min (check notification capacity)
attestor.watchlist.alerts_suppressed_total Dedup effectiveness High ratio = good dedup working
attestor.watchlist.scan_latency_seconds Performance p99 > 50ms (tune cache/patterns)

Escalation Contacts

Severity Contact Response SLA
Critical On-call Security 15 minutes
Warning Security Team 1 hour
Info Security Analyst Next business day