# Identity Watchlist Monitoring Runbook This runbook covers operational procedures for the Stella Ops identity watchlist monitoring system. ## Service Overview The identity watchlist monitor is a background service that: 1. Monitors new Attestor entries in real-time (or via polling) 2. Matches signer identities against configured watchlist patterns 3. Emits alerts through the notification system 4. Applies deduplication to prevent alert storms ### Configuration ```yaml # appsettings.json { "Attestor": { "Watchlist": { "Enabled": true, "Mode": "ChangeFeed", # or "Polling" for air-gap "PollingInterval": "00:00:05", # 5 seconds "MaxEventsPerSecond": 100, "DefaultDedupWindowMinutes": 60, "RegexTimeoutMs": 100, "MaxWatchlistEntriesPerTenant": 1000, "PatternCacheSize": 1000, "InitialDelay": "00:00:10", "NotifyChannelName": "attestor_entries_inserted" } } } ``` ## Alert Triage Procedures ### Critical Severity Alert **Response Time**: Immediate (< 15 minutes) 1. **Acknowledge** the alert in your incident management system 2. **Verify** the matched identity in Rekor: ```bash rekor-cli get --uuid ``` 3. **Determine impact**: - What artifact was signed? - Is this a known/expected signer? - What systems consume this artifact? 4. **Escalate** if malicious activity is confirmed 5. **Document** findings in incident record ### Warning Severity Alert **Response Time**: Within 1 hour 1. **Review** the alert details 2. **Check context**: - Is this a new legitimate workflow? - Is the pattern too broad? 3. **Adjust** watchlist entry if needed: ```bash stella watchlist update --severity info # or stella watchlist update --enabled false ``` 4. **Document** decision rationale ### Info Severity Alert **Response Time**: Next business day 1. **Review** for patterns or trends 2. **Consider** if alert should be disabled or tuned 3. **Archive** after review ## Performance Tuning ### High Scan Latency **Symptom**: `attestor.watchlist.scan_latency_seconds` > 10ms **Investigation**: 1. Check pattern cache hit rate: ```sql SELECT COUNT(*) FROM attestor.identity_watchlist WHERE enabled = true; ``` 2. Review regex patterns for complexity 3. Check tenant watchlist count **Resolution**: - Increase `PatternCacheSize` if cache misses are high - Simplify complex regex patterns - Consider splitting overly broad patterns ### High Alert Volume **Symptom**: `attestor.watchlist.alerts_emitted_total` growing rapidly **Investigation**: 1. Identify top-triggering entries: ```bash stella watchlist alerts --since 1h --format json | jq 'group_by(.watchlistEntryId) | map({id: .[0].watchlistEntryId, count: length}) | sort_by(-.count)' ``` 2. Check if pattern is too broad **Resolution**: - Narrow pattern scope - Increase dedup window - Reduce severity if appropriate ### Database Performance **Symptom**: Slow list/match queries **Investigation**: ```sql EXPLAIN ANALYZE SELECT * FROM attestor.identity_watchlist WHERE enabled = true AND (tenant_id = 'tenant-1' OR scope IN ('Global', 'System')); ``` **Resolution**: - Verify indexes exist: ```sql SELECT indexname FROM pg_indexes WHERE tablename = 'identity_watchlist'; ``` - Run VACUUM ANALYZE if needed - Consider partitioning for large deployments ## Deduplication Table Maintenance ### Cleanup Expired Records Run periodically (daily recommended): ```sql DELETE FROM attestor.identity_alert_dedup WHERE last_alert_at < NOW() - INTERVAL '7 days'; ``` ### Check Dedup Effectiveness ```sql SELECT watchlist_id, COUNT(*) as suppressed_identities, SUM(alert_count) as total_suppressions FROM attestor.identity_alert_dedup GROUP BY watchlist_id ORDER BY total_suppressions DESC LIMIT 10; ``` ## Air-Gap Operation For environments without network access to PostgreSQL LISTEN/NOTIFY: 1. Set `Mode: Polling` in configuration 2. Adjust `PollingInterval` based on acceptable delay (default: 5s) 3. Ensure sufficient database connection pool size 4. Monitor for missed entries during polling gaps ## Disaster Recovery ### Service Restart 1. Entries are processed based on `IntegratedTimeUtc` 2. On restart, the service resumes from last checkpoint 3. Some duplicate alerts may occur during recovery (handled by dedup) ### Database Failover 1. Service will retry connections automatically 2. Pattern cache survives in-memory during brief outages 3. Long outages may require service restart ### Watchlist Export/Import Export: ```bash stella watchlist list --include-global --format json > watchlist-backup.json ``` Import (manual): ```bash # Process each entry and recreate jq -c '.[]' watchlist-backup.json | while read entry; do # Extract fields and call stella watchlist add done ``` ## Metrics Reference | Metric | Description | Alert Threshold | |--------|-------------|-----------------| | `attestor.watchlist.entries_scanned_total` | Processing volume | N/A (informational) | | `attestor.watchlist.matches_total` | Match frequency | > 100/min (review patterns) | | `attestor.watchlist.alerts_emitted_total` | Alert volume | > 50/min (check notification capacity) | | `attestor.watchlist.alerts_suppressed_total` | Dedup effectiveness | High ratio = good dedup working | | `attestor.watchlist.scan_latency_seconds` | Performance | p99 > 50ms (tune cache/patterns) | ## Escalation Contacts | Severity | Contact | Response SLA | |----------|---------|--------------| | Critical | On-call Security | 15 minutes | | Warning | Security Team | 1 hour | | Info | Security Analyst | Next business day | ## Related Documents - [Identity Watchlist User Guide](../modules/attestor/guides/identity-watchlist.md) - [Attestor Architecture](../modules/attestor/architecture.md) - [Notification System](../modules/notify/architecture.md)