stella-ops.org/git.stella-ops.org

Fork 0

Files

master 55744f6a39 tests fixes and some product advisories tunes ups

2026-01-30 07:57:43 +02:00

5.8 KiB

Raw Blame History

Identity Watchlist Monitoring Runbook

This runbook covers operational procedures for the Stella Ops identity watchlist monitoring system.

Service Overview

The identity watchlist monitor is a background service that:

Monitors new Attestor entries in real-time (or via polling)
Matches signer identities against configured watchlist patterns
Emits alerts through the notification system
Applies deduplication to prevent alert storms

Configuration

# appsettings.json
{
  "Attestor": {
    "Watchlist": {
      "Enabled": true,
      "Mode": "ChangeFeed",           # or "Polling" for air-gap
      "PollingInterval": "00:00:05",  # 5 seconds
      "MaxEventsPerSecond": 100,
      "DefaultDedupWindowMinutes": 60,
      "RegexTimeoutMs": 100,
      "MaxWatchlistEntriesPerTenant": 1000,
      "PatternCacheSize": 1000,
      "InitialDelay": "00:00:10",
      "NotifyChannelName": "attestor_entries_inserted"
    }
  }
}

Alert Triage Procedures

Critical Severity Alert

Response Time: Immediate (< 15 minutes)

Acknowledge the alert in your incident management system
Verify the matched identity in Rekor:
```
rekor-cli get --uuid <rekor-uuid>
```
Determine impact:
- What artifact was signed?
- Is this a known/expected signer?
- What systems consume this artifact?
Escalate if malicious activity is confirmed
Document findings in incident record

Warning Severity Alert

Response Time: Within 1 hour

Review the alert details
Check context:
- Is this a new legitimate workflow?
- Is the pattern too broad?

Adjust watchlist entry if needed:

stella watchlist update <id> --severity info
# or
stella watchlist update <id> --enabled false

Document decision rationale

Info Severity Alert

Response Time: Next business day

Review for patterns or trends
Consider if alert should be disabled or tuned
Archive after review

Performance Tuning

High Scan Latency

Symptom: attestor.watchlist.scan_latency_seconds > 10ms

Investigation:

Check pattern cache hit rate:

SELECT COUNT(*) FROM attestor.identity_watchlist WHERE enabled = true;

Review regex patterns for complexity
Check tenant watchlist count

Resolution:

Increase PatternCacheSize if cache misses are high
Simplify complex regex patterns
Consider splitting overly broad patterns

High Alert Volume

Symptom: attestor.watchlist.alerts_emitted_total growing rapidly

Investigation:

Identify top-triggering entries:

stella watchlist alerts --since 1h --format json | jq 'group_by(.watchlistEntryId) | map({id: .[0].watchlistEntryId, count: length}) | sort_by(-.count)'

Check if pattern is too broad

Resolution:

Narrow pattern scope
Increase dedup window
Reduce severity if appropriate

Database Performance

Symptom: Slow list/match queries

Investigation:

EXPLAIN ANALYZE
SELECT * FROM attestor.identity_watchlist
WHERE enabled = true AND (tenant_id = 'tenant-1' OR scope IN ('Global', 'System'));

Resolution:

Verify indexes exist:

SELECT indexname FROM pg_indexes WHERE tablename = 'identity_watchlist';

Run VACUUM ANALYZE if needed
Consider partitioning for large deployments

Deduplication Table Maintenance

Cleanup Expired Records

Run periodically (daily recommended):

DELETE FROM attestor.identity_alert_dedup
WHERE last_alert_at < NOW() - INTERVAL '7 days';

Check Dedup Effectiveness

SELECT
  watchlist_id,
  COUNT(*) as suppressed_identities,
  SUM(alert_count) as total_suppressions
FROM attestor.identity_alert_dedup
GROUP BY watchlist_id
ORDER BY total_suppressions DESC
LIMIT 10;

Air-Gap Operation

For environments without network access to PostgreSQL LISTEN/NOTIFY:

Set Mode: Polling in configuration
Adjust PollingInterval based on acceptable delay (default: 5s)
Ensure sufficient database connection pool size
Monitor for missed entries during polling gaps

Disaster Recovery

Service Restart

Entries are processed based on IntegratedTimeUtc
On restart, the service resumes from last checkpoint
Some duplicate alerts may occur during recovery (handled by dedup)

Database Failover

Service will retry connections automatically
Pattern cache survives in-memory during brief outages
Long outages may require service restart

Watchlist Export/Import

Export:

stella watchlist list --include-global --format json > watchlist-backup.json

Import (manual):

# Process each entry and recreate
jq -c '.[]' watchlist-backup.json | while read entry; do
  # Extract fields and call stella watchlist add
done

Metrics Reference

Metric	Description	Alert Threshold
`attestor.watchlist.entries_scanned_total`	Processing volume	N/A (informational)
`attestor.watchlist.matches_total`	Match frequency	> 100/min (review patterns)
`attestor.watchlist.alerts_emitted_total`	Alert volume	> 50/min (check notification capacity)
`attestor.watchlist.alerts_suppressed_total`	Dedup effectiveness	High ratio = good dedup working
`attestor.watchlist.scan_latency_seconds`	Performance	p99 > 50ms (tune cache/patterns)

Escalation Contacts

Severity	Contact	Response SLA
Critical	On-call Security	15 minutes
Warning	Security Team	1 hour
Info	Security Analyst	Next business day

5.8 KiB Raw Blame History

Identity Watchlist Monitoring Runbook

Service Overview

Configuration

Alert Triage Procedures

Critical Severity Alert

Warning Severity Alert

Info Severity Alert

Performance Tuning

High Scan Latency

High Alert Volume

Database Performance

Deduplication Table Maintenance

Cleanup Expired Records

Check Dedup Effectiveness

Air-Gap Operation

Disaster Recovery

Service Restart

Database Failover

Watchlist Export/Import

Metrics Reference

Escalation Contacts

Related Documents

5.8 KiB

Raw Blame History