tests fixes and some product advisories tunes ups
This commit is contained in:
214
docs/operations/watchlist-monitoring-runbook.md
Normal file
214
docs/operations/watchlist-monitoring-runbook.md
Normal file
@@ -0,0 +1,214 @@
|
||||
# Identity Watchlist Monitoring Runbook
|
||||
|
||||
This runbook covers operational procedures for the Stella Ops identity watchlist monitoring system.
|
||||
|
||||
## Service Overview
|
||||
|
||||
The identity watchlist monitor is a background service that:
|
||||
1. Monitors new Attestor entries in real-time (or via polling)
|
||||
2. Matches signer identities against configured watchlist patterns
|
||||
3. Emits alerts through the notification system
|
||||
4. Applies deduplication to prevent alert storms
|
||||
|
||||
### Configuration
|
||||
|
||||
```yaml
|
||||
# appsettings.json
|
||||
{
|
||||
"Attestor": {
|
||||
"Watchlist": {
|
||||
"Enabled": true,
|
||||
"Mode": "ChangeFeed", # or "Polling" for air-gap
|
||||
"PollingInterval": "00:00:05", # 5 seconds
|
||||
"MaxEventsPerSecond": 100,
|
||||
"DefaultDedupWindowMinutes": 60,
|
||||
"RegexTimeoutMs": 100,
|
||||
"MaxWatchlistEntriesPerTenant": 1000,
|
||||
"PatternCacheSize": 1000,
|
||||
"InitialDelay": "00:00:10",
|
||||
"NotifyChannelName": "attestor_entries_inserted"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Alert Triage Procedures
|
||||
|
||||
### Critical Severity Alert
|
||||
|
||||
**Response Time**: Immediate (< 15 minutes)
|
||||
|
||||
1. **Acknowledge** the alert in your incident management system
|
||||
2. **Verify** the matched identity in Rekor:
|
||||
```bash
|
||||
rekor-cli get --uuid <rekor-uuid>
|
||||
```
|
||||
3. **Determine impact**:
|
||||
- What artifact was signed?
|
||||
- Is this a known/expected signer?
|
||||
- What systems consume this artifact?
|
||||
4. **Escalate** if malicious activity is confirmed
|
||||
5. **Document** findings in incident record
|
||||
|
||||
### Warning Severity Alert
|
||||
|
||||
**Response Time**: Within 1 hour
|
||||
|
||||
1. **Review** the alert details
|
||||
2. **Check context**:
|
||||
- Is this a new legitimate workflow?
|
||||
- Is the pattern too broad?
|
||||
3. **Adjust** watchlist entry if needed:
|
||||
```bash
|
||||
stella watchlist update <id> --severity info
|
||||
# or
|
||||
stella watchlist update <id> --enabled false
|
||||
```
|
||||
4. **Document** decision rationale
|
||||
|
||||
### Info Severity Alert
|
||||
|
||||
**Response Time**: Next business day
|
||||
|
||||
1. **Review** for patterns or trends
|
||||
2. **Consider** if alert should be disabled or tuned
|
||||
3. **Archive** after review
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### High Scan Latency
|
||||
|
||||
**Symptom**: `attestor.watchlist.scan_latency_seconds` > 10ms
|
||||
|
||||
**Investigation**:
|
||||
1. Check pattern cache hit rate:
|
||||
```sql
|
||||
SELECT COUNT(*) FROM attestor.identity_watchlist WHERE enabled = true;
|
||||
```
|
||||
2. Review regex patterns for complexity
|
||||
3. Check tenant watchlist count
|
||||
|
||||
**Resolution**:
|
||||
- Increase `PatternCacheSize` if cache misses are high
|
||||
- Simplify complex regex patterns
|
||||
- Consider splitting overly broad patterns
|
||||
|
||||
### High Alert Volume
|
||||
|
||||
**Symptom**: `attestor.watchlist.alerts_emitted_total` growing rapidly
|
||||
|
||||
**Investigation**:
|
||||
1. Identify top-triggering entries:
|
||||
```bash
|
||||
stella watchlist alerts --since 1h --format json | jq 'group_by(.watchlistEntryId) | map({id: .[0].watchlistEntryId, count: length}) | sort_by(-.count)'
|
||||
```
|
||||
2. Check if pattern is too broad
|
||||
|
||||
**Resolution**:
|
||||
- Narrow pattern scope
|
||||
- Increase dedup window
|
||||
- Reduce severity if appropriate
|
||||
|
||||
### Database Performance
|
||||
|
||||
**Symptom**: Slow list/match queries
|
||||
|
||||
**Investigation**:
|
||||
```sql
|
||||
EXPLAIN ANALYZE
|
||||
SELECT * FROM attestor.identity_watchlist
|
||||
WHERE enabled = true AND (tenant_id = 'tenant-1' OR scope IN ('Global', 'System'));
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Verify indexes exist:
|
||||
```sql
|
||||
SELECT indexname FROM pg_indexes WHERE tablename = 'identity_watchlist';
|
||||
```
|
||||
- Run VACUUM ANALYZE if needed
|
||||
- Consider partitioning for large deployments
|
||||
|
||||
## Deduplication Table Maintenance
|
||||
|
||||
### Cleanup Expired Records
|
||||
|
||||
Run periodically (daily recommended):
|
||||
```sql
|
||||
DELETE FROM attestor.identity_alert_dedup
|
||||
WHERE last_alert_at < NOW() - INTERVAL '7 days';
|
||||
```
|
||||
|
||||
### Check Dedup Effectiveness
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
watchlist_id,
|
||||
COUNT(*) as suppressed_identities,
|
||||
SUM(alert_count) as total_suppressions
|
||||
FROM attestor.identity_alert_dedup
|
||||
GROUP BY watchlist_id
|
||||
ORDER BY total_suppressions DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
## Air-Gap Operation
|
||||
|
||||
For environments without network access to PostgreSQL LISTEN/NOTIFY:
|
||||
|
||||
1. Set `Mode: Polling` in configuration
|
||||
2. Adjust `PollingInterval` based on acceptable delay (default: 5s)
|
||||
3. Ensure sufficient database connection pool size
|
||||
4. Monitor for missed entries during polling gaps
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### Service Restart
|
||||
|
||||
1. Entries are processed based on `IntegratedTimeUtc`
|
||||
2. On restart, the service resumes from last checkpoint
|
||||
3. Some duplicate alerts may occur during recovery (handled by dedup)
|
||||
|
||||
### Database Failover
|
||||
|
||||
1. Service will retry connections automatically
|
||||
2. Pattern cache survives in-memory during brief outages
|
||||
3. Long outages may require service restart
|
||||
|
||||
### Watchlist Export/Import
|
||||
|
||||
Export:
|
||||
```bash
|
||||
stella watchlist list --include-global --format json > watchlist-backup.json
|
||||
```
|
||||
|
||||
Import (manual):
|
||||
```bash
|
||||
# Process each entry and recreate
|
||||
jq -c '.[]' watchlist-backup.json | while read entry; do
|
||||
# Extract fields and call stella watchlist add
|
||||
done
|
||||
```
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `attestor.watchlist.entries_scanned_total` | Processing volume | N/A (informational) |
|
||||
| `attestor.watchlist.matches_total` | Match frequency | > 100/min (review patterns) |
|
||||
| `attestor.watchlist.alerts_emitted_total` | Alert volume | > 50/min (check notification capacity) |
|
||||
| `attestor.watchlist.alerts_suppressed_total` | Dedup effectiveness | High ratio = good dedup working |
|
||||
| `attestor.watchlist.scan_latency_seconds` | Performance | p99 > 50ms (tune cache/patterns) |
|
||||
|
||||
## Escalation Contacts
|
||||
|
||||
| Severity | Contact | Response SLA |
|
||||
|----------|---------|--------------|
|
||||
| Critical | On-call Security | 15 minutes |
|
||||
| Warning | Security Team | 1 hour |
|
||||
| Info | Security Analyst | Next business day |
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Identity Watchlist User Guide](../modules/attestor/guides/identity-watchlist.md)
|
||||
- [Attestor Architecture](../modules/attestor/architecture.md)
|
||||
- [Notification System](../modules/notify/architecture.md)
|
||||
Reference in New Issue
Block a user