12 KiB
12 KiB
Notification Rules and Alerting Engine
Version: 1.0 Date: 2025-11-29 Status: Canonical
This advisory defines the product rationale, rules engine semantics, and implementation strategy for the Notify module, covering channel connectors, throttling, digests, and delivery management.
1. Executive Summary
The Notify module provides rules-driven, tenant-aware notification delivery across security workflows. Key capabilities:
- Rules Engine - Declarative matchers for event routing
- Multi-Channel Delivery - Slack, Teams, Email, Webhooks
- Noise Control - Throttling, deduplication, digest windows
- Approval Tokens - DSSE-signed ack tokens for one-click workflows
- Audit Trail - Complete delivery history with redacted payloads
2. Market Drivers
2.1 Target Segments
| Segment | Notification Requirements | Use Case |
|---|---|---|
| Security Teams | Real-time critical alerts | Incident response |
| DevSecOps | CI/CD integration | Pipeline notifications |
| Compliance | Audit trails | Delivery verification |
| Management | Digest summaries | Executive reporting |
2.2 Competitive Positioning
Most vulnerability tools offer basic email alerts. Stella Ops differentiates with:
- Rules-based routing with fine-grained matchers
- Native Slack/Teams integration with rich formatting
- Digest windows to prevent alert fatigue
- Cryptographic ack tokens for approval workflows
- Tenant isolation with quota controls
3. Rules Engine
3.1 Rule Structure
name: "critical-alerts-prod"
enabled: true
tenant: "acme-corp"
match:
eventKinds:
- "scanner.report.ready"
- "scheduler.rescan.delta"
- "zastava.admission"
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high"
kev: true
verdict: ["fail", "deny"]
vex:
includeRejectedJustifications: false
actions:
- channel: "slack:sec-alerts"
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
3.2 Matcher Types
| Matcher | Description | Example |
|---|---|---|
eventKinds |
Event type filter | ["scanner.report.ready"] |
namespaces |
Namespace patterns | ["prod-*", "staging"] |
repos |
Repository patterns | ["ghcr.io/acme/*"] |
minSeverity |
Minimum severity | "high" |
kev |
KEV-tagged required | true |
verdict |
Report/admission verdict | ["fail", "deny"] |
labels |
Kubernetes labels | {"env": "production"} |
3.3 Evaluation Order
- Tenant check - Discard if rule tenant ≠ event tenant
- Kind filter - Early discard for non-matching kinds
- Scope match - Namespace/repo/label matching
- Delta gates - Severity threshold evaluation
- VEX gate - Filter based on VEX status
- Throttle/dedup - Idempotency key check
- Actions - Enqueue per-channel jobs
4. Channel Connectors
4.1 Built-in Channels
| Channel | Features | Rate Limits |
|---|---|---|
| Slack | Blocks, threads, reactions | 1 msg/sec per channel |
| Teams | Adaptive Cards, webhooks | 4 msgs/sec |
| HTML+text, attachments | Relay-dependent | |
| Webhook | JSON, HMAC signing | 10 req/sec |
4.2 Channel Configuration
channels:
- name: "slack:sec-alerts"
type: slack
config:
channel: "#security-alerts"
workspace: "acme-corp"
secretRef: "ref://notify/slack-token"
- name: "email:soc"
type: email
config:
to: ["soc@acme.com"]
from: "stellaops@acme.com"
smtpHost: "smtp.acme.com"
secretRef: "ref://notify/smtp-creds"
- name: "webhook:siem"
type: webhook
config:
url: "https://siem.acme.com/api/events"
signMethod: "ed25519"
signKeyRef: "ref://notify/webhook-key"
4.3 Connector Contract
public interface INotifyConnector
{
string Type { get; }
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
5. Noise Control
5.1 Throttling
- Per-action throttle - Suppress duplicates within window
- Idempotency key -
hash(ruleId | actionId | event.kind | scope.digest | day) - Configurable windows - 5m, 15m, 1h, 1d
5.2 Digest Windows
actions:
- channel: "email:weekly-summary"
digest: "weekly"
digestOptions:
maxItems: 100
groupBy: ["severity", "namespace"]
template: "digest-summary"
Behavior:
- Coalesce events within window
- Summarize top N items with counts
- Flush on window close or max items
- Safe truncation with "and X more" links
5.3 Quiet Hours
notify:
quietHours:
enabled: true
window: "22:00-06:00"
timezone: "America/New_York"
minSeverity: "critical"
Only critical alerts during quiet hours; others deferred to digests.
6. Templates & Rendering
6.1 Template Engine
- Handlebars-style safe templates
- No arbitrary code execution
- Deterministic outputs (stable property order)
- Locale-aware formatting
6.2 Template Variables
| Variable | Description |
|---|---|
event.kind |
Event type |
event.ts |
Timestamp |
scope.namespace |
Kubernetes namespace |
scope.repo |
Repository |
scope.digest |
Image digest |
payload.verdict |
Policy verdict |
payload.delta.newCritical |
New critical count |
payload.links.ui |
UI deep link |
topFindings[] |
Top N findings |
6.3 Channel-Specific Rendering
Slack:
{
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "Policy FAIL: nginx:latest"}},
{"type": "section", "text": {"type": "mrkdwn", "text": "*2 critical*, 3 high vulnerabilities"}}
]
}
Email:
<h2>Policy FAIL: nginx:latest</h2>
<table>
<tr><td>Critical</td><td>2</td></tr>
<tr><td>High</td><td>3</td></tr>
</table>
<a href="https://ui.internal/reports/...">View Details</a>
7. Ack Tokens
7.1 Token Structure
DSSE-signed tokens for one-click acknowledgements:
{
"payloadType": "application/vnd.stellaops.notify-ack-token+json",
"payload": {
"tenant": "acme-corp",
"deliveryId": "delivery-123",
"notificationId": "notif-456",
"channel": "slack:sec-alerts",
"webhookUrl": "https://notify.internal/ack",
"nonce": "random-nonce",
"actions": ["acknowledge", "escalate"],
"expiresAt": "2025-11-29T13:00:00Z"
},
"signatures": [{"keyid": "notify-ack-key-01", "sig": "..."}]
}
7.2 Token Workflow
- Issue -
POST /notify/ack-tokens/issue - Embed - Token included in message action button
- Click - User clicks button, token sent to webhook
- Verify -
POST /notify/ack-tokens/verify - Audit - Ack event recorded
7.3 Token Rotation
# Rotate ack token signing key
stella notify rotate-ack-key --key-source kms://notify/ack-key
8. Implementation Strategy
8.1 Phase 1: Core Engine (Complete)
- Rules engine with matchers
- Slack connector
- Teams connector
- Email connector
- Webhook connector
8.2 Phase 2: Noise Control (Complete)
- Throttling
- Digest windows
- Idempotency
- Quiet hours
8.3 Phase 3: Ack Tokens (In Progress)
- Token issuance
- Token verification
- Token rotation API (NOTIFY-ACK-45-001)
- Escalation workflows (NOTIFY-ESC-46-001)
8.4 Phase 4: Advanced Features (Planned)
- PagerDuty connector
- Jira ticket creation
- In-app notifications
- Anomaly suppression
9. API Surface
9.1 Channels
| Endpoint | Method | Scope | Description |
|---|---|---|---|
/api/v1/notify/channels |
GET/POST | notify.read/admin |
List/create channels |
/api/v1/notify/channels/{id} |
GET/PATCH/DELETE | notify.admin |
Manage channel |
/api/v1/notify/channels/{id}/test |
POST | notify.admin |
Send test message |
/api/v1/notify/channels/{id}/health |
GET | notify.read |
Health check |
9.2 Rules
| Endpoint | Method | Scope | Description |
|---|---|---|---|
/api/v1/notify/rules |
GET/POST | notify.read/admin |
List/create rules |
/api/v1/notify/rules/{id} |
GET/PATCH/DELETE | notify.admin |
Manage rule |
/api/v1/notify/rules/{id}/test |
POST | notify.admin |
Dry-run rule |
9.3 Deliveries
| Endpoint | Method | Scope | Description |
|---|---|---|---|
/api/v1/notify/deliveries |
GET | notify.read |
List deliveries |
/api/v1/notify/deliveries/{id} |
GET | notify.read |
Delivery detail |
/api/v1/notify/deliveries/{id}/retry |
POST | notify.admin |
Retry delivery |
10. Event Sources
10.1 Subscribed Events
| Event | Source | Typical Actions |
|---|---|---|
scanner.scan.completed |
Scanner | Immediate/digest |
scanner.report.ready |
Scanner | Immediate |
scheduler.rescan.delta |
Scheduler | Immediate/digest |
attestor.logged |
Attestor | Immediate |
zastava.admission |
Zastava | Immediate |
conselier.export.completed |
Concelier | Digest |
excitor.export.completed |
Excititor | Digest |
10.2 Event Envelope
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "acme-corp",
"ts": "2025-11-29T12:00:00Z",
"actor": "scanner-webservice",
"scope": {
"namespace": "production",
"repo": "ghcr.io/acme/api",
"digest": "sha256:..."
},
"payload": {
"reportId": "report-123",
"verdict": "fail",
"summary": {"total": 12, "blocked": 2},
"delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}
}
}
11. Observability
11.1 Metrics
notify.events_consumed_total{kind}notify.rules_matched_total{ruleId}notify.throttled_total{reason}notify.digest_coalesced_total{window}notify.sent_total{channel}notify.failed_total{channel,code}notify.delivery_latency_seconds{channel}
11.2 SLO Targets
| Metric | Target |
|---|---|
| Event-to-delivery p95 | < 60 seconds |
| Failure rate | < 0.5% per hour |
| Duplicate rate | ~0% |
12. Security Considerations
12.1 Secret Management
- Secrets stored as references only
- Just-in-time fetch at send time
- No plaintext in Mongo
12.2 Webhook Signing
X-StellaOps-Signature: t=1732881600,v1=abc123...
X-StellaOps-Timestamp: 2025-11-29T12:00:00Z
- HMAC-SHA256 or Ed25519
- Replay window protection
- Canonical body hash
12.3 Loop Prevention
- Webhook target allowlist
- Event origin tags
- Own webhooks rejected
13. Related Documentation
| Resource | Location |
|---|---|
| Notify architecture | docs/modules/notify/architecture.md |
| Channel schemas | docs/modules/notify/resources/schemas/ |
| Sample payloads | docs/modules/notify/resources/samples/ |
| Bootstrap pack | docs/modules/notify/bootstrap-pack.md |
14. Sprint Mapping
- Primary Sprint: SPRINT_0170_0001_0001_notify_engine.md (NEW)
- Related Sprints:
- SPRINT_0171_0001_0002_notify_connectors.md
- SPRINT_0172_0001_0003_notify_ack_tokens.md
Key Task IDs:
NOTIFY-ENGINE-40-001- Rules engine (DONE)NOTIFY-CONN-41-001- Connectors (DONE)NOTIFY-NOISE-42-001- Throttling/digests (DONE)NOTIFY-ACK-45-001- Token rotation (IN PROGRESS)NOTIFY-ESC-46-001- Escalation workflows (TODO)
15. Success Metrics
| Metric | Target |
|---|---|
| Delivery latency | < 60s p95 |
| Delivery success rate | > 99.5% |
| Duplicate rate | < 0.01% |
| Rule evaluation time | < 10ms |
| Channel health | 99.9% uptime |
Last updated: 2025-11-29