true the date
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-30 19:23:21 +02:00
parent 71e9a56cfd
commit 0bef705bcc
14 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,469 @@
# Notification Rules and Alerting Engine
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, rules engine semantics, and implementation strategy for the Notify module, covering channel connectors, throttling, digests, and delivery management.
---
## 1. Executive Summary
The Notify module provides **rules-driven, tenant-aware notification delivery** across security workflows. Key capabilities:
- **Rules Engine** - Declarative matchers for event routing
- **Multi-Channel Delivery** - Slack, Teams, Email, Webhooks
- **Noise Control** - Throttling, deduplication, digest windows
- **Approval Tokens** - DSSE-signed ack tokens for one-click workflows
- **Audit Trail** - Complete delivery history with redacted payloads
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Notification Requirements | Use Case |
|---------|--------------------------|----------|
| **Security Teams** | Real-time critical alerts | Incident response |
| **DevSecOps** | CI/CD integration | Pipeline notifications |
| **Compliance** | Audit trails | Delivery verification |
| **Management** | Digest summaries | Executive reporting |
### 2.2 Competitive Positioning
Most vulnerability tools offer basic email alerts. Stella Ops differentiates with:
- **Rules-based routing** with fine-grained matchers
- **Native Slack/Teams integration** with rich formatting
- **Digest windows** to prevent alert fatigue
- **Cryptographic ack tokens** for approval workflows
- **Tenant isolation** with quota controls
---
## 3. Rules Engine
### 3.1 Rule Structure
```yaml
name: "critical-alerts-prod"
enabled: true
tenant: "acme-corp"
match:
eventKinds:
- "scanner.report.ready"
- "scheduler.rescan.delta"
- "zastava.admission"
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high"
kev: true
verdict: ["fail", "deny"]
vex:
includeRejectedJustifications: false
actions:
- channel: "slack:sec-alerts"
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
```
### 3.2 Matcher Types
| Matcher | Description | Example |
|---------|-------------|---------|
| `eventKinds` | Event type filter | `["scanner.report.ready"]` |
| `namespaces` | Namespace patterns | `["prod-*", "staging"]` |
| `repos` | Repository patterns | `["ghcr.io/acme/*"]` |
| `minSeverity` | Minimum severity | `"high"` |
| `kev` | KEV-tagged required | `true` |
| `verdict` | Report/admission verdict | `["fail", "deny"]` |
| `labels` | Kubernetes labels | `{"env": "production"}` |
### 3.3 Evaluation Order
1. **Tenant check** - Discard if rule tenant ≠ event tenant
2. **Kind filter** - Early discard for non-matching kinds
3. **Scope match** - Namespace/repo/label matching
4. **Delta gates** - Severity threshold evaluation
5. **VEX gate** - Filter based on VEX status
6. **Throttle/dedup** - Idempotency key check
7. **Actions** - Enqueue per-channel jobs
---
## 4. Channel Connectors
### 4.1 Built-in Channels
| Channel | Features | Rate Limits |
|---------|----------|-------------|
| **Slack** | Blocks, threads, reactions | 1 msg/sec per channel |
| **Teams** | Adaptive Cards, webhooks | 4 msgs/sec |
| **Email** | HTML+text, attachments | Relay-dependent |
| **Webhook** | JSON, HMAC signing | 10 req/sec |
### 4.2 Channel Configuration
```yaml
channels:
- name: "slack:sec-alerts"
type: slack
config:
channel: "#security-alerts"
workspace: "acme-corp"
secretRef: "ref://notify/slack-token"
- name: "email:soc"
type: email
config:
to: ["soc@acme.com"]
from: "stellaops@acme.com"
smtpHost: "smtp.acme.com"
secretRef: "ref://notify/smtp-creds"
- name: "webhook:siem"
type: webhook
config:
url: "https://siem.acme.com/api/events"
signMethod: "ed25519"
signKeyRef: "ref://notify/webhook-key"
```
### 4.3 Connector Contract
```csharp
public interface INotifyConnector
{
string Type { get; }
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
```
---
## 5. Noise Control
### 5.1 Throttling
- **Per-action throttle** - Suppress duplicates within window
- **Idempotency key** - `hash(ruleId | actionId | event.kind | scope.digest | day)`
- **Configurable windows** - 5m, 15m, 1h, 1d
### 5.2 Digest Windows
```yaml
actions:
- channel: "email:weekly-summary"
digest: "weekly"
digestOptions:
maxItems: 100
groupBy: ["severity", "namespace"]
template: "digest-summary"
```
**Behavior:**
- Coalesce events within window
- Summarize top N items with counts
- Flush on window close or max items
- Safe truncation with "and X more" links
### 5.3 Quiet Hours
```yaml
notify:
quietHours:
enabled: true
window: "22:00-06:00"
timezone: "America/New_York"
minSeverity: "critical"
```
Only critical alerts during quiet hours; others deferred to digests.
---
## 6. Templates & Rendering
### 6.1 Template Engine
- Handlebars-style safe templates
- No arbitrary code execution
- Deterministic outputs (stable property order)
- Locale-aware formatting
### 6.2 Template Variables
| Variable | Description |
|----------|-------------|
| `event.kind` | Event type |
| `event.ts` | Timestamp |
| `scope.namespace` | Kubernetes namespace |
| `scope.repo` | Repository |
| `scope.digest` | Image digest |
| `payload.verdict` | Policy verdict |
| `payload.delta.newCritical` | New critical count |
| `payload.links.ui` | UI deep link |
| `topFindings[]` | Top N findings |
### 6.3 Channel-Specific Rendering
**Slack:**
```json
{
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "Policy FAIL: nginx:latest"}},
{"type": "section", "text": {"type": "mrkdwn", "text": "*2 critical*, 3 high vulnerabilities"}}
]
}
```
**Email:**
```html
<h2>Policy FAIL: nginx:latest</h2>
<table>
<tr><td>Critical</td><td>2</td></tr>
<tr><td>High</td><td>3</td></tr>
</table>
<a href="https://ui.internal/reports/...">View Details</a>
```
---
## 7. Ack Tokens
### 7.1 Token Structure
DSSE-signed tokens for one-click acknowledgements:
```json
{
"payloadType": "application/vnd.stellaops.notify-ack-token+json",
"payload": {
"tenant": "acme-corp",
"deliveryId": "delivery-123",
"notificationId": "notif-456",
"channel": "slack:sec-alerts",
"webhookUrl": "https://notify.internal/ack",
"nonce": "random-nonce",
"actions": ["acknowledge", "escalate"],
"expiresAt": "2025-11-29T13:00:00Z"
},
"signatures": [{"keyid": "notify-ack-key-01", "sig": "..."}]
}
```
### 7.2 Token Workflow
1. **Issue** - `POST /notify/ack-tokens/issue`
2. **Embed** - Token included in message action button
3. **Click** - User clicks button, token sent to webhook
4. **Verify** - `POST /notify/ack-tokens/verify`
5. **Audit** - Ack event recorded
### 7.3 Token Rotation
```bash
# Rotate ack token signing key
stella notify rotate-ack-key --key-source kms://notify/ack-key
```
---
## 8. Implementation Strategy
### 8.1 Phase 1: Core Engine (Complete)
- [x] Rules engine with matchers
- [x] Slack connector
- [x] Teams connector
- [x] Email connector
- [x] Webhook connector
### 8.2 Phase 2: Noise Control (Complete)
- [x] Throttling
- [x] Digest windows
- [x] Idempotency
- [x] Quiet hours
### 8.3 Phase 3: Ack Tokens (In Progress)
- [x] Token issuance
- [x] Token verification
- [ ] Token rotation API (NOTIFY-ACK-45-001)
- [ ] Escalation workflows (NOTIFY-ESC-46-001)
### 8.4 Phase 4: Advanced Features (Planned)
- [ ] PagerDuty connector
- [ ] Jira ticket creation
- [ ] In-app notifications
- [ ] Anomaly suppression
---
## 9. API Surface
### 9.1 Channels
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/notify/channels` | GET/POST | `notify.read/admin` | List/create channels |
| `/api/v1/notify/channels/{id}` | GET/PATCH/DELETE | `notify.admin` | Manage channel |
| `/api/v1/notify/channels/{id}/test` | POST | `notify.admin` | Send test message |
| `/api/v1/notify/channels/{id}/health` | GET | `notify.read` | Health check |
### 9.2 Rules
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/notify/rules` | GET/POST | `notify.read/admin` | List/create rules |
| `/api/v1/notify/rules/{id}` | GET/PATCH/DELETE | `notify.admin` | Manage rule |
| `/api/v1/notify/rules/{id}/test` | POST | `notify.admin` | Dry-run rule |
### 9.3 Deliveries
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/notify/deliveries` | GET | `notify.read` | List deliveries |
| `/api/v1/notify/deliveries/{id}` | GET | `notify.read` | Delivery detail |
| `/api/v1/notify/deliveries/{id}/retry` | POST | `notify.admin` | Retry delivery |
---
## 10. Event Sources
### 10.1 Subscribed Events
| Event | Source | Typical Actions |
|-------|--------|-----------------|
| `scanner.scan.completed` | Scanner | Immediate/digest |
| `scanner.report.ready` | Scanner | Immediate |
| `scheduler.rescan.delta` | Scheduler | Immediate/digest |
| `attestor.logged` | Attestor | Immediate |
| `zastava.admission` | Zastava | Immediate |
| `conselier.export.completed` | Concelier | Digest |
| `excitor.export.completed` | Excititor | Digest |
### 10.2 Event Envelope
```json
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "acme-corp",
"ts": "2025-11-29T12:00:00Z",
"actor": "scanner-webservice",
"scope": {
"namespace": "production",
"repo": "ghcr.io/acme/api",
"digest": "sha256:..."
},
"payload": {
"reportId": "report-123",
"verdict": "fail",
"summary": {"total": 12, "blocked": 2},
"delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}
}
}
```
---
## 11. Observability
### 11.1 Metrics
- `notify.events_consumed_total{kind}`
- `notify.rules_matched_total{ruleId}`
- `notify.throttled_total{reason}`
- `notify.digest_coalesced_total{window}`
- `notify.sent_total{channel}`
- `notify.failed_total{channel,code}`
- `notify.delivery_latency_seconds{channel}`
### 11.2 SLO Targets
| Metric | Target |
|--------|--------|
| Event-to-delivery p95 | < 60 seconds |
| Failure rate | < 0.5% per hour |
| Duplicate rate | ~0% |
---
## 12. Security Considerations
### 12.1 Secret Management
- Secrets stored as references only
- Just-in-time fetch at send time
- No plaintext in Mongo
### 12.2 Webhook Signing
```
X-StellaOps-Signature: t=1732881600,v1=abc123...
X-StellaOps-Timestamp: 2025-11-29T12:00:00Z
```
- HMAC-SHA256 or Ed25519
- Replay window protection
- Canonical body hash
### 12.3 Loop Prevention
- Webhook target allowlist
- Event origin tags
- Own webhooks rejected
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Notify architecture | `docs/modules/notify/architecture.md` |
| Channel schemas | `docs/modules/notify/resources/schemas/` |
| Sample payloads | `docs/modules/notify/resources/samples/` |
| Bootstrap pack | `docs/modules/notify/bootstrap-pack.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0170_0001_0001_notify_engine.md (NEW)
- **Related Sprints:**
- SPRINT_0171_0001_0002_notify_connectors.md
- SPRINT_0172_0001_0003_notify_ack_tokens.md
**Key Task IDs:**
- `NOTIFY-ENGINE-40-001` - Rules engine (DONE)
- `NOTIFY-CONN-41-001` - Connectors (DONE)
- `NOTIFY-NOISE-42-001` - Throttling/digests (DONE)
- `NOTIFY-ACK-45-001` - Token rotation (IN PROGRESS)
- `NOTIFY-ESC-46-001` - Escalation workflows (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Delivery latency | < 60s p95 |
| Delivery success rate | > 99.5% |
| Duplicate rate | < 0.01% |
| Rule evaluation time | < 10ms |
| Channel health | 99.9% uptime |
---
*Last updated: 2025-11-29*