Files
git.stella-ops.org/docs/product-advisories/28-Nov-2025 - Notification Rules and Alerting Engine.md
StellaOps Bot 0bef705bcc
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
true the date
2025-11-30 19:23:21 +02:00

12 KiB

Notification Rules and Alerting Engine

Version: 1.0 Date: 2025-11-29 Status: Canonical

This advisory defines the product rationale, rules engine semantics, and implementation strategy for the Notify module, covering channel connectors, throttling, digests, and delivery management.


1. Executive Summary

The Notify module provides rules-driven, tenant-aware notification delivery across security workflows. Key capabilities:

  • Rules Engine - Declarative matchers for event routing
  • Multi-Channel Delivery - Slack, Teams, Email, Webhooks
  • Noise Control - Throttling, deduplication, digest windows
  • Approval Tokens - DSSE-signed ack tokens for one-click workflows
  • Audit Trail - Complete delivery history with redacted payloads

2. Market Drivers

2.1 Target Segments

Segment Notification Requirements Use Case
Security Teams Real-time critical alerts Incident response
DevSecOps CI/CD integration Pipeline notifications
Compliance Audit trails Delivery verification
Management Digest summaries Executive reporting

2.2 Competitive Positioning

Most vulnerability tools offer basic email alerts. Stella Ops differentiates with:

  • Rules-based routing with fine-grained matchers
  • Native Slack/Teams integration with rich formatting
  • Digest windows to prevent alert fatigue
  • Cryptographic ack tokens for approval workflows
  • Tenant isolation with quota controls

3. Rules Engine

3.1 Rule Structure

name: "critical-alerts-prod"
enabled: true
tenant: "acme-corp"

match:
  eventKinds:
    - "scanner.report.ready"
    - "scheduler.rescan.delta"
    - "zastava.admission"
  namespaces: ["prod-*"]
  repos: ["ghcr.io/acme/*"]
  minSeverity: "high"
  kev: true
  verdict: ["fail", "deny"]
  vex:
    includeRejectedJustifications: false

actions:
  - channel: "slack:sec-alerts"
    template: "concise"
    throttle: "5m"

  - channel: "email:soc"
    digest: "hourly"
    template: "detailed"

3.2 Matcher Types

Matcher Description Example
eventKinds Event type filter ["scanner.report.ready"]
namespaces Namespace patterns ["prod-*", "staging"]
repos Repository patterns ["ghcr.io/acme/*"]
minSeverity Minimum severity "high"
kev KEV-tagged required true
verdict Report/admission verdict ["fail", "deny"]
labels Kubernetes labels {"env": "production"}

3.3 Evaluation Order

  1. Tenant check - Discard if rule tenant ≠ event tenant
  2. Kind filter - Early discard for non-matching kinds
  3. Scope match - Namespace/repo/label matching
  4. Delta gates - Severity threshold evaluation
  5. VEX gate - Filter based on VEX status
  6. Throttle/dedup - Idempotency key check
  7. Actions - Enqueue per-channel jobs

4. Channel Connectors

4.1 Built-in Channels

Channel Features Rate Limits
Slack Blocks, threads, reactions 1 msg/sec per channel
Teams Adaptive Cards, webhooks 4 msgs/sec
Email HTML+text, attachments Relay-dependent
Webhook JSON, HMAC signing 10 req/sec

4.2 Channel Configuration

channels:
  - name: "slack:sec-alerts"
    type: slack
    config:
      channel: "#security-alerts"
      workspace: "acme-corp"
      secretRef: "ref://notify/slack-token"

  - name: "email:soc"
    type: email
    config:
      to: ["soc@acme.com"]
      from: "stellaops@acme.com"
      smtpHost: "smtp.acme.com"
      secretRef: "ref://notify/smtp-creds"

  - name: "webhook:siem"
    type: webhook
    config:
      url: "https://siem.acme.com/api/events"
      signMethod: "ed25519"
      signKeyRef: "ref://notify/webhook-key"

4.3 Connector Contract

public interface INotifyConnector
{
    string Type { get; }
    Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
    Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}

5. Noise Control

5.1 Throttling

  • Per-action throttle - Suppress duplicates within window
  • Idempotency key - hash(ruleId | actionId | event.kind | scope.digest | day)
  • Configurable windows - 5m, 15m, 1h, 1d

5.2 Digest Windows

actions:
  - channel: "email:weekly-summary"
    digest: "weekly"
    digestOptions:
      maxItems: 100
      groupBy: ["severity", "namespace"]
      template: "digest-summary"

Behavior:

  • Coalesce events within window
  • Summarize top N items with counts
  • Flush on window close or max items
  • Safe truncation with "and X more" links

5.3 Quiet Hours

notify:
  quietHours:
    enabled: true
    window: "22:00-06:00"
    timezone: "America/New_York"
    minSeverity: "critical"

Only critical alerts during quiet hours; others deferred to digests.


6. Templates & Rendering

6.1 Template Engine

  • Handlebars-style safe templates
  • No arbitrary code execution
  • Deterministic outputs (stable property order)
  • Locale-aware formatting

6.2 Template Variables

Variable Description
event.kind Event type
event.ts Timestamp
scope.namespace Kubernetes namespace
scope.repo Repository
scope.digest Image digest
payload.verdict Policy verdict
payload.delta.newCritical New critical count
payload.links.ui UI deep link
topFindings[] Top N findings

6.3 Channel-Specific Rendering

Slack:

{
  "blocks": [
    {"type": "header", "text": {"type": "plain_text", "text": "Policy FAIL: nginx:latest"}},
    {"type": "section", "text": {"type": "mrkdwn", "text": "*2 critical*, 3 high vulnerabilities"}}
  ]
}

Email:

<h2>Policy FAIL: nginx:latest</h2>
<table>
  <tr><td>Critical</td><td>2</td></tr>
  <tr><td>High</td><td>3</td></tr>
</table>
<a href="https://ui.internal/reports/...">View Details</a>

7. Ack Tokens

7.1 Token Structure

DSSE-signed tokens for one-click acknowledgements:

{
  "payloadType": "application/vnd.stellaops.notify-ack-token+json",
  "payload": {
    "tenant": "acme-corp",
    "deliveryId": "delivery-123",
    "notificationId": "notif-456",
    "channel": "slack:sec-alerts",
    "webhookUrl": "https://notify.internal/ack",
    "nonce": "random-nonce",
    "actions": ["acknowledge", "escalate"],
    "expiresAt": "2025-11-29T13:00:00Z"
  },
  "signatures": [{"keyid": "notify-ack-key-01", "sig": "..."}]
}

7.2 Token Workflow

  1. Issue - POST /notify/ack-tokens/issue
  2. Embed - Token included in message action button
  3. Click - User clicks button, token sent to webhook
  4. Verify - POST /notify/ack-tokens/verify
  5. Audit - Ack event recorded

7.3 Token Rotation

# Rotate ack token signing key
stella notify rotate-ack-key --key-source kms://notify/ack-key

8. Implementation Strategy

8.1 Phase 1: Core Engine (Complete)

  • Rules engine with matchers
  • Slack connector
  • Teams connector
  • Email connector
  • Webhook connector

8.2 Phase 2: Noise Control (Complete)

  • Throttling
  • Digest windows
  • Idempotency
  • Quiet hours

8.3 Phase 3: Ack Tokens (In Progress)

  • Token issuance
  • Token verification
  • Token rotation API (NOTIFY-ACK-45-001)
  • Escalation workflows (NOTIFY-ESC-46-001)

8.4 Phase 4: Advanced Features (Planned)

  • PagerDuty connector
  • Jira ticket creation
  • In-app notifications
  • Anomaly suppression

9. API Surface

9.1 Channels

Endpoint Method Scope Description
/api/v1/notify/channels GET/POST notify.read/admin List/create channels
/api/v1/notify/channels/{id} GET/PATCH/DELETE notify.admin Manage channel
/api/v1/notify/channels/{id}/test POST notify.admin Send test message
/api/v1/notify/channels/{id}/health GET notify.read Health check

9.2 Rules

Endpoint Method Scope Description
/api/v1/notify/rules GET/POST notify.read/admin List/create rules
/api/v1/notify/rules/{id} GET/PATCH/DELETE notify.admin Manage rule
/api/v1/notify/rules/{id}/test POST notify.admin Dry-run rule

9.3 Deliveries

Endpoint Method Scope Description
/api/v1/notify/deliveries GET notify.read List deliveries
/api/v1/notify/deliveries/{id} GET notify.read Delivery detail
/api/v1/notify/deliveries/{id}/retry POST notify.admin Retry delivery

10. Event Sources

10.1 Subscribed Events

Event Source Typical Actions
scanner.scan.completed Scanner Immediate/digest
scanner.report.ready Scanner Immediate
scheduler.rescan.delta Scheduler Immediate/digest
attestor.logged Attestor Immediate
zastava.admission Zastava Immediate
conselier.export.completed Concelier Digest
excitor.export.completed Excititor Digest

10.2 Event Envelope

{
  "eventId": "uuid",
  "kind": "scanner.report.ready",
  "tenant": "acme-corp",
  "ts": "2025-11-29T12:00:00Z",
  "actor": "scanner-webservice",
  "scope": {
    "namespace": "production",
    "repo": "ghcr.io/acme/api",
    "digest": "sha256:..."
  },
  "payload": {
    "reportId": "report-123",
    "verdict": "fail",
    "summary": {"total": 12, "blocked": 2},
    "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}
  }
}

11. Observability

11.1 Metrics

  • notify.events_consumed_total{kind}
  • notify.rules_matched_total{ruleId}
  • notify.throttled_total{reason}
  • notify.digest_coalesced_total{window}
  • notify.sent_total{channel}
  • notify.failed_total{channel,code}
  • notify.delivery_latency_seconds{channel}

11.2 SLO Targets

Metric Target
Event-to-delivery p95 < 60 seconds
Failure rate < 0.5% per hour
Duplicate rate ~0%

12. Security Considerations

12.1 Secret Management

  • Secrets stored as references only
  • Just-in-time fetch at send time
  • No plaintext in Mongo

12.2 Webhook Signing

X-StellaOps-Signature: t=1732881600,v1=abc123...
X-StellaOps-Timestamp: 2025-11-29T12:00:00Z
  • HMAC-SHA256 or Ed25519
  • Replay window protection
  • Canonical body hash

12.3 Loop Prevention

  • Webhook target allowlist
  • Event origin tags
  • Own webhooks rejected

Resource Location
Notify architecture docs/modules/notify/architecture.md
Channel schemas docs/modules/notify/resources/schemas/
Sample payloads docs/modules/notify/resources/samples/
Bootstrap pack docs/modules/notify/bootstrap-pack.md

14. Sprint Mapping

  • Primary Sprint: SPRINT_0170_0001_0001_notify_engine.md (NEW)
  • Related Sprints:
    • SPRINT_0171_0001_0002_notify_connectors.md
    • SPRINT_0172_0001_0003_notify_ack_tokens.md

Key Task IDs:

  • NOTIFY-ENGINE-40-001 - Rules engine (DONE)
  • NOTIFY-CONN-41-001 - Connectors (DONE)
  • NOTIFY-NOISE-42-001 - Throttling/digests (DONE)
  • NOTIFY-ACK-45-001 - Token rotation (IN PROGRESS)
  • NOTIFY-ESC-46-001 - Escalation workflows (TODO)

15. Success Metrics

Metric Target
Delivery latency < 60s p95
Delivery success rate > 99.5%
Duplicate rate < 0.01%
Rule evaluation time < 10ms
Channel health 99.9% uptime

Last updated: 2025-11-29