Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/notify/email-configured.md
+++ b/docs/doctor/articles/notify/email-configured.md
@@ -0,0 +1,86 @@
+---
+checkId: check.notify.email.configured
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, email, smtp, quick, configuration]
+---
+# Email Configuration
+
+## What It Checks
+Verifies that the email (SMTP) notification channel is properly configured. The check reads the `Notify:Channels:Email` configuration section and validates:
+
+- **SMTP host** (`SmtpHost` or `Host`): must be set and non-empty.
+- **SMTP port** (`SmtpPort` or `Port`): must be a valid number between 1 and 65535.
+- **From address** (`FromAddress` or `From`): must be set so outbound emails have a valid sender.
+- **Enabled flag** (`Enabled`): if explicitly set to `false`, reports a warning that the channel is configured but disabled.
+
+The check only runs when the `Notify:Channels:Email` configuration section exists.
+
+## Why It Matters
+Email notifications deliver critical alerts for release gate failures, policy violations, and security findings. Without a properly configured SMTP host, no email notifications can be sent, leaving operators blind to events that require immediate action. A missing from-address causes emails to be rejected by receiving mail servers.
+
+## Common Causes
+- SMTP host not set in configuration
+- Missing `Notify:Channels:Email:SmtpHost` setting
+- SMTP port not specified or set to an invalid value
+- From address not configured
+- Email channel explicitly disabled in configuration
+
+## How to Fix
+
+### Docker Compose
+Add environment variables to your service definition:
+
+```yaml
+environment:
+  Notify__Channels__Email__SmtpHost: "smtp.example.com"
+  Notify__Channels__Email__SmtpPort: "587"
+  Notify__Channels__Email__FromAddress: "noreply@example.com"
+  Notify__Channels__Email__UseSsl: "true"
+```
+
+### Bare Metal / systemd
+Edit `appsettings.json`:
+
+```json
+{
+  "Notify": {
+    "Channels": {
+      "Email": {
+        "SmtpHost": "smtp.example.com",
+        "SmtpPort": 587,
+        "FromAddress": "noreply@example.com",
+        "UseSsl": true
+      }
+    }
+  }
+}
+```
+
+Restart the service:
+```bash
+sudo systemctl restart stellaops-notify
+```
+
+### Kubernetes / Helm
+Set values in your Helm `values.yaml`:
+
+```yaml
+notify:
+  channels:
+    email:
+      smtpHost: "smtp.example.com"
+      smtpPort: 587
+      fromAddress: "noreply@example.com"
+      useSsl: true
+      credentialsSecret: "stellaops-smtp-credentials"
+```
+
+## Verification
+```
+stella doctor run --check check.notify.email.configured
+```
+
+## Related Checks
+- `check.notify.email.connectivity` — tests whether the configured SMTP server is reachable
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/email-connectivity.md
+++ b/docs/doctor/articles/notify/email-connectivity.md
@@ -0,0 +1,78 @@
+---
+checkId: check.notify.email.connectivity
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, email, smtp, connectivity, network]
+---
+# Email Connectivity
+
+## What It Checks
+Verifies that the configured SMTP server is reachable by opening a TCP connection to the SMTP host and port. The check:
+
+- Opens a TCP socket to `SmtpHost:SmtpPort` with a 10-second timeout.
+- Reads the SMTP banner and verifies it starts with `220` (standard SMTP greeting).
+- Reports an info-level result if the connection succeeds but the banner is not a recognized SMTP response.
+- Fails if the connection times out, is refused, or encounters a socket error.
+
+The check only runs when both `SmtpHost` and `SmtpPort` are configured with valid values.
+
+## Why It Matters
+A configured but unreachable SMTP server means email notifications will silently fail. Release gate alerts, security finding notifications, and approval requests will never reach operators, potentially delaying incident response.
+
+## Common Causes
+- SMTP server not running
+- Wrong host or port in configuration
+- Firewall blocking outbound SMTP connections
+- DNS resolution failure for the SMTP hostname
+- Network latency too high (exceeding 10-second timeout)
+
+## How to Fix
+
+### Docker Compose
+Verify network connectivity from the container:
+
+```bash
+docker exec <notify-container> nc -zv smtp.example.com 587
+docker exec <notify-container> nslookup smtp.example.com
+```
+
+Ensure the container network can reach the SMTP server. If behind a proxy, configure it:
+```yaml
+environment:
+  HTTP_PROXY: "http://proxy.example.com:8080"
+```
+
+### Bare Metal / systemd
+Test connectivity manually:
+
+```bash
+nc -zv smtp.example.com 587
+telnet smtp.example.com 587
+nslookup smtp.example.com
+```
+
+Check firewall rules:
+```bash
+sudo iptables -L -n | grep 587
+```
+
+### Kubernetes / Helm
+Verify connectivity from the pod:
+
+```bash
+kubectl exec -it <notify-pod> -- nc -zv smtp.example.com 587
+```
+
+Check NetworkPolicy resources that might block egress:
+```bash
+kubectl get networkpolicy -n stellaops
+```
+
+## Verification
+```
+stella doctor run --check check.notify.email.connectivity
+```
+
+## Related Checks
+- `check.notify.email.configured` — verifies SMTP configuration is complete
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/queue-health.md
+++ b/docs/doctor/articles/notify/queue-health.md
@@ -0,0 +1,93 @@
+---
+checkId: check.notify.queue.health
+plugin: stellaops.doctor.notify
+severity: fail
+tags: [notify, queue, redis, nats, infrastructure]
+---
+# Notification Queue Health
+
+## What It Checks
+Verifies that the notification event and delivery queues are healthy. The check:
+
+- Reads the `Notify:Queue:Transport` (or `Kind`) setting to determine the queue transport type (Redis/Valkey or NATS).
+- Resolves `NotifyQueueHealthCheck` and `NotifyDeliveryQueueHealthCheck` from the DI container.
+- Invokes each registered health check and aggregates the results.
+- Fails if any queue reports an `Unhealthy` status; warns if degraded; passes if all are healthy.
+
+The check only runs when a queue transport is configured in `Notify:Queue:Transport`.
+
+## Why It Matters
+The notification queue is the backbone of the notification pipeline. If the event queue is unhealthy, new notification events are lost. If the delivery queue is unhealthy, pending notifications to email, Slack, Teams, and webhook channels will not be delivered. This is a severity-fail check because queue failure means complete notification blackout.
+
+## Common Causes
+- Queue server (Redis/Valkey/NATS) not running
+- Network connectivity issues between the Notify service and the queue server
+- Authentication failure (wrong password or credentials)
+- Incorrect connection string in configuration
+
+## How to Fix
+
+### Docker Compose
+For Redis/Valkey transport:
+
+```bash
+# Check Redis health
+docker exec <redis-container> redis-cli ping
+
+# Check connection string
+docker exec <notify-container> env | grep Notify__Queue
+
+# Restart Redis if needed
+docker restart <redis-container>
+```
+
+For NATS transport:
+
+```bash
+# Check NATS server status
+docker exec <nats-container> nats server ping
+
+# Check NATS logs
+docker logs <nats-container> --tail 50
+```
+
+### Bare Metal / systemd
+```bash
+# Redis/Valkey
+redis-cli ping
+redis-cli info server
+
+# NATS
+nats server ping
+systemctl status nats
+```
+
+Verify the connection string in `appsettings.json`:
+```json
+{
+  "Notify": {
+    "Queue": {
+      "Transport": "redis",
+      "Redis": {
+        "ConnectionString": "127.1.1.2:6379"
+      }
+    }
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -it <redis-pod> -- redis-cli ping
+kubectl logs <notify-pod> --tail 50 | grep -i queue
+```
+
+## Verification
+```
+stella doctor run --check check.notify.queue.health
+```
+
+## Related Checks
+- `check.notify.email.configured` — verifies email channel configuration
+- `check.notify.slack.configured` — verifies Slack channel configuration
+- `check.notify.webhook.configured` — verifies webhook channel configuration
--- a/docs/doctor/articles/notify/slack-configured.md
+++ b/docs/doctor/articles/notify/slack-configured.md
@@ -0,0 +1,72 @@
+---
+checkId: check.notify.slack.configured
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, slack, quick, configuration]
+---
+# Slack Configuration
+
+## What It Checks
+Verifies that the Slack notification channel is properly configured. The check reads `Notify:Channels:Slack` and validates:
+
+- **Webhook URL** (`WebhookUrl`): must be set and non-empty.
+- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning that Slack is configured but disabled.
+
+The check only runs when the `Notify:Channels:Slack` configuration section exists.
+
+## Why It Matters
+Slack is a primary real-time notification channel for many operations teams. Without a configured webhook URL, security alerts, release gate notifications, and approval requests cannot reach Slack channels, delaying incident response.
+
+## Common Causes
+- Slack webhook URL not set in configuration
+- Missing `Notify:Channels:Slack:WebhookUrl` setting
+- Environment variable not bound to configuration
+- Slack notifications explicitly disabled
+
+## How to Fix
+
+### Docker Compose
+```yaml
+environment:
+  Notify__Channels__Slack__WebhookUrl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+```
+
+> **Security note:** Slack webhook URLs are secrets. Store them in a secrets manager or Docker secrets, not in plain-text compose files.
+
+### Bare Metal / systemd
+Edit `appsettings.json`:
+
+```json
+{
+  "Notify": {
+    "Channels": {
+      "Slack": {
+        "WebhookUrl": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+      }
+    }
+  }
+}
+```
+
+### Kubernetes / Helm
+```yaml
+notify:
+  channels:
+    slack:
+      webhookUrlSecret: "stellaops-slack-webhook"
+```
+
+Create the secret:
+```bash
+kubectl create secret generic stellaops-slack-webhook \
+  --from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+```
+
+## Verification
+```
+stella doctor run --check check.notify.slack.configured
+```
+
+## Related Checks
+- `check.notify.slack.connectivity` — tests whether the Slack webhook endpoint is reachable
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/slack-connectivity.md
+++ b/docs/doctor/articles/notify/slack-connectivity.md
@@ -0,0 +1,68 @@
+---
+checkId: check.notify.slack.connectivity
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, slack, connectivity, network]
+---
+# Slack Connectivity
+
+## What It Checks
+Verifies that the configured Slack webhook endpoint is reachable. The check:
+
+- Sends an empty-text POST payload to the webhook URL with a 10-second timeout.
+- Slack returns `no_text` for empty messages, which proves the endpoint is alive without posting a visible message.
+- Passes if the response is successful or contains `no_text`.
+- Warns if an unexpected HTTP status is returned (e.g., invalid or revoked webhook).
+- Fails on connection timeout or HTTP request exceptions.
+
+The check only runs when `Notify:Channels:Slack:WebhookUrl` is set and is a valid absolute URL.
+
+## Why It Matters
+A configured but unreachable Slack webhook means notifications are silently dropped. Teams relying on Slack for release alerts and security findings will miss critical events.
+
+## Common Causes
+- Invalid or expired webhook URL
+- Slack workspace configuration changed
+- Webhook URL revoked or regenerated
+- Rate limiting by Slack
+- Firewall blocking outbound HTTPS to hooks.slack.com
+- Proxy configuration required but not set
+
+## How to Fix
+
+### Docker Compose
+Test connectivity from the container:
+
+```bash
+docker exec <notify-container> curl -v https://hooks.slack.com/
+```
+
+If behind a proxy:
+```yaml
+environment:
+  HTTPS_PROXY: "http://proxy.example.com:8080"
+```
+
+### Bare Metal / systemd
+```bash
+curl -v https://hooks.slack.com/
+curl -X POST -H 'Content-type: application/json' \
+  --data '{"text":"Doctor test"}' \
+  'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -it <notify-pod> -- curl -v https://hooks.slack.com/
+```
+
+If the webhook URL has been revoked, create a new one in the Slack App settings under **Incoming Webhooks** and update the configuration.
+
+## Verification
+```
+stella doctor run --check check.notify.slack.connectivity
+```
+
+## Related Checks
+- `check.notify.slack.configured` — verifies Slack webhook URL is set
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/teams-configured.md
+++ b/docs/doctor/articles/notify/teams-configured.md
@@ -0,0 +1,67 @@
+---
+checkId: check.notify.teams.configured
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, teams, quick, configuration]
+---
+# Teams Configuration
+
+## What It Checks
+Verifies that the Microsoft Teams notification channel is properly configured. The check reads `Notify:Channels:Teams` and validates:
+
+- **Webhook URL** (`WebhookUrl`): must be set and non-empty.
+- **URL format**: validates that the URL belongs to a Microsoft domain (`webhook.office.com` or `microsoft.com`).
+- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning.
+
+The check only runs when the `Notify:Channels:Teams` configuration section exists.
+
+## Why It Matters
+Teams is a common enterprise notification channel. Without a valid webhook URL, notifications about release decisions, policy violations, and security findings cannot reach Teams channels.
+
+## Common Causes
+- Teams webhook URL not set in configuration
+- Webhook URL is not from a Microsoft domain (malformed or legacy URL)
+- Teams notifications explicitly disabled
+- Environment variable not bound to configuration
+
+## How to Fix
+
+### Docker Compose
+```yaml
+environment:
+  Notify__Channels__Teams__WebhookUrl: "https://YOUR_TENANT.webhook.office.com/webhookb2/..."
+```
+
+> **Security note:** Teams webhook URLs are secrets. Use Docker secrets or a vault.
+
+### Bare Metal / systemd
+```json
+{
+  "Notify": {
+    "Channels": {
+      "Teams": {
+        "WebhookUrl": "https://YOUR_TENANT.webhook.office.com/webhookb2/..."
+      }
+    }
+  }
+}
+```
+
+### Kubernetes / Helm
+```yaml
+notify:
+  channels:
+    teams:
+      webhookUrlSecret: "stellaops-teams-webhook"
+```
+
+To create the webhook in Teams: Channel > Connectors > Incoming Webhook > Create.
+
+## Verification
+```
+stella doctor run --check check.notify.teams.configured
+```
+
+## Related Checks
+- `check.notify.teams.connectivity` — tests whether the Teams webhook endpoint is reachable
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/teams-connectivity.md
+++ b/docs/doctor/articles/notify/teams-connectivity.md
@@ -0,0 +1,60 @@
+---
+checkId: check.notify.teams.connectivity
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, teams, connectivity, network]
+---
+# Teams Connectivity
+
+## What It Checks
+Verifies that the configured Microsoft Teams webhook endpoint is reachable. The check:
+
+- Sends a minimal Adaptive Card payload to the webhook URL with a 10-second timeout.
+- Passes if the response is successful (HTTP 2xx).
+- Warns if an unexpected HTTP status is returned (invalid, expired, or revoked webhook).
+- Fails on connection timeout or HTTP request exceptions.
+
+The check only runs when `Notify:Channels:Teams:WebhookUrl` is set and is a valid absolute URL.
+
+## Why It Matters
+An unreachable Teams webhook means notifications silently fail to deliver. Operations teams will miss release alerts and security findings if the webhook is broken.
+
+## Common Causes
+- Invalid or expired webhook URL
+- Teams connector disabled or deleted
+- Microsoft 365 tenant configuration changed
+- Firewall blocking outbound HTTPS to webhook.office.com
+- Proxy configuration required
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker exec <notify-container> curl -v https://webhook.office.com/
+```
+
+### Bare Metal / systemd
+```bash
+curl -v https://webhook.office.com/
+curl -H 'Content-Type: application/json' \
+  -d '{"text":"Doctor test"}' \
+  'https://YOUR_TENANT.webhook.office.com/webhookb2/...'
+```
+
+Check Microsoft 365 service status at https://status.office.com.
+
+### Kubernetes / Helm
+```bash
+kubectl exec -it <notify-pod> -- curl -v https://webhook.office.com/
+```
+
+If the webhook is broken, recreate it: Teams channel > Connectors > Incoming Webhook > delete and recreate.
+
+## Verification
+```
+stella doctor run --check check.notify.teams.connectivity
+```
+
+## Related Checks
+- `check.notify.teams.configured` — verifies Teams webhook URL is set and valid
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/webhook-configured.md
+++ b/docs/doctor/articles/notify/webhook-configured.md
@@ -0,0 +1,68 @@
+---
+checkId: check.notify.webhook.configured
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, webhook, quick, configuration]
+---
+# Webhook Configuration
+
+## What It Checks
+Verifies that the generic webhook notification channel is properly configured. The check reads `Notify:Channels:Webhook` and validates:
+
+- **URL** (`Url` or `Endpoint`): must be set and be a valid HTTP or HTTPS URL.
+- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning.
+- Also reads `Method` (defaults to POST) and `ContentType` (defaults to application/json) for evidence.
+
+The check only runs when the `Notify:Channels:Webhook` configuration section exists.
+
+## Why It Matters
+Generic webhooks integrate Stella Ops notifications with third-party systems (PagerDuty, OpsGenie, custom dashboards, SIEM tools). A missing or malformed URL prevents these integrations from receiving events.
+
+## Common Causes
+- Webhook URL not set in configuration
+- Malformed URL (missing protocol `http://` or `https://`)
+- Invalid characters in URL
+- Webhook channel explicitly disabled
+
+## How to Fix
+
+### Docker Compose
+```yaml
+environment:
+  Notify__Channels__Webhook__Url: "https://your-endpoint/webhook"
+  Notify__Channels__Webhook__Method: "POST"
+  Notify__Channels__Webhook__ContentType: "application/json"
+```
+
+### Bare Metal / systemd
+```json
+{
+  "Notify": {
+    "Channels": {
+      "Webhook": {
+        "Url": "https://your-endpoint/webhook",
+        "Method": "POST",
+        "ContentType": "application/json"
+      }
+    }
+  }
+}
+```
+
+### Kubernetes / Helm
+```yaml
+notify:
+  channels:
+    webhook:
+      url: "https://your-endpoint/webhook"
+      method: "POST"
+```
+
+## Verification
+```
+stella doctor run --check check.notify.webhook.configured
+```
+
+## Related Checks
+- `check.notify.webhook.connectivity` — tests whether the webhook endpoint is reachable
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy
--- a/docs/doctor/articles/notify/webhook-connectivity.md
+++ b/docs/doctor/articles/notify/webhook-connectivity.md
@@ -0,0 +1,58 @@
+---
+checkId: check.notify.webhook.connectivity
+plugin: stellaops.doctor.notify
+severity: warn
+tags: [notify, webhook, connectivity, network]
+---
+# Webhook Connectivity
+
+## What It Checks
+Verifies that the configured generic webhook endpoint is reachable. The check:
+
+- Sends a HEAD request to the webhook URL (falls back to OPTIONS if HEAD is unsupported) with a 10-second timeout.
+- Any response with HTTP status < 500 is considered reachable (even 401/403, which indicate the endpoint exists but requires authentication).
+- Warns on HTTP 5xx responses (server-side errors).
+- Fails on connection timeout or HTTP request exceptions.
+
+The check only runs when `Notify:Channels:Webhook:Url` (or `Endpoint`) is set and is a valid absolute URL.
+
+## Why It Matters
+A configured but unreachable webhook endpoint means third-party integrations silently stop receiving notifications. Events that should trigger PagerDuty alerts, SIEM ingestion, or custom dashboard updates will be lost.
+
+## Common Causes
+- Endpoint server not responding
+- Network connectivity issue or firewall blocking connection
+- DNS resolution failure
+- TLS/SSL certificate problem on the endpoint
+- Webhook endpoint service is down
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker exec <notify-container> curl -v --max-time 10 https://your-endpoint/webhook
+docker exec <notify-container> nslookup your-endpoint
+```
+
+### Bare Metal / systemd
+```bash
+curl -I https://your-endpoint/webhook
+nslookup your-endpoint
+nc -zv your-endpoint 443
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -it <notify-pod> -- curl -v https://your-endpoint/webhook
+```
+
+Check that egress NetworkPolicies allow traffic to the webhook destination.
+
+## Verification
+```
+stella doctor run --check check.notify.webhook.connectivity
+```
+
+## Related Checks
+- `check.notify.webhook.configured` — verifies webhook URL is set and valid
+- `check.notify.queue.health` — verifies the notification delivery queue is healthy