Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,86 @@
---
checkId: check.notify.email.configured
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, email, smtp, quick, configuration]
---
# Email Configuration
## What It Checks
Verifies that the email (SMTP) notification channel is properly configured. The check reads the `Notify:Channels:Email` configuration section and validates:
- **SMTP host** (`SmtpHost` or `Host`): must be set and non-empty.
- **SMTP port** (`SmtpPort` or `Port`): must be a valid number between 1 and 65535.
- **From address** (`FromAddress` or `From`): must be set so outbound emails have a valid sender.
- **Enabled flag** (`Enabled`): if explicitly set to `false`, reports a warning that the channel is configured but disabled.
The check only runs when the `Notify:Channels:Email` configuration section exists.
## Why It Matters
Email notifications deliver critical alerts for release gate failures, policy violations, and security findings. Without a properly configured SMTP host, no email notifications can be sent, leaving operators blind to events that require immediate action. A missing from-address causes emails to be rejected by receiving mail servers.
## Common Causes
- SMTP host not set in configuration
- Missing `Notify:Channels:Email:SmtpHost` setting
- SMTP port not specified or set to an invalid value
- From address not configured
- Email channel explicitly disabled in configuration
## How to Fix
### Docker Compose
Add environment variables to your service definition:
```yaml
environment:
Notify__Channels__Email__SmtpHost: "smtp.example.com"
Notify__Channels__Email__SmtpPort: "587"
Notify__Channels__Email__FromAddress: "noreply@example.com"
Notify__Channels__Email__UseSsl: "true"
```
### Bare Metal / systemd
Edit `appsettings.json`:
```json
{
"Notify": {
"Channels": {
"Email": {
"SmtpHost": "smtp.example.com",
"SmtpPort": 587,
"FromAddress": "noreply@example.com",
"UseSsl": true
}
}
}
}
```
Restart the service:
```bash
sudo systemctl restart stellaops-notify
```
### Kubernetes / Helm
Set values in your Helm `values.yaml`:
```yaml
notify:
channels:
email:
smtpHost: "smtp.example.com"
smtpPort: 587
fromAddress: "noreply@example.com"
useSsl: true
credentialsSecret: "stellaops-smtp-credentials"
```
## Verification
```
stella doctor run --check check.notify.email.configured
```
## Related Checks
- `check.notify.email.connectivity` — tests whether the configured SMTP server is reachable
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,78 @@
---
checkId: check.notify.email.connectivity
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, email, smtp, connectivity, network]
---
# Email Connectivity
## What It Checks
Verifies that the configured SMTP server is reachable by opening a TCP connection to the SMTP host and port. The check:
- Opens a TCP socket to `SmtpHost:SmtpPort` with a 10-second timeout.
- Reads the SMTP banner and verifies it starts with `220` (standard SMTP greeting).
- Reports an info-level result if the connection succeeds but the banner is not a recognized SMTP response.
- Fails if the connection times out, is refused, or encounters a socket error.
The check only runs when both `SmtpHost` and `SmtpPort` are configured with valid values.
## Why It Matters
A configured but unreachable SMTP server means email notifications will silently fail. Release gate alerts, security finding notifications, and approval requests will never reach operators, potentially delaying incident response.
## Common Causes
- SMTP server not running
- Wrong host or port in configuration
- Firewall blocking outbound SMTP connections
- DNS resolution failure for the SMTP hostname
- Network latency too high (exceeding 10-second timeout)
## How to Fix
### Docker Compose
Verify network connectivity from the container:
```bash
docker exec <notify-container> nc -zv smtp.example.com 587
docker exec <notify-container> nslookup smtp.example.com
```
Ensure the container network can reach the SMTP server. If behind a proxy, configure it:
```yaml
environment:
HTTP_PROXY: "http://proxy.example.com:8080"
```
### Bare Metal / systemd
Test connectivity manually:
```bash
nc -zv smtp.example.com 587
telnet smtp.example.com 587
nslookup smtp.example.com
```
Check firewall rules:
```bash
sudo iptables -L -n | grep 587
```
### Kubernetes / Helm
Verify connectivity from the pod:
```bash
kubectl exec -it <notify-pod> -- nc -zv smtp.example.com 587
```
Check NetworkPolicy resources that might block egress:
```bash
kubectl get networkpolicy -n stellaops
```
## Verification
```
stella doctor run --check check.notify.email.connectivity
```
## Related Checks
- `check.notify.email.configured` — verifies SMTP configuration is complete
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,93 @@
---
checkId: check.notify.queue.health
plugin: stellaops.doctor.notify
severity: fail
tags: [notify, queue, redis, nats, infrastructure]
---
# Notification Queue Health
## What It Checks
Verifies that the notification event and delivery queues are healthy. The check:
- Reads the `Notify:Queue:Transport` (or `Kind`) setting to determine the queue transport type (Redis/Valkey or NATS).
- Resolves `NotifyQueueHealthCheck` and `NotifyDeliveryQueueHealthCheck` from the DI container.
- Invokes each registered health check and aggregates the results.
- Fails if any queue reports an `Unhealthy` status; warns if degraded; passes if all are healthy.
The check only runs when a queue transport is configured in `Notify:Queue:Transport`.
## Why It Matters
The notification queue is the backbone of the notification pipeline. If the event queue is unhealthy, new notification events are lost. If the delivery queue is unhealthy, pending notifications to email, Slack, Teams, and webhook channels will not be delivered. This is a severity-fail check because queue failure means complete notification blackout.
## Common Causes
- Queue server (Redis/Valkey/NATS) not running
- Network connectivity issues between the Notify service and the queue server
- Authentication failure (wrong password or credentials)
- Incorrect connection string in configuration
## How to Fix
### Docker Compose
For Redis/Valkey transport:
```bash
# Check Redis health
docker exec <redis-container> redis-cli ping
# Check connection string
docker exec <notify-container> env | grep Notify__Queue
# Restart Redis if needed
docker restart <redis-container>
```
For NATS transport:
```bash
# Check NATS server status
docker exec <nats-container> nats server ping
# Check NATS logs
docker logs <nats-container> --tail 50
```
### Bare Metal / systemd
```bash
# Redis/Valkey
redis-cli ping
redis-cli info server
# NATS
nats server ping
systemctl status nats
```
Verify the connection string in `appsettings.json`:
```json
{
"Notify": {
"Queue": {
"Transport": "redis",
"Redis": {
"ConnectionString": "127.1.1.2:6379"
}
}
}
}
```
### Kubernetes / Helm
```bash
kubectl exec -it <redis-pod> -- redis-cli ping
kubectl logs <notify-pod> --tail 50 | grep -i queue
```
## Verification
```
stella doctor run --check check.notify.queue.health
```
## Related Checks
- `check.notify.email.configured` — verifies email channel configuration
- `check.notify.slack.configured` — verifies Slack channel configuration
- `check.notify.webhook.configured` — verifies webhook channel configuration

View File

@@ -0,0 +1,72 @@
---
checkId: check.notify.slack.configured
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, slack, quick, configuration]
---
# Slack Configuration
## What It Checks
Verifies that the Slack notification channel is properly configured. The check reads `Notify:Channels:Slack` and validates:
- **Webhook URL** (`WebhookUrl`): must be set and non-empty.
- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning that Slack is configured but disabled.
The check only runs when the `Notify:Channels:Slack` configuration section exists.
## Why It Matters
Slack is a primary real-time notification channel for many operations teams. Without a configured webhook URL, security alerts, release gate notifications, and approval requests cannot reach Slack channels, delaying incident response.
## Common Causes
- Slack webhook URL not set in configuration
- Missing `Notify:Channels:Slack:WebhookUrl` setting
- Environment variable not bound to configuration
- Slack notifications explicitly disabled
## How to Fix
### Docker Compose
```yaml
environment:
Notify__Channels__Slack__WebhookUrl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```
> **Security note:** Slack webhook URLs are secrets. Store them in a secrets manager or Docker secrets, not in plain-text compose files.
### Bare Metal / systemd
Edit `appsettings.json`:
```json
{
"Notify": {
"Channels": {
"Slack": {
"WebhookUrl": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
}
}
}
}
```
### Kubernetes / Helm
```yaml
notify:
channels:
slack:
webhookUrlSecret: "stellaops-slack-webhook"
```
Create the secret:
```bash
kubectl create secret generic stellaops-slack-webhook \
--from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```
## Verification
```
stella doctor run --check check.notify.slack.configured
```
## Related Checks
- `check.notify.slack.connectivity` — tests whether the Slack webhook endpoint is reachable
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,68 @@
---
checkId: check.notify.slack.connectivity
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, slack, connectivity, network]
---
# Slack Connectivity
## What It Checks
Verifies that the configured Slack webhook endpoint is reachable. The check:
- Sends an empty-text POST payload to the webhook URL with a 10-second timeout.
- Slack returns `no_text` for empty messages, which proves the endpoint is alive without posting a visible message.
- Passes if the response is successful or contains `no_text`.
- Warns if an unexpected HTTP status is returned (e.g., invalid or revoked webhook).
- Fails on connection timeout or HTTP request exceptions.
The check only runs when `Notify:Channels:Slack:WebhookUrl` is set and is a valid absolute URL.
## Why It Matters
A configured but unreachable Slack webhook means notifications are silently dropped. Teams relying on Slack for release alerts and security findings will miss critical events.
## Common Causes
- Invalid or expired webhook URL
- Slack workspace configuration changed
- Webhook URL revoked or regenerated
- Rate limiting by Slack
- Firewall blocking outbound HTTPS to hooks.slack.com
- Proxy configuration required but not set
## How to Fix
### Docker Compose
Test connectivity from the container:
```bash
docker exec <notify-container> curl -v https://hooks.slack.com/
```
If behind a proxy:
```yaml
environment:
HTTPS_PROXY: "http://proxy.example.com:8080"
```
### Bare Metal / systemd
```bash
curl -v https://hooks.slack.com/
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Doctor test"}' \
'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
```
### Kubernetes / Helm
```bash
kubectl exec -it <notify-pod> -- curl -v https://hooks.slack.com/
```
If the webhook URL has been revoked, create a new one in the Slack App settings under **Incoming Webhooks** and update the configuration.
## Verification
```
stella doctor run --check check.notify.slack.connectivity
```
## Related Checks
- `check.notify.slack.configured` — verifies Slack webhook URL is set
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,67 @@
---
checkId: check.notify.teams.configured
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, teams, quick, configuration]
---
# Teams Configuration
## What It Checks
Verifies that the Microsoft Teams notification channel is properly configured. The check reads `Notify:Channels:Teams` and validates:
- **Webhook URL** (`WebhookUrl`): must be set and non-empty.
- **URL format**: validates that the URL belongs to a Microsoft domain (`webhook.office.com` or `microsoft.com`).
- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning.
The check only runs when the `Notify:Channels:Teams` configuration section exists.
## Why It Matters
Teams is a common enterprise notification channel. Without a valid webhook URL, notifications about release decisions, policy violations, and security findings cannot reach Teams channels.
## Common Causes
- Teams webhook URL not set in configuration
- Webhook URL is not from a Microsoft domain (malformed or legacy URL)
- Teams notifications explicitly disabled
- Environment variable not bound to configuration
## How to Fix
### Docker Compose
```yaml
environment:
Notify__Channels__Teams__WebhookUrl: "https://YOUR_TENANT.webhook.office.com/webhookb2/..."
```
> **Security note:** Teams webhook URLs are secrets. Use Docker secrets or a vault.
### Bare Metal / systemd
```json
{
"Notify": {
"Channels": {
"Teams": {
"WebhookUrl": "https://YOUR_TENANT.webhook.office.com/webhookb2/..."
}
}
}
}
```
### Kubernetes / Helm
```yaml
notify:
channels:
teams:
webhookUrlSecret: "stellaops-teams-webhook"
```
To create the webhook in Teams: Channel > Connectors > Incoming Webhook > Create.
## Verification
```
stella doctor run --check check.notify.teams.configured
```
## Related Checks
- `check.notify.teams.connectivity` — tests whether the Teams webhook endpoint is reachable
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,60 @@
---
checkId: check.notify.teams.connectivity
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, teams, connectivity, network]
---
# Teams Connectivity
## What It Checks
Verifies that the configured Microsoft Teams webhook endpoint is reachable. The check:
- Sends a minimal Adaptive Card payload to the webhook URL with a 10-second timeout.
- Passes if the response is successful (HTTP 2xx).
- Warns if an unexpected HTTP status is returned (invalid, expired, or revoked webhook).
- Fails on connection timeout or HTTP request exceptions.
The check only runs when `Notify:Channels:Teams:WebhookUrl` is set and is a valid absolute URL.
## Why It Matters
An unreachable Teams webhook means notifications silently fail to deliver. Operations teams will miss release alerts and security findings if the webhook is broken.
## Common Causes
- Invalid or expired webhook URL
- Teams connector disabled or deleted
- Microsoft 365 tenant configuration changed
- Firewall blocking outbound HTTPS to webhook.office.com
- Proxy configuration required
## How to Fix
### Docker Compose
```bash
docker exec <notify-container> curl -v https://webhook.office.com/
```
### Bare Metal / systemd
```bash
curl -v https://webhook.office.com/
curl -H 'Content-Type: application/json' \
-d '{"text":"Doctor test"}' \
'https://YOUR_TENANT.webhook.office.com/webhookb2/...'
```
Check Microsoft 365 service status at https://status.office.com.
### Kubernetes / Helm
```bash
kubectl exec -it <notify-pod> -- curl -v https://webhook.office.com/
```
If the webhook is broken, recreate it: Teams channel > Connectors > Incoming Webhook > delete and recreate.
## Verification
```
stella doctor run --check check.notify.teams.connectivity
```
## Related Checks
- `check.notify.teams.configured` — verifies Teams webhook URL is set and valid
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,68 @@
---
checkId: check.notify.webhook.configured
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, webhook, quick, configuration]
---
# Webhook Configuration
## What It Checks
Verifies that the generic webhook notification channel is properly configured. The check reads `Notify:Channels:Webhook` and validates:
- **URL** (`Url` or `Endpoint`): must be set and be a valid HTTP or HTTPS URL.
- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning.
- Also reads `Method` (defaults to POST) and `ContentType` (defaults to application/json) for evidence.
The check only runs when the `Notify:Channels:Webhook` configuration section exists.
## Why It Matters
Generic webhooks integrate Stella Ops notifications with third-party systems (PagerDuty, OpsGenie, custom dashboards, SIEM tools). A missing or malformed URL prevents these integrations from receiving events.
## Common Causes
- Webhook URL not set in configuration
- Malformed URL (missing protocol `http://` or `https://`)
- Invalid characters in URL
- Webhook channel explicitly disabled
## How to Fix
### Docker Compose
```yaml
environment:
Notify__Channels__Webhook__Url: "https://your-endpoint/webhook"
Notify__Channels__Webhook__Method: "POST"
Notify__Channels__Webhook__ContentType: "application/json"
```
### Bare Metal / systemd
```json
{
"Notify": {
"Channels": {
"Webhook": {
"Url": "https://your-endpoint/webhook",
"Method": "POST",
"ContentType": "application/json"
}
}
}
}
```
### Kubernetes / Helm
```yaml
notify:
channels:
webhook:
url: "https://your-endpoint/webhook"
method: "POST"
```
## Verification
```
stella doctor run --check check.notify.webhook.configured
```
## Related Checks
- `check.notify.webhook.connectivity` — tests whether the webhook endpoint is reachable
- `check.notify.queue.health` — verifies the notification delivery queue is healthy

View File

@@ -0,0 +1,58 @@
---
checkId: check.notify.webhook.connectivity
plugin: stellaops.doctor.notify
severity: warn
tags: [notify, webhook, connectivity, network]
---
# Webhook Connectivity
## What It Checks
Verifies that the configured generic webhook endpoint is reachable. The check:
- Sends a HEAD request to the webhook URL (falls back to OPTIONS if HEAD is unsupported) with a 10-second timeout.
- Any response with HTTP status < 500 is considered reachable (even 401/403, which indicate the endpoint exists but requires authentication).
- Warns on HTTP 5xx responses (server-side errors).
- Fails on connection timeout or HTTP request exceptions.
The check only runs when `Notify:Channels:Webhook:Url` (or `Endpoint`) is set and is a valid absolute URL.
## Why It Matters
A configured but unreachable webhook endpoint means third-party integrations silently stop receiving notifications. Events that should trigger PagerDuty alerts, SIEM ingestion, or custom dashboard updates will be lost.
## Common Causes
- Endpoint server not responding
- Network connectivity issue or firewall blocking connection
- DNS resolution failure
- TLS/SSL certificate problem on the endpoint
- Webhook endpoint service is down
## How to Fix
### Docker Compose
```bash
docker exec <notify-container> curl -v --max-time 10 https://your-endpoint/webhook
docker exec <notify-container> nslookup your-endpoint
```
### Bare Metal / systemd
```bash
curl -I https://your-endpoint/webhook
nslookup your-endpoint
nc -zv your-endpoint 443
```
### Kubernetes / Helm
```bash
kubectl exec -it <notify-pod> -- curl -v https://your-endpoint/webhook
```
Check that egress NetworkPolicies allow traffic to the webhook destination.
## Verification
```
stella doctor run --check check.notify.webhook.connectivity
```
## Related Checks
- `check.notify.webhook.configured` verifies webhook URL is set and valid
- `check.notify.queue.health` verifies the notification delivery queue is healthy