Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/notify/queue-health.md
+++ b/docs/doctor/articles/notify/queue-health.md
@@ -0,0 +1,93 @@
+---
+checkId: check.notify.queue.health
+plugin: stellaops.doctor.notify
+severity: fail
+tags: [notify, queue, redis, nats, infrastructure]
+---
+# Notification Queue Health
+
+## What It Checks
+Verifies that the notification event and delivery queues are healthy. The check:
+
+- Reads the `Notify:Queue:Transport` (or `Kind`) setting to determine the queue transport type (Redis/Valkey or NATS).
+- Resolves `NotifyQueueHealthCheck` and `NotifyDeliveryQueueHealthCheck` from the DI container.
+- Invokes each registered health check and aggregates the results.
+- Fails if any queue reports an `Unhealthy` status; warns if degraded; passes if all are healthy.
+
+The check only runs when a queue transport is configured in `Notify:Queue:Transport`.
+
+## Why It Matters
+The notification queue is the backbone of the notification pipeline. If the event queue is unhealthy, new notification events are lost. If the delivery queue is unhealthy, pending notifications to email, Slack, Teams, and webhook channels will not be delivered. This is a severity-fail check because queue failure means complete notification blackout.
+
+## Common Causes
+- Queue server (Redis/Valkey/NATS) not running
+- Network connectivity issues between the Notify service and the queue server
+- Authentication failure (wrong password or credentials)
+- Incorrect connection string in configuration
+
+## How to Fix
+
+### Docker Compose
+For Redis/Valkey transport:
+
+```bash
+# Check Redis health
+docker exec <redis-container> redis-cli ping
+
+# Check connection string
+docker exec <notify-container> env | grep Notify__Queue
+
+# Restart Redis if needed
+docker restart <redis-container>
+```
+
+For NATS transport:
+
+```bash
+# Check NATS server status
+docker exec <nats-container> nats server ping
+
+# Check NATS logs
+docker logs <nats-container> --tail 50
+```
+
+### Bare Metal / systemd
+```bash
+# Redis/Valkey
+redis-cli ping
+redis-cli info server
+
+# NATS
+nats server ping
+systemctl status nats
+```
+
+Verify the connection string in `appsettings.json`:
+```json
+{
+  "Notify": {
+    "Queue": {
+      "Transport": "redis",
+      "Redis": {
+        "ConnectionString": "127.1.1.2:6379"
+      }
+    }
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -it <redis-pod> -- redis-cli ping
+kubectl logs <notify-pod> --tail 50 | grep -i queue
+```
+
+## Verification
+```
+stella doctor run --check check.notify.queue.health
+```
+
+## Related Checks
+- `check.notify.email.configured` — verifies email channel configuration
+- `check.notify.slack.configured` — verifies Slack channel configuration
+- `check.notify.webhook.configured` — verifies webhook channel configuration