Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/integration/webhook-health.md
+++ b/docs/doctor/articles/integration/webhook-health.md
@@ -0,0 +1,77 @@
+---
+checkId: check.integration.webhooks
+plugin: stellaops.doctor.integration
+severity: warn
+tags: [integration, webhooks, notifications, events]
+---
+# Integration Webhook Health
+
+## What It Checks
+Iterates over all webhook endpoints defined under `Webhooks:Endpoints`. For **outbound** webhooks it sends an HTTP HEAD request to the target URL and considers the endpoint reachable if the response status code is below 500. For **inbound** webhooks it marks reachability as true (endpoint is local). It then calculates the delivery failure rate from `TotalDeliveries` and `SuccessfulDeliveries` counters. The check **fails** if any outbound endpoint is unreachable or if any webhook's failure rate exceeds 20%, **warns** if any webhook's failure rate is between 5% and 20%, and **passes** otherwise.
+
+## Why It Matters
+Webhooks are the primary event-driven communication channel between Stella Ops and external systems. Unreachable outbound endpoints mean notifications, CI triggers, and audit event deliveries silently fail. A rising failure rate is an early warning of endpoint degradation that can cascade into missed alerts, delayed approvals, and incomplete audit trails.
+
+## Common Causes
+- Webhook endpoint is down or returning 5xx errors
+- Network connectivity issue or DNS resolution failure
+- TLS certificate expired or untrusted
+- Payload format changed causing receiver to reject events
+- Rate limiting by the receiving service
+- Intermittent timeouts under load
+
+## How to Fix
+
+### Docker Compose
+```bash
+# List configured webhooks
+grep 'WEBHOOKS__' .env
+
+# Test an outbound webhook endpoint
+docker compose exec gateway curl -I https://hooks.example.com/stellaops
+
+# View webhook delivery logs
+docker compose logs platform | grep -i webhook
+
+# Update a webhook URL
+echo 'Webhooks__Endpoints__0__Url=https://hooks.example.com/v2/stellaops' >> .env
+docker compose restart platform
+```
+
+### Bare Metal / systemd
+```bash
+# Check webhook configuration
+cat /etc/stellaops/appsettings.Production.json | jq '.Webhooks'
+
+# Test endpoint connectivity
+curl -I https://hooks.example.com/stellaops
+
+# Review delivery history
+stella webhooks logs <webhook-name> --status failed
+
+# Retry failed deliveries
+stella webhooks retry <webhook-name>
+```
+
+### Kubernetes / Helm
+```yaml
+# values.yaml
+webhooks:
+  endpoints:
+    - name: slack-releases
+      url: https://hooks.example.com/stellaops
+      direction: outbound
+```
+```bash
+helm upgrade stellaops ./chart -f values.yaml
+```
+
+## Verification
+```
+stella doctor run --check check.integration.webhooks
+```
+
+## Related Checks
+- `check.integration.slack` -- Slack-specific webhook validation
+- `check.integration.teams` -- Teams-specific webhook validation
+- `check.integration.ci.system` -- CI systems that receive webhook events