Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,77 @@
---
checkId: check.integration.webhooks
plugin: stellaops.doctor.integration
severity: warn
tags: [integration, webhooks, notifications, events]
---
# Integration Webhook Health
## What It Checks
Iterates over all webhook endpoints defined under `Webhooks:Endpoints`. For **outbound** webhooks it sends an HTTP HEAD request to the target URL and considers the endpoint reachable if the response status code is below 500. For **inbound** webhooks it marks reachability as true (endpoint is local). It then calculates the delivery failure rate from `TotalDeliveries` and `SuccessfulDeliveries` counters. The check **fails** if any outbound endpoint is unreachable or if any webhook's failure rate exceeds 20%, **warns** if any webhook's failure rate is between 5% and 20%, and **passes** otherwise.
## Why It Matters
Webhooks are the primary event-driven communication channel between Stella Ops and external systems. Unreachable outbound endpoints mean notifications, CI triggers, and audit event deliveries silently fail. A rising failure rate is an early warning of endpoint degradation that can cascade into missed alerts, delayed approvals, and incomplete audit trails.
## Common Causes
- Webhook endpoint is down or returning 5xx errors
- Network connectivity issue or DNS resolution failure
- TLS certificate expired or untrusted
- Payload format changed causing receiver to reject events
- Rate limiting by the receiving service
- Intermittent timeouts under load
## How to Fix
### Docker Compose
```bash
# List configured webhooks
grep 'WEBHOOKS__' .env
# Test an outbound webhook endpoint
docker compose exec gateway curl -I https://hooks.example.com/stellaops
# View webhook delivery logs
docker compose logs platform | grep -i webhook
# Update a webhook URL
echo 'Webhooks__Endpoints__0__Url=https://hooks.example.com/v2/stellaops' >> .env
docker compose restart platform
```
### Bare Metal / systemd
```bash
# Check webhook configuration
cat /etc/stellaops/appsettings.Production.json | jq '.Webhooks'
# Test endpoint connectivity
curl -I https://hooks.example.com/stellaops
# Review delivery history
stella webhooks logs <webhook-name> --status failed
# Retry failed deliveries
stella webhooks retry <webhook-name>
```
### Kubernetes / Helm
```yaml
# values.yaml
webhooks:
endpoints:
- name: slack-releases
url: https://hooks.example.com/stellaops
direction: outbound
```
```bash
helm upgrade stellaops ./chart -f values.yaml
```
## Verification
```
stella doctor run --check check.integration.webhooks
```
## Related Checks
- `check.integration.slack` -- Slack-specific webhook validation
- `check.integration.teams` -- Teams-specific webhook validation
- `check.integration.ci.system` -- CI systems that receive webhook events