Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.7 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.integration.webhooks | stellaops.doctor.integration | warn |
|
Integration Webhook Health
What It Checks
Iterates over all webhook endpoints defined under Webhooks:Endpoints. For outbound webhooks it sends an HTTP HEAD request to the target URL and considers the endpoint reachable if the response status code is below 500. For inbound webhooks it marks reachability as true (endpoint is local). It then calculates the delivery failure rate from TotalDeliveries and SuccessfulDeliveries counters. The check fails if any outbound endpoint is unreachable or if any webhook's failure rate exceeds 20%, warns if any webhook's failure rate is between 5% and 20%, and passes otherwise.
Why It Matters
Webhooks are the primary event-driven communication channel between Stella Ops and external systems. Unreachable outbound endpoints mean notifications, CI triggers, and audit event deliveries silently fail. A rising failure rate is an early warning of endpoint degradation that can cascade into missed alerts, delayed approvals, and incomplete audit trails.
Common Causes
- Webhook endpoint is down or returning 5xx errors
- Network connectivity issue or DNS resolution failure
- TLS certificate expired or untrusted
- Payload format changed causing receiver to reject events
- Rate limiting by the receiving service
- Intermittent timeouts under load
How to Fix
Docker Compose
# List configured webhooks
grep 'WEBHOOKS__' .env
# Test an outbound webhook endpoint
docker compose exec gateway curl -I https://hooks.example.com/stellaops
# View webhook delivery logs
docker compose logs platform | grep -i webhook
# Update a webhook URL
echo 'Webhooks__Endpoints__0__Url=https://hooks.example.com/v2/stellaops' >> .env
docker compose restart platform
Bare Metal / systemd
# Check webhook configuration
cat /etc/stellaops/appsettings.Production.json | jq '.Webhooks'
# Test endpoint connectivity
curl -I https://hooks.example.com/stellaops
# Review delivery history
stella webhooks logs <webhook-name> --status failed
# Retry failed deliveries
stella webhooks retry <webhook-name>
Kubernetes / Helm
# values.yaml
webhooks:
endpoints:
- name: slack-releases
url: https://hooks.example.com/stellaops
direction: outbound
helm upgrade stellaops ./chart -f values.yaml
Verification
stella doctor run --check check.integration.webhooks
Related Checks
check.integration.slack-- Slack-specific webhook validationcheck.integration.teams-- Teams-specific webhook validationcheck.integration.ci.system-- CI systems that receive webhook events