Files
git.stella-ops.org/docs/doctor/articles/integration/webhook-health.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

2.7 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.integration.webhooks stellaops.doctor.integration warn
integration
webhooks
notifications
events

Integration Webhook Health

What It Checks

Iterates over all webhook endpoints defined under Webhooks:Endpoints. For outbound webhooks it sends an HTTP HEAD request to the target URL and considers the endpoint reachable if the response status code is below 500. For inbound webhooks it marks reachability as true (endpoint is local). It then calculates the delivery failure rate from TotalDeliveries and SuccessfulDeliveries counters. The check fails if any outbound endpoint is unreachable or if any webhook's failure rate exceeds 20%, warns if any webhook's failure rate is between 5% and 20%, and passes otherwise.

Why It Matters

Webhooks are the primary event-driven communication channel between Stella Ops and external systems. Unreachable outbound endpoints mean notifications, CI triggers, and audit event deliveries silently fail. A rising failure rate is an early warning of endpoint degradation that can cascade into missed alerts, delayed approvals, and incomplete audit trails.

Common Causes

  • Webhook endpoint is down or returning 5xx errors
  • Network connectivity issue or DNS resolution failure
  • TLS certificate expired or untrusted
  • Payload format changed causing receiver to reject events
  • Rate limiting by the receiving service
  • Intermittent timeouts under load

How to Fix

Docker Compose

# List configured webhooks
grep 'WEBHOOKS__' .env

# Test an outbound webhook endpoint
docker compose exec gateway curl -I https://hooks.example.com/stellaops

# View webhook delivery logs
docker compose logs platform | grep -i webhook

# Update a webhook URL
echo 'Webhooks__Endpoints__0__Url=https://hooks.example.com/v2/stellaops' >> .env
docker compose restart platform

Bare Metal / systemd

# Check webhook configuration
cat /etc/stellaops/appsettings.Production.json | jq '.Webhooks'

# Test endpoint connectivity
curl -I https://hooks.example.com/stellaops

# Review delivery history
stella webhooks logs <webhook-name> --status failed

# Retry failed deliveries
stella webhooks retry <webhook-name>

Kubernetes / Helm

# values.yaml
webhooks:
  endpoints:
    - name: slack-releases
      url: https://hooks.example.com/stellaops
      direction: outbound
helm upgrade stellaops ./chart -f values.yaml

Verification

stella doctor run --check check.integration.webhooks
  • check.integration.slack -- Slack-specific webhook validation
  • check.integration.teams -- Teams-specific webhook validation
  • check.integration.ci.system -- CI systems that receive webhook events