Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.7 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Integration Webhook Health

What It Checks

Iterates over all webhook endpoints defined under Webhooks:Endpoints. For outbound webhooks it sends an HTTP HEAD request to the target URL and considers the endpoint reachable if the response status code is below 500. For inbound webhooks it marks reachability as true (endpoint is local). It then calculates the delivery failure rate from TotalDeliveries and SuccessfulDeliveries counters. The check fails if any outbound endpoint is unreachable or if any webhook's failure rate exceeds 20%, warns if any webhook's failure rate is between 5% and 20%, and passes otherwise.

Why It Matters

Webhooks are the primary event-driven communication channel between Stella Ops and external systems. Unreachable outbound endpoints mean notifications, CI triggers, and audit event deliveries silently fail. A rising failure rate is an early warning of endpoint degradation that can cascade into missed alerts, delayed approvals, and incomplete audit trails.

Common Causes

Webhook endpoint is down or returning 5xx errors
Network connectivity issue or DNS resolution failure
TLS certificate expired or untrusted
Payload format changed causing receiver to reject events
Rate limiting by the receiving service
Intermittent timeouts under load

How to Fix

Docker Compose

# List configured webhooks
grep 'WEBHOOKS__' .env

# Test an outbound webhook endpoint
docker compose exec gateway curl -I https://hooks.example.com/stellaops

# View webhook delivery logs
docker compose logs platform | grep -i webhook

# Update a webhook URL
echo 'Webhooks__Endpoints__0__Url=https://hooks.example.com/v2/stellaops' >> .env
docker compose restart platform

Bare Metal / systemd

# Check webhook configuration
cat /etc/stellaops/appsettings.Production.json | jq '.Webhooks'

# Test endpoint connectivity
curl -I https://hooks.example.com/stellaops

# Review delivery history
stella webhooks logs <webhook-name> --status failed

# Retry failed deliveries
stella webhooks retry <webhook-name>

Kubernetes / Helm

# values.yaml
webhooks:
  endpoints:
    - name: slack-releases
      url: https://hooks.example.com/stellaops
      direction: outbound

helm upgrade stellaops ./chart -f values.yaml

Verification

stella doctor run --check check.integration.webhooks

check.integration.slack -- Slack-specific webhook validation
check.integration.teams -- Teams-specific webhook validation
check.integration.ci.system -- CI systems that receive webhook events

2.7 KiB Raw Blame History