Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,86 @@
---
checkId: check.telemetry.otlp.endpoint
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, telemetry, otlp]
---
# OTLP Endpoint
## What It Checks
Verifies that the OTLP (OpenTelemetry Protocol) collector endpoint is reachable. The check:
- Reads the endpoint from `Telemetry:OtlpEndpoint` configuration.
- Sends a GET request to `{endpoint}/v1/health` with a 5-second timeout.
- Passes if the endpoint returns a successful HTTP response.
- Warns on non-success status codes, timeouts, or connection failures.
The check only runs when `Telemetry:OtlpEndpoint` is configured.
## Why It Matters
OTLP is the standard protocol for exporting traces, metrics, and logs to observability backends (Grafana, Jaeger, Datadog, etc.). If the collector is unreachable, telemetry data is lost, making it impossible to monitor service performance, trace request flows, or detect anomalies.
## Common Causes
- OTLP collector not running
- Wrong endpoint configured
- Network connectivity issue or firewall blocking connection
- Collector health endpoint not available at `/v1/health`
## How to Fix
### Docker Compose
```yaml
environment:
Telemetry__OtlpEndpoint: "http://otel-collector:4317"
```
```bash
# Check if collector is running
docker ps | grep otel
# Check collector logs
docker logs otel-collector --tail 50
# Test connectivity
docker exec <platform-container> curl -v http://otel-collector:4317/v1/health
```
### Bare Metal / systemd
```bash
# Check collector status
systemctl status otel-collector
# Test endpoint
curl -v http://localhost:4317/v1/health
# Check port binding
netstat -an | grep 4317
```
Edit `appsettings.json`:
```json
{
"Telemetry": {
"OtlpEndpoint": "http://localhost:4317"
}
}
```
### Kubernetes / Helm
```yaml
telemetry:
otlpEndpoint: "http://otel-collector.monitoring.svc:4317"
```
```bash
kubectl get pods -n monitoring | grep otel
kubectl logs -n monitoring <otel-collector-pod> --tail 50
```
## Verification
```
stella doctor run --check check.telemetry.otlp.endpoint
```
## Related Checks
- `check.metrics.prometheus.scrape` — verifies Prometheus metrics endpoint accessibility
- `check.logs.directory.writable` — verifies log directory is writable