Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
86
docs/doctor/articles/observability/otlp-endpoint.md
Normal file
86
docs/doctor/articles/observability/otlp-endpoint.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
checkId: check.telemetry.otlp.endpoint
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, telemetry, otlp]
|
||||
---
|
||||
# OTLP Endpoint
|
||||
|
||||
## What It Checks
|
||||
Verifies that the OTLP (OpenTelemetry Protocol) collector endpoint is reachable. The check:
|
||||
|
||||
- Reads the endpoint from `Telemetry:OtlpEndpoint` configuration.
|
||||
- Sends a GET request to `{endpoint}/v1/health` with a 5-second timeout.
|
||||
- Passes if the endpoint returns a successful HTTP response.
|
||||
- Warns on non-success status codes, timeouts, or connection failures.
|
||||
|
||||
The check only runs when `Telemetry:OtlpEndpoint` is configured.
|
||||
|
||||
## Why It Matters
|
||||
OTLP is the standard protocol for exporting traces, metrics, and logs to observability backends (Grafana, Jaeger, Datadog, etc.). If the collector is unreachable, telemetry data is lost, making it impossible to monitor service performance, trace request flows, or detect anomalies.
|
||||
|
||||
## Common Causes
|
||||
- OTLP collector not running
|
||||
- Wrong endpoint configured
|
||||
- Network connectivity issue or firewall blocking connection
|
||||
- Collector health endpoint not available at `/v1/health`
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Telemetry__OtlpEndpoint: "http://otel-collector:4317"
|
||||
```
|
||||
|
||||
```bash
|
||||
# Check if collector is running
|
||||
docker ps | grep otel
|
||||
|
||||
# Check collector logs
|
||||
docker logs otel-collector --tail 50
|
||||
|
||||
# Test connectivity
|
||||
docker exec <platform-container> curl -v http://otel-collector:4317/v1/health
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check collector status
|
||||
systemctl status otel-collector
|
||||
|
||||
# Test endpoint
|
||||
curl -v http://localhost:4317/v1/health
|
||||
|
||||
# Check port binding
|
||||
netstat -an | grep 4317
|
||||
```
|
||||
|
||||
Edit `appsettings.json`:
|
||||
```json
|
||||
{
|
||||
"Telemetry": {
|
||||
"OtlpEndpoint": "http://localhost:4317"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
telemetry:
|
||||
otlpEndpoint: "http://otel-collector.monitoring.svc:4317"
|
||||
```
|
||||
|
||||
```bash
|
||||
kubectl get pods -n monitoring | grep otel
|
||||
kubectl logs -n monitoring <otel-collector-pod> --tail 50
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.telemetry.otlp.endpoint
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.metrics.prometheus.scrape` — verifies Prometheus metrics endpoint accessibility
|
||||
- `check.logs.directory.writable` — verifies log directory is writable
|
||||
Reference in New Issue
Block a user