Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/observability/prometheus-scrape.md
+++ b/docs/doctor/articles/observability/prometheus-scrape.md
@@ -0,0 +1,90 @@
+---
+checkId: check.metrics.prometheus.scrape
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, metrics, prometheus]
+---
+# Prometheus Scrape
+
+## What It Checks
+Verifies that the application metrics endpoint is accessible for Prometheus scraping. The check:
+
+- Reads `Metrics:Path` (default `/metrics`), `Metrics:Port` (default `8080`), and `Metrics:Host` (default `localhost`).
+- Sends a GET request to `http://{host}:{port}{path}` with a 5-second timeout.
+- Counts the number of Prometheus-formatted metric lines in the response.
+- Passes if the endpoint returns a successful response with metrics.
+- Warns on non-success status codes, timeouts, or connection failures.
+
+The check only runs when `Metrics:Enabled` is set to `true`.
+
+## Why It Matters
+Prometheus metrics provide real-time visibility into service health, request latencies, error rates, and resource utilization. Without a scrapeable metrics endpoint, alerting rules cannot fire, dashboards go blank, and capacity planning has no data.
+
+## Common Causes
+- Metrics endpoint not enabled in configuration
+- Wrong port configured
+- Service not running on the expected port
+- Authentication required but not configured for Prometheus
+- Firewall blocking the metrics port
+
+## How to Fix
+
+### Docker Compose
+```yaml
+environment:
+  Metrics__Enabled: "true"
+  Metrics__Path: "/metrics"
+  Metrics__Port: "8080"
+```
+
+```bash
+# Test metrics endpoint
+docker exec <platform-container> curl -s http://localhost:8080/metrics | head -5
+```
+
+### Bare Metal / systemd
+Edit `appsettings.json`:
+```json
+{
+  "Metrics": {
+    "Enabled": true,
+    "Path": "/metrics",
+    "Port": 8080
+  }
+}
+```
+
+```bash
+# Verify metrics are exposed
+curl -s http://localhost:8080/metrics | head -5
+
+# Check port binding
+netstat -an | grep 8080
+```
+
+### Kubernetes / Helm
+```yaml
+metrics:
+  enabled: true
+  port: 8080
+  path: "/metrics"
+  serviceMonitor:
+    enabled: true
+```
+
+Add Prometheus annotations to the pod:
+```yaml
+annotations:
+  prometheus.io/scrape: "true"
+  prometheus.io/port: "8080"
+  prometheus.io/path: "/metrics"
+```
+
+## Verification
+```
+stella doctor run --check check.metrics.prometheus.scrape
+```
+
+## Related Checks
+- `check.telemetry.otlp.endpoint` — verifies OTLP collector endpoint reachability
+- `check.logs.directory.writable` — verifies log directory is writable