Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/observability/log-directory-writable.md
+++ b/docs/doctor/articles/observability/log-directory-writable.md
@@ -0,0 +1,68 @@
+---
+checkId: check.logs.directory.writable
+plugin: stellaops.doctor.observability
+severity: fail
+tags: [observability, logs, quick]
+---
+# Log Directory Writable
+
+## What It Checks
+Verifies that the log directory exists and is writable. The check:
+
+- Reads the log path from `Logging:Path` configuration. Falls back to platform defaults: `/var/log/stellaops` on Linux, `%ProgramData%\StellaOps\logs` on Windows.
+- Verifies the directory exists.
+- Writes a temporary file to test write access, then deletes it.
+- Fails if the directory does not exist, is not writable due to permissions, or encounters an I/O error.
+
+## Why It Matters
+If the log directory is not writable, application logs are silently lost. Without logs, troubleshooting service failures, debugging policy evaluation issues, and performing security incident investigations becomes impossible. This is a severity-fail check because log loss breaks the auditability guarantee.
+
+## Common Causes
+- Log directory not created during installation
+- Directory was deleted
+- Configuration points to wrong path
+- Insufficient permissions or directory owned by different user
+- Read-only file system
+- Disk full
+
+## How to Fix
+
+### Docker Compose
+```yaml
+volumes:
+  - log-data:/var/log/stellaops
+```
+
+```bash
+docker exec <platform-container> mkdir -p /var/log/stellaops
+```
+
+### Bare Metal / systemd
+```bash
+# Create log directory
+sudo mkdir -p /var/log/stellaops
+
+# Set ownership and permissions
+sudo chown -R stellaops:stellaops /var/log/stellaops
+sudo chmod 755 /var/log/stellaops
+```
+
+### Kubernetes / Helm
+```yaml
+logging:
+  path: "/var/log/stellaops"
+  persistence:
+    enabled: true
+    size: 10Gi
+```
+
+Or use an `emptyDir` volume for ephemeral log storage with a sidecar shipping logs to an external system.
+
+## Verification
+```
+stella doctor run --check check.logs.directory.writable
+```
+
+## Related Checks
+- `check.logs.rotation.configured` — verifies log rotation is configured
+- `check.storage.diskspace` — verifies sufficient disk space is available
--- a/docs/doctor/articles/observability/log-rotation.md
+++ b/docs/doctor/articles/observability/log-rotation.md
@@ -0,0 +1,83 @@
+---
+checkId: check.logs.rotation.configured
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, logs]
+---
+# Log Rotation
+
+## What It Checks
+Verifies that log rotation is configured to prevent disk exhaustion. The check:
+
+- Looks for application-level rotation via `Logging:RollingPolicy` configuration.
+- Checks for Serilog rolling configuration at `Serilog:WriteTo:0:Args:rollingInterval`.
+- On Linux, checks for system-level logrotate at `/etc/logrotate.d/stellaops`.
+- Scans log files in the log directory and flags any file exceeding 100MB.
+- Warns if rotation is not configured and large log files exist or total log size exceeds 200MB.
+- Reports info if rotation is not configured but logs are still small.
+
+## Why It Matters
+Without log rotation, log files grow unbounded until they exhaust disk space. Disk exhaustion causes cascading failures across all services. Even before exhaustion, very large log files are slow to search and analyze during incident response.
+
+## Common Causes
+- Log rotation not configured in application settings
+- logrotate not installed or stellaops config missing from `/etc/logrotate.d/`
+- Application-level rotation disabled
+- Rotation threshold set too high
+- Very high log volume overwhelming rotation schedule
+
+## How to Fix
+
+### Docker Compose
+Set application-level log rotation:
+
+```yaml
+environment:
+  Logging__RollingPolicy: "Size"
+  Serilog__WriteTo__0__Args__rollingInterval: "Day"
+  Serilog__WriteTo__0__Args__fileSizeLimitBytes: "104857600"  # 100MB
+```
+
+### Bare Metal / systemd
+Option 1 -- Application-level rotation in `appsettings.json`:
+```json
+{
+  "Logging": {
+    "RollingPolicy": "Size"
+  }
+}
+```
+
+Option 2 -- System-level logrotate:
+```bash
+sudo cp /usr/share/stellaops/logrotate.conf /etc/logrotate.d/stellaops
+
+# Or create manually:
+cat <<EOF | sudo tee /etc/logrotate.d/stellaops
+/var/log/stellaops/*.log {
+    daily
+    rotate 14
+    compress
+    missingok
+    notifempty
+    maxsize 100M
+}
+EOF
+```
+
+### Kubernetes / Helm
+```yaml
+logging:
+  rollingPolicy: "Size"
+  maxFileSizeMB: 100
+  retainFiles: 14
+```
+
+## Verification
+```
+stella doctor run --check check.logs.rotation.configured
+```
+
+## Related Checks
+- `check.logs.directory.writable` — verifies log directory exists and is writable
+- `check.storage.diskspace` — verifies sufficient disk space is available
--- a/docs/doctor/articles/observability/otlp-endpoint.md
+++ b/docs/doctor/articles/observability/otlp-endpoint.md
@@ -0,0 +1,86 @@
+---
+checkId: check.telemetry.otlp.endpoint
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, telemetry, otlp]
+---
+# OTLP Endpoint
+
+## What It Checks
+Verifies that the OTLP (OpenTelemetry Protocol) collector endpoint is reachable. The check:
+
+- Reads the endpoint from `Telemetry:OtlpEndpoint` configuration.
+- Sends a GET request to `{endpoint}/v1/health` with a 5-second timeout.
+- Passes if the endpoint returns a successful HTTP response.
+- Warns on non-success status codes, timeouts, or connection failures.
+
+The check only runs when `Telemetry:OtlpEndpoint` is configured.
+
+## Why It Matters
+OTLP is the standard protocol for exporting traces, metrics, and logs to observability backends (Grafana, Jaeger, Datadog, etc.). If the collector is unreachable, telemetry data is lost, making it impossible to monitor service performance, trace request flows, or detect anomalies.
+
+## Common Causes
+- OTLP collector not running
+- Wrong endpoint configured
+- Network connectivity issue or firewall blocking connection
+- Collector health endpoint not available at `/v1/health`
+
+## How to Fix
+
+### Docker Compose
+```yaml
+environment:
+  Telemetry__OtlpEndpoint: "http://otel-collector:4317"
+```
+
+```bash
+# Check if collector is running
+docker ps | grep otel
+
+# Check collector logs
+docker logs otel-collector --tail 50
+
+# Test connectivity
+docker exec <platform-container> curl -v http://otel-collector:4317/v1/health
+```
+
+### Bare Metal / systemd
+```bash
+# Check collector status
+systemctl status otel-collector
+
+# Test endpoint
+curl -v http://localhost:4317/v1/health
+
+# Check port binding
+netstat -an | grep 4317
+```
+
+Edit `appsettings.json`:
+```json
+{
+  "Telemetry": {
+    "OtlpEndpoint": "http://localhost:4317"
+  }
+}
+```
+
+### Kubernetes / Helm
+```yaml
+telemetry:
+  otlpEndpoint: "http://otel-collector.monitoring.svc:4317"
+```
+
+```bash
+kubectl get pods -n monitoring | grep otel
+kubectl logs -n monitoring <otel-collector-pod> --tail 50
+```
+
+## Verification
+```
+stella doctor run --check check.telemetry.otlp.endpoint
+```
+
+## Related Checks
+- `check.metrics.prometheus.scrape` — verifies Prometheus metrics endpoint accessibility
+- `check.logs.directory.writable` — verifies log directory is writable
--- a/docs/doctor/articles/observability/prometheus-scrape.md
+++ b/docs/doctor/articles/observability/prometheus-scrape.md
@@ -0,0 +1,90 @@
+---
+checkId: check.metrics.prometheus.scrape
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, metrics, prometheus]
+---
+# Prometheus Scrape
+
+## What It Checks
+Verifies that the application metrics endpoint is accessible for Prometheus scraping. The check:
+
+- Reads `Metrics:Path` (default `/metrics`), `Metrics:Port` (default `8080`), and `Metrics:Host` (default `localhost`).
+- Sends a GET request to `http://{host}:{port}{path}` with a 5-second timeout.
+- Counts the number of Prometheus-formatted metric lines in the response.
+- Passes if the endpoint returns a successful response with metrics.
+- Warns on non-success status codes, timeouts, or connection failures.
+
+The check only runs when `Metrics:Enabled` is set to `true`.
+
+## Why It Matters
+Prometheus metrics provide real-time visibility into service health, request latencies, error rates, and resource utilization. Without a scrapeable metrics endpoint, alerting rules cannot fire, dashboards go blank, and capacity planning has no data.
+
+## Common Causes
+- Metrics endpoint not enabled in configuration
+- Wrong port configured
+- Service not running on the expected port
+- Authentication required but not configured for Prometheus
+- Firewall blocking the metrics port
+
+## How to Fix
+
+### Docker Compose
+```yaml
+environment:
+  Metrics__Enabled: "true"
+  Metrics__Path: "/metrics"
+  Metrics__Port: "8080"
+```
+
+```bash
+# Test metrics endpoint
+docker exec <platform-container> curl -s http://localhost:8080/metrics | head -5
+```
+
+### Bare Metal / systemd
+Edit `appsettings.json`:
+```json
+{
+  "Metrics": {
+    "Enabled": true,
+    "Path": "/metrics",
+    "Port": 8080
+  }
+}
+```
+
+```bash
+# Verify metrics are exposed
+curl -s http://localhost:8080/metrics | head -5
+
+# Check port binding
+netstat -an | grep 8080
+```
+
+### Kubernetes / Helm
+```yaml
+metrics:
+  enabled: true
+  port: 8080
+  path: "/metrics"
+  serviceMonitor:
+    enabled: true
+```
+
+Add Prometheus annotations to the pod:
+```yaml
+annotations:
+  prometheus.io/scrape: "true"
+  prometheus.io/port: "8080"
+  prometheus.io/path: "/metrics"
+```
+
+## Verification
+```
+stella doctor run --check check.metrics.prometheus.scrape
+```
+
+## Related Checks
+- `check.telemetry.otlp.endpoint` — verifies OTLP collector endpoint reachability
+- `check.logs.directory.writable` — verifies log directory is writable