Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/operations/scheduler.md
+++ b/docs/doctor/articles/operations/scheduler.md
@@ -0,0 +1,108 @@
+---
+checkId: check.operations.scheduler
+plugin: stellaops.doctor.operations
+severity: warn
+tags: [operations, scheduler, core]
+---
+# Scheduler Health
+
+## What It Checks
+Evaluates the scheduler service status, scheduled jobs, and execution history:
+
+- **Service status**: fail if the scheduler service is not running.
+- **Missed executions**: warn if any scheduled job executions were missed (scheduled time passed without the job running).
+
+Evidence collected: `ServiceStatus`, `ScheduledJobs`, `MissedExecutions`, `LastExecution`, `NextExecution`, `CompletedToday`.
+
+This check always runs (no configuration prerequisites).
+
+## Why It Matters
+The scheduler is responsible for triggering time-based operations across the platform: vulnerability database syncs, periodic scans, evidence expiration, report generation, feed updates, and more. If the scheduler is down, none of these periodic tasks run, causing data staleness across the system. Missed executions indicate that the scheduler was unable to trigger a job at its scheduled time, which can cause cascading data freshness issues.
+
+## Common Causes
+- Scheduler service crashed or was not started
+- Service configuration error preventing startup
+- System was down during a scheduled execution time (maintenance, outage)
+- Scheduler overloaded with too many concurrent scheduled jobs
+- Clock skew between the scheduler and other services
+- Resource exhaustion preventing the scheduler from processing triggers
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check scheduler/orchestrator service status
+docker compose -f docker-compose.stella-ops.yml ps orchestrator
+
+# View scheduler logs
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "scheduler\|schedule"
+
+# Restart the service
+docker compose -f docker-compose.stella-ops.yml restart orchestrator
+
+# Review missed executions
+stella scheduler preview --missed
+
+# Trigger catch-up for missed jobs
+stella scheduler catchup --dry-run
+stella scheduler catchup
+```
+
+### Bare Metal / systemd
+```bash
+# Check scheduler service status
+sudo systemctl status stellaops-scheduler
+
+# Start the scheduler if stopped
+sudo systemctl start stellaops-scheduler
+
+# View scheduler logs
+sudo journalctl -u stellaops-scheduler --since "4 hours ago"
+
+# Review missed executions
+stella scheduler preview --missed
+
+# Trigger catch-up
+stella scheduler catchup --dry-run
+
+# Verify system clock is synchronized
+timedatectl status
+```
+
+### Kubernetes / Helm
+```bash
+# Check scheduler pod status
+kubectl get pods -l app=stellaops-scheduler
+
+# View logs for the scheduler pod
+kubectl logs -l app=stellaops-scheduler --tail=200
+
+# Restart the scheduler
+kubectl rollout restart deployment stellaops-scheduler
+
+# Check NTP synchronization in the node
+kubectl exec -it <scheduler-pod> -- date -u
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scheduler:
+  replicas: 1  # only one scheduler instance to avoid duplicate execution
+  resources:
+    limits:
+      memory: 512Mi
+      cpu: "0.5"
+  catchupOnStart: true  # run missed jobs on startup
+```
+
+## Verification
+```
+stella doctor run --check check.operations.scheduler
+```
+
+## Related Checks
+- `check.operations.job-queue` -- scheduler feeds jobs into the queue
+- `check.operations.dead-letter` -- scheduler-triggered jobs that fail end up in dead letter
+- `check.release.schedule` -- release schedule depends on the scheduler service
+- `check.scanner.vuln` -- vulnerability database sync is scheduler-driven