Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/operations/dead-letter.md
+++ b/docs/doctor/articles/operations/dead-letter.md
@@ -0,0 +1,90 @@
+---
+checkId: check.operations.dead-letter
+plugin: stellaops.doctor.operations
+severity: warn
+tags: [operations, queue, dead-letter]
+---
+# Dead Letter Queue
+
+## What It Checks
+Examines the dead letter queue for failed jobs that have exhausted their retry attempts and require manual review:
+
+- **Critical threshold**: fail when more than 50 failed jobs accumulate in the dead letter queue.
+- **Warning threshold**: warn when more than 10 failed jobs are present.
+- **Acceptable range**: 1-10 failed jobs pass with an informational note.
+
+Evidence collected: `FailedJobs`, `OldestFailure`, `MostCommonError`, `RetryableCount`.
+
+This check always runs (no configuration prerequisites).
+
+## Why It Matters
+Dead letter queue entries represent work that the system was unable to complete after all retry attempts. Each entry is a job that may have had side effects (partial writes, notifications sent, resources allocated) and now sits in an inconsistent state. A growing dead letter queue indicates a systemic issue -- a downstream service outage, a configuration error, or a bug that is causing repeated failures. Left unattended, dead letters accumulate and can mask the root cause of operational issues.
+
+## Common Causes
+- Persistent downstream service failures (registry unavailable, external API down)
+- Configuration errors causing jobs to fail deterministically (wrong credentials, missing endpoints)
+- Resource exhaustion (out of memory, disk full) during job execution
+- Integration service outage (SCM, CI, secrets manager)
+- Transient failures accumulating faster than the retry mechanism can clear them
+- Jobs consistently failing on specific artifact types or inputs
+
+## How to Fix
+
+### Docker Compose
+```bash
+# List dead letter queue entries
+stella orchestrator deadletter list --limit 20
+
+# Analyze common failure patterns
+stella orchestrator deadletter analyze
+
+# Retry jobs that are eligible for retry
+stella orchestrator deadletter retry --filter retryable
+
+# Retry all failed jobs
+stella orchestrator deadletter retry --all
+
+# View orchestrator logs for root cause
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "error\|fail"
+```
+
+### Bare Metal / systemd
+```bash
+# List recent failures
+stella orchestrator deadletter list --since 1h
+
+# Analyze failure patterns
+stella orchestrator deadletter analyze
+
+# Retry retryable jobs
+stella orchestrator deadletter retry --filter retryable
+
+# Check orchestrator service health
+sudo systemctl status stellaops-orchestrator
+sudo journalctl -u stellaops-orchestrator --since "4 hours ago" | grep -i "deadletter\|error"
+```
+
+### Kubernetes / Helm
+```bash
+# List dead letter entries
+kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter list --limit 20
+
+# Analyze failures
+kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter analyze
+
+# Retry retryable jobs
+kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter retry --filter retryable
+
+# Check orchestrator pod logs
+kubectl logs -l app=stellaops-orchestrator --tail=200 | grep -i dead.letter
+```
+
+## Verification
+```
+stella doctor run --check check.operations.dead-letter
+```
+
+## Related Checks
+- `check.operations.job-queue` -- job queue backlog can indicate the same underlying issue
+- `check.operations.scheduler` -- scheduler failures may produce dead letter entries
+- `check.postgres.connectivity` -- database issues are a common root cause of job failures
--- a/docs/doctor/articles/operations/job-queue.md
+++ b/docs/doctor/articles/operations/job-queue.md
@@ -0,0 +1,113 @@
+---
+checkId: check.operations.job-queue
+plugin: stellaops.doctor.operations
+severity: fail
+tags: [operations, queue, jobs, core]
+---
+# Job Queue Health
+
+## What It Checks
+Evaluates the platform job queue health across three dimensions:
+
+- **Worker availability**: fail immediately if no workers are active (zero active workers).
+- **Queue depth**: warn at 100+ pending jobs, fail at 500+ pending jobs.
+- **Processing rate**: warn if processing rate drops below 10 jobs/minute.
+
+Evidence collected: `QueueDepth`, `ActiveWorkers`, `TotalWorkers`, `ProcessingRate`, `OldestJobAge`, `CompletedLast24h`, `CriticalThreshold`, `WarningThreshold`, `RateStatus`.
+
+This check always runs (no configuration prerequisites).
+
+## Why It Matters
+The job queue is the backbone of asynchronous processing in Stella Ops. It handles scan jobs, SBOM generation, vulnerability matching, evidence collection, notification delivery, and many other background tasks. If no workers are available, all background processing stops. A deep queue means jobs are waiting longer than expected, which cascades into delayed scan results, stale findings, and blocked release gates. A low processing rate indicates a performance bottleneck that will only get worse under load.
+
+## Common Causes
+- Worker service not running (crashed, not started, configuration error)
+- All workers crashed or became unhealthy simultaneously
+- Job processing slower than submission rate during high-activity periods
+- Workers overloaded or misconfigured (too few workers for the workload)
+- Downstream service bottleneck (database slow, external API rate-limited)
+- Database performance issues slowing job dequeue operations
+- Higher than normal job submission rate (bulk scan, new integration)
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check orchestrator service status
+docker compose -f docker-compose.stella-ops.yml ps orchestrator
+
+# View worker logs
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator
+
+# Restart the orchestrator service
+docker compose -f docker-compose.stella-ops.yml restart orchestrator
+
+# Scale workers
+docker compose -f docker-compose.stella-ops.yml up -d --scale orchestrator=4
+```
+
+```yaml
+services:
+  orchestrator:
+    environment:
+      Orchestrator__Workers__Count: "8"
+      Orchestrator__Workers__MaxConcurrent: "4"
+```
+
+### Bare Metal / systemd
+```bash
+# Check orchestrator service
+sudo systemctl status stellaops-orchestrator
+
+# View logs for worker errors
+sudo journalctl -u stellaops-orchestrator --since "1 hour ago" | grep -i "worker\|queue"
+
+# Restart workers
+stella orchestrator workers restart
+
+# Scale workers
+stella orchestrator workers scale --count 8
+
+# Monitor queue depth trend
+stella orchestrator queue watch
+```
+
+### Kubernetes / Helm
+```bash
+# Check orchestrator pods
+kubectl get pods -l app=stellaops-orchestrator
+
+# View worker logs
+kubectl logs -l app=stellaops-orchestrator --tail=200
+
+# Scale workers
+kubectl scale deployment stellaops-orchestrator --replicas=4
+
+# Check for stuck jobs
+kubectl exec -it <orchestrator-pod> -- stella orchestrator jobs list --status stuck
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+orchestrator:
+  replicas: 4
+  workers:
+    count: 8
+    maxConcurrent: 4
+  resources:
+    limits:
+      memory: 2Gi
+      cpu: "2"
+```
+
+## Verification
+```
+stella doctor run --check check.operations.job-queue
+```
+
+## Related Checks
+- `check.operations.dead-letter` -- failed jobs end up in the dead letter queue
+- `check.operations.scheduler` -- scheduler feeds jobs into the queue
+- `check.scanner.queue` -- scanner-specific queue health
+- `check.postgres.connectivity` -- database issues affect job dequeue performance
--- a/docs/doctor/articles/operations/scheduler.md
+++ b/docs/doctor/articles/operations/scheduler.md
@@ -0,0 +1,108 @@
+---
+checkId: check.operations.scheduler
+plugin: stellaops.doctor.operations
+severity: warn
+tags: [operations, scheduler, core]
+---
+# Scheduler Health
+
+## What It Checks
+Evaluates the scheduler service status, scheduled jobs, and execution history:
+
+- **Service status**: fail if the scheduler service is not running.
+- **Missed executions**: warn if any scheduled job executions were missed (scheduled time passed without the job running).
+
+Evidence collected: `ServiceStatus`, `ScheduledJobs`, `MissedExecutions`, `LastExecution`, `NextExecution`, `CompletedToday`.
+
+This check always runs (no configuration prerequisites).
+
+## Why It Matters
+The scheduler is responsible for triggering time-based operations across the platform: vulnerability database syncs, periodic scans, evidence expiration, report generation, feed updates, and more. If the scheduler is down, none of these periodic tasks run, causing data staleness across the system. Missed executions indicate that the scheduler was unable to trigger a job at its scheduled time, which can cause cascading data freshness issues.
+
+## Common Causes
+- Scheduler service crashed or was not started
+- Service configuration error preventing startup
+- System was down during a scheduled execution time (maintenance, outage)
+- Scheduler overloaded with too many concurrent scheduled jobs
+- Clock skew between the scheduler and other services
+- Resource exhaustion preventing the scheduler from processing triggers
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check scheduler/orchestrator service status
+docker compose -f docker-compose.stella-ops.yml ps orchestrator
+
+# View scheduler logs
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "scheduler\|schedule"
+
+# Restart the service
+docker compose -f docker-compose.stella-ops.yml restart orchestrator
+
+# Review missed executions
+stella scheduler preview --missed
+
+# Trigger catch-up for missed jobs
+stella scheduler catchup --dry-run
+stella scheduler catchup
+```
+
+### Bare Metal / systemd
+```bash
+# Check scheduler service status
+sudo systemctl status stellaops-scheduler
+
+# Start the scheduler if stopped
+sudo systemctl start stellaops-scheduler
+
+# View scheduler logs
+sudo journalctl -u stellaops-scheduler --since "4 hours ago"
+
+# Review missed executions
+stella scheduler preview --missed
+
+# Trigger catch-up
+stella scheduler catchup --dry-run
+
+# Verify system clock is synchronized
+timedatectl status
+```
+
+### Kubernetes / Helm
+```bash
+# Check scheduler pod status
+kubectl get pods -l app=stellaops-scheduler
+
+# View logs for the scheduler pod
+kubectl logs -l app=stellaops-scheduler --tail=200
+
+# Restart the scheduler
+kubectl rollout restart deployment stellaops-scheduler
+
+# Check NTP synchronization in the node
+kubectl exec -it <scheduler-pod> -- date -u
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scheduler:
+  replicas: 1  # only one scheduler instance to avoid duplicate execution
+  resources:
+    limits:
+      memory: 512Mi
+      cpu: "0.5"
+  catchupOnStart: true  # run missed jobs on startup
+```
+
+## Verification
+```
+stella doctor run --check check.operations.scheduler
+```
+
+## Related Checks
+- `check.operations.job-queue` -- scheduler feeds jobs into the queue
+- `check.operations.dead-letter` -- scheduler-triggered jobs that fail end up in dead letter
+- `check.release.schedule` -- release schedule depends on the scheduler service
+- `check.scanner.vuln` -- vulnerability database sync is scheduler-driven