Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/scanner/queue.md
+++ b/docs/doctor/articles/scanner/queue.md
@@ -0,0 +1,110 @@
+---
+checkId: check.scanner.queue
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, queue, jobs, processing]
+---
+# Scanner Queue Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/queue/stats` and evaluates job queue health across four dimensions:
+
+- **Queue depth**: warn at 100+ pending jobs, fail at 500+.
+- **Failure rate**: warn at 5%+ of processed jobs failing, fail at 15%+.
+- **Stuck jobs**: any stuck jobs trigger an immediate fail.
+- **Backlog growth**: a growing backlog triggers a warning.
+
+Evidence collected: `queue_depth`, `processing_rate_per_min`, `stuck_jobs`, `failed_jobs`, `failure_rate`, `oldest_job_age_min`, `backlog_growing`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured; otherwise it is skipped.
+
+## Why It Matters
+The scanner queue is the central work pipeline for SBOM generation, vulnerability scanning, and reachability analysis. A backlogged or stuck queue delays security findings, blocks release gates that depend on scan results, and can cascade into approval timeouts. Stuck jobs indicate a worker crash or resource failure that will not self-heal.
+
+## Common Causes
+- Scanner worker process crashed or was OOM-killed
+- Job dependency (registry, database) became unavailable mid-scan
+- Resource exhaustion (CPU, memory, disk) on the scanner host
+- Database connection lost during job processing
+- Sudden spike in image pushes overwhelming worker capacity
+- Processing rate slower than ingest rate during bulk import
+
+## How to Fix
+
+### Docker Compose
+Check scanner worker status and restart if needed:
+
+```bash
+# View scanner container logs for errors
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 scanner
+
+# Restart the scanner service
+docker compose -f docker-compose.stella-ops.yml restart scanner
+
+# Scale scanner workers (if using replicas)
+docker compose -f docker-compose.stella-ops.yml up -d --scale scanner=4
+```
+
+Adjust concurrency via environment variables:
+
+```yaml
+environment:
+  Scanner__Queue__MaxConcurrentJobs: "4"
+  Scanner__Queue__StuckJobTimeoutMinutes: "30"
+```
+
+### Bare Metal / systemd
+```bash
+# Check scanner service status
+sudo systemctl status stellaops-scanner
+
+# View recent logs
+sudo journalctl -u stellaops-scanner --since "1 hour ago"
+
+# Restart the service
+sudo systemctl restart stellaops-scanner
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Queue": {
+    "MaxConcurrentJobs": 4,
+    "StuckJobTimeoutMinutes": 30
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check scanner pod status
+kubectl get pods -l app=stellaops-scanner
+
+# View logs for crash loops
+kubectl logs -l app=stellaops-scanner --tail=200
+
+# Scale scanner deployment
+kubectl scale deployment stellaops-scanner --replicas=4
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  replicas: 4
+  queue:
+    maxConcurrentJobs: 4
+    stuckJobTimeoutMinutes: 30
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.queue
+```
+
+## Related Checks
+- `check.scanner.resources` -- scanner CPU/memory utilization affecting processing rate
+- `check.scanner.sbom` -- SBOM generation failures may originate from queue issues
+- `check.scanner.vuln` -- vulnerability scan health depends on queue throughput
+- `check.operations.job-queue` -- platform-wide job queue health