Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,110 @@
---
checkId: check.scanner.queue
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, queue, jobs, processing]
---
# Scanner Queue Health
## What It Checks
Queries the Scanner service at `/api/v1/queue/stats` and evaluates job queue health across four dimensions:
- **Queue depth**: warn at 100+ pending jobs, fail at 500+.
- **Failure rate**: warn at 5%+ of processed jobs failing, fail at 15%+.
- **Stuck jobs**: any stuck jobs trigger an immediate fail.
- **Backlog growth**: a growing backlog triggers a warning.
Evidence collected: `queue_depth`, `processing_rate_per_min`, `stuck_jobs`, `failed_jobs`, `failure_rate`, `oldest_job_age_min`, `backlog_growing`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured; otherwise it is skipped.
## Why It Matters
The scanner queue is the central work pipeline for SBOM generation, vulnerability scanning, and reachability analysis. A backlogged or stuck queue delays security findings, blocks release gates that depend on scan results, and can cascade into approval timeouts. Stuck jobs indicate a worker crash or resource failure that will not self-heal.
## Common Causes
- Scanner worker process crashed or was OOM-killed
- Job dependency (registry, database) became unavailable mid-scan
- Resource exhaustion (CPU, memory, disk) on the scanner host
- Database connection lost during job processing
- Sudden spike in image pushes overwhelming worker capacity
- Processing rate slower than ingest rate during bulk import
## How to Fix
### Docker Compose
Check scanner worker status and restart if needed:
```bash
# View scanner container logs for errors
docker compose -f docker-compose.stella-ops.yml logs --tail 200 scanner
# Restart the scanner service
docker compose -f docker-compose.stella-ops.yml restart scanner
# Scale scanner workers (if using replicas)
docker compose -f docker-compose.stella-ops.yml up -d --scale scanner=4
```
Adjust concurrency via environment variables:
```yaml
environment:
Scanner__Queue__MaxConcurrentJobs: "4"
Scanner__Queue__StuckJobTimeoutMinutes: "30"
```
### Bare Metal / systemd
```bash
# Check scanner service status
sudo systemctl status stellaops-scanner
# View recent logs
sudo journalctl -u stellaops-scanner --since "1 hour ago"
# Restart the service
sudo systemctl restart stellaops-scanner
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Queue": {
"MaxConcurrentJobs": 4,
"StuckJobTimeoutMinutes": 30
}
}
```
### Kubernetes / Helm
```bash
# Check scanner pod status
kubectl get pods -l app=stellaops-scanner
# View logs for crash loops
kubectl logs -l app=stellaops-scanner --tail=200
# Scale scanner deployment
kubectl scale deployment stellaops-scanner --replicas=4
```
Set in Helm `values.yaml`:
```yaml
scanner:
replicas: 4
queue:
maxConcurrentJobs: 4
stuckJobTimeoutMinutes: 30
```
## Verification
```
stella doctor run --check check.scanner.queue
```
## Related Checks
- `check.scanner.resources` -- scanner CPU/memory utilization affecting processing rate
- `check.scanner.sbom` -- SBOM generation failures may originate from queue issues
- `check.scanner.vuln` -- vulnerability scan health depends on queue throughput
- `check.operations.job-queue` -- platform-wide job queue health

View File

@@ -0,0 +1,113 @@
---
checkId: check.scanner.reachability
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, reachability, analysis, performance]
---
# Reachability Computation Health
## What It Checks
Queries the Scanner service at `/api/v1/reachability/stats` and evaluates reachability analysis performance and accuracy:
- **Computation failures**: fail if failure rate exceeds 10% of total computations.
- **Average computation time**: warn at 5,000ms, fail at 30,000ms.
- **Vulnerability filtering effectiveness**: reported as evidence (ratio of unreachable to total vulnerabilities).
Evidence collected: `total_computations`, `computation_failures`, `failure_rate`, `avg_computation_time_ms`, `p95_computation_time_ms`, `reachable_vulns`, `unreachable_vulns`, `filter_rate`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
Reachability analysis is what separates actionable vulnerability findings from noise. It determines which vulnerabilities are actually reachable in the call graph, filtering out false positives that would otherwise block releases or waste triage time. Slow computations delay security feedback loops, and failures mean vulnerabilities are reported without reachability context, inflating finding counts and eroding operator trust.
## Common Causes
- Invalid or incomplete call graph data from the SBOM/slice pipeline
- Missing slice cache entries forcing full recomputation
- Timeout on large codebases with deep dependency trees
- Memory exhaustion during graph traversal on complex projects
- Complex call graphs with high fan-out or cyclical references
- Insufficient CPU/memory allocated to scanner workers
## How to Fix
### Docker Compose
```bash
# Check scanner logs for reachability errors
docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "reachability\|computation"
# Warm the slice cache to speed up subsequent computations
stella scanner cache warm
# Increase scanner resources
```
```yaml
services:
scanner:
deploy:
resources:
limits:
memory: 4G
cpus: "4.0"
environment:
Scanner__Reachability__TimeoutMs: "60000"
Scanner__Reachability__MaxGraphDepth: "100"
```
### Bare Metal / systemd
```bash
# View reachability computation errors
sudo journalctl -u stellaops-scanner --since "1 hour ago" | grep -i reachability
# Retry failed computations
stella scanner reachability retry --failed
# Warm the slice cache
stella scanner cache warm
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Reachability": {
"TimeoutMs": 60000,
"MaxGraphDepth": 100,
"MaxConcurrentComputations": 4
}
}
```
### Kubernetes / Helm
```bash
# Check scanner pod resource usage
kubectl top pods -l app=stellaops-scanner
# Scale scanner workers for parallel computation
kubectl scale deployment stellaops-scanner --replicas=4
```
Set in Helm `values.yaml`:
```yaml
scanner:
replicas: 4
resources:
limits:
memory: 4Gi
cpu: "4"
reachability:
timeoutMs: 60000
maxGraphDepth: 100
```
## Verification
```
stella doctor run --check check.scanner.reachability
```
## Related Checks
- `check.scanner.slice.cache` -- cache misses are a primary cause of slow computations
- `check.scanner.witness.graph` -- reachability depends on witness graph integrity
- `check.scanner.sbom` -- SBOM quality directly affects reachability accuracy
- `check.scanner.resources` -- resource constraints cause computation timeouts

View File

@@ -0,0 +1,119 @@
---
checkId: check.scanner.resources
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, resources, cpu, memory, workers]
---
# Scanner Resource Utilization
## What It Checks
Queries the Scanner service at `/api/v1/resources/stats` and evaluates CPU, memory, and worker pool health:
- **CPU utilization**: warn at 75%, fail at 90%.
- **Memory utilization**: warn at 80%, fail at 95%.
- **Worker pool saturation**: warn when all workers are busy (zero idle workers).
Evidence collected: `cpu_utilization`, `memory_utilization`, `memory_used_mb`, `active_workers`, `total_workers`, `idle_workers`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
The scanner is one of the most resource-intensive services in the Stella Ops stack. It processes container images, generates SBOMs, runs vulnerability matching, and performs reachability analysis. When scanner resources are exhausted, all downstream pipelines stall: queue depth grows, scan latency increases, and release gates time out waiting for scan results. Memory exhaustion can cause OOM kills that lose in-progress work.
## Common Causes
- High scan volume during bulk import or CI surge
- Memory leak from accumulated scan artifacts not being garbage collected
- Large container images (multi-GB layers) being processed concurrently
- Insufficient CPU/memory allocation relative to workload
- All workers busy with no capacity for new jobs
- Worker scaling not keeping up with demand
## How to Fix
### Docker Compose
```bash
# Check scanner resource usage
docker stats scanner --no-stream
# Reduce concurrent jobs to lower resource pressure
# In docker-compose.stella-ops.yml:
```
```yaml
services:
scanner:
deploy:
resources:
limits:
memory: 4G
cpus: "4.0"
environment:
Scanner__MaxConcurrentJobs: "2"
Scanner__Workers__Count: "4"
```
```bash
# Restart scanner to apply new resource limits
docker compose -f docker-compose.stella-ops.yml up -d scanner
```
### Bare Metal / systemd
```bash
# Check current resource usage
top -p $(pgrep -f stellaops-scanner)
# Reduce concurrent processing
stella scanner config set MaxConcurrentJobs 2
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Scanner": {
"MaxConcurrentJobs": 2,
"Workers": {
"Count": 4
}
}
}
```
```bash
sudo systemctl restart stellaops-scanner
```
### Kubernetes / Helm
```bash
# Check pod resource usage
kubectl top pods -l app=stellaops-scanner
# Scale horizontally instead of vertically
kubectl scale deployment stellaops-scanner --replicas=4
```
Set in Helm `values.yaml`:
```yaml
scanner:
replicas: 4
resources:
requests:
memory: 2Gi
cpu: "2"
limits:
memory: 4Gi
cpu: "4"
maxConcurrentJobs: 2
```
## Verification
```
stella doctor run --check check.scanner.resources
```
## Related Checks
- `check.scanner.queue` -- resource exhaustion causes queue backlog growth
- `check.scanner.sbom` -- memory exhaustion causes SBOM generation failures
- `check.scanner.reachability` -- CPU constraints slow computation times
- `check.scanner.slice.cache` -- cache effectiveness reduces resource demand

View File

@@ -0,0 +1,106 @@
---
checkId: check.scanner.sbom
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, sbom, cyclonedx, spdx, compliance]
---
# SBOM Generation Health
## What It Checks
Queries the Scanner service at `/api/v1/sbom/stats` and evaluates SBOM generation health:
- **Success rate**: warn when below 95%, fail when below 80%.
- **Validation failures**: any schema validation failures trigger a warning regardless of success rate.
Evidence collected: `total_generated`, `successful_generations`, `failed_generations`, `success_rate`, `format_cyclonedx`, `format_spdx`, `validation_failures`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
SBOMs are the foundation of the entire Stella Ops security pipeline. Without valid SBOMs, vulnerability scanning produces incomplete results, reachability analysis cannot run, and release gates that require an SBOM attestation will block promotions. Compliance frameworks (e.g., EO 14028, EU CRA) mandate accurate SBOMs for every shipped artifact.
## Common Causes
- Invalid or corrupted source artifacts (truncated layers, missing manifests)
- Parser errors for specific ecosystems (e.g., unsupported lockfile format)
- Memory exhaustion on large monorepo or multi-module projects
- SBOM schema validation failures due to generator version mismatch
- Unsupported container base image format
- Minor parsing issues in transitive dependency resolution
## How to Fix
### Docker Compose
```bash
# View recent SBOM generation failures
docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "sbom.*fail"
# Restart the scanner to clear any cached bad state
docker compose -f docker-compose.stella-ops.yml restart scanner
# Increase memory limit if OOM is suspected
# In docker-compose.stella-ops.yml:
```
```yaml
services:
scanner:
deploy:
resources:
limits:
memory: 4G
environment:
Scanner__Sbom__ValidationMode: "Strict"
Scanner__Sbom__MaxArtifactSizeMb: "500"
```
### Bare Metal / systemd
```bash
# Check scanner logs for SBOM errors
sudo journalctl -u stellaops-scanner --since "1 hour ago" | grep -i sbom
# Retry failed SBOMs
stella scanner sbom retry --failed
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Sbom": {
"ValidationMode": "Strict",
"MaxArtifactSizeMb": 500
}
}
```
### Kubernetes / Helm
```bash
# Check for OOMKilled scanner pods
kubectl get pods -l app=stellaops-scanner -o wide
kubectl describe pod <scanner-pod> | grep -A 5 "Last State"
# View SBOM-related logs
kubectl logs -l app=stellaops-scanner --tail=200 | grep -i sbom
```
Set in Helm `values.yaml`:
```yaml
scanner:
resources:
limits:
memory: 4Gi
sbom:
validationMode: Strict
maxArtifactSizeMb: 500
```
## Verification
```
stella doctor run --check check.scanner.sbom
```
## Related Checks
- `check.scanner.queue` -- queue backlog can delay SBOM generation
- `check.scanner.witness.graph` -- witness graphs depend on successful SBOM output
- `check.scanner.resources` -- resource exhaustion is a top cause of SBOM failures

View File

@@ -0,0 +1,112 @@
---
checkId: check.scanner.slice.cache
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, cache, slice, performance]
---
# Slice Cache Health
## What It Checks
Queries the Scanner service at `/api/v1/cache/stats` and evaluates slice cache effectiveness:
- **Storage utilization**: warn at 80% full, fail at 95% full.
- **Hit rate**: warn below 50%, fail below 20%.
- **Eviction rate**: reported as evidence.
Evidence collected: `hit_rate`, `hits`, `misses`, `entry_count`, `used_bytes`, `total_bytes`, `storage_utilization`, `eviction_rate`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
The slice cache stores pre-computed code slices used by the reachability engine. A healthy cache avoids re-analyzing the same dependency trees on every scan, reducing computation time from seconds to milliseconds. When the cache hit rate drops, reachability computations slow dramatically, causing queue backlog and delayed security feedback. When storage fills up, evictions accelerate and the cache thrashes, making it effectively useless.
## Common Causes
- Cache size limit too small for the working set of scanned artifacts
- TTL configured too long, preventing eviction of stale entries
- Eviction policy not working (configuration error)
- Unexpected growth in the number of unique slices (new projects onboarded)
- Cache was recently cleared (restart, volume reset)
- Working set larger than cache capacity
## How to Fix
### Docker Compose
```bash
# Clear stale cache entries
stella scanner cache prune --stale
# Warm the cache for active projects
stella scanner cache warm
```
Increase cache size in `docker-compose.stella-ops.yml`:
```yaml
services:
scanner:
environment:
Scanner__Cache__MaxSizeBytes: "4294967296" # 4 GB
Scanner__Cache__TtlHours: "72"
Scanner__Cache__EvictionPolicy: "LRU"
volumes:
- scanner-cache:/data/cache
```
### Bare Metal / systemd
```bash
# Check cache directory size
du -sh /var/lib/stellaops/scanner/cache
# Prune stale entries
stella scanner cache prune --stale
# Warm cache
stella scanner cache warm
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Cache": {
"MaxSizeBytes": 4294967296,
"TtlHours": 72,
"EvictionPolicy": "LRU",
"DataPath": "/var/lib/stellaops/scanner/cache"
}
}
```
```bash
sudo systemctl restart stellaops-scanner
```
### Kubernetes / Helm
```bash
# Check PVC usage for cache volume
kubectl exec -it <scanner-pod> -- df -h /data/cache
```
Set in Helm `values.yaml`:
```yaml
scanner:
cache:
maxSizeBytes: 4294967296 # 4 GB
ttlHours: 72
evictionPolicy: LRU
persistence:
enabled: true
size: 10Gi
storageClass: fast-ssd
```
## Verification
```
stella doctor run --check check.scanner.slice.cache
```
## Related Checks
- `check.scanner.reachability` -- cache misses directly increase computation time
- `check.scanner.resources` -- cache thrashing increases CPU and memory usage
- `check.scanner.queue` -- slow cache performance cascades into queue backlog

View File

@@ -0,0 +1,111 @@
---
checkId: check.scanner.vuln
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, vulnerability, cve, database]
---
# Vulnerability Scan Health
## What It Checks
Queries the Scanner service at `/api/v1/vuln/stats` and evaluates vulnerability scanning health, focusing on database freshness:
- **Database freshness**: warn when the vulnerability database is older than 24 hours, fail when older than 72 hours.
- **Scan failure rate**: warn when scan failure rate exceeds 10%.
Evidence collected: `database_age_hours`, `last_db_update`, `total_cves`, `scans_completed`, `scan_failures`, `failure_rate`, `vulnerabilities_found`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
A stale vulnerability database means newly disclosed CVEs are not detected in scans, creating a false sense of security. Artifacts that pass policy gates with an outdated database may contain exploitable vulnerabilities that would have been caught with current data. In regulated environments, scan freshness is an auditable compliance requirement. High scan failure rates mean some artifacts are not being scanned at all.
## Common Causes
- Vulnerability database sync job failed or is not scheduled
- Feed source (NVD, OSV, vendor advisory) unavailable or rate-limited
- Network connectivity issue preventing feed downloads
- Scheduled sync delayed due to system overload
- Parsing errors on specific artifact formats
- Unsupported package ecosystem or lockfile format
## How to Fix
### Docker Compose
```bash
# Trigger an immediate database sync
stella scanner db sync
# Check sync job status
stella scanner db status
# View scanner logs for sync errors
docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "sync\|feed\|vuln"
```
Configure sync schedule in `docker-compose.stella-ops.yml`:
```yaml
services:
scanner:
environment:
Scanner__VulnDb__SyncIntervalHours: "6"
Scanner__VulnDb__FeedSources: "nvd,osv,github"
Scanner__VulnDb__RetryCount: "3"
```
### Bare Metal / systemd
```bash
# Trigger manual sync
stella scanner db sync
# Check sync schedule
stella scanner db schedule
# View sync logs
sudo journalctl -u stellaops-scanner --since "24 hours ago" | grep -i "sync\|feed"
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"VulnDb": {
"SyncIntervalHours": 6,
"FeedSources": ["nvd", "osv", "github"],
"RetryCount": 3
}
}
```
### Kubernetes / Helm
```bash
# Check scanner pod logs for sync status
kubectl logs -l app=stellaops-scanner --tail=100 | grep -i sync
# Verify CronJob for database sync exists and is running
kubectl get cronjobs -l app=stellaops-scanner-sync
```
Set in Helm `values.yaml`:
```yaml
scanner:
vulnDb:
syncIntervalHours: 6
feedSources:
- nvd
- osv
- github
retryCount: 3
syncCronJob:
schedule: "0 */6 * * *"
```
## Verification
```
stella doctor run --check check.scanner.vuln
```
## Related Checks
- `check.scanner.queue` -- scan failures may originate from queue processing issues
- `check.scanner.sbom` -- vulnerability matching depends on SBOM quality
- `check.scanner.reachability` -- reachability analysis uses vulnerability data to filter findings

View File

@@ -0,0 +1,108 @@
---
checkId: check.scanner.witness.graph
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, witness, graph, reachability, evidence]
---
# Witness Graph Health
## What It Checks
Queries the Scanner service at `/api/v1/witness/stats` and evaluates witness graph construction health:
- **Construction failures**: fail if failure rate exceeds 10% of total constructions.
- **Incomplete graphs**: warn if any graphs are incomplete (missing nodes or edges).
- **Consistency errors**: warn if any consistency errors are detected (orphaned nodes, version mismatches).
Evidence collected: `total_constructed`, `construction_failures`, `failure_rate`, `incomplete_graphs`, `avg_nodes_per_graph`, `avg_edges_per_graph`, `avg_completeness`, `consistency_errors`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
Witness graphs are the evidence artifacts that prove how a vulnerability reachability verdict was reached. They record the call chain from application entry point to vulnerable function. Without intact witness graphs, reachability findings lack provenance, attestation of scan results is weakened, and auditors cannot verify that "unreachable" verdicts are legitimate. Incomplete or inconsistent graphs can cause incorrect reachability conclusions.
## Common Causes
- Missing SBOM input (SBOM generation failed for the artifact)
- Parser error on specific artifact types or ecosystems
- Cyclical dependency detected causing infinite traversal
- Resource exhaustion during graph construction on large projects
- Partial SBOM data (some dependencies resolved, others missing)
- Missing transitive dependencies in the dependency tree
- Version mismatch between SBOM and slice data
- Orphaned nodes from stale cache entries
## How to Fix
### Docker Compose
```bash
# View recent construction failures
docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "witness.*fail\|graph.*error"
# Rebuild failed graphs
stella scanner witness rebuild --failed
# Check SBOM pipeline health (witness graphs depend on SBOMs)
stella doctor run --check check.scanner.sbom
```
```yaml
services:
scanner:
environment:
Scanner__WitnessGraph__MaxDepth: "50"
Scanner__WitnessGraph__TimeoutMs: "30000"
Scanner__WitnessGraph__ConsistencyCheckEnabled: "true"
```
### Bare Metal / systemd
```bash
# View construction errors
sudo journalctl -u stellaops-scanner --since "1 hour ago" | grep -i witness
# Rebuild failed graphs
stella scanner witness rebuild --failed
# View graph statistics
stella scanner witness stats
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"WitnessGraph": {
"MaxDepth": 50,
"TimeoutMs": 30000,
"ConsistencyCheckEnabled": true
}
}
```
### Kubernetes / Helm
```bash
# Check scanner logs for witness graph issues
kubectl logs -l app=stellaops-scanner --tail=200 | grep -i witness
# Rebuild failed graphs
kubectl exec -it <scanner-pod> -- stella scanner witness rebuild --failed
```
Set in Helm `values.yaml`:
```yaml
scanner:
witnessGraph:
maxDepth: 50
timeoutMs: 30000
consistencyCheckEnabled: true
```
## Verification
```
stella doctor run --check check.scanner.witness.graph
```
## Related Checks
- `check.scanner.sbom` -- witness graphs are constructed from SBOM data
- `check.scanner.reachability` -- reachability verdicts depend on witness graph integrity
- `check.scanner.slice.cache` -- stale cache entries can cause consistency errors
- `check.scanner.resources` -- resource exhaustion causes construction failures