Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.0 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Slice Cache Health

What It Checks

Queries the Scanner service at /api/v1/cache/stats and evaluates slice cache effectiveness:

Storage utilization: warn at 80% full, fail at 95% full.
Hit rate: warn below 50%, fail below 20%.
Eviction rate: reported as evidence.

Evidence collected: hit_rate, hits, misses, entry_count, used_bytes, total_bytes, storage_utilization, eviction_rate.

The check requires Scanner:Url or Services:Scanner:Url to be configured.

Why It Matters

The slice cache stores pre-computed code slices used by the reachability engine. A healthy cache avoids re-analyzing the same dependency trees on every scan, reducing computation time from seconds to milliseconds. When the cache hit rate drops, reachability computations slow dramatically, causing queue backlog and delayed security feedback. When storage fills up, evictions accelerate and the cache thrashes, making it effectively useless.

Common Causes

Cache size limit too small for the working set of scanned artifacts
TTL configured too long, preventing eviction of stale entries
Eviction policy not working (configuration error)
Unexpected growth in the number of unique slices (new projects onboarded)
Cache was recently cleared (restart, volume reset)
Working set larger than cache capacity

How to Fix

Docker Compose

# Clear stale cache entries
stella scanner cache prune --stale

# Warm the cache for active projects
stella scanner cache warm

Increase cache size in docker-compose.stella-ops.yml:

services:
  scanner:
    environment:
      Scanner__Cache__MaxSizeBytes: "4294967296"  # 4 GB
      Scanner__Cache__TtlHours: "72"
      Scanner__Cache__EvictionPolicy: "LRU"
    volumes:
      - scanner-cache:/data/cache

Bare Metal / systemd

# Check cache directory size
du -sh /var/lib/stellaops/scanner/cache

# Prune stale entries
stella scanner cache prune --stale

# Warm cache
stella scanner cache warm

Edit /etc/stellaops/scanner/appsettings.json:

{
  "Cache": {
    "MaxSizeBytes": 4294967296,
    "TtlHours": 72,
    "EvictionPolicy": "LRU",
    "DataPath": "/var/lib/stellaops/scanner/cache"
  }
}

sudo systemctl restart stellaops-scanner

Kubernetes / Helm

# Check PVC usage for cache volume
kubectl exec -it <scanner-pod> -- df -h /data/cache

Set in Helm values.yaml:

scanner:
  cache:
    maxSizeBytes: 4294967296  # 4 GB
    ttlHours: 72
    evictionPolicy: LRU
    persistence:
      enabled: true
      size: 10Gi
      storageClass: fast-ssd

Verification

stella doctor run --check check.scanner.slice.cache

check.scanner.reachability -- cache misses directly increase computation time
check.scanner.resources -- cache thrashing increases CPU and memory usage
check.scanner.queue -- slow cache performance cascades into queue backlog

3.0 KiB Raw Blame History