Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.5 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | |||
|---|---|---|---|---|---|---|
| check.compliance.evidence-rate | stellaops.doctor.compliance | fail |
|
Evidence Generation Rate
What It Checks
Monitors evidence generation success rate by querying the Evidence Locker at /api/v1/evidence/metrics. The check computes the success rate as (totalGenerated - failed) / totalGenerated over the last 24 hours and compares it against two thresholds:
| Condition | Result |
|---|---|
| Evidence Locker unreachable | Warn |
| Success rate < 95% | Fail |
| Success rate 95%-99% | Warn |
| Success rate >= 99% | Pass |
Evidence collected: success_rate, total_generated_24h, failed_24h, pending_24h, avg_generation_time_ms.
The check only runs when EvidenceLocker:Url or Services:EvidenceLocker:Url is configured. It uses a 10-second HTTP timeout. If no evidence has been generated (totalGenerated == 0), the success rate defaults to 100%.
Why It Matters
Evidence generation is a critical path in the release pipeline. Every release decision, scan result, and policy evaluation produces evidence that feeds compliance audits and attestation chains. A dropping success rate means evidence records are being lost, which creates gaps in the audit trail. Below 95%, the system is losing more than 1 in 20 evidence records, making compliance reporting unreliable and potentially invalidating release approvals that lack supporting evidence.
Common Causes
- Evidence generation service failures (internal errors, OOM)
- Database connectivity issues preventing evidence persistence
- Signing key unavailable, blocking signed evidence creation
- Storage quota exceeded on the evidence backend
- Intermittent failures due to high load or resource contention
How to Fix
Docker Compose
# Check evidence locker logs for errors
docker compose logs evidence-locker --since 1h | grep -i error
# Verify signing keys
docker compose exec evidence-locker stella evidence keys status
# Check database connectivity
docker compose exec evidence-locker stella evidence db check
# Check storage capacity
docker compose exec evidence-locker df -h /data/evidence
# If storage is full, clean up or expand volume
docker compose exec evidence-locker stella evidence cleanup --older-than 90d --dry-run
Bare Metal / systemd
# Check service logs
journalctl -u stellaops-evidence-locker --since "1 hour ago" | grep -i error
# Verify signing keys
stella evidence keys status
# Check database connectivity
stella evidence db check
# Check storage usage
df -h /var/lib/stellaops/evidence
sudo systemctl restart stellaops-evidence-locker
Kubernetes / Helm
# Check evidence locker pod logs
kubectl logs deploy/stellaops-evidence-locker --since=1h | grep -i error
# Verify signing keys
kubectl exec deploy/stellaops-evidence-locker -- stella evidence keys status
# Check persistent volume usage
kubectl exec deploy/stellaops-evidence-locker -- df -h /data/evidence
# Check for OOMKilled pods
kubectl get events --field-selector reason=OOMKilled -n stellaops
Verification
stella doctor run --check check.compliance.evidence-rate
Related Checks
check.compliance.attestation-signing— signing key health affects evidence generationcheck.compliance.evidence-integrity— integrity of generated evidencecheck.compliance.provenance-completeness— provenance depends on evidence generationcheck.compliance.audit-readiness— overall audit readiness depends on evidence availability