Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,101 @@
---
checkId: check.compliance.attestation-signing
plugin: stellaops.doctor.compliance
severity: fail
tags: [compliance, attestation, signing, crypto]
---
# Attestation Signing Health
## What It Checks
Monitors attestation signing capability by querying the Attestor service at `/api/v1/signing/status`. The check validates:
- **Key availability**: whether a signing key is loaded and accessible (via `keyAvailable` in the response).
- **Key expiration**: if the key has an `expiresAt` timestamp, the check fails when the key is already expired, warns when expiry is within 30 days, and passes otherwise.
- **Signing activity**: reports the key type and the number of signatures produced in the last 24 hours.
The check only runs when `Attestor:Url` or `Services:Attestor:Url` is configured. It uses a 10-second HTTP timeout.
| Condition | Result |
|---|---|
| Attestor unreachable or HTTP error | Fail |
| Key not available | Fail |
| Key expired | Fail |
| Key expires within 30 days | Warn |
| Key available and not expiring soon | Pass |
## Why It Matters
Attestation signing is the foundation of Stella Ops' evidence chain. Without a working signing key, the system cannot create attestations for releases, SBOM scans, or policy decisions. This breaks the entire compliance audit trail and makes releases unverifiable. Key expiration without timely rotation causes the same downstream impact as a missing key, but with no advance warning unless monitored.
## Common Causes
- HSM/KMS connectivity issue preventing key access
- Key rotation in progress (brief window of unavailability)
- Key expired or revoked without replacement
- Permission denied on the key management backend
- Attestor service unavailable or misconfigured endpoint URL
## How to Fix
### Docker Compose
Verify the Attestor service is running and the URL is correct:
```bash
# Check attestor container health
docker compose ps attestor
# Verify signing key status
docker compose exec attestor stella attestor key status
# If key is expired, rotate it
docker compose exec attestor stella attestor key rotate
# Ensure the URL is correct in your .env or compose override
# Attestor__Url=http://attestor:5082
```
### Bare Metal / systemd
Check the Attestor service and key configuration:
```bash
# Check service status
sudo systemctl status stellaops-attestor
# Verify key status
stella attestor key status
# Test HSM/KMS connectivity
stella attestor hsm test
# Rotate an expired key
stella attestor key rotate
# If using appsettings.json, verify Attestor:Url is correct
cat /etc/stellaops/appsettings.json | jq '.Attestor'
```
### Kubernetes / Helm
```bash
# Check attestor pod status
kubectl get pods -l app=stellaops-attestor
# Check signing key status
kubectl exec deploy/stellaops-attestor -- stella attestor key status
# Verify HSM/KMS connectivity from the pod
kubectl exec deploy/stellaops-attestor -- stella attestor hsm test
# Schedule key rotation via Helm values
helm upgrade stellaops ./charts/stellaops \
--set attestor.keyRotation.enabled=true \
--set attestor.keyRotation.scheduleBeforeExpiryDays=30
```
## Verification
```
stella doctor run --check check.compliance.attestation-signing
```
## Related Checks
- `check.compliance.evidence-rate` — monitors evidence generation success rate, which depends on signing
- `check.compliance.provenance-completeness` — verifies provenance records exist for releases (requires working signing)
- `check.compliance.evidence-integrity` — verifies signatures on stored evidence
- `check.crypto.hsm` — validates HSM/PKCS#11 module availability used by the signing key

View File

@@ -0,0 +1,100 @@
---
checkId: check.compliance.audit-readiness
plugin: stellaops.doctor.compliance
severity: warn
tags: [compliance, audit, evidence]
---
# Audit Readiness
## What It Checks
Verifies the system is ready for compliance audits by querying the Evidence Locker at `/api/v1/evidence/audit-readiness`. The check evaluates four readiness criteria:
- **Retention policy configured**: whether a data retention policy is active.
- **Audit logging enabled**: whether audit log capture is turned on.
- **Backup verified**: whether the most recent backup has been validated.
- **Evidence retention age**: whether the oldest evidence meets the required retention period (default 365 days).
| Condition | Result |
|---|---|
| Evidence Locker unreachable | Warn |
| 3 or more issues found | Fail |
| 1-2 issues found | Warn |
| All criteria satisfied | Pass |
Evidence collected: `issues_count`, `retention_policy_configured`, `audit_log_enabled`, `backup_verified`, `evidence_count`, `oldest_evidence_days`.
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 15-second HTTP timeout.
## Why It Matters
Compliance audits (SOC2, FedRAMP, HIPAA, PCI-DSS) require verifiable evidence retention, continuous audit logging, and validated backups. If any of these controls is missing, the organization cannot demonstrate compliance during an audit. A missing retention policy means evidence may be silently deleted. Disabled audit logging creates gaps in the chain of custody. Unverified backups risk data loss during incident recovery.
## Common Causes
- No retention policy configured (default is not set)
- Audit logging disabled in configuration or by error
- Backup verification job not running or failing silently
- Evidence retention shorter than the required period (e.g., 90 days configured but 365 required)
## How to Fix
### Docker Compose
```bash
# Configure retention policy
docker compose exec evidence-locker stella evidence retention set --days 365
# Enable audit logging
docker compose exec platform stella audit enable
# Verify backup status
docker compose exec evidence-locker stella evidence backup verify
# Set environment variables if needed
# EvidenceLocker__Retention__Days=365
# AuditLog__Enabled=true
```
### Bare Metal / systemd
```bash
# Configure retention policy
stella evidence retention set --days 365
# Enable audit logging
stella audit enable
# Verify backup status
stella evidence backup verify
# Edit appsettings.json
# "EvidenceLocker": { "Retention": { "Days": 365 } }
# "AuditLog": { "Enabled": true }
sudo systemctl restart stellaops-evidence-locker
```
### Kubernetes / Helm
```yaml
# values.yaml
evidenceLocker:
retention:
days: 365
backup:
enabled: true
schedule: "0 2 * * *"
verifyAfterBackup: true
auditLog:
enabled: true
```
```bash
helm upgrade stellaops ./charts/stellaops -f values.yaml
```
## Verification
```
stella doctor run --check check.compliance.audit-readiness
```
## Related Checks
- `check.compliance.evidence-integrity` — verifies evidence has not been tampered with
- `check.compliance.export-readiness` — verifies evidence can be exported for auditors
- `check.compliance.evidence-rate` — monitors evidence generation health
- `check.compliance.framework` — verifies compliance framework controls are passing

View File

@@ -0,0 +1,100 @@
---
checkId: check.compliance.evidence-integrity
plugin: stellaops.doctor.compliance
severity: fail
tags: [compliance, security, integrity, signatures]
---
# Evidence Integrity
## What It Checks
Detects evidence tampering or integrity issues by querying the Evidence Locker at `/api/v1/evidence/integrity-check`. The check verifies cryptographic signatures and hash chains across all stored evidence records. It evaluates:
- **Tampered records**: evidence records where the signature or hash does not match the stored content.
- **Verification errors**: records that could not be verified (e.g., missing certificates, unsupported algorithms).
- **Hash chain validity**: whether the sequential hash chain linking evidence records is intact.
| Condition | Result |
|---|---|
| Evidence Locker unreachable | Warn |
| Any tampered records detected (tamperedCount > 0) | Fail (CRITICAL) |
| Verification errors but no tampering | Warn |
| All records verified, no tampering | Pass |
Evidence collected: `tampered_count`, `verified_count`, `total_checked`, `first_tampered_id`, `verification_errors`, `hash_chain_valid`.
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 60-second HTTP timeout due to the intensive nature of the integrity scan.
## Why It Matters
Evidence integrity is the cornerstone of compliance and audit trust. Tampered evidence records indicate either storage corruption, a security breach, or malicious modification of release decisions. Any tampering invalidates the entire evidence chain and must be treated as a security incident. Verification errors, while less severe, mean some evidence cannot be independently validated, weakening the audit posture.
## Common Causes
- Evidence modification after signing (accidental or malicious)
- Storage corruption (disk errors, incomplete writes)
- Malicious tampering by an attacker with storage access
- Key or certificate mismatch after key rotation
- Missing signing certificates needed for verification
- Certificate expiration rendering signatures unverifiable
- Unsupported signature algorithm in older evidence records
## How to Fix
### Docker Compose
```bash
# List tampered evidence (DO NOT DELETE - preserve for investigation)
docker compose exec evidence-locker stella evidence audit --tampered
# Check for storage corruption
docker compose exec evidence-locker stella evidence integrity-check --verbose
# If tampering is confirmed, escalate to security team
# Preserve all logs and evidence for forensic analysis
docker compose logs evidence-locker > evidence-locker-forensic.log
# For verification errors (missing certs), import the required certificates
docker compose exec evidence-locker stella evidence certs import --path /certs/
```
### Bare Metal / systemd
```bash
# List tampered evidence
stella evidence audit --tampered
# Full integrity check with details
stella evidence integrity-check --verbose
# Check for disk errors
sudo smartctl -H /dev/sda
sudo fsck -n /dev/sda1
# Import missing certificates for verification
stella evidence certs import --path /etc/stellaops/certs/
# DO NOT delete tampered evidence - preserve for investigation
```
### Kubernetes / Helm
```bash
# List tampered evidence
kubectl exec deploy/stellaops-evidence-locker -- stella evidence audit --tampered
# Full integrity check
kubectl exec deploy/stellaops-evidence-locker -- stella evidence integrity-check --verbose
# Check persistent volume health
kubectl describe pvc stellaops-evidence-data
# Export forensic logs
kubectl logs deploy/stellaops-evidence-locker --all-containers > forensic.log
```
## Verification
```
stella doctor run --check check.compliance.evidence-integrity
```
## Related Checks
- `check.compliance.attestation-signing` — signing key health affects evidence signature creation
- `check.compliance.evidence-rate` — evidence generation failures may relate to integrity issues
- `check.evidencelocker.merkle` — Merkle anchor verification provides additional integrity guarantees
- `check.evidencelocker.provenance` — provenance chain integrity validates the evidence chain
- `check.compliance.audit-readiness` — overall audit readiness depends on evidence integrity

View File

@@ -0,0 +1,94 @@
---
checkId: check.compliance.evidence-rate
plugin: stellaops.doctor.compliance
severity: fail
tags: [compliance, evidence, attestation]
---
# Evidence Generation Rate
## What It Checks
Monitors evidence generation success rate by querying the Evidence Locker at `/api/v1/evidence/metrics`. The check computes the success rate as `(totalGenerated - failed) / totalGenerated` over the last 24 hours and compares it against two thresholds:
| Condition | Result |
|---|---|
| Evidence Locker unreachable | Warn |
| Success rate < 95% | Fail |
| Success rate 95%-99% | Warn |
| Success rate >= 99% | Pass |
Evidence collected: `success_rate`, `total_generated_24h`, `failed_24h`, `pending_24h`, `avg_generation_time_ms`.
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 10-second HTTP timeout. If no evidence has been generated (`totalGenerated == 0`), the success rate defaults to 100%.
## Why It Matters
Evidence generation is a critical path in the release pipeline. Every release decision, scan result, and policy evaluation produces evidence that feeds compliance audits and attestation chains. A dropping success rate means evidence records are being lost, which creates gaps in the audit trail. Below 95%, the system is losing more than 1 in 20 evidence records, making compliance reporting unreliable and potentially invalidating release approvals that lack supporting evidence.
## Common Causes
- Evidence generation service failures (internal errors, OOM)
- Database connectivity issues preventing evidence persistence
- Signing key unavailable, blocking signed evidence creation
- Storage quota exceeded on the evidence backend
- Intermittent failures due to high load or resource contention
## How to Fix
### Docker Compose
```bash
# Check evidence locker logs for errors
docker compose logs evidence-locker --since 1h | grep -i error
# Verify signing keys
docker compose exec evidence-locker stella evidence keys status
# Check database connectivity
docker compose exec evidence-locker stella evidence db check
# Check storage capacity
docker compose exec evidence-locker df -h /data/evidence
# If storage is full, clean up or expand volume
docker compose exec evidence-locker stella evidence cleanup --older-than 90d --dry-run
```
### Bare Metal / systemd
```bash
# Check service logs
journalctl -u stellaops-evidence-locker --since "1 hour ago" | grep -i error
# Verify signing keys
stella evidence keys status
# Check database connectivity
stella evidence db check
# Check storage usage
df -h /var/lib/stellaops/evidence
sudo systemctl restart stellaops-evidence-locker
```
### Kubernetes / Helm
```bash
# Check evidence locker pod logs
kubectl logs deploy/stellaops-evidence-locker --since=1h | grep -i error
# Verify signing keys
kubectl exec deploy/stellaops-evidence-locker -- stella evidence keys status
# Check persistent volume usage
kubectl exec deploy/stellaops-evidence-locker -- df -h /data/evidence
# Check for OOMKilled pods
kubectl get events --field-selector reason=OOMKilled -n stellaops
```
## Verification
```
stella doctor run --check check.compliance.evidence-rate
```
## Related Checks
- `check.compliance.attestation-signing` — signing key health affects evidence generation
- `check.compliance.evidence-integrity` — integrity of generated evidence
- `check.compliance.provenance-completeness` — provenance depends on evidence generation
- `check.compliance.audit-readiness` — overall audit readiness depends on evidence availability

View File

@@ -0,0 +1,104 @@
---
checkId: check.compliance.export-readiness
plugin: stellaops.doctor.compliance
severity: warn
tags: [compliance, export, audit]
---
# Evidence Export Readiness
## What It Checks
Verifies that evidence can be exported in auditor-ready formats by querying the Evidence Locker at `/api/v1/evidence/export/capabilities`. The check evaluates four export capabilities:
- **PDF export**: ability to generate PDF evidence reports.
- **JSON export**: ability to export evidence as structured JSON.
- **Signed bundle export**: ability to create cryptographically signed evidence bundles.
- **Chain of custody report**: ability to generate chain-of-custody documentation.
| Condition | Result |
|---|---|
| Evidence Locker unreachable | Warn |
| 2 or more export formats unavailable | Fail |
| 1 export format unavailable | Warn |
| All 4 export formats available | Pass |
Evidence collected: `pdf_export`, `json_export`, `signed_bundle`, `chain_of_custody`, `available_formats`.
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 10-second HTTP timeout.
## Why It Matters
Auditors require evidence in specific formats. PDF reports are the most common delivery format for compliance reviews. Signed bundles provide cryptographic proof of evidence authenticity. The chain of custody report demonstrates that evidence has not been modified since collection. If these export capabilities are not available when an auditor requests them, it delays the audit process and may raise concerns about evidence integrity.
## Common Causes
- Export dependencies not installed (e.g., PDF rendering libraries)
- Signing keys not configured for evidence bundle signing
- Template files missing for PDF report generation
- Evidence Locker deployed without export module enabled
## How to Fix
### Docker Compose
```bash
# Check export configuration
docker compose exec evidence-locker stella evidence export --check
# Verify export dependencies are installed
docker compose exec evidence-locker dpkg -l | grep -i wkhtmltopdf
# Enable export features in environment
# EvidenceLocker__Export__PdfEnabled=true
# EvidenceLocker__Export__SignedBundleEnabled=true
# EvidenceLocker__Export__ChainOfCustodyEnabled=true
# Restart after configuration changes
docker compose restart evidence-locker
```
### Bare Metal / systemd
```bash
# Check export configuration
stella evidence export --check
# Install PDF rendering dependencies if missing
sudo apt install wkhtmltopdf
# Configure export in appsettings.json
# "EvidenceLocker": {
# "Export": {
# "PdfEnabled": true,
# "SignedBundleEnabled": true,
# "ChainOfCustodyEnabled": true
# }
# }
sudo systemctl restart stellaops-evidence-locker
```
### Kubernetes / Helm
```yaml
# values.yaml
evidenceLocker:
export:
pdfEnabled: true
jsonEnabled: true
signedBundleEnabled: true
chainOfCustodyEnabled: true
signingKeySecret: "stellaops-export-signing-key"
```
```bash
# Create signing key secret for bundles
kubectl create secret generic stellaops-export-signing-key \
--from-file=key.pem=./export-signing-key.pem
helm upgrade stellaops ./charts/stellaops -f values.yaml
```
## Verification
```
stella doctor run --check check.compliance.export-readiness
```
## Related Checks
- `check.compliance.audit-readiness` — overall audit readiness including retention and logging
- `check.compliance.attestation-signing` — signing key health required for signed bundle export
- `check.compliance.evidence-integrity` — integrity of the evidence being exported

View File

@@ -0,0 +1,90 @@
---
checkId: check.compliance.framework
plugin: stellaops.doctor.compliance
severity: warn
tags: [compliance, framework, soc2, fedramp]
---
# Compliance Framework
## What It Checks
Verifies that configured compliance framework requirements are met by querying the Policy service at `/api/v1/compliance/status`. The check supports SOC2, FedRAMP, HIPAA, PCI-DSS, and custom frameworks. It evaluates:
- **Failing controls**: any compliance controls in a failed state trigger a fail result.
- **Compliance score**: a score below 100% (but with zero failing controls) triggers a warning.
- **Control counts**: reports total, passing, and failing control counts along with the framework name.
| Condition | Result |
|---|---|
| Policy service unreachable | Warn |
| Any controls failing (failingControls > 0) | Fail |
| Compliance score < 100% | Warn |
| All controls passing, score = 100% | Pass |
The check only runs when `Compliance:Frameworks` is configured. It uses a 15-second HTTP timeout.
## Why It Matters
Compliance frameworks define the security and operational controls your organization must satisfy. Failing controls mean the system is not meeting regulatory requirements, which can result in audit findings, failed certifications, or legal exposure. Even partial non-compliance (score below 100%) indicates controls that need attention before the next audit cycle.
## Common Causes
- Control requirements not implemented in the platform configuration
- Evidence gaps where expected artifacts are missing
- Policy violations detected by the policy engine
- Configuration drift from the established compliance baseline
- New controls added to the framework that have not been addressed
## How to Fix
### Docker Compose
```bash
# List all failing controls
docker compose exec policy stella compliance audit --failing
# Generate remediation plan
docker compose exec policy stella compliance remediate --plan
# Review compliance status in detail
docker compose exec policy stella compliance status --framework soc2
# Configure frameworks in your .env
# Compliance__Frameworks=soc2,hipaa
```
### Bare Metal / systemd
```bash
# List failing controls
stella compliance audit --failing
# Generate remediation plan
stella compliance remediate --plan
# Configure frameworks in appsettings.json
# "Compliance": { "Frameworks": "soc2,hipaa" }
sudo systemctl restart stellaops-policy
```
### Kubernetes / Helm
```yaml
# values.yaml
compliance:
frameworks: "soc2,hipaa"
autoRemediate: false
reportSchedule: "0 6 * * 1" # Weekly Monday 6am
```
```bash
# Apply and check
helm upgrade stellaops ./charts/stellaops -f values.yaml
kubectl exec deploy/stellaops-policy -- stella compliance audit --failing
```
## Verification
```
stella doctor run --check check.compliance.framework
```
## Related Checks
- `check.compliance.audit-readiness` verifies the system is ready for compliance audits
- `check.compliance.evidence-integrity` verifies evidence integrity for compliance evidence
- `check.compliance.provenance-completeness` verifies provenance records support compliance claims
- `check.compliance.export-readiness` verifies evidence can be exported for auditor review

View File

@@ -0,0 +1,102 @@
---
checkId: check.compliance.provenance-completeness
plugin: stellaops.doctor.compliance
severity: fail
tags: [compliance, provenance, slsa]
---
# Provenance Completeness
## What It Checks
Verifies that provenance records exist for all releases by querying the Provenance service at `/api/v1/provenance/completeness`. The check computes a completeness rate as `(totalReleases - missingCount) / totalReleases` and evaluates the SLSA (Supply-chain Levels for Software Artifacts) level:
| Condition | Result |
|---|---|
| Provenance service unreachable | Warn |
| Completeness rate < 99% | Fail |
| SLSA level < 2 (but completeness >= 99%) | Warn |
| Completeness >= 99% and SLSA level >= 2 | Pass |
Evidence collected: `completeness_rate`, `total_releases`, `missing_count`, `slsa_level`.
The check only runs when `Provenance:Url` or `Services:Provenance:Url` is configured. It uses a 15-second HTTP timeout. If no releases exist (`totalReleases == 0`), completeness defaults to 100%.
## Why It Matters
Provenance records document the complete history of how a software artifact was built, including the source code, build system, and build steps. Without provenance, there is no verifiable link between source code and the deployed artifact. This is a foundational requirement for SLSA compliance and supply-chain security. Missing provenance for even a small percentage of releases creates audit gaps that undermine the trustworthiness of the entire release pipeline.
## Common Causes
- Build pipeline not configured to generate provenance attestations
- Provenance upload failures due to network or authentication issues
- Legacy releases created before provenance generation was enabled
- Manual deployments that bypass the standard build pipeline
- Build system not meeting SLSA level 2+ requirements
## How to Fix
### Docker Compose
```bash
# List releases missing provenance
docker compose exec provenance stella provenance audit --missing
# Generate backfill provenance for existing releases (dry run first)
docker compose exec provenance stella provenance backfill --dry-run
# If dry run looks correct, run the actual backfill
docker compose exec provenance stella provenance backfill
# Check SLSA level
docker compose exec provenance stella provenance slsa-level
# Ensure provenance generation is enabled in the pipeline
# Provenance__Enabled=true
# Provenance__SlsaLevel=2
```
### Bare Metal / systemd
```bash
# List releases missing provenance
stella provenance audit --missing
# Backfill provenance (dry run first)
stella provenance backfill --dry-run
# Check SLSA level configuration
stella provenance slsa-level
# Configure in appsettings.json
# "Provenance": { "Enabled": true, "SlsaLevel": 2 }
sudo systemctl restart stellaops-provenance
```
### Kubernetes / Helm
```yaml
# values.yaml
provenance:
enabled: true
slsaLevel: 2
backfill:
enabled: true
schedule: "0 3 * * 0" # Weekly Sunday 3am
```
```bash
# List missing provenance
kubectl exec deploy/stellaops-provenance -- stella provenance audit --missing
# Backfill
kubectl exec deploy/stellaops-provenance -- stella provenance backfill --dry-run
helm upgrade stellaops ./charts/stellaops -f values.yaml
```
## Verification
```
stella doctor run --check check.compliance.provenance-completeness
```
## Related Checks
- `check.compliance.attestation-signing` — signing key required for provenance attestations
- `check.compliance.evidence-rate` — evidence generation rate includes provenance records
- `check.compliance.evidence-integrity` — integrity of provenance evidence
- `check.evidencelocker.provenance` — provenance chain integrity at the storage level
- `check.compliance.framework` — compliance frameworks may require specific SLSA levels