Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,86 @@
---
checkId: check.environment.drift
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, drift, configuration, consistency]
---
# Environment Drift Detection
## What It Checks
Queries the Release Orchestrator drift report API (`/api/v1/environments/drift`) and compares configuration snapshots across environments. The check requires at least 2 environments to perform comparison. Each drift item carries a severity classification:
- **Fail** if any drift is classified as `critical` (e.g., security-relevant configuration differences between staging and production)
- **Warn** if drifts exist but none are critical
- **Pass** if no configuration drift is detected between environments
Evidence includes the specific configuration keys that drifted and which environments are affected.
## Why It Matters
Configuration drift between environments undermines the core promise of promotion-based releases: that what you test in staging is what runs in production. Drift can cause subtle behavioral differences that only manifest under production load, making bugs nearly impossible to reproduce. Critical drift in security-related configuration (TLS settings, authentication, network policies) can create compliance violations and security exposures.
## Common Causes
- Manual configuration changes applied directly to one environment (bypassing the release pipeline)
- Failed deployment that left partial configuration in one environment
- Configuration sync job that did not propagate to all environments
- Environment restored from an outdated backup
- Intentional per-environment overrides that were not tracked as accepted exceptions
## How to Fix
### Docker Compose
```bash
# View the current drift report
stella env drift show
# Compare specific configuration between environments
diff <(docker exec stellaops-staging cat /app/appsettings.json) \
<(docker exec stellaops-prod cat /app/appsettings.json)
# Reconcile by redeploying from the canonical source
docker compose -f docker-compose.stella-ops.yml up -d --force-recreate <service>
# If drift is intentional, mark it as accepted
stella env drift accept <config-key>
```
### Bare Metal / systemd
```bash
# View drift report
stella env drift show
# Compare config files between environments
diff /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
# Reconcile by copying from source of truth
sudo cp /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
sudo systemctl restart stellaops-<service>
# Or accept drift as intentional
stella env drift accept <config-key>
```
### Kubernetes / Helm
```bash
# View drift between environments
stella env drift show
# Compare Helm values between environments
diff <(helm get values stellaops -n stellaops-staging -o yaml) \
<(helm get values stellaops -n stellaops-prod -o yaml)
# Reconcile by redeploying with consistent values
helm upgrade stellaops stellaops/stellaops -n stellaops-prod \
-f values-prod.yaml
# Compare ConfigMaps
kubectl diff -f configmap.yaml -n stellaops-prod
```
## Verification
```bash
stella doctor run --check check.environment.drift
```
## Related Checks
- `check.environment.deployments` - drift can cause service failures after redeployment
- `check.environment.secrets` - secret configuration differences between environments
- `check.environment.network.policy` - network policy drift is a security concern